|
Prev: WM_ON_PAINT gets called too much
Next: Type mismatch error when opening word doc using automation
From: ALEKS! on 24 Apr 2008 15:14 Hi all, I am trying to use MultiByteToWideChar function to detect invalid encoded strings (in US-ASCII 7-bit for example). The input string I use is UTF-8 encoded E2:82:AC (euro sign), which is not a valid ASCII-7 string. The call is as follows: char *input_data = "\342\202\254"; /* octal */ long n_out = MultiByteToWideChar((UNIT)20127, MB_ERR_INVALID_CHARS, input_data, 3, NULL, 0); When passing NULL in lpWideCharStr and 0 in cchWideChar I am querying the function to get the number of the output wide chars after conversion. What I expected was a return of 0 in n_out, as the input string is not a valid ASCII-7 character, but the output I get is 3, what means that 3 wide chars are obtained. If I pass a valid output place to store the output string, I get this UTF-16LE string: 62:00:02:00:2C:00, which seems the string that should be obtained when only reading the 7bits of each input byte. But: why is the MSB not read? Why don't I get a ERROR_NO_UNICODE_TRANSLATION or such? Why does it work? The idea is to detect if a given string is valid in a given encoding, not only ASCII-7. Thanks for the help in advance.
From: Joseph M. Newcomer on 24 Apr 2008 16:47 See below... On Thu, 24 Apr 2008 12:14:10 -0700 (PDT), "ALEKS!" <aleksander.morgado(a)gmail.com> wrote: >Hi all, > >I am trying to use MultiByteToWideChar function to detect invalid >encoded strings (in US-ASCII 7-bit for example). The input string I >use is UTF-8 encoded E2:82:AC (euro sign), which is not a valid >ASCII-7 string. > >The call is as follows: >char *input_data = "\342\202\254"; /* octal */ **** If you have data that you think of in hex, converting it to octal is a bit roundabout; why not write "\xE2\x82\xAC" ? **** >long n_out = MultiByteToWideChar((UNIT)20127, MB_ERR_INVALID_CHARS, >input_data, 3, NULL, 0); **** What is 20127? Some random magical number? Perhaps a comment that this is US-ASCII 7-bit code page, or a #define or static const UINT, would have helped... **** > >When passing NULL in lpWideCharStr and 0 in cchWideChar I am querying >the function to get the number of the output wide chars after >conversion. What I expected was a return of 0 in n_out, as the input >string is not a valid ASCII-7 character, but the output I get is 3, >what means that 3 wide chars are obtained. **** That's probably right. **** > >If I pass a valid output place to store the output string, I get this >UTF-16LE string: 62:00:02:00:2C:00, which seems the string that should >be obtained when only reading the 7bits of each input byte. But: why >is the MSB not read? **** Probably because you told it not to read it! You DID say that it is 7-bit data, so it used only the low-order 7 bits. It will NOT treat what you clearly told it as 7-bit data as UTF-8 encoding. So it is doing precisely the correct thing, using precisely the data you told it to use. **** >Why don't I get a ERROR_NO_UNICODE_TRANSLATION or >such? Why does it work? **** It works because it is supposed to. It is doing what you asked. If your input string is encoded in UTF-8, then the ONLY code page you can use for the translation is CP_UTF8. You will convert it to Unicode. Now, you can ask it to conver the Unicode back to 20127 (US-ASCII 7-bit), and if these is an illegal character, it will indicate that there is a problem, because WideCharToMultibyte will set the LPBOOL parameter to indicate that there was a translation error. **** > >The idea is to detect if a given string is valid in a given encoding, >not only ASCII-7. **** You can't ask it to treat UTF-8 as ASCII-7 and expect that it will translate correctly. It will do exactly what you asked, which is to treat the input string as a sequence of 7-bit ASCII bytes, which it does by ignoring the high-order bit (which is probably the parity bit). By the way, I tried the technique I suggest above using my Locale Explorer (which you can download from my MVP Tips site; just select the MultiByte tab) and it returns ? for the result. If you set the lpUsedDefault radio button to "variable" it will actually tell you it set this value to TRUE joe **** > >Thanks for the help in advance. Joseph M. Newcomer [MVP] email: newcomer(a)flounder.com Web: http://www.flounder.com MVP Tips: http://www.flounder.com/mvp_tips.htm
From: ALEKS! on 25 Apr 2008 06:09 Hi Joseph, > >The call is as follows: > >char *input_data = "\342\202\254"; /* octal */ > > **** > If you have data that you think of in hex, converting it to octal is a bit roundabout; why > not write > > "\xE2\x82\xAC" > ? You don't read octal? sorry. But this is not the point. > ****>long n_out = MultiByteToWideChar((UNIT)20127, MB_ERR_INVALID_CHARS, > >input_data, 3, NULL, 0); > > **** > What is 20127? Some random magical number? Perhaps a comment that this is US-ASCII 7-bit > code page, or a #define or static const UINT, would have helped... > **** Sorry, I thought it was understood from the explanation above. BTW, don't blame me for that, I am not the one who decided to use random magical numbers (Code points) to identify encodings. > > >If I pass a valid output place to store the output string, I get this > >UTF-16LE string: 62:00:02:00:2C:00, which seems the string that should > >be obtained when only reading the 7bits of each input byte. But: why > >is the MSB not read? > > **** > Probably because you told it not to read it! You DID say that it is 7-bit data, so it > used only the low-order 7 bits. It will NOT treat what you clearly told it as 7-bit data > as UTF-8 encoding. So it is doing precisely the correct thing, using precisely the data > you told it to use. I disagree. What I am trying to test is that the function returns an error when the input data string is not a valid ASCII-7 encoded string (ASCII-7 for example). Forget about ASCII-7... the question would end up being: Given an input string, which is supposed to be encoded in a given encoding, how do I detect that the string is really a valid encoded string? My example tries to do so: I pass an input string encoded in UTF-8 (with bytes which have the MSB set) and I tell the function that the encoding is ASCII-7. The truth is that the string is NOT encoded in ASCII-7, so I really expect an error returned from the function. > Now, you can ask it to conver the Unicode back to 20127 (US-ASCII 7-bit), and if these is > an illegal character, it will indicate that there is a problem, because > WideCharToMultibyte will set the LPBOOL parameter to indicate that there was a translation > error. Wow. So is this the only way to do so? Convert to UTF-16LE and then back to the input encoding? That's useful, yes. *sigh* > **** > > >The idea is to detect if a given string is valid in a given encoding, > >not only ASCII-7. > > **** > You can't ask it to treat UTF-8 as ASCII-7 and expect that it will translate correctly. It > will do exactly what you asked, which is to treat the input string as a sequence of 7-bit > ASCII bytes, which it does by ignoring the high-order bit (which is probably the parity > bit). I am not expecting to translate it correcly. I am expecting an error telling me that the string is not ASCII-7. If I pass an invalid UTF-8 string to the function, and I also tell the function that the string is UTF-8 encoded, I would expect an error returned from the function. This is completely the same case. Regards, Aleksander
From: Giovanni Dicanio on 25 Apr 2008 06:50 "ALEKS!" <aleksander.morgado(a)gmail.com> ha scritto nel messaggio news:8a169515-c25e-46dd-9b49-71b959105d14(a)b64g2000hsa.googlegroups.com... > The call is as follows: > char *input_data = "\342\202\254"; /* octal */ > long n_out = MultiByteToWideChar((UNIT)20127, MB_ERR_INVALID_CHARS, > input_data, 3, NULL, 0); If you use CP_UTF7 instead of magic number 20127 you have the correct result (0 is returned): <code> char *input_data = "\342\202\254"; /* octal */ long n_out = MultiByteToWideChar(CP_UTF7, MB_ERR_INVALID_CHARS, input_data, 3, NULL, 0); </code> Note that CP_UTF7 is defined like this: #define CP_UTF7 65000 // UTF-7 translation HTH, Giovanni
From: Giovanni Dicanio on 25 Apr 2008 06:53 "Joseph M. Newcomer" <newcomer(a)flounder.com> ha scritto nel messaggio news:gcr114lllsaifuhspo4f5dqstgu11du2d9(a)4ax.com... > If you have data that you think of in hex, converting it to octal is a bit > roundabout; why > not write > > "\xE2\x82\xAC" > ? I'm with Joe about that :) (Of course, it is not the "kernel" of the problem.) > **** >>long n_out = MultiByteToWideChar((UNIT)20127, MB_ERR_INVALID_CHARS, >>input_data, 3, NULL, 0); > **** > What is 20127? Some random magical number? Perhaps a comment that this > is US-ASCII 7-bit > code page, or a #define or static const UINT, would have helped... I think that the OP's problem is here in that magic number: MultiByteToWideChar documentation clearly says that CP_UTF7 should be used as CodePage parameter value: http://msdn2.microsoft.com/en-us/library/ms776413(VS.85).aspx Giovanni
|
Next
|
Last
Pages: 1 2 3 4 Prev: WM_ON_PAINT gets called too much Next: Type mismatch error when opening word doc using automation |