From: ALEKS! on
Hi all,

I am trying to use MultiByteToWideChar function to detect invalid
encoded strings (in US-ASCII 7-bit for example). The input string I
use is UTF-8 encoded E2:82:AC (euro sign), which is not a valid
ASCII-7 string.

The call is as follows:
char *input_data = "\342\202\254"; /* octal */
long n_out = MultiByteToWideChar((UNIT)20127, MB_ERR_INVALID_CHARS,
input_data, 3, NULL, 0);

When passing NULL in lpWideCharStr and 0 in cchWideChar I am querying
the function to get the number of the output wide chars after
conversion. What I expected was a return of 0 in n_out, as the input
string is not a valid ASCII-7 character, but the output I get is 3,
what means that 3 wide chars are obtained.

If I pass a valid output place to store the output string, I get this
UTF-16LE string: 62:00:02:00:2C:00, which seems the string that should
be obtained when only reading the 7bits of each input byte. But: why
is the MSB not read? Why don't I get a ERROR_NO_UNICODE_TRANSLATION or
such? Why does it work?

The idea is to detect if a given string is valid in a given encoding,
not only ASCII-7.

Thanks for the help in advance.
From: Joseph M. Newcomer on
See below...
On Thu, 24 Apr 2008 12:14:10 -0700 (PDT), "ALEKS!" <aleksander.morgado(a)gmail.com> wrote:

>Hi all,
>
>I am trying to use MultiByteToWideChar function to detect invalid
>encoded strings (in US-ASCII 7-bit for example). The input string I
>use is UTF-8 encoded E2:82:AC (euro sign), which is not a valid
>ASCII-7 string.
>
>The call is as follows:
>char *input_data = "\342\202\254"; /* octal */
****
If you have data that you think of in hex, converting it to octal is a bit roundabout; why
not write

"\xE2\x82\xAC"
?
****
>long n_out = MultiByteToWideChar((UNIT)20127, MB_ERR_INVALID_CHARS,
>input_data, 3, NULL, 0);
****
What is 20127? Some random magical number? Perhaps a comment that this is US-ASCII 7-bit
code page, or a #define or static const UINT, would have helped...
****
>
>When passing NULL in lpWideCharStr and 0 in cchWideChar I am querying
>the function to get the number of the output wide chars after
>conversion. What I expected was a return of 0 in n_out, as the input
>string is not a valid ASCII-7 character, but the output I get is 3,
>what means that 3 wide chars are obtained.
****
That's probably right.
****
>
>If I pass a valid output place to store the output string, I get this
>UTF-16LE string: 62:00:02:00:2C:00, which seems the string that should
>be obtained when only reading the 7bits of each input byte. But: why
>is the MSB not read?
****
Probably because you told it not to read it! You DID say that it is 7-bit data, so it
used only the low-order 7 bits. It will NOT treat what you clearly told it as 7-bit data
as UTF-8 encoding. So it is doing precisely the correct thing, using precisely the data
you told it to use.
****
>Why don't I get a ERROR_NO_UNICODE_TRANSLATION or
>such? Why does it work?
****
It works because it is supposed to. It is doing what you asked.

If your input string is encoded in UTF-8, then the ONLY code page you can use for the
translation is CP_UTF8. You will convert it to Unicode.

Now, you can ask it to conver the Unicode back to 20127 (US-ASCII 7-bit), and if these is
an illegal character, it will indicate that there is a problem, because
WideCharToMultibyte will set the LPBOOL parameter to indicate that there was a translation
error.
****
>
>The idea is to detect if a given string is valid in a given encoding,
>not only ASCII-7.
****
You can't ask it to treat UTF-8 as ASCII-7 and expect that it will translate correctly. It
will do exactly what you asked, which is to treat the input string as a sequence of 7-bit
ASCII bytes, which it does by ignoring the high-order bit (which is probably the parity
bit).

By the way, I tried the technique I suggest above using my Locale Explorer (which you can
download from my MVP Tips site; just select the MultiByte tab) and it returns ? for the
result. If you set the lpUsedDefault radio button to "variable" it will actually tell you
it set this value to TRUE
joe
****
>
>Thanks for the help in advance.
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
From: ALEKS! on
Hi Joseph,

> >The call is as follows:
> >char *input_data = "\342\202\254"; /* octal */
>
> ****
> If you have data that you think of in hex, converting it to octal is a bit roundabout; why
> not write
>
> "\xE2\x82\xAC"
> ?

You don't read octal? sorry. But this is not the point.


> ****>long n_out = MultiByteToWideChar((UNIT)20127, MB_ERR_INVALID_CHARS,
> >input_data, 3, NULL, 0);
>
> ****
> What is 20127? Some random magical number? Perhaps a comment that this is US-ASCII 7-bit
> code page, or a #define or static const UINT, would have helped...
> ****

Sorry, I thought it was understood from the explanation above. BTW,
don't blame me for that, I am not the one who decided to use random
magical numbers (Code points) to identify encodings.


>
> >If I pass a valid output place to store the output string, I get this
> >UTF-16LE string: 62:00:02:00:2C:00, which seems the string that should
> >be obtained when only reading the 7bits of each input byte. But: why
> >is the MSB not read?
>
> ****
> Probably because you told it not to read it! You DID say that it is 7-bit data, so it
> used only the low-order 7 bits. It will NOT treat what you clearly told it as 7-bit data
> as UTF-8 encoding. So it is doing precisely the correct thing, using precisely the data
> you told it to use.

I disagree. What I am trying to test is that the function returns an
error when the input data string is not a valid ASCII-7 encoded string
(ASCII-7 for example). Forget about ASCII-7... the question would end
up being:

Given an input string, which is supposed to be encoded in a given
encoding, how do I detect that the string is really a valid encoded
string?

My example tries to do so: I pass an input string encoded in UTF-8
(with bytes which have the MSB set) and I tell the function that the
encoding is ASCII-7. The truth is that the string is NOT encoded in
ASCII-7, so I really expect an error returned from the function.

> Now, you can ask it to conver the Unicode back to 20127 (US-ASCII 7-bit), and if these is
> an illegal character, it will indicate that there is a problem, because
> WideCharToMultibyte will set the LPBOOL parameter to indicate that there was a translation
> error.

Wow. So is this the only way to do so? Convert to UTF-16LE and then
back to the input encoding? That's useful, yes. *sigh*


> ****
>
> >The idea is to detect if a given string is valid in a given encoding,
> >not only ASCII-7.
>
> ****
> You can't ask it to treat UTF-8 as ASCII-7 and expect that it will translate correctly. It
> will do exactly what you asked, which is to treat the input string as a sequence of 7-bit
> ASCII bytes, which it does by ignoring the high-order bit (which is probably the parity
> bit).

I am not expecting to translate it correcly. I am expecting an error
telling me that the string is not ASCII-7.


If I pass an invalid UTF-8 string to the function, and I also tell the
function that the string is UTF-8 encoded, I would expect an error
returned from the function. This is completely the same case.

Regards,
Aleksander
From: Giovanni Dicanio on

"ALEKS!" <aleksander.morgado(a)gmail.com> ha scritto nel messaggio
news:8a169515-c25e-46dd-9b49-71b959105d14(a)b64g2000hsa.googlegroups.com...

> The call is as follows:
> char *input_data = "\342\202\254"; /* octal */
> long n_out = MultiByteToWideChar((UNIT)20127, MB_ERR_INVALID_CHARS,
> input_data, 3, NULL, 0);

If you use CP_UTF7 instead of magic number 20127 you have the correct result
(0 is returned):

<code>

char *input_data = "\342\202\254"; /* octal */
long n_out = MultiByteToWideChar(CP_UTF7, MB_ERR_INVALID_CHARS,
input_data, 3, NULL, 0);

</code>

Note that CP_UTF7 is defined like this:

#define CP_UTF7 65000 // UTF-7 translation

HTH,
Giovanni


From: Giovanni Dicanio on

"Joseph M. Newcomer" <newcomer(a)flounder.com> ha scritto nel messaggio
news:gcr114lllsaifuhspo4f5dqstgu11du2d9(a)4ax.com...

> If you have data that you think of in hex, converting it to octal is a bit
> roundabout; why
> not write
>
> "\xE2\x82\xAC"
> ?

I'm with Joe about that :)
(Of course, it is not the "kernel" of the problem.)


> ****
>>long n_out = MultiByteToWideChar((UNIT)20127, MB_ERR_INVALID_CHARS,
>>input_data, 3, NULL, 0);
> ****
> What is 20127? Some random magical number? Perhaps a comment that this
> is US-ASCII 7-bit
> code page, or a #define or static const UINT, would have helped...

I think that the OP's problem is here in that magic number:
MultiByteToWideChar documentation clearly says that CP_UTF7 should be used
as CodePage parameter value:

http://msdn2.microsoft.com/en-us/library/ms776413(VS.85).aspx

Giovanni