From: Norman Diamond on
"Mihai N." <nmihai_year_2000(a)yahoo.com> wrote in message
news:Xns9813D785509FMihaiN(a)207.46.248.16...
[Norman Diamond:]
>> Except that all the Win32 APIs don't actually work that way. SOME Win32
>> APIs count TCHARs, i.e. counting chars in ANSI and counting wchar_ts in
>> Unicode. But SOME Win32 APIs really count characters. Microsoft has
>> responded to a few cases, including one personally, to say that for some
>> Win32 APIs, even in the ANSI versions, internal processing is performed
>> in Unicode and the limits are counted in actual characters rather than in
>> the number of bytes required for the ANSI representations.
>
> Can you give some examples?

The one for which Microsoft sent personal e-mail was CreateFile. Microsoft
assured me that even the ANSI version (CreateFileA) uses Unicode internally
and MAX_PATH is the limit on the number of characters internally, so if an
ANSI application needs more than MAX_PATH bytes to specify a usable filename
then it can indeed do so. I've been a bit negligent in not writing a test
program to test this answer yet.

The other cases that I recall were discussed in newsgroups, most likely
microsoft.public.win32.programmer.ui. It's been a while now. In general I
learned from it that even in cases where we think MSDN pretty obviously
doesn't mean what it says, sometimes it really does mean what it says.

> In my experience <<There are in fact very few APIs that deal with the
> "user character">> and those are decently documented.

Either that or there are very few that are decently documented ^_^

From: Norman Diamond on
> Multibyte Character Set is an *encoding* of a character set.

Yes, ANSI code page 932 is an encoding just like other ANSI code pages such
as (I might not be remembering these numbers correctly) 1252 and 850.

> however, StringCchPrintf, sprintf, etc. do only convert characters using
> code pages in special cases, e.g., %lc or %C format.

And %s and stuff like that. (If you're compiling in an ANSI environment
then simply use %s, but if you're compiling in a Unicode environment and
want to produce an ANSI encoded string then use %S.)

> For ANSI mode, this means that 'character' is 'byte'. In ANSI mode, one
> character is one byte.

For some reason I thought that you had sometimes written code targetting
ANSI code pages in which you knew that these statements are not true. It
looks like I misremembered. OK, then it seems that this is your
introduction to such code pages. In ANSI mode, one character is one or more
bytes. In the ANSI code pages that Microsoft implemented, one character is
one or two bytes, no more than two.

I haven't been using Japanese Microsoft systems for nearly 20 years, I've
only been using them for half that length of time and occasionally seen them
in use the other half of that time while I was using Japanese Unix and
Japanese VMS systems. I've used %s format in printf in Japanese Unix and
VMS and Windows systems. This is one kind of experiment that you don't need
to tell me to do.

I will continue to respect your expertise on matters other than character
encodings.


"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in message
news:b9i1d2p7ca3n59258h63bc1mavfgjngicd(a)4ax.com...
> Multibyte Character Set is an *encoding* of a character set. In ANSI
> mode, MBCS can be
> used to encode 'characters' in an extended set; however, StringCchPrintf,
> sprintf, etc. do
> only convert characters using code pages in special cases, e.g., %lc or %C
> format. The
> formal definition for %c, the formatting code being discussed in this
> example, is that
> the int argument is converted to 'unsigned char' and formatted as a
> character. For ANSI
> mode, this means that 'character' is 'byte'. In ANSI mode, one character
> is one byte.
>
> In a multibyte character set, a glyph might be represented by one to four
> successive 8-bit
> bytes. Note that using %c would be erroneous for formatting an integer
> value, if the
> intent was to produce a multibyte sequence representing a single logical
> character.
>
> This can easily be seen by looking at the %c formatting code in output.c
> in the CRT
> source. %c formats exactly one byte in ANSI mode. So arguing that %c
> requires two bytes
> for a character is not correct.
>
> The exact code executed for %c formatting is
> unsigned short temp;
> temp = (unsigned short) get_int_arg(&argptr);
> {
> buffer.sz[0] = (char) temp;
> textlen = 1;
> }
>
> I see nothing here that can generate more than one byte of output. Note
> that the %C and
> %lc formats, which take wide character values and format them in
> accordance with the code
> page, *can* generate more than one byte of character, which does satisfy
> the objection
> raised. But the format here is clearly %c, and %c is clearly defined, and
> the
> implementation reflects that definition. So I'm not sure what the issue
> is here.
>
> StringCchPrintf is defined in terms of 8-bit characters and 16-bit
> characters, not in
> terms of logical characters encoded in an MBCS. MBCS does not enter the
> discussion; if
> you format using %lc or %C it will actually truncate the multibyte string
> to fit in the
> buffer. Thus, it obeys its requirement of not allowing a buffer overrun.
>
> This can be seen trivially simply by--get this--DOING THE EXPERIMENT!!!!!
> So while you
> can contend until the cows come home that you think that you know how to
> read the
> documentation, it is a matter of a couple minutes to actually do the
> experiment. I found
> that even when the wctomb function produces a sequence of multiple bytes
> to represent the
> wide character as a multibyte character, when formatting with %lc, the
> ANSI definition of
> StringCchPrintf is in terms of ANSI characters, 8-bit bytes, and it writes
> exactly one of
> the three bytes of the multibyte sequence, the first byte. So the
> sequence
>
> StringCchPrintf(buffer, '%lc', 0xF95C);
>
> will simply transfer to the target buffer the first 8-bit byte of what
> turned out to be a
> 3-byte multibyte sequence.
>
> Note that since I don't have appropriate multinational support, I had to
> actually set a
> breakpoint and "fake" the results of wctomb, because what it does on my
> machine is fail
> the conversion and return -1. So I simply placed two bytes and a NUL into
> the buffer as
> if wctomb had worked correctly, changed the length to 2, and proceeded
> with the execution.
> Otherwise, I just get an empty string.
>
> UTF-8 is one of the many multibyte character encodings that exist. I
> chose it as an
> example because it is specified in the Unicode standard.
>
>
> joe
>
>
> On Wed, 2 Aug 2006 09:12:11 +0900, "Norman Diamond"
> <ndiamond(a)community.nospam> wrote:
>
>>I wrote:
>>>> The documentation for StringCchPrintf talks about counts of characters.
>>
>>Dr. Newcomer's response emphasises several times that the documentation
>>for
>>StringCchPrintf talks about counts of ***** characters ***** EXACTLY as I
>>said it does. It is reassuring to see this agreement, though I wonder why
>>it's expressed so oddly.
>>
>>But then odd questions arises
>>
>>> Now where, in the above documentation, does it say that a 'character' is
>>> exactly one byte?
>>> How do you infer that a 'character', in ANSI mode, can occupy two bytes?
>>
>>Very very true. In the documentation of StringCchPrintf, MSDN correctly
>>refrains from saying that a 'character' is exactly one byte. Microsoft is
>>well aware that code page 932 (Shift-JIS) and the code page for the
>>world's
>>largest country by population and a couple of other code pages contain
>>characters that, in ANSI mode, occupy two bytes. Dr. Newcomer, I think
>>you
>>are well aware of this too, and I am really confused why you ask these
>>questions
From: Mihai N. on
> The one for which Microsoft sent personal e-mail was CreateFile. Microsoft
> assured me that even the ANSI version (CreateFileA) uses Unicode internally
> and MAX_PATH is the limit on the number of characters internally, so if an
> ANSI application needs more than MAX_PATH bytes to specify a usable
> filename
> then it can indeed do so. I've been a bit negligent in not writing a test
> program to test this answer yet.
But CreateFile does not take a number of chars as parameter.
What I suspect is happening is that the MAX_PATH is the limit if you don't
use "\\?\" and is there both in the W and A versions.
And since the A version does a conversion to Unicode and calls the W one,
the limit is probably there and expressed in utf16 code units, indeed.
Interesting for some week-end experiments :-)

> In general I
> learned from it that even in cases where we think MSDN pretty obviously
> doesn't mean what it says, sometimes it really does mean what it says.
....
> Either that or there are very few that are decently documented ^_^
I know what you mean.
And I was very vocal about it, until I have discovered the Mac OS X
documentation. Now I claim that the MSDN is great, and should be a model :-)


--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
From: Mihai N. on
> In a multibyte character set, a glyph might be represented by one to four
> successive 8-bit bytes.
....
> UTF-8 is one of the many multibyte character encodings that exist.
> I chose it as an example because it is specified in the Unicode standard.

You should never use UTF-8 as an example in the Windows world. It is
guaranteed to give weird results, since it is not supported.
Windows only knows about ANSI code pages (and UTF-8 cannot be that) or
UTF-16.
The only place where utf-8 is ok in Windows is in API doing conversion
to/from utf-16



--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
From: Norman Diamond on
"Mihai N." <nmihai_year_2000(a)yahoo.com> wrote in message
news:Xns9813EB11E9700MihaiN(a)207.46.248.16...
>> The one for which Microsoft sent personal e-mail was CreateFile.
>> Microsoft assured me that even the ANSI version (CreateFileA) uses
>> Unicode internally and MAX_PATH is the limit on the number of characters
>> internally, so if an ANSI application needs more than MAX_PATH bytes to
>> specify a usable filename then it can indeed do so. I've been a bit
>> negligent in not writing a test program to test this answer yet.
>
> But CreateFile does not take a number of chars as parameter.

So what? Where we intuitively think that the stated limit of MAX_PATH
characters means MAX_PATH chars in ANSI, Microsoft informed me that the
limit really is MAX_PATH characters even if it takes twice that many bytes.
You asked for examples of cases where we had been wrong in nearly always
assuming that MSDN's statements about characters meant TCHARs, and this is a
big example.

> What I suspect is happening is that the MAX_PATH is the limit if you don't
> use "\\?\" and is there both in the W and A versions.
> And since the A version does a conversion to Unicode and calls the W one,
> the limit is probably there and expressed in utf16 code units, indeed.

You suspect that Microsoft's e-mail to me was accurate, and as mentioned, I
have the same impression. Though they send a lot of unbelievable e-mails,
they send some believable e-mails too and this was one.

> Interesting for some week-end experiments :-)

Yup. By the way, considering that VFAT can store a filename consisting of
around 250 Kanji, one weekend experiment would be to try opening the file
under Windows 98 (Japanese version of course). But really I'll consider it
close enough if it works under Windows 2000, XP, 2003, and Vista beta. I
haven't had time to test it and I do believe that mail.