From: Alf P. Steinbach on
* Pete Becker:
>
> WCHAR has to be 2 bytes and store UTF-16 in little-endian format,
> because that's the way that the Windows API was designed. More recently,
> wchar_t has to do the same, because WCHAR is now defined as wchar_t.
> There's no essential connection, just the artificial one that the
> Windows headers create.

As a practical matter consider wide character literals.

FooW( L"Hello, hello?" );

where FooW is some API function.

If wchar_t isn't what the API level expects, this won't compile.

So essentially the OS limits the compilers regarding what wchar_t can be
by default.

--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Eugene Gershnik on
kanze wrote:
> Eugene Gershnik wrote:
>> kanze wrote:
>>> Eugene Gershnik wrote:

>> since AFAIK ? may
>> be encoded as e followed by the thingy on top or as a single
>> unit ?. ;-)
>
> Not according to Unicode, at least not in correctly formed
> Unicode sequences.

Well here is AFAIK correctly formed Unicode sequence that means e?

U+0065 U+0301

All my editors seem to agree with that.

> But that's not the point. The point is more the opposite:
> simplistic solutions like looking for a single character are
> just that: simplistic.

I think we agree on that.

> The fact that you can find certain
> characters with such a single character search in UTF-8 is a
> marginal advantage, at best.

But not on this. The control parts of all textual network protocols are all
in 7-bit english, special characters in filenames are also within ASCII
range on all filesystems I have ever seen. Lots of system and network
programming tasks consist of looking for these. This is of course just few
areas out of many but I would guess they are among the largest ones for C++.

>> Also note that you *can* use std::find with a filtering
>> iterator (which is easy to write) sacrificing performance.
>> Then again nobody uses std::find on strings. You either use
>> basic_string::find or strstr() and similar. Which both work
>> fine on ? in UTF-8 as long as you pass it as a string and not
>> a single char.
>
> Agreed. But then, a lot of other multibyte character sets will
> work in that case as well.

Not necessarily. A lot of MBCS encodings make the fatal error of mixing the
range of trail bytes with that of a single byte. Which makes simple search
like strstr(string, ".") impossible.

>>>> It is also can be used (with caution) with std::string
>>>> unlike UTF-16 and UTF-32 for which you will have to invent
>>>> a character type and write traits.
>
>>> Agreed, but in practice, if you are using UTF-8 in
>>> std::string, you're strings aren't compatible with the third
>>> party libraries using std::string in their interface.
>
>> This depends on a library. If it only looks for characters
>> below 0x7F and passes the rest unmodified I stay compatible.
>> Most libraries fall in this category. That's why so much Unix
>> code works perfectly in UTF-8 locale even though it wasn't
>> written with it in mind.
>
> Are you kidding. I've not found this to be the case at all.
> Most Unix tools are extremely primitive,

[...]

No argument about tools but I am talking about _libraries_. The code that
breaks is probably a minority but it is the visible one that deals with UI.

> This is one case where Windows has the edge on Unix: Windows
> imposes a specific encoding for filenames. IMHO, it would have
> been better if they had followed the Plan 9 example, and chosen
> UTF-8, but anything is better than the Unix solution, where
> nothing is defined, every application does whatever it feels
> like, and filenames with anything other than basic US ASCII end
> up causing a lot of problems.

Well this is just one special case of Unix flexibility. You get totally
different behavior based on PATH, LD_LIBRARY_PATH, LC_CTYPE etc. etc. Not to
mention that on open source ones you can never be sure that the kernel you
are running on is wasn't modified by the user beyond any recognition. As a
developer of shrink-wrapped software I don't like it but that's life.

>>> Arguably, you want a different type, so that the compiler
>>> will catch errors.
>
>> Yes. When I want maximum safety I create struct utf8_char
>> {...}; with the same size and alignment as char. Then I
>> specialize char_traits, delegating to char_traits<char> and
>> have typedef basic_string<utf8_char> utf8_string.

[...]

> And you doubtlessly have to convert a lot:-). Or do you also
> create all of the needed facet's in locale?

I don't really use iostreams for anything but logging so I don't
particularly care about extra conversions there. The big areas where I need
to convert are when calling system APIs and when interfacing with
Unicode-aware libraries that use UTF-16 (like most XML parsers). The first
is inevitable in portable code whatever you choose and the second so far
haven't been a performance problem.

> Still, it doesn't work if the code you're interfacing to is
> trying to line data up using character counts, and doesn't
> expect multi-byte characters. If, like a lot of software here
> in Europe, it assumes ISO 8859-1.

Of course but the UIs I have to deal with are all either Web or Java ones
these days.

>> All the system APIs (not some but *all*) that deal with
>> strings accept UTF-16. None of them accept UTF-8 and UTF-32.
>> There is also no notion of UTF-8 locale. If you select
>> anything but UTF-16 for your application you will have to
>> convert everywhere.
>
> They've got to support UTF-8 somewhere. It's the standard
> encoding for all of the Internet protocols.

Win32 contains very little that deals with internet protocols. They are
usually handled by applications themselves.
All I can recall about UTF-8 on Win32 is that there is a function to convert
to and from from wchar_t (UTF-16) and some routines for DNS queries.

>> On the most fundamental level to do I18N correctly strings
>> have to be dealt with as indivisible units. When you want to
>> perform some operation you pass the string to a library and
>> get the results back. No hand written iteration can be
>> expected to deal with pre-composed vs. composite, different
>> canonical forms and all the other garbage Unicode brings us.
>> If a string is an indivisible unit then it doesn't really
>> matter what this unit is as long as it is what your libraries
>> expect to see.
>
> So we basically agree:-). All that's missing is the libraries.
> (I know, some exist. But all too often, you don't have a
> choice.)

In my experience it boils down to money and/or ignorance. It is not that
hard to write a specific library if you know the problem domain well (which
I don't despite having to deal with it occasionally). However, most shops
don't even think about hiring an expert in this area. After all, I18N string
handling is easy, just use wchar_t everywhere, right? ;-)

--
Eugene



[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Eugene Gershnik on
Eugene Gershnik wrote:
>
> Well here is AFAIK correctly formed Unicode sequence that means e?
>
> U+0065 U+0301
>
> All my editors seem to agree with that.

And unintentionally this was also a good demonstration of how broken modern
software is with regards to Unicode. My NNTP client (Microsoft Outlook
Express) had correctly shown the character as ? while editing but
transmitted it as e? as you can see above. This is despite being Windows
only application that presumably uses UTF-16 wchar_t internally.
See also this nice test of modern search engines
http://blogs.msdn.com/michkap/archive/2005/11/15/492301.aspx

--
Eugene



[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Pete Becker on
Alf P. Steinbach wrote:

>
> As a practical matter consider wide character literals.
>
> FooW( L"Hello, hello?" );
>
> where FooW is some API function.
>
> If wchar_t isn't what the API level expects, this won't compile.
>
> So essentially the OS limits the compilers regarding what wchar_t can be
> by default.
>

Gosh, how did we survive back in the olden days when WCHAR wasn't wchar_t?

--

Pete Becker
Roundhouse Consulting, Ltd.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Pete Becker on
Eugene Gershnik wrote:
> Pete Becker wrote:
>
>>I'm well aware of the history of wchar_t in MS compilers. I was
>>talking about the definition of WCHAR in the Windows headers, which,
>>believe it
>>or not, at one time didn't ship with the compiler.
>
>
> I do believe it since I well remember it ;-) In any case WCHAR was obviously
> intended to stand for wchar_t.

Or, equally obviously, for wide character.

> When the compilers didn't uniformly provide
> it unsigned short was used as a substitute. This is still the case with many
> older Windows libraries.
>

Which is what I said in the first place.

--

Pete Becker
Roundhouse Consulting, Ltd.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

First  |  Prev  |  Next  |  Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
Prev: localtime deprecated?
Next: bind guard ?