UTF8 and std::string [C++]

Prev: localtime deprecated?
Next: bind guard ?

From: Eugene Gershnik on 13 Jun 2006 18:33

Bronek Kozicki wrote:
> jrm wrote:
> > std::wstring might not be a good idea according to the details section
> > here from ustring class:
>
> why not? std::wstring is typicaly implemented on top of Unicode support of
> target platform, and character type used is typically some fixed-width Unicode
> encoding, like UTF16 (on Windows) of UTF32 (on Linux; I do not know about
> other flavours of Unix).

wchar_t is locale dependent on Solaris. It is UTF-32 for UTF-8 locales
and something proprietary on others. This question has been beaten to
death in this NG in the past. The simple conclusion is standard C++
wchar_t != Unicode. IIRC P.J. Plauger once explained here why it should
be considered a good thing.

> UTF8 is not character type (neither UTF16 or UTF32
> are, but at least they are fixed width, so they can map to wchar_t) but fancy
> encoding.

UTF-16 is *not* fixed width. It is a variable width encoding where a
Unicode character can be represented by 1 or 2 16-bit units. At least
this was so last time I checked. I wouldn't be suprised if some new
Unicode standard broke it further.

UTF-32 is the only fixed length encoding for Unicode available today.
Again see caveat above. It is also very wasteful if the bulk of your
text processing is ASCII compatible. (note that 4 bytes is the *worst*
case for UTF-8).

UTF-8 has special properties that make it very attractive for many
applications. In particular it guarantees that no byte of multi-byte
entry corresponds to a standalone single byte. Thus with UTF-8 you can
still search for english only strings (like /, \\ or .) using
single-byte algorithms like strchr().
It is also can be used (with caution) with std::string unlike UTF-16
and UTF-32 for which you will have to invent a character type and write
traits.
IMO UTF-8 (and UTF-8 locales) is probably the best way to use Unicode
on Unix. Apparently I am also backed by known experts
http://www.cl.cam.ac.uk/~mgk25/unicode.html#linux

UTF-16 is a good option on platforms that directly support it like
Windows, AIX or Java. UTF-32 is probably not a good option anywhere ;-)

--
Eugene

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Alf P. Steinbach on 14 Jun 2006 06:18

* Pete Becker:
> Wu Yongwei wrote:
>
>> A gotcha under Windows: wchar_t is 2 bytes wide.
>
> wchar_t is a type defined by the compiler. For some Windows compilers
> it's 2 bytes wide, for others it isn't.

Is there a C++ compiler for 32-bit Windows where wchar_t isn't 32 bits
by default?

If such a compiler exists it would be unable to compile existing source
code based on the identity assumption C++ wchar_t === Windows WCHAR.

--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Jeff Koftinoff on 14 Jun 2006 06:20

Bronek Kozicki wrote:
> Jeff Koftinoff wrote:
> > But UTF-16 and UTF-32 both are potentially multi-code-point per
> > character encodings... See the "Grapheme Boundaries" section of:
>
> they are best one can get now.
>
>
> B.
>

Right, but a 'best' solution would be to use a string class that can
iterate multi-byte characters.

--jeffk++

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Pete Becker on 14 Jun 2006 06:28

Bronek Kozicki wrote:
>
> why not? std::wstring is typicaly implemented on top of Unicode support of
> target platform, and character type used is typically some fixed-width Unicode
> encoding, like UTF16 (on Windows) of UTF32 (on Linux; I do not know about
> other flavours of Unix).

UTF16 is not fixed-width, unless you are sure you will never have
characters represented by surrogate pairs. But if you're willing to do
that, then UTF8 is also fixed-width, so long as you are sure you will
never have characters represented by values greater thn 0xff. It's just
a question of how much stuff you're willing to ignore in order to claim
that a representation is fixed width.

--

Pete Becker
Roundhouse Consulting, Ltd.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: kanze on 14 Jun 2006 06:30

Pete Becker wrote:
> jrm wrote:

> > std::wstring might not be a good idea according to the details section
> > here from ustring class:

> > <snip
> > src=http://www.gtkmm.org/docs/glibmm-2.4/docs/reference/html/classGlib_1_1ustring.html#_details>

> > In a perfect world the C++ Standard Library would contain a
> > UTF-8 string class. Unfortunately, the C++ standard doesn't
> > mention UTF-8 at all. Note that std::wstring is not a UTF-8
> > string class because it contains only fixed-width characters
> > (where width could be 32, 16, or even 8 bits).

> > </snip>

> Back in the olden days, the Japanese tried to work with
> multi-byte representations of Japanese characters. The result
> of that experience was that they insisted that C add wide
> character support so they wouldn't have to.

Times change. UTF-8 was designed with some of the problems
encountered in the Japanese encodings in mind.

Having said that, I think a lot depends on the application. I
certainly wouldn't like to have to write an editor using UTF-8,
for example. But for a lot of applications (including things
like compilers and interpreters), text handling is limited to
reading input sequentially, cutting it up into tokens, then only
comparing the tokens or pasting them together for output text.
As long as you're only accessing any string object sequentially
(which is the case in such applications), UTF-8 can be made to
work quite well.

--
James Kanze GABI Software
Conseils en informatique orient?e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S?mard, 78210 St.-Cyr-l'?cole, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
Prev: localtime deprecated?
Next: bind guard ?