c++0x: u16string u32string and string [C++]

Prev: GPP Generic Preprocessor
Next: UML Survey

From: german diago on 16 Feb 2010 22:40

Hello. I've been trying support for utf strings in c++0x (from gcc
svn). I looked at the current draft N3000 for the language, and I have
a question.

The length() member function says it returns the number of char16_t,
char32_t or chars in a string, depending on the basic character type.

But the number of chars that a symbol is encoded in, at least for
utf-8 encoding (and I believe it's also true for utf-16) is variable.
So these functions don't return the real number of symbols in each
string, but the number of chars, depending on the size of the char.
So to calculate the real number of symbols, you cannot rely on a
standard function. I think a standard function to calculate the number
of "symbols", not the number of chars of a string, should be included,
maybe with another name, since length should be kept for
compatibility.

Or is there one I'm not aware of? Thanks for your time.

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Joshua Maurice on 17 Feb 2010 08:13

On Feb 17, 7:40 am, german diago <germandi...(a)gmail.com> wrote:
> Hello. I've been trying support for utf strings in c++0x (from gcc
> svn). I looked at the current draft N3000 for the language, and I have
> a question.
>
> The length() member function says it returns the number of char16_t,
> char32_t or chars in a string, depending on the basic character type.
>
> But the number of chars that a symbol is encoded in, at least for
> utf-8 encoding (and I believe it's also true for utf-16) is variable.
> So these functions don't return the real number of symbols in each
> string, but the number of chars, depending on the size of the char.
> So to calculate the real number of symbols, you cannot rely on a
> standard function. I think a standard function to calculate the number
> of "symbols", not the number of chars of a string, should be included,
> maybe with another name, since length should be kept for
> compatibility.
>
> Or is there one I'm not aware of? Thanks for your time.

Then there should also be a function to return the total number of
grapheme clusters. Analogously, there probably ought to be iterators
for 1- encoding units, 2- symbols aka unicode code point, and 3-
grapheme clusters aka what the end user thinks of as a char.

Honestly, I haven't reviewed it yet, but I hold out little hope that
we'll actually have this basic functionality, and thus we'll still be
stuck with ICU for the forseeable future.

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Mathias Gaunard on 17 Feb 2010 08:14

On 17 f�v, 15:40, german diago <germandi...(a)gmail.com> wrote:

> The length() member function says it returns the number of char16_t,
> char32_t or chars in a string, depending on the basic character type.

Of course, since u16string is simply basic_string<char16_t>.
It's just a mean to store Unicode, and no string operation is Unicode-
aware.

> I think a standard function to calculate the number
> of "symbols", not the number of chars of a string, should be included,
> maybe with another name, since length should be kept for
> compatibility.

And what purpose would that function serve, alone?
Ideally you would need a whole set of Unicode support primitives.

Also you might be misguided in thinking that a Unicode code point is a
"symbol". A grapheme is closer to that idea, and can be made of an
arbitrary number of code points (or rather up to 32 if you restrict
yourself to stream-safe unicode strings).

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: CornedBee on 17 Feb 2010 08:13

On Feb 17, 4:40 pm, german diago <germandi...(a)gmail.com> wrote:
> But the number of chars that a symbol is encoded in, at least for
> utf-8 encoding (and I believe it's also true for utf-16) is variable.
> So these functions don't return the real number of symbols in each
> string, but the number of chars, depending on the size of the char.
> So to calculate the real number of symbols, you cannot rely on a
> standard function. I think a standard function to calculate the number
> of "symbols", not the number of chars of a string, should be included,
> maybe with another name, since length should be kept for
> compatibility.

UTF-16 is also variable-length, yes.

The problem is that providing a function that calculates the number of
code points opens a huge can of worms. The moment you do it, people
will start asking about the number of graphemes, and about
normalization forms, and within minutes you're looking at the job of
implementing Unicode collation.

C++0x simply doesn't have time for that.

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

|
Pages: 1
Prev: GPP Generic Preprocessor
Next: UML Survey