UTF8 and std::string [C++]

Prev: localtime deprecated?
Next: bind guard ?

From: Chris Vine on 19 Jun 2006 07:21

Kirit S?lensminde wrote:

> If you're using narrow character sequences in your code then I as the
> person who _runs_ the software gets to decide how your string was
> encoded. Not you as the person who _wrote_ it. Using wide character
> strings is the only choice you have if you actually want to control
> what is in the string. It's a wonder to me that any of it works at all.

If it were true you would be right, but it is not true. If you write string
literals then the author decides what the codeset is, for obvious reasons
(it is hard-coded into the binary). If narrow strings are being imported
from elsewhere (for example from using NLS) then you can chose the codeset
in the call to bind_textdomain_codeset(); and because library interfaces
will specify the codeset to be used with it, you must do so. For keyboard
input which is delivered in a form which depends on the user's locale, it
is for the writer of the code to ensure it is converted to the codeset
expected by the libraries with which it interfaces.

Even with wide characters, it is library specifications, not the user, that
will decide the codeset. It may be UTF-16, UTF-32, or something else.

Chris

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: hdante on 19 Jun 2006 07:20

Why people keep repeating that UTF16 is a fixed-width Unicode encoding
? I'm not getting into normalized/denormalized characters discussion
here, but even the most basic UTF-16 requires either 16 or 32 bits for
encoding a single Unicode character. An old 2-byte fixed width encoding
was UCS-2 and I don't even know if it still is.

Bronek Kozicki wrote:
> jrm wrote:
> > std::wstring might not be a good idea according to the details section
> > here from ustring class:
>
> why not? std::wstring is typicaly implemented on top of Unicode support of
> target platform, and character type used is typically some fixed-width Unicode
> encoding, like UTF16 (on Windows) of UTF32 (on Linux; I do not know about
> other flavours of Unix). UTF8 is not character type (neither UTF16 or UTF32
> are, but at least they are fixed width, so they can map to wchar_t) but fancy
> encoding. And typical location of data encoding is not in data processing, but
> input/output. Anything that can be represented in UTF8 can be also represented
> in UTF32 and in UTF16 (or almost anything - there are surrogates to compensate
> shorter characters in UTF16, but I'm not sure how much value they provide)

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Pete Becker on 19 Jun 2006 09:59

Alf P. Steinbach wrote:

> {This thread is drifting too far off topic, follow-ups are likely
> to be
> rejected unless they include Standard C++ relevant content. -mod/fwg}
>
> * Pete Becker:
>
>> Alf P. Steinbach wrote:
>>
>>> As a practical matter consider wide character literals.
>>>
>>> FooW( L"Hello, hello?" );
>>>
>>> where FooW is some API function.
>>>
>>> If wchar_t isn't what the API level expects, this won't compile.
>>>
>>
>> Why should it compile? The idiomatic way to write this is
>>
>> Foo(_T("Hello, hello?"));
>>
>
>
> I'm sorry, no.
>
> I'm not sure how topical this is, but we're talking about a library
> (the
> Windows API)

No, it's an operating system interface. Libraries can impose whatever
application-level policies their writers like; if you don't want to
write things their way you don't have to use the library. OS interfaces
should be far less restrictive. Some application writers may prefer wide
characters that are wider than 16 bits, and the OS interface should not
make that significantly harder.

>
>> Once you abandon that indirection layer
>> you're locked into a specific choice of type.
>
>
> I'm sorry, no. There is no indirection layer (an indirection layer
> would be what many higher level libraries provide through smart string
> classes).

I'm the one who used the term, so you don't get to redefine it. An
indirection layer is anything that lets you change the meaning of code
that uses it, without rewriting that code.

> There is however a C/C++ "choice" layer, if a set of macros
> might be called a layer, choosing between only wchar_t or only char,
> which does not work for functions that come in only one variant, and
> which in modern versions of the OS has no benefits; in particular, the
> choice layer does not remove the coupling between wchar_t and a
> particular OS-defined size.
>

Macros can provide a far broader range of functionality than merely
choosing between char and wchar_t. The fact that the MS _T macro only
does these two is not inherent in macros, but is the result of a
decision made by the writer of that macro. It would be possible to write
that macro that converted literal character strings into a
compiler-specific array type that could be passed to the Windows API.
That approach would not impose an application-wide requirement that
wchar_t be 16 bits wide.

To restate my earlier point, which none of the above addresses, it's not
a disaster if FooW(L"Hello, hello?"); doesn't compile. So it's not
self-evident that the argument to a function that traffics in UTF-16
should be wchar_t*.

--

Pete Becker
Roundhouse Consulting, Ltd.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Pete Becker on 19 Jun 2006 09:56

Eugene Gershnik wrote:
>
> Not at all. Outlook Express informs me that my message was
> transmitted in
> what it calls "Western European (ISO)" encoding (presumably ISO
> 8859-1).

What I said about Unicode is correct. The fact that you are now
providing additional information about your software doesn't change
that.

--

Pete Becker
Roundhouse Consulting, Ltd.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: David J. Littleboy on 20 Jun 2006 04:24

{This thread is getting way off topic though encodings are important to
C++ and not just for wchar_t. That is why I have let this one throug.
But please try to make the C++ content less implicit. -mod/fg}

"Eugene Gershnik" <gershnik(a)hotmail.com> wrote:
> Pete Becker wrote:
>> Eugene Gershnik wrote:
>>> Eugene Gershnik wrote:
>>>
>>>> Well here is AFAIK correctly formed Unicode sequence that means e?
>>>>
>>>> U+0065 U+0301
>>>>
>>>> All my editors seem to agree with that.
>>>
>>>
>>> And unintentionally this was also a good demonstration of how broken
>>> modern software is with regards to Unicode. My NNTP client
>>> (Microsoft Outlook Express) had correctly shown the character as ?
>>> while editing but transmitted it as e? as you can see above. This is
>>> despite being Windows only application that presumably uses UTF-16
>>> wchar_t internally.
>>
>> That seems like how it ought to work. U0065 is LATIN SMALL LETTER E,
>> and U0301 is COMBINING ACUTE ACCENT. They're two distinct characters,
>> which
>> is why they're written that way and transmitted that way.
>
> Not at all. Outlook Express informs me that my message was transmitted in
> what it calls "Western European (ISO)" encoding (presumably ISO 8859-1).

Outlook Express gives you the option to set the encoding you transmit in.
Two of those options are Unicode. If you don't select one of those, then it
has to convert your text.

> How
> the Unicode sequence in its editor is converted to this encoding is up to
> the application but a reasonable user expectation is that what looks like
> ?
> should be transmitted as ?.

One man's "reasonable user expectation" is another's unacceptable
abomination. Just because you can't see a reason for transmitting as two
characters doesn't mean there isn't one. In particular, there are a lot of
combining characters in Unicode, most of which can't be encoded in "Western
European". So there simply isn't any general solution to the problem. The
whole point of Unicode is to remove as many assumptions from the encoding
level, and you are asking it to make assumptions.

> Instead OE tranmitted the sequence as two distinct characters e and ?.
> This
> is *not* how it is supposed to work. What is supposed to happen is that an
> application canonicalizes the string prior to doing encoding conversions.
> Which it obviously didn't.

See above.

>> For display,
>> they combine to represent the single glyph that the editor shows. If
>> you want that glyph to be represented by a single character you have
>> to canonicalize the character sequence,
>
> I in this context am the *user* of the application. I type characters in
> my
> WYSIWYG editor and press "Send" button. I am not supposed to know what
> canonicalization is, much less to do it manually.
> It is the application which is supposed to do it transparently for me. If
> it
> doesn't it is broken.

Again, that's _your_ desire. There are a lot of other users out there. Some
of us speak an Oriental language or two, and realize that all bets are off
if you change encodings.

David J. Littleboy
Tokyo, Japan

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

First | Prev | Next | Last
Pages: 2 3 4 5 6 7 8 9 10 11 12 13
Prev: localtime deprecated?
Next: bind guard ?