From: Martin B. on
On 24.07.2010 15:55, Stanley Friesen wrote:
> "Martin B."<0xCDCDCDCD(a)gmx.at> wrote:
>> Stanley Friesen wrote:
>>> "joe"<jc1996(a)att.net> wrote:
>>>
>>>> Francis Glassborow wrote:
>>>>> joe wrote:
>>>> [...]
>>>>> Anyway this has got very far from C++ where we certainly do need a way
>>>>> to handle text in more than just American English.
>>>> Not far at all from C++ given that it has lame support for Unicode,
>>>
>>> In C++0X there is actually considerable support. It allows many
>>> [...]
>>
>> As I see it, some support is added for better handling of unicode at
>> compile time. (Uni character literals, charXX_t, etc.)
>>
>> We are left with the same mess we always had at runtime. (modulo
>> char32_t, maybe):
>> [...]
> [...]
>> * No way to tell what character set a char* is encoded in (and this will
>> get worse with compile-time u8 constants).
>> * std::exception works only with char*
>
> Which still allows UTF-8 strings.

std::exception allows for UTF-8 strings. Yes. It already does this,
C++0x doesn't add anything in this regard.

cheers,
Martin

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Stanley Friesen on
Mathias Gaunard <loufoque(a)gmail.com> wrote:

>On Jul 24, 2:55 pm, Stanley Friesen <sar...(a)friesen.net> wrote:
>> Mathias Gaunard <loufo...(a)gmail.com> wrote:
>> >On Jul 22, 1:26 pm, Stanley Friesen <sar...(a)friesen.net> wrote:
>>
>> >> It provides conversions between the three main
>> >> representations (UTF-8, UTF-16, and UTF-32).
>>
>> >Not really in a way that is practical to use though.
>>
>> Well, sstreams (string streams) should provide that capability, even if
>> that is a trifle clumsy.
>
>string streams do not invoke codecvt facets, only file streams do.

Oops, I forgot that point.

>Also note most current implementations do not allow N to M conversion
>with codecvt facets, and only allow one-way 1 to N (in-memory fixed
>width, in-file variable-width), so I'd be quite careful about this.

I challenge this one however. The standard facet codecvt<char16_t,
char, mbstate_t> is required to support conversions between UTF-16 and
UTF-8 (22.4.1.4, para 3). Failure to properly convert surrogate pairs is
a failure to support UTF-16, as that is the difference between UTF-16
and UCS-2. And the draft standard clearly incorporates the distinction,
since the "extra" facet codecvt_utf8<Elem>, is explicitly specified to
convert to and from either UCS-2 or UCS-4 (depending on Elem).
>
>The alternative is applying the codecvt facet directly, which has a
>fairly ugly interface and requires static contiguous buffers.

Yes, I agree it is a touch clumsy. The best way to use it would be to
wrap it in a simplified library interface.
>
>What we truly need is an iterator-based interface, that basically
>behaves like std::copy, or better yet, iterator adaptors that convert
>as you iterate.

Hmm, this may be tricky to specify, given the nature of the conversions.
Dereferencing such an iterator would have to resolve to some sort of
container (e.g. a specialization of basic_string), as there is no
guarantee that the result will be a single code.

>But that's not sufficient, you also need ways to segment strings
>(graphemes, words, sentences), do normalization, case conversion, etc.
>None of which are nowhere near possible in C++0x.

And I maintain that they are beyond the scope of the C++ standard. These
are things, I think, that should be supplied as domain libraries, since
different systems may well require different performance trade-offs.
>
>
>> >> The only thing it lacks that I see as a
>> >> substantial issue is UTF-16 and/or UTF-32 iostreams. This is
>> >> unfortunate, as both Windows and modern Unix support such files at the
>> >> OS level.
>>
>> >basic_istream<char16_t> etc. should work just fine.
>>
>> That will read or write a UTF-8 file, not a UTF-16/UTF-32 file. The
>> specification is quite clear - it is required to apply the appropriate
>> codecvt facet.
>
>That's not a problem at the stream level, but at the filebuf level.
>File streams invoke codecvt facets to convert from their type to char
>because filebufs are char-based.
>
Though this means one would need to instantiate one's own type of file
buffers to get basic_istream<char16_t> to actually input from a UTF-16
external file. This goes beyond merely clumsy to manifestly
labyrinthine. This is, I maintain, something that *should* be
standardized in the language, as it is widely useful, difficult to get
right, and has few design issues that would make alternative
implementations useful.
>
>> >So GCC, the most widely used C and C++ compiler, is not a decent
>> >development environment?
>>
>> It is not a development environment at all, it is just a compiler. A
>> development environment includes build configuration, syntax-aware
>> editing, syntax-aware searches and so on.
>
>Looks like you only know the world of software development as you see
>it through your Microsoft Visual Studio window.
>
No. That also describes the Tornado development environment for
VxWorks, and its successor, as well as the development environment for
several other similar OS's.

There is also Eclipse, which is an OS and compiler independent
development environment. It can be configured to use gcc as the
compiler.
>
>
>> >As was clearly stated in the parent message, GCC only supports
>> >inputting unicode characters in identifiers as escape codes.
>>
>> I understand. I also do not consider GCCs C++0X support complete as of
>> now.
>
>You said that any decent development environment that exists supports
>it NOW.

My point is that it is too early to judge if the lack of support for
directly encoded Unicode-extended identifiers is going remain true of
complete C++0X implementations. It is clear the draft standard was
written to permit such an implementation under the as-if rule (it even
explicitly says this is allowed). Whether, once the new standard is
approved and final, any major vendors will actually implement that
feature remains to be seen.
--
The peace of God be with you.

Stanley Friesen

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Mathias Gaunard on
On Jul 29, 4:34 am, Stanley Friesen <sar...(a)friesen.net> wrote:

> Hmm, this may be tricky to specify, given the nature of the conversions.
> Dereferencing such an iterator would have to resolve to some sort of
> container (e.g. a specialization of basic_string), as there is no
> guarantee that the result will be a single code.

Huh? Just return the results in multiple iteration steps.
Iterator adaptors do not have to be one-to-one...

See my Unicode library if you want examples.


--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]