From: Mathias Gaunard on
On Jul 23, 12:49 am, "Martin B." <0xCDCDC...(a)gmx.at> wrote:

> * No unicode aware string class

And exactly what would it do that would be of any use?


--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Stanley Friesen on
"Martin B." <0xCDCDCDCD(a)gmx.at> wrote:

>Stanley Friesen wrote:
>> "joe" <jc1996(a)att.net> wrote:
>>
>>> Francis Glassborow wrote:
>>>> joe wrote:
>>> [...]
>>>> Anyway this has got very far from C++ where we certainly do need a way
>>>> to handle text in more than just American English.
>>> Not far at all from C++ given that it has lame support for Unicode,
>>
>> In C++0X there is actually considerable support. It allows many
>> non-punctuation characters in identifiers (e.g. variable names, class
>> names &c.). It provides conversions between the three main
>> representations (UTF-8, UTF-16, and UTF-32). It at least allows for
>> tailorable Unicode collation. The only thing it lacks that I see as a
>> substantial issue is UTF-16 and/or UTF-32 iostreams. This is
>> unfortunate, as both Windows and modern Unix support such files at the
>> OS level.
>
>As I see it, some support is added for better handling of unicode at
>compile time. (Uni character literals, charXX_t, etc.)
>
>We are left with the same mess we always had at runtime. (modulo
>char32_t, maybe):
>* No unicode aware string class

Support for u16string and u32string seems sufficient for low level
purposes, especially combined with conversions between the various
formats, and collation support. I am not sure that the *language*
should mandate much more than this. More complex Unicode processing is
generally task-specific. An editor has different needs than a Web
browser, for instance. (Also I think the "ctype" functionality in the
Unicode character traits classes has to apply proper Unicode semantics).

>* No way to tell what character set a char* is encoded in (and this will
>get worse with compile-time u8 constants).
>* std::exception works only with char*

Which still allows UTF-8 strings.
--
The peace of God be with you.

Stanley Friesen

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Stanley Friesen on
Mathias Gaunard <loufoque(a)gmail.com> wrote:

>On Jul 22, 1:26 pm, Stanley Friesen <sar...(a)friesen.net> wrote:
>
>> It provides conversions between the three main
>> representations (UTF-8, UTF-16, and UTF-32).
>
>Not really in a way that is practical to use though.
>
Well, sstreams (string streams) should provide that capability, even if
that is a trifle clumsy.
>
>> The only thing it lacks that I see as a
>> substantial issue is UTF-16 and/or UTF-32 iostreams. This is
>> unfortunate, as both Windows and modern Unix support such files at the
>> OS level.
>
>basic_istream<char16_t> etc. should work just fine.

That will read or write a UTF-8 file, not a UTF-16/UTF-32 file. The
specification is quite clear - it is required to apply the appropriate
codecvt facet.
>
>
>> But any decent development
>> environment will allow actual Unicode source files, and apply the as-if
>> rule to treat valid non-ASCII characters identically to the escape
>> codes.
>
>So GCC, the most widely used C and C++ compiler, is not a decent
>development environment?

It is not a development environment at all, it is just a compiler. A
development environment includes build configuration, syntax-aware
editing, syntax-aware searches and so on. Still, I think it would be a
very useful improvement to allow it to accept UTF-8 text files, at the
very least.

>As was clearly stated in the parent message, GCC only supports
>inputting unicode characters in identifiers as escape codes.

I understand. I also do not consider GCCs C++0X support complete as of
now. Heck, the *standard* isn't even official yet, and there have been
significant changes to it in the last 6 months that GCC cannot possibly
have implemented yet. And last I checked, GCC's documentation made no
claim to implement the entirety even of the draft standard at that time.

So, before we judge it, let us wait until the GNU people claim full
support of the final standard.
--
The peace of God be with you.

Stanley Friesen

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Mathias Gaunard on
On Jul 24, 2:55 pm, Stanley Friesen <sar...(a)friesen.net> wrote:
> Mathias Gaunard <loufo...(a)gmail.com> wrote:
> >On Jul 22, 1:26 pm, Stanley Friesen <sar...(a)friesen.net> wrote:
>
> >> It provides conversions between the three main
> >> representations (UTF-8, UTF-16, and UTF-32).
>
> >Not really in a way that is practical to use though.
>
> Well, sstreams (string streams) should provide that capability, even if
> that is a trifle clumsy.

string streams do not invoke codecvt facets, only file streams do.
Also note most current implementations do not allow N to M conversion
with codecvt facets, and only allow one-way 1 to N (in-memory fixed
width, in-file variable-width), so I'd be quite careful about this.

The alternative is applying the codecvt facet directly, which has a
fairly ugly interface and requires static contiguous buffers.

What we truly need is an iterator-based interface, that basically
behaves like std::copy, or better yet, iterator adaptors that convert
as you iterate.
But that's not sufficient, you also need ways to segment strings
(graphemes, words, sentences), do normalization, case conversion, etc.
None of which are nowhere near possible in C++0x.


> >> The only thing it lacks that I see as a
> >> substantial issue is UTF-16 and/or UTF-32 iostreams. This is
> >> unfortunate, as both Windows and modern Unix support such files at the
> >> OS level.
>
> >basic_istream<char16_t> etc. should work just fine.
>
> That will read or write a UTF-8 file, not a UTF-16/UTF-32 file. The
> specification is quite clear - it is required to apply the appropriate
> codecvt facet.

That's not a problem at the stream level, but at the filebuf level.
File streams invoke codecvt facets to convert from their type to char
because filebufs are char-based.


> >So GCC, the most widely used C and C++ compiler, is not a decent
> >development environment?
>
> It is not a development environment at all, it is just a compiler. A
> development environment includes build configuration, syntax-aware
> editing, syntax-aware searches and so on.

Looks like you only know the world of software development as you see
it through your Microsoft Visual Studio window.



> >As was clearly stated in the parent message, GCC only supports
> >inputting unicode characters in identifiers as escape codes.
>
> I understand. I also do not consider GCCs C++0X support complete as of
> now.

You said that any decent development environment that exists supports
it NOW.
I'm just putting you in fact of your inaccurate statements.


--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Martin B. on
On 23.07.2010 23:43, Mathias Gaunard wrote:
> On Jul 23, 12:49 am, "Martin B."<0xCDCDC...(a)gmx.at> wrote:
>
>> * No unicode aware string class
>
> And exactly what would it do that would be of any use?
>

Like, make working with "normal" strings (as opposed to performance
relevant data-crunshing strings) a no-brainer?

Like, provide a clear, easy and efficient interface to work with unicode
strings.
Clear like:
* If I have an object of such a class I *know* it is a valid unicode
string and not some locale-, system-, or implementation-defined
character array mumbo jumbo.
* No way to implicitly convert it to and from any character (array) type
without clearly specifying what encoding to use for this.
Efficient and easy like:
* The internal representation is configurable and it's efficient to
extract a primitive-type-array of the internal represenation but the
normal joe-programmer doesn't have to care about the internal represenation.

cheers,
Martin

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]