The D Programming Language [C++]

Prev: how can operator new overrun memory?!
Next: Why no std::back_insert_iterator::value_type?

From: Walter Bright on 28 Nov 2006 14:23

Nemanja Trifunovic wrote:
>> Third party users of std::string would assume (and correctly so,
>> according to the standard) that there is a one-to-one correspondence
>> between characters and string elements.
>
> I don't know about the standard ("The C++ Standard Library" 14.1.1
> mentiones multibyte vs wide strings), but in practice that assumption
> is just silly unless the third party library is explicitely restricted
> to work with ASCII only. In general, multibyte strings (SHIFT_JIS, for
> instance) are often stored in std::string, or char[]

If you're using it to store SHIFT-JIS, that isn't going to interoperate
if you also use it to store UTF-8.

> The
> only scenario where the search is going to fail is if a user tries
> something like
>
> strchr(str, 0x45a);
>
> when searching for cyrillic "nje" within an utf-8 encoded string,

Exactly.

> but I've never seen anybody doing such a thing.

I've seen such.

> C and C++ programmers are
> aware that such functions search for "bytes", not "characters".

So they're carefully using a subset of std::string's capabilities based
on knowledge that the full generality of it doesn't work. If it did work
properly, there wouldn't be proposals to add utf8 and utf16 types to C++0x.

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Walter Bright on 28 Nov 2006 14:21

{ This thread has generated a number of mod comments and I'm sorry to
add to that, but again: for follow-ups, please keep in mind the
connection back to C++ (here the library solutions philosophy?). -mod }

Peter Dimov wrote:
> Given that both compilers optimized out the imaginary part completely,
> that neither allocated a complex<> anywhere in memory, and that the x87
> FPU doesn't quite have a concept of a register pair, I'm really not
> sure what you mean. Can you give the D output of the program so that we
> can see what a real compiler should do?

Here's what Digital Mars C++ does, which implements C99 complex numbers:

------------------ program ------------------
#include <complex.h>

complex long double f( complex long double c )
{
return c;
}
------------------- asm ---------------------------
?f@@YA_W_W@Z:
fld tbyte ptr 4[ESP]
fld tbyte ptr 0Eh[ESP]
ret
--------------------------------------------------

Note that even though Digital Mars C++ implements complex as a native
type (per C99), this does NOT prevent it from implementing C++98 complex:

-------------------- program ----------------------
#include <complex>

std::complex<long double> f( std::complex<long double> c )
{
return c;
}
-------------------- asm --------------------------
?f@@YA?AU?$complex(a)std@_Z(a)std@@U12@@Z:
fld tbyte ptr 8[ESP]
mov EAX,4[ESP]
fld tbyte ptr 012h[ESP]
fxch ST1
fstp tbyte ptr [EAX]
fstp tbyte ptr 0Ah[EAX]
ret
----------------------------------------------------

BTW, the x87 does have a concept of a floating point register pair
(ST1,ST0) just like it does for long longs (EDX,EAX).

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Nemanja Trifunovic on 28 Nov 2006 16:12

Walter Bright wrote:
> If you're using it to store SHIFT-JIS, that isn't going to interoperate
> if you also use it to store UTF-8.

Of course it won't. I mentioned SHIFT-JIS only to show that storing
multibyte characters in std::string or char[] is nothing new and
specific to utf-8. Users were "always" allowed to assume one-to-one
mapping between bytes and characters only for specific encodings such
as ASCII.

Walter Bright wrote:
> > C and C++ programmers are
> > aware that such functions search for "bytes", not "characters".
>
> So they're carefully using a subset of std::string's capabilities based
> on knowledge that the full generality of it doesn't work. If it did work
> properly, there wouldn't be proposals to add utf8 and utf16 types to C++0x.

That's orthogonal to what we are talking about here. I am saying that
std::string can be used for storing utf-8 encoded strings and doing
string operations on them. In fact, I'm doing that all the time and it
works just fine. The need to recognize new types comes from the fact
that we are moving away from the "legacy encodings" to Unicode and it
is a good idea to have specialized types that would be less general
than std::string, but would offer some Unicode-specific functionality
(conversions, etc).

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Walter Bright on 28 Nov 2006 19:47

David Abrahams wrote:
> Walter Bright <walter(a)digitalmars-nospamm.com> writes:
>
>> David Abrahams wrote:
>>> But then, I've never insisted on the ability to redefine
>>> syntax and tokens, and I don't even believe it's necessary in order to
>>> achieve the kind of flexibility and power I'm describing.
>> How would you do it, then?
>
> Several features could come together to do the job: excellent
> first-class constant folding

That's doable.

> and first-class code block/expression
> types come to mind. Some Haskell-style composed operators might help,
> but probably aren't strictly necessary.
>
> BTW, string literals would be a builtin feature, there's no question
> about *that*.

Ok, but that leaves open the question of what builtin type would the
builtin literal have?

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Nemanja Trifunovic on 28 Nov 2006 19:45

Walter Bright wrote:
> Nemanja Trifunovic wrote:
> > But what woud you expect? The user simply must take the string encoding
> > into consideration when doing string operations like that. If s1
> > contains a string in some multibyte encoding the user must be aware of
> > it. This is not specific to utf-8.
>
> If it supported utf-8, I would expect things like encoding and decoding
> of utf-8 to work. std::string right now offers nothing for the utf-8 user.
>

std::string is not a unicode string type. It is merely a multibyte
string type. It can be used to handle utf-8 encoded strings but of
course it does not provide encoding-specific functionality out of the
box; again, there are 3rd party libraries for that.

In fact, I don't know why I am defending std::string here - I don't
like it at all. Probably because you are attacking it for a wrong
reason :)

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

First | Prev | Next | Last
Pages: 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
Prev: how can operator new overrun memory?!
Next: Why no std::back_insert_iterator::value_type?