From: Ilya Zakharevich on

I'm trying to use tr/// operator (instead of RExen), and do not think
it works... The simplified example is

>perl5.8.7 -wle "$_ = q(abcdefg); tr/\x{e000}-\x{e0ff}/ /c; print"
UTF-16 surrogate 0xdfff at -e line 1.
Malformed UTF-8 character (UTF-16 surrogate 0xdfff) at -e line 1.
abcdefg

The original code contained something like

perl5.8.7 -wle "$_ = qq(abcd\x{e155}efg);
tr/\x{e100}-\x{e1ff}\x00-\x{1FFFFF}/\x00-\xFF_/; print"
Unicode character 0x1fffff is illegal at -e line 1.
________

That spurious warning can be worked about, but I think the behaviour
is not up to documentation; is it?

Thanks,
Ilya
From: zbrg on
Ilya Zakharevich a dit le Tue, 11 Apr 2006 02:53:58 +0000 (UTC):
>I'm trying to use tr/// operator (instead of RExen), and do not think
>it works... The simplified example is
>
> >perl5.8.7 -wle "$_ = q(abcdefg); tr/\x{e000}-\x{e0ff}/ /c; print"
> UTF-16 surrogate 0xdfff at -e line 1.
> Malformed UTF-8 character (UTF-16 surrogate 0xdfff) at -e line 1.
> abcdefg
>
[...]
>That spurious warning can be worked about, but I think the behaviour
>is not up to documentation; is it?

Its in the perldiag manpage :

UTF-16 surrogate %s
(W utf8) You tried to generate half of an UTF-16 surrogate by requesting a
Unicode character between the code points 0xD800 and 0xDFFF (inclusive). That
range is reserved exclusively for the use of UTF-16 encoding (by having two 16-
bit UCS-2 characters); but Perl encodes its characters in UTF-8, so what you
got is a very illegal character. If you really know what you are doing you can
turn off this warning by "no warnings 'utf8';".

From: Ilya Zakharevich on
[A complimentary Cc of this posting was sent to

<zbrg(a)mail.invalid>], who wrote in article <443b8741$0$5170$626a54ce(a)news.free.fr>:
> > >perl5.8.7 -wle "$_ = q(abcdefg); tr/\x{e000}-\x{e0ff}/ /c; print"
> > UTF-16 surrogate 0xdfff at -e line 1.
> > Malformed UTF-8 character (UTF-16 surrogate 0xdfff) at -e line 1.
> > abcdefg

> >That spurious warning can be worked about, but I think the behaviour
> >is not up to documentation; is it?

> Its in the perldiag manpage :
>
> UTF-16 surrogate %s
> (W utf8) You tried to generate half ...

First of all, I assume that "its" is this broken warning (actually,
one of two [duplicate] warnings). Since it does not apply to the
situation I discuss, I can hardly find your finding this message in
the list of warnings relevant.

Second, what I was discussing was not the warning, but the ACTION. Do
you think the RESULT ('abcdefg') is "correct"?

Thanks anyway,
Ilya

P.S. Actually, the text in perldiag is also wrong:

> of an UTF-16 surrogate by requesting a Unicode character between the
> code points 0xD800 and 0xDFFF (inclusive). That range is reserved
> exclusively for the use of UTF-16 encoding (by having two 16- bit
> UCS-2 characters); but Perl encodes its characters in UTF-8, so what
> you got is a very illegal character. If you really know what you
> are doing you can turn off this warning by "no warnings 'utf8';".

Perl (the language) does not encode its characters in UTF-8.
Characters are not encoded in any way, they just "are". And, if you
consider implementation, the internal encoding is not UTF-8 either (it
is called in perl world as "utf8", and is a proper superset). Sigh...
From: Dr.Ruud on
Ilya Zakharevich schreef:

> I'm trying to use tr/// operator (instead of RExen), and do not think
> it works... The simplified example is
>
> >perl5.8.7 -wle "$_ = q(abcdefg); tr/\x{e000}-\x{e0ff}/ /c; print"
> UTF-16 surrogate 0xdfff at -e line 1.
> Malformed UTF-8 character (UTF-16 surrogate 0xdfff) at -e line 1.
> abcdefg
>
> The original code contained something like
>
> perl5.8.7 -wle "$_ = qq(abcd\x{e155}efg);
> tr/\x{e100}-\x{e1ff}\x00-\x{1FFFFF}/\x00-\xFF_/; print"
> Unicode character 0x1fffff is illegal at -e line 1.
> ________
>
> That spurious warning can be worked about,

Is it a "spurious warning"?

perl -MO=Deparse -e '$_ = qq(\x{d7ff}\x{d800})'

perl -MO=Deparse -e 'tr/\x{d7ff}\x{d800}//'


> but I think the behaviour
> is not up to documentation; is it?

It isn't.

--
Affijn, Ruud

"Gewoon is een tijger."

From: thundergnat on
Ilya Zakharevich wrote:
> I'm trying to use tr/// operator (instead of RExen), and do not think
> it works... The simplified example is
>
> >perl5.8.7 -wle "$_ = q(abcdefg); tr/\x{e000}-\x{e0ff}/ /c; print"
> UTF-16 surrogate 0xdfff at -e line 1.
> Malformed UTF-8 character (UTF-16 surrogate 0xdfff) at -e line 1.
> abcdefg
>
> The original code contained something like
>
> perl5.8.7 -wle "$_ = qq(abcd\x{e155}efg);
> tr/\x{e100}-\x{e1ff}\x00-\x{1FFFFF}/\x00-\xFF_/; print"
> Unicode character 0x1fffff is illegal at -e line 1.
> ________
>
> That spurious warning can be worked about, but I think the behaviour
> is not up to documentation; is it?
>

It /does/ appear to be a bug in tr. Not in that it has a problem with
characters in the range D800?DFFF, that doesn't surprise me much. Those
/aren't/ legal utf-8 character codes. The thing that DOES surprise me is
that tr considers \x{e000} (and \x{d7ff}!) to be in the range
\x{d800}-\x{dfff}. Seems like tr is confused about the surrogates range.


no error:
perl -wle "$_ = q(abcdefg); tr/\x{e001}-\x{e0ff}/ /c; print"


error
perl -wle "$_ = q(abcdefg); tr/\x{e000}/ /c; print"


error
perl -wle "$_ = q(abcdefg); tr/\x{d7ff}/ /c; print"


no error
perl -wle "$_ = q(abcdefg); tr/\x{d7fe}/ /c; print"