From: Ilya Zakharevich on
[A complimentary Cc of this posting was sent to
Dr.Ruud
<rvtol+news(a)isolution.nl>], who wrote in article <e1gu0l.1m4.1(a)news.isolution.nl>:
> > The original code contained something like
> >
> > perl5.8.7 -wle "$_ = qq(abcd\x{e155}efg);
> > tr/\x{e100}-\x{e1ff}\x00-\x{1FFFFF}/\x00-\xFF_/; print"
> > Unicode character 0x1fffff is illegal at -e line 1.
> > ________
> >
> > That spurious warning can be worked about,
>
> Is it a "spurious warning"?

Looks so. What makes you doubt it? I'm working with Perl characters,
not Unicode characters; and IIRC, even Unicode goes up to 0x1fffff...
Or is it 0x10ffff?

> perl -MO=Deparse -e 'tr/\x{d7ff}\x{d800}//'

What is your point? I do not see which output makes you think this is
relevant... Did you try

perl -MO=Deparse -e 'tr/\x{7ff}\x{800}//'

Thanks,
Ilya
From: Ilya Zakharevich on
[A complimentary Cc of this posting was sent to
Dr.Ruud
<rvtol+news(a)isolution.nl>], who wrote in article <e1gu0l.1m4.1(a)news.isolution.nl>:
> Is it a "spurious warning"?

> perl -MO=Deparse -e 'tr/\x{d7ff}\x{d800}//'

Oups, ignore my preceeding message; I was using wrong quotes... So I
see now where the Perl bug is:

>perl -MO=Deparse -e "tr/\x{0000}-\x{ffff}//"
Malformed UTF-8 character (character 0xffff) at -e line 1.
Malformed UTF-8 character (character 0xffff) at -e line 1.
use utf8 ();
tr/\000//;
-e syntax OK

>perl -MO=Deparse -e "tr/\x{0000}-\x{fff0}//"
use utf8 ();
tr/\000-\x{fff0}//;
-e syntax OK

So some Perl developer thought that Perl characters == Unicode
characters, and mangles the pattern without reporting errors...

A lot of thanks,
Ilya
From: Ilya Zakharevich on
[A complimentary Cc of this posting was sent to
thundergnat
<thundergnat(a)hotmail.com>], who wrote in article <g8idnSdjQ4e2lKHZRVn-iw(a)rcn.net>:
> It /does/ appear to be a bug in tr. Not in that it has a problem with
> characters in the range D800?DFFF, that doesn't surprise me much. Those
> /aren't/ legal utf-8 character codes.

Let me disagree. First, I know of no such thing as utf-8. Second, if
you mean utf8, legal codes are 0..MAX_UV (since the size of UV is
specific to Perl build, this depends on the build of Perl executable).

Some codes would not appear in Unicode strings; but one should be able
to treat "binary" data freely (including 0..31 and 0x80..0x9F ranges,
and other characters which have no Unicode-consortium-assigned
cultural information).

Thanks,
Ilya
From: zbrg on
Ilya Zakharevich a dit le Tue, 11 Apr 2006 16:17:49 +0000 (UTC):
> Since it does not apply to the
>situation I discuss, I can hardly find your finding this message in
>the list of warnings relevant.
>
>Second, what I was discussing was not the warning, but the ACTION. Do
>you think the RESULT ('abcdefg') is "correct"?

The warning seems relevant, as avoiding the 0xD800-0xDFFF range seems to give a
good result :


$ perl -wle '$_ = q(abcdefg); tr/\x{d7ff}-\x{e0ff}/ /c; print'
From: Ben Bacarisse on
On Tue, 11 Apr 2006 22:11:32 +0000, Ilya Zakharevich wrote:

> Let me disagree. First, I know of no such thing as utf-8. Second, if
> you mean utf8

The proper form is UTF-8 (i.e. with caps) so your correction (further from
the accepted form) seems rather harsh!

Refs:
http://www.unicode.org/versions/Unicode3.0.html
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

--
Ben.