|
From: Ilya Zakharevich on 10 Apr 2006 22:53 I'm trying to use tr/// operator (instead of RExen), and do not think it works... The simplified example is >perl5.8.7 -wle "$_ = q(abcdefg); tr/\x{e000}-\x{e0ff}/ /c; print" UTF-16 surrogate 0xdfff at -e line 1. Malformed UTF-8 character (UTF-16 surrogate 0xdfff) at -e line 1. abcdefg The original code contained something like perl5.8.7 -wle "$_ = qq(abcd\x{e155}efg); tr/\x{e100}-\x{e1ff}\x00-\x{1FFFFF}/\x00-\xFF_/; print" Unicode character 0x1fffff is illegal at -e line 1. ________ That spurious warning can be worked about, but I think the behaviour is not up to documentation; is it? Thanks, Ilya
From: zbrg on 11 Apr 2006 06:38 Ilya Zakharevich a dit le Tue, 11 Apr 2006 02:53:58 +0000 (UTC): >I'm trying to use tr/// operator (instead of RExen), and do not think >it works... The simplified example is > > >perl5.8.7 -wle "$_ = q(abcdefg); tr/\x{e000}-\x{e0ff}/ /c; print" > UTF-16 surrogate 0xdfff at -e line 1. > Malformed UTF-8 character (UTF-16 surrogate 0xdfff) at -e line 1. > abcdefg > [...] >That spurious warning can be worked about, but I think the behaviour >is not up to documentation; is it? Its in the perldiag manpage : UTF-16 surrogate %s (W utf8) You tried to generate half of an UTF-16 surrogate by requesting a Unicode character between the code points 0xD800 and 0xDFFF (inclusive). That range is reserved exclusively for the use of UTF-16 encoding (by having two 16- bit UCS-2 characters); but Perl encodes its characters in UTF-8, so what you got is a very illegal character. If you really know what you are doing you can turn off this warning by "no warnings 'utf8';".
From: Ilya Zakharevich on 11 Apr 2006 12:17 [A complimentary Cc of this posting was sent to <zbrg(a)mail.invalid>], who wrote in article <443b8741$0$5170$626a54ce(a)news.free.fr>: > > >perl5.8.7 -wle "$_ = q(abcdefg); tr/\x{e000}-\x{e0ff}/ /c; print" > > UTF-16 surrogate 0xdfff at -e line 1. > > Malformed UTF-8 character (UTF-16 surrogate 0xdfff) at -e line 1. > > abcdefg > >That spurious warning can be worked about, but I think the behaviour > >is not up to documentation; is it? > Its in the perldiag manpage : > > UTF-16 surrogate %s > (W utf8) You tried to generate half ... First of all, I assume that "its" is this broken warning (actually, one of two [duplicate] warnings). Since it does not apply to the situation I discuss, I can hardly find your finding this message in the list of warnings relevant. Second, what I was discussing was not the warning, but the ACTION. Do you think the RESULT ('abcdefg') is "correct"? Thanks anyway, Ilya P.S. Actually, the text in perldiag is also wrong: > of an UTF-16 surrogate by requesting a Unicode character between the > code points 0xD800 and 0xDFFF (inclusive). That range is reserved > exclusively for the use of UTF-16 encoding (by having two 16- bit > UCS-2 characters); but Perl encodes its characters in UTF-8, so what > you got is a very illegal character. If you really know what you > are doing you can turn off this warning by "no warnings 'utf8';". Perl (the language) does not encode its characters in UTF-8. Characters are not encoded in any way, they just "are". And, if you consider implementation, the internal encoding is not UTF-8 either (it is called in perl world as "utf8", and is a proper superset). Sigh...
From: Dr.Ruud on 11 Apr 2006 12:46 Ilya Zakharevich schreef: > I'm trying to use tr/// operator (instead of RExen), and do not think > it works... The simplified example is > > >perl5.8.7 -wle "$_ = q(abcdefg); tr/\x{e000}-\x{e0ff}/ /c; print" > UTF-16 surrogate 0xdfff at -e line 1. > Malformed UTF-8 character (UTF-16 surrogate 0xdfff) at -e line 1. > abcdefg > > The original code contained something like > > perl5.8.7 -wle "$_ = qq(abcd\x{e155}efg); > tr/\x{e100}-\x{e1ff}\x00-\x{1FFFFF}/\x00-\xFF_/; print" > Unicode character 0x1fffff is illegal at -e line 1. > ________ > > That spurious warning can be worked about, Is it a "spurious warning"? perl -MO=Deparse -e '$_ = qq(\x{d7ff}\x{d800})' perl -MO=Deparse -e 'tr/\x{d7ff}\x{d800}//' > but I think the behaviour > is not up to documentation; is it? It isn't. -- Affijn, Ruud "Gewoon is een tijger."
From: thundergnat on 11 Apr 2006 15:53
Ilya Zakharevich wrote: > I'm trying to use tr/// operator (instead of RExen), and do not think > it works... The simplified example is > > >perl5.8.7 -wle "$_ = q(abcdefg); tr/\x{e000}-\x{e0ff}/ /c; print" > UTF-16 surrogate 0xdfff at -e line 1. > Malformed UTF-8 character (UTF-16 surrogate 0xdfff) at -e line 1. > abcdefg > > The original code contained something like > > perl5.8.7 -wle "$_ = qq(abcd\x{e155}efg); > tr/\x{e100}-\x{e1ff}\x00-\x{1FFFFF}/\x00-\xFF_/; print" > Unicode character 0x1fffff is illegal at -e line 1. > ________ > > That spurious warning can be worked about, but I think the behaviour > is not up to documentation; is it? > It /does/ appear to be a bug in tr. Not in that it has a problem with characters in the range D800?DFFF, that doesn't surprise me much. Those /aren't/ legal utf-8 character codes. The thing that DOES surprise me is that tr considers \x{e000} (and \x{d7ff}!) to be in the range \x{d800}-\x{dfff}. Seems like tr is confused about the surrogates range. no error: perl -wle "$_ = q(abcdefg); tr/\x{e001}-\x{e0ff}/ /c; print" error perl -wle "$_ = q(abcdefg); tr/\x{e000}/ /c; print" error perl -wle "$_ = q(abcdefg); tr/\x{d7ff}/ /c; print" no error perl -wle "$_ = q(abcdefg); tr/\x{d7fe}/ /c; print" |