Prev: Posting Guidelines for comp.lang.perl.misc ($Revision: 1.9 $)
Next: Where is the document for \ (backslash operator for creating a reference)
From: Ben Morrow on 7 Jun 2010 12:21 Quoth Martijn Lievaart <m(a)rtij.nl.invlalid>: > On Sat, 05 Jun 2010 21:33:20 +0100, Ben Morrow wrote: > > > This is a bad plan. Locales (specifically, the 'locale' pragma) and > > Unicode don't play nicely together in Perl, and if you're processing > > international text you will probably end up with Unicode strings. A > > Can you expand on this? What exactly goes wrong (or is unexpected)? A simple example would be something like this: LC_ALL=hu_HU.ISO8859-2 perl -Mlocale -E' binmode STDOUT, ":utf8"; my $x = "\xa1"; say $x =~ /\w/; say $x; ' 0xA1 is inverted-! in ISO8859-1, but A-ogonek in ISO8859-2. The regex match assumes the bytes of the string are to be interpreted according to the locale's character set, so it matches; but the output layer upgrades to UTF8 assuming the bytes are ISO8859-1, so it prints an inverted ! (in UTF-8, of course). A stranger example would be: LC_ALL=hu_HU.ISO8859-2 perl -Mlocale -E' binmode STDOUT, ":utf8"; my $x = "\xa1"; say $x, lc $x; $x .= "\x{2013}"; say $x, lc $x; ' which produces the output (all in UTF8) (inverted-!) (plus-minus) (inverted-!) (en-dash) (inverted-!) (en-dash) To understand this we need a table of character values: CODE UNICODE ISO8859-1 ISO8859-2 \xA1 inverted-! inverted-! A-ogonek \xB1 plus-minus plus-minus a-ogonek \x2013 en-dash - - What's happening here is that the first line, where $x contained only 8-bit characters, is performing the lc according to the locale, so \xA1 goes to \xB1; the second line, where $x has been 'upgraded' to Unicode internally, is performing lc according to Unicode, so \xA1 stays as \xA1. Both strings are then subject to the same incorrect-translation- from-ISO8859-1 as above. Confused yet? :) Ben
From: Martijn Lievaart on 7 Jun 2010 13:00 On Mon, 07 Jun 2010 17:21:07 +0100, Ben Morrow wrote: > Quoth Martijn Lievaart <m(a)rtij.nl.invlalid>: >> On Sat, 05 Jun 2010 21:33:20 +0100, Ben Morrow wrote: >> >> > This is a bad plan. Locales (specifically, the 'locale' pragma) and >> > Unicode don't play nicely together in Perl, and if you're processing >> > international text you will probably end up with Unicode strings. A >> >> Can you expand on this? What exactly goes wrong (or is unexpected)? > (snip) > > Confused yet? :) That's just plain buggy I would say, or is there some logic I don't see? Besides, your examples did not work for me completely, to get the same regex matching I had to set LANG as well. M4
From: Peter J. Holzer on 7 Jun 2010 17:29
On 2010-06-07 17:00, Martijn Lievaart <m(a)rtij.nl.invlalid> wrote: > On Mon, 07 Jun 2010 17:21:07 +0100, Ben Morrow wrote: >> Quoth Martijn Lievaart <m(a)rtij.nl.invlalid>: >>> On Sat, 05 Jun 2010 21:33:20 +0100, Ben Morrow wrote: >>> >>> > This is a bad plan. Locales (specifically, the 'locale' pragma) and >>> > Unicode don't play nicely together in Perl, and if you're processing >>> > international text you will probably end up with Unicode strings. A >>> >>> Can you expand on this? What exactly goes wrong (or is unexpected)? >> > (snip) >> >> Confused yet? :) > > That's just plain buggy I would say, or is there some logic I don't see? The logic is that if you use locale then byte strings are supposed to be encoded according to the current locale. This affects only regexps and string comparisons according to perldoc locale. It could be argued that it should also affect implicit upgrading to character strings. > Besides, your examples did not work for me completely, to get the same > regex matching I had to set LANG as well. That looks like a bug. LC_ALL is supposed to override LANG. hp |