From: Ben Morrow on

Quoth Martijn Lievaart <m(a)rtij.nl.invlalid>:
> On Sat, 05 Jun 2010 21:33:20 +0100, Ben Morrow wrote:
>
> > This is a bad plan. Locales (specifically, the 'locale' pragma) and
> > Unicode don't play nicely together in Perl, and if you're processing
> > international text you will probably end up with Unicode strings. A
>
> Can you expand on this? What exactly goes wrong (or is unexpected)?

A simple example would be something like this:

LC_ALL=hu_HU.ISO8859-2
perl -Mlocale -E'
binmode STDOUT, ":utf8";
my $x = "\xa1";
say $x =~ /\w/;
say $x;
'

0xA1 is inverted-! in ISO8859-1, but A-ogonek in ISO8859-2. The regex
match assumes the bytes of the string are to be interpreted according to
the locale's character set, so it matches; but the output layer upgrades
to UTF8 assuming the bytes are ISO8859-1, so it prints an inverted ! (in
UTF-8, of course).

A stranger example would be:

LC_ALL=hu_HU.ISO8859-2
perl -Mlocale -E'
binmode STDOUT, ":utf8";
my $x = "\xa1";
say $x, lc $x;
$x .= "\x{2013}";
say $x, lc $x;
'

which produces the output (all in UTF8)

(inverted-!) (plus-minus)
(inverted-!) (en-dash) (inverted-!) (en-dash)

To understand this we need a table of character values:

CODE UNICODE ISO8859-1 ISO8859-2
\xA1 inverted-! inverted-! A-ogonek
\xB1 plus-minus plus-minus a-ogonek
\x2013 en-dash - -

What's happening here is that the first line, where $x contained only
8-bit characters, is performing the lc according to the locale, so \xA1
goes to \xB1; the second line, where $x has been 'upgraded' to Unicode
internally, is performing lc according to Unicode, so \xA1 stays as
\xA1. Both strings are then subject to the same incorrect-translation-
from-ISO8859-1 as above.

Confused yet? :)

Ben

From: Martijn Lievaart on
On Mon, 07 Jun 2010 17:21:07 +0100, Ben Morrow wrote:

> Quoth Martijn Lievaart <m(a)rtij.nl.invlalid>:
>> On Sat, 05 Jun 2010 21:33:20 +0100, Ben Morrow wrote:
>>
>> > This is a bad plan. Locales (specifically, the 'locale' pragma) and
>> > Unicode don't play nicely together in Perl, and if you're processing
>> > international text you will probably end up with Unicode strings. A
>>
>> Can you expand on this? What exactly goes wrong (or is unexpected)?
>
(snip)
>
> Confused yet? :)

That's just plain buggy I would say, or is there some logic I don't see?

Besides, your examples did not work for me completely, to get the same
regex matching I had to set LANG as well.

M4
From: Peter J. Holzer on
On 2010-06-07 17:00, Martijn Lievaart <m(a)rtij.nl.invlalid> wrote:
> On Mon, 07 Jun 2010 17:21:07 +0100, Ben Morrow wrote:
>> Quoth Martijn Lievaart <m(a)rtij.nl.invlalid>:
>>> On Sat, 05 Jun 2010 21:33:20 +0100, Ben Morrow wrote:
>>>
>>> > This is a bad plan. Locales (specifically, the 'locale' pragma) and
>>> > Unicode don't play nicely together in Perl, and if you're processing
>>> > international text you will probably end up with Unicode strings. A
>>>
>>> Can you expand on this? What exactly goes wrong (or is unexpected)?
>>
> (snip)
>>
>> Confused yet? :)
>
> That's just plain buggy I would say, or is there some logic I don't see?

The logic is that if you use locale then byte strings are supposed to be
encoded according to the current locale. This affects only regexps and
string comparisons according to perldoc locale. It could be argued that
it should also affect implicit upgrading to character strings.

> Besides, your examples did not work for me completely, to get the same
> regex matching I had to set LANG as well.

That looks like a bug. LC_ALL is supposed to override LANG.

hp