From: sln on
On Sat, 9 Jan 2010 00:50:51 -0800 (PST), Bart Van der Donck <bart(a)nijlen.com> wrote:

>Peter J. Holzer wrote:
>
>> On 2010-01-08 09:58, Bart Van der Donck <b...(a)nijlen.com> wrote:
>>
>>> � use Encode;
>>> � open(my $in, '<:raw', $mypath) || die "Couldn't open file: $!";
>>> � my $txt = do { local $/; <$in> };
>>> � close $in;
>>> � my @lines = split /\n/, decode('UTF-16LE', $txt);
>>
>> Shorter:
>>
>> � � open(my $in, '<:encoding(UTF-16LE)', $mypath) || die "Couldn't open file: $!";
>> � � my @lines = <$in>;
>> � � chomp @lines;
>
>For my particular situation, it appears that I need the raw method
>anyhow. When I read directly with '<:encoding(UTF-16LE)', it says:
>
> "UTF-16LE:Unicode character fffe is illegal at script.pl line 32."
>
>(32 is the line with the 'open'-call)

Try:
� open(my $in, '<:encoding(UTF-16)', $mypath) || die "Couldn't open file: $!";
^^
UTF-16

fffe BOM is UTF-16LE, and should have opened ok.
However, when you read the first time without seeking past the
bom offset (2), fffe is read and is illeagal UTF-16 char.

When you open with UTF-16 instead, the layer expects a BOM and
automatically moves the file position past it for the first read.
Its called the BOM bug !!!

Of course if you don't have a BOM, using UTF-16 will die with
"no BOM". Another bug !!!

I posted code before that auto navigates these waters, if you
bothered to look.

-sln
From: sln on
On Sat, 09 Jan 2010 05:41:49 -0800, sln(a)netherlands.com wrote:

>On Sat, 9 Jan 2010 00:50:51 -0800 (PST), Bart Van der Donck <bart(a)nijlen.com> wrote:
>
>>Peter J. Holzer wrote:
>>
>>> On 2010-01-08 09:58, Bart Van der Donck <b...(a)nijlen.com> wrote:
>>>
>>>> � use Encode;
>>>> � open(my $in, '<:raw', $mypath) || die "Couldn't open file: $!";
>>>> � my $txt = do { local $/; <$in> };
>>>> � close $in;
>>>> � my @lines = split /\n/, decode('UTF-16LE', $txt);
>>>
>>> Shorter:
>>>
>>> � � open(my $in, '<:encoding(UTF-16LE)', $mypath) || die "Couldn't open file: $!";
>>> � � my @lines = <$in>;
>>> � � chomp @lines;
>>
>>For my particular situation, it appears that I need the raw method
>>anyhow. When I read directly with '<:encoding(UTF-16LE)', it says:
>>
>> "UTF-16LE:Unicode character fffe is illegal at script.pl line 32."
>>
>>(32 is the line with the 'open'-call)
>
>Try:
> � open(my $in, '<:encoding(UTF-16)', $mypath) || die "Couldn't open file: $!";
> ^^
> UTF-16
>
>fffe BOM is UTF-16LE, and should have opened ok.
>However, when you read the first time without seeking past the
>bom offset (2), fffe is read and is illeagal UTF-16 char.
>
>When you open with UTF-16 instead, the layer expects a BOM and
>automatically moves the file position past it for the first read.
>Its called the BOM bug !!!

The bug is that seek's are dead, you have to keep track of bom
offset yourself (if bom) and this should be transparent if :encoding(UTF-16).
>
>Of course if you don't have a BOM, using UTF-16 will die with
>"no BOM". Another bug !!!
>
>I posted code before that auto navigates these waters, if you
>bothered to look.
>
>-sln

From: Ben Morrow on

Quoth J�rgen Exner <jurgenex(a)hotmail.com>:
> Bart Van der Donck <bart(a)nijlen.com> wrote:
> >For my particular situation, it appears that I need the raw method
> >anyhow. When I read directly with '<:encoding(UTF-16LE)', it says:
> >
> > "UTF-16LE:Unicode character fffe is illegal at script.pl line 32."
>
> The only place where 0xFFFE could possibly show up is the byte order
> mark (BOM) and I would be very surprised if Perl couldn't handle the
> BOM.

IIRC the the Perl UTF-16 layers are a little too picky. If you ask for
UTF-16LE, it will complain if there is a BOM. If, OTOH, you ask for
UTF-16, it will correctly detect the BOM and set the byte order from it.

> I would suggest to check the file with a hex editor to make sure it does
> not contain an additional rouge BOM somewhere in the middle of the file.

I wasn't aware BOMs came in different colours :).

Ben