From: Bart Van der Donck on
Jürgen Exner wrote:

> Bart Van der Donck <b...(a)nijlen.com> wrote:
>
>>The source file looks odd and displays differently in various
>>plaintext readers. It looks like some sort of half binary / half ascii
>>format (including the headers). The body of the file is more-or-less
>>consistent. The address to be extracted is in the following format:
>>
>>   "-   n a m e @ h o s t . c o m   "
>>
>>All text in the source file is with such spaces between.
>>
>>Spaces can be displayed like EOL, space or nothing. Binary characters
>>seem to be inserted randomly; sometimes I can recognize a pattern of a
>>repeated string. Maybe someone is familiar with this format ? The
>>messages were saved from MS Outlook.
>
> This file is likely in UTF-16 or USC-2. Did you look at it in a
> hex/binary editor? Those spaces are probably 0x00 bytes and not really
> spaces at all.
>
> Use the proper encoding and Perl should be able to read the file just
> fine,

Yes - it appeared to be a UTF-16 issue indeed. I tried about all
possible byte order encoding schemes... and the following finally did
the trick:

use Encode;
open(my $in, '<:raw', $mypath) || die "Couldn't open file: $!";
my $txt = do { local $/; <$in> };
close $in;
my @lines = split /\n/, decode('UTF-16LE', $txt);

Thanks all for your help!

--
Bart
From: Permostat on
On Jan 7, 11:19 am, Bart Van der Donck <b...(a)nijlen.com> wrote:
> Hello,
>
> I have been assigned a task to filter out an email address from the
> body of a (.msg) source file.
>
> The source file looks odd and displays differently in various
> plaintext readers. It looks like some sort of half binary / half ascii
> format (including the headers). The body of the file is more-or-less
> consistent. The address to be extracted is in the following format:
>
>    "-   n a m e @ h o s t . c o m   "
>
> All text in the source file is with such spaces between.
>
> Spaces can be displayed like EOL, space or nothing. Binary characters
> seem to be inserted randomly; sometimes I can recognize a pattern of a
> repeated string. Maybe someone is familiar with this format ? The
> messages were saved from MS Outlook.
>
> I tried many variants, my best shot goes to:
>
>    if (/(-)(\s\s\s)(.+)(@)(.+)(\.)(.+)(\s\s\s)/gs)  { ...
>
> But still no success. I was thinking of an encoding issue (Unicode/
> UTF?), but the source file seems too different for that.
>
> Thanks
>
> --
>  Bart

Do little tiny minor jobs usually make you break out in a sweat like
this??

PRONTOR
From: Peter J. Holzer on
On 2010-01-08 09:58, Bart Van der Donck <bart(a)nijlen.com> wrote:
> use Encode;
> open(my $in, '<:raw', $mypath) || die "Couldn't open file: $!";
> my $txt = do { local $/; <$in> };
> close $in;
> my @lines = split /\n/, decode('UTF-16LE', $txt);

Shorter:

open(my $in, '<:encoding(UTF-16LE)', $mypath) || die "Couldn't open file: $!";
my @lines = <$in>;
chomp @lines;

(untested)

hp
From: Bart Van der Donck on
Peter J. Holzer wrote:

> On 2010-01-08 09:58, Bart Van der Donck <b...(a)nijlen.com> wrote:
>
>>   use Encode;
>>   open(my $in, '<:raw', $mypath) || die "Couldn't open file: $!";
>>   my $txt = do { local $/; <$in> };
>>   close $in;
>>   my @lines = split /\n/, decode('UTF-16LE', $txt);
>
> Shorter:
>
>     open(my $in, '<:encoding(UTF-16LE)', $mypath) || die "Couldn't open file: $!";
>     my @lines = <$in>;
>     chomp @lines;

For my particular situation, it appears that I need the raw method
anyhow. When I read directly with '<:encoding(UTF-16LE)', it says:

"UTF-16LE:Unicode character fffe is illegal at script.pl line 32."

(32 is the line with the 'open'-call)

--
Bart
From: J�rgen Exner on
Bart Van der Donck <bart(a)nijlen.com> wrote:
>For my particular situation, it appears that I need the raw method
>anyhow. When I read directly with '<:encoding(UTF-16LE)', it says:
>
> "UTF-16LE:Unicode character fffe is illegal at script.pl line 32."

The only place where 0xFFFE could possibly show up is the byte order
mark (BOM) and I would be very surprised if Perl couldn't handle the
BOM.
I would suggest to check the file with a hex editor to make sure it does
not contain an additional rouge BOM somewhere in the middle of the file.

jue