Regex to extract email from .msg [Perl]

Prev: FAQ 5.23 All I want to do is append a small amount of text to the end of a file. Do I still have to use locking?
Next: Net::SSH::Expect SSHAuthenticationError Login timed out.

From: Bart Van der Donck on 8 Jan 2010 04:58

Jürgen Exner wrote:

> Bart Van der Donck <b...(a)nijlen.com> wrote:
>
>>The source file looks odd and displays differently in various
>>plaintext readers. It looks like some sort of half binary / half ascii
>>format (including the headers). The body of the file is more-or-less
>>consistent. The address to be extracted is in the following format:
>>
>> "- n a m e @ h o s t . c o m "
>>
>>All text in the source file is with such spaces between.
>>
>>Spaces can be displayed like EOL, space or nothing. Binary characters
>>seem to be inserted randomly; sometimes I can recognize a pattern of a
>>repeated string. Maybe someone is familiar with this format ? The
>>messages were saved from MS Outlook.
>
> This file is likely in UTF-16 or USC-2. Did you look at it in a
> hex/binary editor? Those spaces are probably 0x00 bytes and not really
> spaces at all.
>
> Use the proper encoding and Perl should be able to read the file just
> fine,

Yes - it appeared to be a UTF-16 issue indeed. I tried about all
possible byte order encoding schemes... and the following finally did
the trick:

use Encode;
open(my $in, '<:raw', $mypath) || die "Couldn't open file: $!";
my $txt = do { local $/; <$in> };
close $in;
my @lines = split /\n/, decode('UTF-16LE', $txt);

Thanks all for your help!

--
Bart

From: Permostat on 8 Jan 2010 09:08

On Jan 7, 11:19 am, Bart Van der Donck <b...(a)nijlen.com> wrote:
> Hello,
>
> I have been assigned a task to filter out an email address from the
> body of a (.msg) source file.
>
> The source file looks odd and displays differently in various
> plaintext readers. It looks like some sort of half binary / half ascii
> format (including the headers). The body of the file is more-or-less
> consistent. The address to be extracted is in the following format:
>
> "- n a m e @ h o s t . c o m "
>
> All text in the source file is with such spaces between.
>
> Spaces can be displayed like EOL, space or nothing. Binary characters
> seem to be inserted randomly; sometimes I can recognize a pattern of a
> repeated string. Maybe someone is familiar with this format ? The
> messages were saved from MS Outlook.
>
> I tried many variants, my best shot goes to:
>
> if (/(-)(\s\s\s)(.+)(@)(.+)(\.)(.+)(\s\s\s)/gs) { ...
>
> But still no success. I was thinking of an encoding issue (Unicode/
> UTF?), but the source file seems too different for that.
>
> Thanks
>
> --
> Bart

Do little tiny minor jobs usually make you break out in a sweat like
this??

PRONTOR

From: Peter J. Holzer on 8 Jan 2010 09:11

On 2010-01-08 09:58, Bart Van der Donck <bart(a)nijlen.com> wrote:
> use Encode;
> open(my $in, '<:raw', $mypath) || die "Couldn't open file: $!";
> my $txt = do { local $/; <$in> };
> close $in;
> my @lines = split /\n/, decode('UTF-16LE', $txt);

Shorter:

open(my $in, '<:encoding(UTF-16LE)', $mypath) || die "Couldn't open file: $!";
my @lines = <$in>;
chomp @lines;

(untested)

hp

From: Bart Van der Donck on 9 Jan 2010 03:50

Peter J. Holzer wrote:

> On 2010-01-08 09:58, Bart Van der Donck <b...(a)nijlen.com> wrote:
>
>> use Encode;
>> open(my $in, '<:raw', $mypath) || die "Couldn't open file: $!";
>> my $txt = do { local $/; <$in> };
>> close $in;
>> my @lines = split /\n/, decode('UTF-16LE', $txt);
>
> Shorter:
>
> open(my $in, '<:encoding(UTF-16LE)', $mypath) || die "Couldn't open file: $!";
> my @lines = <$in>;
> chomp @lines;

For my particular situation, it appears that I need the raw method
anyhow. When I read directly with '<:encoding(UTF-16LE)', it says:

"UTF-16LE:Unicode character fffe is illegal at script.pl line 32."

(32 is the line with the 'open'-call)

--
Bart

From: J�rgen Exner on 9 Jan 2010 04:20

Bart Van der Donck <bart(a)nijlen.com> wrote:
>For my particular situation, it appears that I need the raw method
>anyhow. When I read directly with '<:encoding(UTF-16LE)', it says:
>
> "UTF-16LE:Unicode character fffe is illegal at script.pl line 32."

The only place where 0xFFFE could possibly show up is the byte order
mark (BOM) and I would be very surprised if Perl couldn't handle the
BOM.
I would suggest to check the file with a hex editor to make sure it does
not contain an additional rouge BOM somewhere in the middle of the file.

jue

First | Prev | Next | Last
Pages: 1 2 3
Prev: FAQ 5.23 All I want to do is append a small amount of text to the end of a file. Do I still have to use locking?
Next: Net::SSH::Expect SSHAuthenticationError Login timed out.