From: Bart Van der Donck on
Hello,

I have been assigned a task to filter out an email address from the
body of a (.msg) source file.

The source file looks odd and displays differently in various
plaintext readers. It looks like some sort of half binary / half ascii
format (including the headers). The body of the file is more-or-less
consistent. The address to be extracted is in the following format:

"- n a m e @ h o s t . c o m "

All text in the source file is with such spaces between.

Spaces can be displayed like EOL, space or nothing. Binary characters
seem to be inserted randomly; sometimes I can recognize a pattern of a
repeated string. Maybe someone is familiar with this format ? The
messages were saved from MS Outlook.

I tried many variants, my best shot goes to:

if (/(-)(\s\s\s)(.+)(@)(.+)(\.)(.+)(\s\s\s)/gs) { ...

But still no success. I was thinking of an encoding issue (Unicode/
UTF?), but the source file seems too different for that.

Thanks

--
Bart
From: Peter J. Holzer on
On 2010-01-07 17:19, Bart Van der Donck <bart(a)nijlen.com> wrote:
> I have been assigned a task to filter out an email address from the
> body of a (.msg) source file.
>
> The source file looks odd and displays differently in various
> plaintext readers. It looks like some sort of half binary / half ascii
> format (including the headers). The body of the file is more-or-less
> consistent. The address to be extracted is in the following format:
>
> "- n a m e @ h o s t . c o m "
>
> All text in the source file is with such spaces between.
>
> Spaces can be displayed like EOL, space or nothing.

Then they probably aren't spaces. Most likely they are nul characters.
If you have Linux, use hd (or od) to look at the file. If you use
Windows, there's probably some freeware hex editor/viewer you can use.

> But still no success. I was thinking of an encoding issue (Unicode/
> UTF?), but the source file seems too different for that.

Most likely UTF-16, but there may be some additional markup.

hp
From: Steve C on
Bart Van der Donck wrote:
> Hello,
>
> I have been assigned a task to filter out an email address from the
> body of a (.msg) source file.
>
> The source file looks odd and displays differently in various
> plaintext readers. It looks like some sort of half binary / half ascii
> format (including the headers). The body of the file is more-or-less
> consistent. The address to be extracted is in the following format:
>
> "- n a m e @ h o s t . c o m "
>
> All text in the source file is with such spaces between.
>
> Spaces can be displayed like EOL, space or nothing. Binary characters
> seem to be inserted randomly; sometimes I can recognize a pattern of a
> repeated string. Maybe someone is familiar with this format ? The
> messages were saved from MS Outlook.
>
> I tried many variants, my best shot goes to:
>
> if (/(-)(\s\s\s)(.+)(@)(.+)(\.)(.+)(\s\s\s)/gs) { ...
>
> But still no success. I was thinking of an encoding issue (Unicode/
> UTF?), but the source file seems too different for that.
>
> Thanks
>

Are you sure they are spaces and not NULs? Windows text files
frequently use 16-bit wide character format, which looks like
0x0 in the high byte and ASCII in the low byte for English
characters.

http://www.microsoft.com/opentype/unicode/cs.htm

From: Wanna-Be Sys Admin on
Bart Van der Donck wrote:

> Hello,
>
> I have been assigned a task to filter out an email address from the
> body of a (.msg) source file.
>
> The source file looks odd and displays differently in various
> plaintext readers. It looks like some sort of half binary / half ascii
> format (including the headers). The body of the file is more-or-less
> consistent. The address to be extracted is in the following format:
>
> "- n a m e @ h o s t . c o m "
>
> All text in the source file is with such spaces between.
>
> Spaces can be displayed like EOL, space or nothing. Binary characters
> seem to be inserted randomly; sometimes I can recognize a pattern of a
> repeated string. Maybe someone is familiar with this format ? The
> messages were saved from MS Outlook.
>
> I tried many variants, my best shot goes to:
>
> if (/(-)(\s\s\s)(.+)(@)(.+)(\.)(.+)(\s\s\s)/gs) { ...

Maybe try stripping hidden characters from the file first? Trying to
guess just exactly how many spaces or other characters, can be a
hassle. The problem is, another address could be completely
inconsistent from another, I assume they aren't all the same? If so,
and if they are really white space, maybe \s+ in place of \s\s\s would
be better? Also, why are you capturing \s\s\s, is that intentional? Is
that expected and what you want? Anyway, you probably need to convert
the file/data to strip out the junk so you can get the actual data you
want and not try and work around ignoring or grabbing that junk

--
Not really a wanna-be, but I don't know everything.
From: J�rgen Exner on
Bart Van der Donck <bart(a)nijlen.com> wrote:
>The source file looks odd and displays differently in various
>plaintext readers. It looks like some sort of half binary / half ascii
>format (including the headers). The body of the file is more-or-less
>consistent. The address to be extracted is in the following format:
>
> "- n a m e @ h o s t . c o m "
>
>All text in the source file is with such spaces between.
>
>Spaces can be displayed like EOL, space or nothing. Binary characters
>seem to be inserted randomly; sometimes I can recognize a pattern of a
>repeated string. Maybe someone is familiar with this format ? The
>messages were saved from MS Outlook.

This file is likely in UTF-16 or USC-2. Did you look at it in a
hex/binary editor? Those spaces are probably 0x00 bytes and not really
spaces at all.

Use the proper encoding and Perl should be able to read the file just
fine,

jue