Prev: FAQ 5.23 All I want to do is append a small amount of text to the end of a file. Do I still have to use locking?
Next: Net::SSH::Expect SSHAuthenticationError Login timed out.
From: Bart Van der Donck on 7 Jan 2010 12:19 Hello, I have been assigned a task to filter out an email address from the body of a (.msg) source file. The source file looks odd and displays differently in various plaintext readers. It looks like some sort of half binary / half ascii format (including the headers). The body of the file is more-or-less consistent. The address to be extracted is in the following format: "- n a m e @ h o s t . c o m " All text in the source file is with such spaces between. Spaces can be displayed like EOL, space or nothing. Binary characters seem to be inserted randomly; sometimes I can recognize a pattern of a repeated string. Maybe someone is familiar with this format ? The messages were saved from MS Outlook. I tried many variants, my best shot goes to: if (/(-)(\s\s\s)(.+)(@)(.+)(\.)(.+)(\s\s\s)/gs) { ... But still no success. I was thinking of an encoding issue (Unicode/ UTF?), but the source file seems too different for that. Thanks -- Bart
From: Peter J. Holzer on 7 Jan 2010 14:00 On 2010-01-07 17:19, Bart Van der Donck <bart(a)nijlen.com> wrote: > I have been assigned a task to filter out an email address from the > body of a (.msg) source file. > > The source file looks odd and displays differently in various > plaintext readers. It looks like some sort of half binary / half ascii > format (including the headers). The body of the file is more-or-less > consistent. The address to be extracted is in the following format: > > "- n a m e @ h o s t . c o m " > > All text in the source file is with such spaces between. > > Spaces can be displayed like EOL, space or nothing. Then they probably aren't spaces. Most likely they are nul characters. If you have Linux, use hd (or od) to look at the file. If you use Windows, there's probably some freeware hex editor/viewer you can use. > But still no success. I was thinking of an encoding issue (Unicode/ > UTF?), but the source file seems too different for that. Most likely UTF-16, but there may be some additional markup. hp
From: Steve C on 7 Jan 2010 14:14 Bart Van der Donck wrote: > Hello, > > I have been assigned a task to filter out an email address from the > body of a (.msg) source file. > > The source file looks odd and displays differently in various > plaintext readers. It looks like some sort of half binary / half ascii > format (including the headers). The body of the file is more-or-less > consistent. The address to be extracted is in the following format: > > "- n a m e @ h o s t . c o m " > > All text in the source file is with such spaces between. > > Spaces can be displayed like EOL, space or nothing. Binary characters > seem to be inserted randomly; sometimes I can recognize a pattern of a > repeated string. Maybe someone is familiar with this format ? The > messages were saved from MS Outlook. > > I tried many variants, my best shot goes to: > > if (/(-)(\s\s\s)(.+)(@)(.+)(\.)(.+)(\s\s\s)/gs) { ... > > But still no success. I was thinking of an encoding issue (Unicode/ > UTF?), but the source file seems too different for that. > > Thanks > Are you sure they are spaces and not NULs? Windows text files frequently use 16-bit wide character format, which looks like 0x0 in the high byte and ASCII in the low byte for English characters. http://www.microsoft.com/opentype/unicode/cs.htm
From: Wanna-Be Sys Admin on 7 Jan 2010 19:34 Bart Van der Donck wrote: > Hello, > > I have been assigned a task to filter out an email address from the > body of a (.msg) source file. > > The source file looks odd and displays differently in various > plaintext readers. It looks like some sort of half binary / half ascii > format (including the headers). The body of the file is more-or-less > consistent. The address to be extracted is in the following format: > > "- n a m e @ h o s t . c o m " > > All text in the source file is with such spaces between. > > Spaces can be displayed like EOL, space or nothing. Binary characters > seem to be inserted randomly; sometimes I can recognize a pattern of a > repeated string. Maybe someone is familiar with this format ? The > messages were saved from MS Outlook. > > I tried many variants, my best shot goes to: > > if (/(-)(\s\s\s)(.+)(@)(.+)(\.)(.+)(\s\s\s)/gs) { ... Maybe try stripping hidden characters from the file first? Trying to guess just exactly how many spaces or other characters, can be a hassle. The problem is, another address could be completely inconsistent from another, I assume they aren't all the same? If so, and if they are really white space, maybe \s+ in place of \s\s\s would be better? Also, why are you capturing \s\s\s, is that intentional? Is that expected and what you want? Anyway, you probably need to convert the file/data to strip out the junk so you can get the actual data you want and not try and work around ignoring or grabbing that junk -- Not really a wanna-be, but I don't know everything.
From: J�rgen Exner on 7 Jan 2010 19:35
Bart Van der Donck <bart(a)nijlen.com> wrote: >The source file looks odd and displays differently in various >plaintext readers. It looks like some sort of half binary / half ascii >format (including the headers). The body of the file is more-or-less >consistent. The address to be extracted is in the following format: > > "- n a m e @ h o s t . c o m " > >All text in the source file is with such spaces between. > >Spaces can be displayed like EOL, space or nothing. Binary characters >seem to be inserted randomly; sometimes I can recognize a pattern of a >repeated string. Maybe someone is familiar with this format ? The >messages were saved from MS Outlook. This file is likely in UTF-16 or USC-2. Did you look at it in a hex/binary editor? Those spaces are probably 0x00 bytes and not really spaces at all. Use the proper encoding and Perl should be able to read the file just fine, jue |