Prev: FAQ 5.23 All I want to do is append a small amount of text to the end of a file. Do I still have to use locking?
Next: Net::SSH::Expect SSHAuthenticationError Login timed out.
From: Bart Van der Donck on 8 Jan 2010 04:58 Jürgen Exner wrote: > Bart Van der Donck <b...(a)nijlen.com> wrote: > >>The source file looks odd and displays differently in various >>plaintext readers. It looks like some sort of half binary / half ascii >>format (including the headers). The body of the file is more-or-less >>consistent. The address to be extracted is in the following format: >> >> "- n a m e @ h o s t . c o m " >> >>All text in the source file is with such spaces between. >> >>Spaces can be displayed like EOL, space or nothing. Binary characters >>seem to be inserted randomly; sometimes I can recognize a pattern of a >>repeated string. Maybe someone is familiar with this format ? The >>messages were saved from MS Outlook. > > This file is likely in UTF-16 or USC-2. Did you look at it in a > hex/binary editor? Those spaces are probably 0x00 bytes and not really > spaces at all. > > Use the proper encoding and Perl should be able to read the file just > fine, Yes - it appeared to be a UTF-16 issue indeed. I tried about all possible byte order encoding schemes... and the following finally did the trick: use Encode; open(my $in, '<:raw', $mypath) || die "Couldn't open file: $!"; my $txt = do { local $/; <$in> }; close $in; my @lines = split /\n/, decode('UTF-16LE', $txt); Thanks all for your help! -- Bart
From: Permostat on 8 Jan 2010 09:08 On Jan 7, 11:19 am, Bart Van der Donck <b...(a)nijlen.com> wrote: > Hello, > > I have been assigned a task to filter out an email address from the > body of a (.msg) source file. > > The source file looks odd and displays differently in various > plaintext readers. It looks like some sort of half binary / half ascii > format (including the headers). The body of the file is more-or-less > consistent. The address to be extracted is in the following format: > > "- n a m e @ h o s t . c o m " > > All text in the source file is with such spaces between. > > Spaces can be displayed like EOL, space or nothing. Binary characters > seem to be inserted randomly; sometimes I can recognize a pattern of a > repeated string. Maybe someone is familiar with this format ? The > messages were saved from MS Outlook. > > I tried many variants, my best shot goes to: > > if (/(-)(\s\s\s)(.+)(@)(.+)(\.)(.+)(\s\s\s)/gs) { ... > > But still no success. I was thinking of an encoding issue (Unicode/ > UTF?), but the source file seems too different for that. > > Thanks > > -- > Bart Do little tiny minor jobs usually make you break out in a sweat like this?? PRONTOR
From: Peter J. Holzer on 8 Jan 2010 09:11 On 2010-01-08 09:58, Bart Van der Donck <bart(a)nijlen.com> wrote: > use Encode; > open(my $in, '<:raw', $mypath) || die "Couldn't open file: $!"; > my $txt = do { local $/; <$in> }; > close $in; > my @lines = split /\n/, decode('UTF-16LE', $txt); Shorter: open(my $in, '<:encoding(UTF-16LE)', $mypath) || die "Couldn't open file: $!"; my @lines = <$in>; chomp @lines; (untested) hp
From: Bart Van der Donck on 9 Jan 2010 03:50 Peter J. Holzer wrote: > On 2010-01-08 09:58, Bart Van der Donck <b...(a)nijlen.com> wrote: > >> use Encode; >> open(my $in, '<:raw', $mypath) || die "Couldn't open file: $!"; >> my $txt = do { local $/; <$in> }; >> close $in; >> my @lines = split /\n/, decode('UTF-16LE', $txt); > > Shorter: > > open(my $in, '<:encoding(UTF-16LE)', $mypath) || die "Couldn't open file: $!"; > my @lines = <$in>; > chomp @lines; For my particular situation, it appears that I need the raw method anyhow. When I read directly with '<:encoding(UTF-16LE)', it says: "UTF-16LE:Unicode character fffe is illegal at script.pl line 32." (32 is the line with the 'open'-call) -- Bart
From: J�rgen Exner on 9 Jan 2010 04:20
Bart Van der Donck <bart(a)nijlen.com> wrote: >For my particular situation, it appears that I need the raw method >anyhow. When I read directly with '<:encoding(UTF-16LE)', it says: > > "UTF-16LE:Unicode character fffe is illegal at script.pl line 32." The only place where 0xFFFE could possibly show up is the byte order mark (BOM) and I would be very surprised if Perl couldn't handle the BOM. I would suggest to check the file with a hex editor to make sure it does not contain an additional rouge BOM somewhere in the middle of the file. jue |