Appropriate technique for altering a text file? [Perl]

Prev: FAQ 8.17 How can I measure time under a second?
Next: FAQ 5.23 AND: Perl the latest vs Perl the gratest

From: ccc31807 on 13 Aug 2010 13:08

During the discussion of the 9-11 mosque in NYC, several commentators
mentioned Milestones
by Sayed Qutb. I decided to read it to see that the fuss was about,
and ended up with an ASCII text copy generated from a PDF original.

I could have printed the text directly, but it was pretty mangled, and
after attempting and failing to reformat the document using vi, I
decided to write a simple Perl script to reformat it. I wanted to do
several things, join paragraphs together (every line in the file was
terminated by a "\n"), separate paragraphs by a blank line (block
style), remove repeated periods (dots), remove form feeds (which
marked pages in the original), etc.

I first attempted to munge the file in place, like this:
#FIRST ATTEMPT
open MS, '<', $file;
open OUT, '>', $out;
while (<MS>)
{
#do stuff
print OUT;
}
close MS;
close OUT;

It mostly worked, but I couldn't fine tune it. I then attempted to
munge two lines together, like this:
#SECOND ATTEMPT
open MS, '<', $file;
open OUT, '>', $out;
$line1 = <MS>;
while (<MS>)
{
$line2 = $_;
#do stuff
print OUT;
$line 2 = $line1;
}
close MS;
close OUT;

This worked a little better, but it wasn't perfect. I then tried this
and got perfect formatting:
#THIRD ATTEMPT
{
local $/ = undef;
open MS, '<', $file;
$document = <MS>;
close MS;
}
#series of transformations like this
$document =~ s/\r//;
open OUT, '>', $out;
print OUT $document;
close OUT;

All of the work I have done in the past has munged the lines one by
one, as in the first example. Occasionally, I have had to use the
second style (e.g., where the formatting of each line depends on the
content of the preceding line.) I've never used the third style at
all.

I liked the third way a lot. It seemed quick, easy, and worked
perfectly. I was actually able to open the resulting document in Word,
fancify it a little, and print a nice finished copy. However, I can't
think of any actual uses of the third style in my day to day work.

My question is this: Is the third attempt, slurping the entire
document into memory and transforming the text by regexs, very common,
or is it considered a last resort when nothing else would work?

CC.

From: Uri Guttman on 13 Aug 2010 13:29

>>>>> "c" == ccc31807 <cartercc(a)gmail.com> writes:

c> This worked a little better, but it wasn't perfect. I then tried this
c> and got perfect formatting:
c> #THIRD ATTEMPT
c> {
c> local $/ = undef;
c> open MS, '<', $file;
c> $document = <MS>;
c> close MS;
c> }

c> All of the work I have done in the past has munged the lines one by
c> one, as in the first example. Occasionally, I have had to use the
c> second style (e.g., where the formatting of each line depends on the
c> content of the preceding line.) I've never used the third style at
c> all.

it isn't as common as it should be IMNSHO. in the old days reading files
line by line was almost required do to small memory machines. today,
megabyte files can be slurped without fear at all but line by line is
still taught as standard. it take time to change views.

c> I liked the third way a lot. It seemed quick, easy, and worked
c> perfectly. I was actually able to open the resulting document in
c> Word, fancify it a little, and print a nice finished copy. However,
c> I can't think of any actual uses of the third style in my day to
c> day work.

parsing and text munging is much easier when the entire file is in
ram. there is no need to mix i/o with logic, the i/o is much faster, you
can send/receive whole documents to servers (which could format things
or whatever), etc. slurping whole files makes a lot of sense in many
areas.

c> My question is this: Is the third attempt, slurping the entire
c> document into memory and transforming the text by regexs, very common,
c> or is it considered a last resort when nothing else would work?

it is not a last resort by any imagination today. and use File::Slurp
instead for both reading and writing the file. it is cleaner and faster
than the methods you used.

uri

--
Uri Guttman ------ uri(a)stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

From: Peter J. Holzer on 13 Aug 2010 14:14

On 2010-08-13 17:08, ccc31807 <cartercc(a)gmail.com> wrote:
[ 3 ways of munging a text file: line by line, pairs of lines,
and whole file at once
]

> I liked the third way a lot. It seemed quick, easy, and worked
> perfectly. I was actually able to open the resulting document in Word,
> fancify it a little, and print a nice finished copy. However, I can't
> think of any actual uses of the third style in my day to day work.
>
> My question is this: Is the third attempt, slurping the entire
> document into memory and transforming the text by regexs, very common,
> or is it considered a last resort when nothing else would work?

Uri would probably tell you that's what you always should do unless the
file is too big to fit into memory (and you should use File::Slurp for
it) :-).

I do whatever allows the most straightforward implementation. Very
often that means reading the whole data into memory, although not
necessarily as a single scalar.

hp

From: Peter J. Holzer on 13 Aug 2010 14:42

On 2010-08-13 18:14, Peter J. Holzer <hjp-usenet2(a)hjp.at> wrote:
> Uri would probably tell you [...]

I didn't see Uri's answer before I posted this. I swear! :-)

hp

From: Uri Guttman on 13 Aug 2010 14:48

>>>>> "PJH" == Peter J Holzer <hjp-usenet2(a)hjp.at> writes:

PJH> On 2010-08-13 18:14, Peter J. Holzer <hjp-usenet2(a)hjp.at> wrote:
>> Uri would probably tell you [...]

PJH> I didn't see Uri's answer before I posted this. I swear! :-)

great minds. :)

uri

--
Uri Guttman ------ uri(a)stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

| Next | Last
Pages: 1 2
Prev: FAQ 8.17 How can I measure time under a second?
Next: FAQ 5.23 AND: Perl the latest vs Perl the gratest