From: Tad McClellan on
Uri Guttman <uri(a)StemSystems.com> wrote:
>>>>>> "PJH" == Peter J Holzer <hjp-usenet2(a)hjp.at> writes:
>
> PJH> On 2010-08-13 18:14, Peter J. Holzer <hjp-usenet2(a)hjp.at> wrote:
> >> Uri would probably tell you [...]
>
> PJH> I didn't see Uri's answer before I posted this. I swear! :-)
>
> great minds. :)


yes, but why were you and Peter both thinking the same thing?

:-) :-)


--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.liamg\100cm.j.dat/"
The above message is a Usenet post.
I don't recall having given anyone permission to use it on a Web site.
From: ccc31807 on
On Aug 13, 1:29 pm, "Uri Guttman" <u...(a)StemSystems.com> wrote:
> parsing and text munging is much easier when the entire file is in
> ram. there is no need to mix i/o with logic, the i/o is much faster, you
> can send/receive whole documents to servers (which could format things
> or whatever), etc. slurping whole files makes a lot of sense in many
> areas.

Most of what I do requires me to treat each record as a separate
'document.' In many cases, this even extends to the output, where one
input document results in hundreds of separate output documents, each
of which must be opened, written to, and closed.

I'm not being difficult (or maybe I am) but I'm having a hard time
seeing how this kind of logic which treats each record separately:

while (<IN>)
{
chomp;
my ($var1, $var2, ... $varn) = split;
#do stuff
print OUT qq("$field1","$field2",..."$fieldn"\n);
}

or this:

foreach my $key (sort keys %{$hashref})
{
#do stuff using $hashref{$key}{var1}, $hashref{$key}{var2}, etc.
print OUT qq("$field1","$field2",..."$fieldn"\n);
}

could be made easier by dealing with the entire file at once.

Okay, this is the first time I have had to treat a single file as a
unit, and to be honest the experience was positive. Still, my
worldview consists of record oriented datasets, so I put this in my
nice-to-know-but-not-particularly-useful category.

CC.
From: sln on
On Fri, 13 Aug 2010 10:08:48 -0700 (PDT), ccc31807 <cartercc(a)gmail.com> wrote:

>My question is this: Is the third attempt, slurping the entire
>document into memory and transforming the text by regexs, very common,
>or is it considered a last resort when nothing else would work?
>

The answer is no to slurping, and no to using regex's on large
documents that don't need to be all in memory.

There is usually a single drive (say raid). Only one
i/o operation is performed at a time. If hogged, the
other processes will wait until the hog is done and thier
i/o is dequed, done and returned.
The speeds of modern sata2, raid configured drives work well
when reading/writing incremental data, it should always be
used this way on large data that can be worked on incrementally.
The default buffer on read between the api and the device is usually
small, so as to not clog up device i/o and spin locks. So its still
going to be incremental.

A complex regex will perform larger back tracking on large
data then on smaller data. So it depends on the type and complexity.

The third reason is always memory. Sure, there is a lot of memory,
but to hog it all, bogs down background file cacheing and other processing.