From: Willem on
Uri Guttman wrote:
)>>>>> "W" == Willem <willem(a)turtle.stack.nl> writes:
)
) W> my $spos = rindex($block, "\n");
)
) ahh, here is your bottleneck. use tr/// to count the newlines of each
) block. if you haven't read enough then read another. you don't need to
) use rindex for each newline. also when you find the block which has the
) desired ending, you can use a forward regex or something else to find
) the nth newline in one call. perl is slow doing ops in a loop but fast
) doing loops internally. so always use perl ops which do more work for you.

So you're thinking it could be even faster ?
Okay, I tried it with tr, and indeed it goes almost twice as fast.
About five times as fast as ReadBackwards.

I added:

my $nnl = ($block =~ tr/\n/\n/);
if ($lines >= $nnl) {
$lines -= $nnl;
next;
}

At this time, I'm beginning to see significant fluctuations (50%),
caused by disk caching effects, most likely.

'user' time is about a factor 10:1 against ReadBackwards, while
'real' time is only about 8:1, so disk I/O is definitely a factor here.

) W> while ($spos >= 0) {
) W> if (--$lines < 0) {
) W> truncate($fh, $pos + $spos)
) W> or die "Failed to truncate '$file':$!";
) W> exit(0);
) W> }
) W> $spos = rindex($block, "\n", $spos - 1);
)
) that is a slow perl loop calling rindex over and over.

I know, I was going for correctness first.

What regex or other perl-internal would you use to find the Nth newline
from the rear ?

PS: I don't really think that will make much difference, as with the tr///
optimization, it will only be osed on the final $blocksize bytes.

PPS: I might try a C version, to see what that does, but not today.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
From: Willem on
Uri Guttman wrote:
) did you try my suggested algorithm? it isn't too much work reading large
) blocks from the end, counting newlines and then doing a truncate at the
) desired point. i see it at about 30 lines of code or so.

Which algorithm would that be ? I posted my code crossthread, you already
commented on it (especially the rindex bit) and I dropped in the suggested
tr/// optimization. A hundred thousand lines removed from ten million in
between 0.02 and 0.05 seconds, a million from ten million in between 0.19
and 0.35 seconds. Fluctuations probably due to disk caching effects.

File::ReadBackwards takes between 1.8 and 2 seconds, so I guess the
overhead in splitting the blocks into lines and such is significant.

Or did you put a function in that module that I'm missing ?
I used the code posted in the FAQ that opened this thread.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
From: Dr.Ruud on
Willem wrote:

> a ten-million line file
> that is over 600mb large.

Anything is over 600 mb large.
With the exception of the 600 mb themselves of course.

--
Ruud
From: Uri Guttman on
>>>>> "W" == Willem <willem(a)turtle.stack.nl> writes:

W> Uri Guttman wrote:
W> )>>>>> "W" == Willem <willem(a)turtle.stack.nl> writes:
W> )
W> ) W> my $spos = rindex($block, "\n");
W> )
W> ) ahh, here is your bottleneck. use tr/// to count the newlines of each
W> ) block. if you haven't read enough then read another. you don't need to
W> ) use rindex for each newline. also when you find the block which has the
W> ) desired ending, you can use a forward regex or something else to find
W> ) the nth newline in one call. perl is slow doing ops in a loop but fast
W> ) doing loops internally. so always use perl ops which do more work for you.

W> So you're thinking it could be even faster ?

i know it will be faster. it does less work by far since it doesn't
split lines and loop over each line. that is a ton of perl code being
skipped. i ever claimed readbackwards was the fastest way to truncate a
file. it is faster than simpler methods but code that just does this
would have to be faster as it does so much less work.

W> Okay, I tried it with tr, and indeed it goes almost twice as fast.
W> About five times as fast as ReadBackwards.

sounds reasonable.

W> I added:

W> my $nnl = ($block =~ tr/\n/\n/);

you don't need the replacement part if you are just counting chars. when
tr is passed an empty replacement part it uses the left part so it
effectively doesn't change anything (unless you use modifier). counting
newlines is just tr/\n//.

W> if ($lines >= $nnl) {
W> $lines -= $nnl;
W> next;
W> }

W> At this time, I'm beginning to see significant fluctuations (50%),
W> caused by disk caching effects, most likely.

probably. they are hard to work around with large files being tested.

W> 'user' time is about a factor 10:1 against ReadBackwards, while
W> 'real' time is only about 8:1, so disk I/O is definitely a factor here.

W> ) W> while ($spos >= 0) {
W> ) W> if (--$lines < 0) {
W> ) W> truncate($fh, $pos + $spos)
W> ) W> or die "Failed to truncate '$file':$!";
W> ) W> exit(0);
W> ) W> }
W> ) W> $spos = rindex($block, "\n", $spos - 1);
W> )
W> ) that is a slow perl loop calling rindex over and over.

W> I know, I was going for correctness first.

W> What regex or other perl-internal would you use to find the Nth newline
W> from the rear ?

within a block, i would count from the front using m/\n{$cnt}/ and then
pos to find out where it was in the block. you can calculate the
truncate point from that. be aware of off by one issues as they abound
in this type of coding.

W> PS: I don't really think that will make much difference, as with the tr///
W> optimization, it will only be osed on the final $blocksize bytes.

it will still be faster if you have large block sizes. and larger blocks
will generally be faster as well up to the point when ram is running out.

W> PPS: I might try a C version, to see what that does, but not today.

given that the work is now down to reading large blocks, scanning
quickly for newlines, i suspect a c version won't be that much faster as
it still needs to do the same work. the key is to stay inside perl's
guts which are very fast and stay away from perl ops which are slow (the
interpreter op loop is the major overhead in perl vs c).

uri

--
Uri Guttman ------ uri(a)stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
From: Peter Makholm on
Ralph Malph <ralph(a)happydays.com> writes:

> $ time perl faq.pl > top_n-10000
>
> real 0m0.219s
> user 0m0.093s
> sys 0m0.061s
>
> $ time cat puke | wc -l | xargs echo -10000 + | bc \
> | xargs echo head puke -n | sh > top_n-10000
>
> real 0m0.312s
> user 0m0.090s
> sys 0m0.121s

On a GNU system, which I believe includes cygwin, you should be able
to just say

$ time head -n -10000 puke > top_n-10000

No idea about how it would compare to the other solutions.

//Makholm