From: Willem on
Uri Guttman wrote:
)>>>>> "W" == Willem <willem(a)turtle.stack.nl> writes:
)
) W> Have you also tried removing 10 lines from a million-line file ?
) W> And for giggles, you could try a hand-rolled one that uses the functions
) W> seek(), sysread() and truncate() to accomplish the job.
)
) ahem. that is what file::readbackward does!

I know.

) it may be possible to hand
) roll optimize it by removing some overhead, etc. but it was designed to
) be very fast. your earlier point about how much to remove or skip is the
) important one. truncating most of a large file will be slower but you
) still need to count lines from the end. since you don't need to read
) each line for this you could read large blocks, scan for newlines and
) count them and then truncate to the desired point. readbackwards has the
) overhead of splitting the blocks into lines and returning each one for
) counting. but you always need to read the part you are truncating if you
) are counting lines from the end.

I just tried a hand-rolled sysseek/sysread/truncate version, and either
that overhead is very significant, or my machine is a lot faster.

> time perl trunc.pl tenmillion.txt 10000

real 0m0.013s
user 0m0.011s
sys 0m0.003s

That's removing ten thousand lines from a ten-million line file
that is over 600mb large.

I'll try a ReadBackwards version now.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
From: Willem on
Willem wrote:
)> time perl trunc.pl tenmillion.txt 10000
)
) real 0m0.013s
) user 0m0.011s
) sys 0m0.003s
)
) That's removing ten thousand lines from a ten-million line file
) that is over 600mb large.
)
) I'll try a ReadBackwards version now.


> time perl readb.pl million.txt 10000

real 0m0.036s
user 0m0.035s
sys 0m0.002s

Almost a factor of 3:1.
Removing a hundred thousand lines holds the same 3:1 pattern.

readb.pl:

use warnings;
use strict;

use File::ReadBackwards;

my $filename = shift;
my $Lines_to_truncate = shift;

my $bw = File::ReadBackwards->new( $filename )
or die "Could not read backwards in [$filename]: $!";

my $lines_from_end = 0;
until( $bw->eof or $lines_from_end == $Lines_to_truncate )
{
$lines_from_end++;
}

truncate( $filename, $bw->tell );


trunc.pl:

use warnings;
use strict;

my $blocksize = 4096;

my ($file, $lines) = @ARGV;
open (my $fh, '+<', $file)
or die "Failed to open '$file' for r/w: $!";
my $pos = sysseek($fh, 0, 2)
or die "Failed to seek to EOF of '$file': $!";

while (($pos -= ($pos - 1) % $blocksize + 1) >= 0) {
sysseek($fh, $pos, 0)
or die "Failed to seek backwards in '$file':$!";
sysread($fh, my $block, $blocksize)
or die "Failed to read from '$file': $!";
my $spos = rindex($block, "\n");
while ($spos >= 0) {
if (--$lines < 0) {
truncate($fh, $pos + $spos)
or die "Failed to truncate '$file':$!";
exit(0);
}
$spos = rindex($block, "\n", $spos - 1);
}
}
die "Not enough lines in file";



SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
From: Ralph Malph on
On 5/17/2010 12:43 PM, Ralph Malph wrote:
[snip]
> $ time cat puke | wc -l | xargs echo -10000 + | bc \
> | xargs echo head puke -n | sh > top_n-10000
>
> real 0m0.312s
> user 0m0.090s
> sys 0m0.121s

a non-pipelined shell script
does better
$ time ./temp.sh > top_n-10000
real 0m0.266s
user 0m0.015s
sys 0m0.076s

------------
temp.sh
------------
lines = `wc -l puke`
let num_lines=$(($lines-10000))
head puke -n $num_lines
From: Uri Guttman on
>>>>> "W" == Willem <willem(a)turtle.stack.nl> writes:

W> Uri Guttman wrote:

W> ) it may be possible to hand
W> ) roll optimize it by removing some overhead, etc. but it was designed to
W> ) be very fast. your earlier point about how much to remove or skip is the
W> ) important one. truncating most of a large file will be slower but you
W> ) still need to count lines from the end. since you don't need to read
W> ) each line for this you could read large blocks, scan for newlines and
W> ) count them and then truncate to the desired point. readbackwards has the
W> ) overhead of splitting the blocks into lines and returning each one for
W> ) counting. but you always need to read the part you are truncating if you
W> ) are counting lines from the end.

W> I just tried a hand-rolled sysseek/sysread/truncate version, and either
W> that overhead is very significant, or my machine is a lot faster.

W> I'll try a ReadBackwards version now.

did you try my suggested algorithm? it isn't too much work reading large
blocks from the end, counting newlines and then doing a truncate at the
desired point. i see it at about 30 lines of code or so.

uri

--
Uri Guttman ------ uri(a)stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
From: Uri Guttman on
>>>>> "W" == Willem <willem(a)turtle.stack.nl> writes:

W> my $spos = rindex($block, "\n");

ahh, here is your bottleneck. use tr/// to count the newlines of each
block. if you haven't read enough then read another. you don't need to
use rindex for each newline. also when you find the block which has the
desired ending, you can use a forward regex or something else to find
the nth newline in one call. perl is slow doing ops in a loop but fast
doing loops internally. so always use perl ops which do more work for you.

W> while ($spos >= 0) {
W> if (--$lines < 0) {
W> truncate($fh, $pos + $spos)
W> or die "Failed to truncate '$file':$!";
W> exit(0);
W> }
W> $spos = rindex($block, "\n", $spos - 1);

that is a slow perl loop calling rindex over and over.

uri

--
Uri Guttman ------ uri(a)stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------