Speed of reading some MB of data using qx(...) [Perl]

Prev: Extracting table in html page
Next: Use of uninitialised value, how to avoid?

From: Ilya Zakharevich on 27 Jul 2010 20:50

On 2010-07-27, Wolfram Humann <w.c.humann(a)arcor.de> wrote:
> sv.c). Perl_sv_grow then needs to decide if the string's memory is
> already sufficient or really needs to grow. In the latter case,
> safesysrealloc -> Perl_safesysrealloc -> realloc is called. The
> interesting point is: how much memory does it request? The answer is:
>
> newlen += 10 * (newlen - SvCUR(sv)); /* avoid copy each time */
>
> I.e. it requests 10 times as much memory as is required for the
> current append operation. So when I loop 10000 times and each time
> append 100 chars to an initial string size of 10 million, the memory
> grows from 10.000e6 to 10.001e6 to 10.002e6 and so on 1000 times till
> it ends at 11.000e6.

Good l*rd!

The current algorithm is optimized to work in tandem with "my"
malloc(), which would round up to a certain geometric progression
anyway. So if one use as different malloc()s, one should better use

newlen += (newlen >> 4) + 10; /* avoid copy each time */

Thanks for an analysis (and report this on p5p)!

Yours,
Ilya

P.S. And the current algorithm is a disaster if you try to append a
string of length 200e6 to a short string...

From: Ilya Zakharevich on 27 Jul 2010 20:59

On 2010-07-27, Ben Morrow <ben(a)morrow.me.uk> wrote:
>> By the way: I estimate the time required for each realloc to be around
>> 3 ms for 10e6 chars, growing linearly with the amount of data -- I
>> consider that a fair speed and no reason to blame win32.
>
> If you timed perl's own realloc, you would (I believe) find it does much
> better than this. AFAICS from the code, it has a fixed set of block
> sizes it actually allocates. Enlarging a block such that it doesn't go
> over the block size actually allocated is *free*, not even linear in the
> size of the block, since all it does is adjust the end marker. This is
> the logic you are expecting sv_grow to implement, but perl has decided
> that this is the allocator's responsibility.

Wrong analysis. *In this particular situation* one would find Perl's
realloc() as slow as M$'s one - but it would be called MUCH more
rarely. With my malloc(), I implemented an extra call:
malloced_size(). So on the 1st iteration, things would go similarly
to "their-malloc", and realloc() would be called - and your analysis
would apply.

But on the 2nd iteration of appending, Perl would discover the new
*really* allocated size for the string, and would update the buffer
size stored in SV. After this, for many iteractions the realloc() is
out-of-the-loop altogether.

BTW, IIRC, my malloc() does not store the "end marker" at all (unless
a debugging version is used). (The bucket size of short strings
depends only on which page of memory they belong to - well, actually,
half-page.)

Hope this helps,
Ilya

From: Ilya Zakharevich on 27 Jul 2010 21:02

On 2010-07-27, Peter J. Holzer <hjp-usenet2(a)hjp.at> wrote:
> I'm pretty sure that GNU malloc doesn't round up to powers of two or
> something like that. However, the performance difference between GNU
> malloc and Perl malloc is rather small:

Yeah, right. Not more than 3 times. ;-)

> perl 5.12.1, default config, EGLIBC 2.11.2-2:
>
> 1E5 chars + 1E4 x 1E2 chars: 3.9 ms
> 1E6 chars + 1E4 x 1E2 chars: 3.8 ms
> 1E7 chars + 1E4 x 1E2 chars: 4.4 ms
>
> 1E7 chars + 1E5 x 1E1 chars: 28.4 ms
> 1E7 chars + 1E4 x 1E2 chars: 4.5 ms
> 1E7 chars + 1E3 x 1E3 chars: 2.6 ms
> 1E7 chars + 1E2 x 1E4 chars: 2.0 ms
> 1E7 chars + 1E1 x 1E5 chars: 1.9 ms
>
> 1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 2.0 ms
> 1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 4.4 ms
>
> perl 5.12.1, usemymalloc=y, EGLIBC 2.11.2-2:
>
> 1E5 chars + 1E4 x 1E2 chars: 2.6 ms
> 1E6 chars + 1E4 x 1E2 chars: 3.8 ms
> 1E7 chars + 1E4 x 1E2 chars: 2.5 ms
>
> 1E7 chars + 1E5 x 1E1 chars: 18.8 ms
> 1E7 chars + 1E4 x 1E2 chars: 2.5 ms
> 1E7 chars + 1E3 x 1E3 chars: 0.9 ms
> 1E7 chars + 1E2 x 1E4 chars: 0.9 ms
> 1E7 chars + 1E1 x 1E5 chars: 1.1 ms
>
> 1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.9 ms
> 1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 3.4 ms

Yours,
Ilya

From: Peter J. Holzer on 28 Jul 2010 04:15

On 2010-07-28 01:02, Ilya Zakharevich <nospam-abuse(a)ilyaz.org> wrote:
> On 2010-07-27, Peter J. Holzer <hjp-usenet2(a)hjp.at> wrote:
>> I'm pretty sure that GNU malloc doesn't round up to powers of two or
>> something like that. However, the performance difference between GNU
>> malloc and Perl malloc is rather small:
>
> Yeah, right. Not more than 3 times. ;-)

Compared to the factor of about 1000 that Wolfram and Ben observed this
is small. More importantly, growing a string to length n in constant
increments appears to be an O(n) operation with both your malloc and GNU
malloc, but O(n�) with Win32 malloc and BSD malloc.

Your malloc is certainly faster than GNU malloc. OTOH, AFAICS from
quickly scanning the comments in malloc.c, it always pads allocation to
the next power of two (minus 4) and it doesn't return unused memory back
to the OS. So it might use quite a bit more memory. Which of these
effects is more important depends on the program and is hard to
determine analytically. Which means that I should benchmark some of my
more performance-critical scripts with both allocators ;-).

hp

From: Wolfram Humann on 28 Jul 2010 11:09

On Jul 28, 2:50 am, Ilya Zakharevich <nospam-ab...(a)ilyaz.org> wrote:
> On 2010-07-27, Wolfram Humann <w.c.hum...(a)arcor.de> wrote:
>
> > sv.c). Perl_sv_grow then needs to decide if the string's memory is
> > already sufficient or really needs to grow. In the latter case,
> > safesysrealloc -> Perl_safesysrealloc -> realloc is called. The
> > interesting point is: how much memory does it request? The answer is:
>
> > newlen += 10 * (newlen - SvCUR(sv)); /* avoid copy each time */
>
> > I.e. it requests 10 times as much memory as is required for the
> > current append operation. So when I loop 10000 times and each time
> > append 100 chars to an initial string size of 10 million, the memory
> > grows from 10.000e6 to 10.001e6 to 10.002e6 and so on 1000 times till
> > it ends at 11.000e6.
>
> Good l*rd!
>
> The current algorithm is optimized to work in tandem with "my"
> malloc(), which would round up to a certain geometric progression
> anyway. So if one use as different malloc()s, one should better use
>
> newlen += (newlen >> 4) + 10; /* avoid copy each time */

I finally managed to compile my own win32 perl. (Actually it was quite
easy once I refrained from doing mistakes so stupid I do not dare to
talk about them...)
Now I could modify Perl_sv_grow() and insert debugging prints and I
found good and bad news.

The bad news: Looks like I was *overly optimistic* (LOL!) concerning
the efficiency of the current string memory allocation on win32. The
"newlen += 10 * (newlen - SvCUR(sv))" line is only executed if
SvOOK(sv) -- i.e. in most cases it is *not* executed. Therefore win32
system realloc is not called every tenth string-append operation but
*every* time something gets appended to a string.

The good news: A single additional line of code makes win32 perl
100...1000 times faster!
(for code that appends to strings very frequently)

I went with Ilya's proposal but inserted the line a little further
down, just after
if (newlen > SvLEN(sv)) { /* need more room? */

So now we have:
if (newlen > SvLEN(sv)) { /* need more room? */
newlen += (newlen >> 2) + 10;
#ifndef Perl_safesysmalloc_size
newlen = PERL_STRLEN_ROUNDUP(newlen);
#endif
if (SvLEN(sv) && s) {
s = (char*)saferealloc(s, newlen);
}

The remaining question is by what ratio a string's memory should grow.
I tried several values from (newlen >> 0) to (newlen >> 6) for the
best compromise between execution time and memory usage and my
personal favorite is (newlen >> 2). What do others here think? At the
end of this post I will attach the results for my benchmark script
starting with Cygwin Perl followed by several versions of (newlen >>
x) and finally the unpatched Strawberry Perl. These reports now also
include memory footprint info (courtesy of pslist from the
Sysinternals suite). I also went back to my original task of reading a
12 MB postscript file using qx(cat ...) and in some cases I also
report times for that -- here Cygwin (70 ms) still beats my modified
perl (210 ms), but that's still waaaaay better than the original 18000
ms :-)

I will also report to p5p.

Wolfram

###########################################################

c:\cygwin\bin\perl d:\exe\LongStrings.pl

1E5 chars + 1E4 x 1E2 chars: 1.5 ms
1E6 chars + 1E4 x 1E2 chars: 2.3 ms
1E7 chars + 1E4 x 1E2 chars: 1.5 ms

1E7 chars + 1E5 x 1E1 chars: 12.2 ms
1E7 chars + 1E4 x 1E2 chars: 1.4 ms
1E7 chars + 1E3 x 1E3 chars: 0.6 ms
1E7 chars + 1E2 x 1E4 chars: 0.6 ms
1E7 chars + 1E1 x 1E5 chars: 0.8 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.2 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 5.9 ms

Private MB: 326.5
Peak Private MB: 326.5

--------------

qx(cat postscriptfile.ps): 68.7 ms
Private MB: 38.5
Peak Private MB: 38.5

###########################################################

newlen += (newlen >> 0) + 10;

C:\wh_fast_perl\bin\perl d:\exe\LongStrings.pl

1E5 chars + 1E4 x 1E2 chars: 2.2 ms
1E6 chars + 1E4 x 1E2 chars: 1.4 ms
1E7 chars + 1E4 x 1E2 chars: 1.4 ms

1E7 chars + 1E5 x 1E1 chars: 10.4 ms
1E7 chars + 1E4 x 1E2 chars: 1.4 ms
1E7 chars + 1E3 x 1E3 chars: 0.6 ms
1E7 chars + 1E2 x 1E4 chars: 0.6 ms
1E7 chars + 1E1 x 1E5 chars: 0.6 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.2 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 6.0 ms

Private MB: 378.3
Peak Private MB: 418.0

--------------

qx(cat postscriptfile.ps): 181.2 ms
Private MB: 25.1
Peak Private MB: 40.3

###########################################################

newlen += (newlen >> 1) + 10;

C:\wh_fast_perl\bin\perl d:\exe\LongStrings.pl

1E5 chars + 1E4 x 1E2 chars: 2.5 ms
1E6 chars + 1E4 x 1E2 chars: 2.4 ms
1E7 chars + 1E4 x 1E2 chars: 1.3 ms

1E7 chars + 1E5 x 1E1 chars: 9.6 ms
1E7 chars + 1E4 x 1E2 chars: 1.3 ms
1E7 chars + 1E3 x 1E3 chars: 0.7 ms
1E7 chars + 1E2 x 1E4 chars: 0.7 ms
1E7 chars + 1E1 x 1E5 chars: 0.6 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.1 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 6.4 ms

Private MB: 290.2
Peak Private MB: 319.5

###########################################################

newlen += (newlen >> 2) + 10;

C:\wh_fast_perl\bin\perl d:\exe\LongStrings.pl

1E5 chars + 1E4 x 1E2 chars: 9.2 ms
1E6 chars + 1E4 x 1E2 chars: 5.3 ms
1E7 chars + 1E4 x 1E2 chars: 1.5 ms

1E7 chars + 1E5 x 1E1 chars: 9.9 ms
1E7 chars + 1E4 x 1E2 chars: 1.4 ms
1E7 chars + 1E3 x 1E3 chars: 0.5 ms
1E7 chars + 1E2 x 1E4 chars: 0.5 ms
1E7 chars + 1E1 x 1E5 chars: 0.6 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.1 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 5.4 ms

Private MB: 244.9
Peak Private MB: 270.1

--------------

qx(cat postscriptfile.ps): 209.8 ms
Private MB: 16.2
Peak Private MB: 29.0

###########################################################

newlen += (newlen >> 3) + 10;

C:\wh_fast_perl\bin\perl d:\exe\LongStrings.pl

1E5 chars + 1E4 x 1E2 chars: 12.1 ms
1E6 chars + 1E4 x 1E2 chars: 6.9 ms
1E7 chars + 1E4 x 1E2 chars: 1.4 ms

1E7 chars + 1E5 x 1E1 chars: 10.3 ms
1E7 chars + 1E4 x 1E2 chars: 1.4 ms
1E7 chars + 1E3 x 1E3 chars: 0.5 ms
1E7 chars + 1E2 x 1E4 chars: 0.5 ms
1E7 chars + 1E1 x 1E5 chars: 0.5 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.1 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 5.6 ms

Private MB: 221.9
Peak Private MB: 244.3

###########################################################

newlen += (newlen >> 4) + 10;

C:\wh_fast_perl\bin\perl d:\exe\LongStrings.pl

1E5 chars + 1E4 x 1E2 chars: 17.0 ms
1E6 chars + 1E4 x 1E2 chars: 13.8 ms
1E7 chars + 1E4 x 1E2 chars: 11.2 ms

1E7 chars + 1E5 x 1E1 chars: 19.4 ms
1E7 chars + 1E4 x 1E2 chars: 10.1 ms
1E7 chars + 1E3 x 1E3 chars: 10.9 ms
1E7 chars + 1E2 x 1E4 chars: 11.1 ms
1E7 chars + 1E1 x 1E5 chars: 11.0 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.2 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 6.3 ms

Private MB: 219.4
Peak Private MB: 233.8

--------------

qx(cat postscriptfile.ps): 312.0 ms
Private MB: 14.0
Peak Private MB: 25.8

###########################################################

newlen += (newlen >> 6) + 10;

C:\wh_fast_perl\bin\perl d:\exe\LongStrings.pl

1E5 chars + 1E4 x 1E2 chars: 57.7 ms
1E6 chars + 1E4 x 1E2 chars: 59.8 ms
1E7 chars + 1E4 x 1E2 chars: 67.9 ms

1E7 chars + 1E5 x 1E1 chars: 69.4 ms
1E7 chars + 1E4 x 1E2 chars: 71.6 ms
1E7 chars + 1E3 x 1E3 chars: 69.6 ms
1E7 chars + 1E2 x 1E4 chars: 64.8 ms
1E7 chars + 1E1 x 1E5 chars: 53.8 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.2 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 5.7 ms

Private MB: 219.8
Peak Private MB: 230.0

###########################################################

unpatched Strawberry Perl

c:\strawberry\perl\bin\perl d:\exe\LongStrings.pl

1E5 chars + 1E4 x 1E2 chars: 96.2 ms
1E6 chars + 1E4 x 1E2 chars: 325.7 ms
1E7 chars + 1E4 x 1E2 chars: 2655.9 ms

1E7 chars + 1E5 x 1E1 chars: 2687.3 ms
1E7 chars + 1E4 x 1E2 chars: 2687.4 ms
1E7 chars + 1E3 x 1E3 chars: 2656.1 ms
1E7 chars + 1E2 x 1E4 chars: 1093.6 ms
1E7 chars + 1E1 x 1E5 chars: 108.3 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.1 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 6.1 ms

Private MB: 200.4
Peak Private MB: 210.2

--------------

qx(cat postscriptfile.ps): 18187.5 ms
Private MB: 13.2
Peak Private MB: 24.9

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: Extracting table in html page
Next: Use of uninitialised value, how to avoid?