From: Ilya Zakharevich on 27 Jul 2010 20:50 On 2010-07-27, Wolfram Humann <w.c.humann(a)arcor.de> wrote: > sv.c). Perl_sv_grow then needs to decide if the string's memory is > already sufficient or really needs to grow. In the latter case, > safesysrealloc -> Perl_safesysrealloc -> realloc is called. The > interesting point is: how much memory does it request? The answer is: > > newlen += 10 * (newlen - SvCUR(sv)); /* avoid copy each time */ > > I.e. it requests 10 times as much memory as is required for the > current append operation. So when I loop 10000 times and each time > append 100 chars to an initial string size of 10 million, the memory > grows from 10.000e6 to 10.001e6 to 10.002e6 and so on 1000 times till > it ends at 11.000e6. Good l*rd! The current algorithm is optimized to work in tandem with "my" malloc(), which would round up to a certain geometric progression anyway. So if one use as different malloc()s, one should better use newlen += (newlen >> 4) + 10; /* avoid copy each time */ Thanks for an analysis (and report this on p5p)! Yours, Ilya P.S. And the current algorithm is a disaster if you try to append a string of length 200e6 to a short string...
From: Ilya Zakharevich on 27 Jul 2010 20:59 On 2010-07-27, Ben Morrow <ben(a)morrow.me.uk> wrote: >> By the way: I estimate the time required for each realloc to be around >> 3 ms for 10e6 chars, growing linearly with the amount of data -- I >> consider that a fair speed and no reason to blame win32. > > If you timed perl's own realloc, you would (I believe) find it does much > better than this. AFAICS from the code, it has a fixed set of block > sizes it actually allocates. Enlarging a block such that it doesn't go > over the block size actually allocated is *free*, not even linear in the > size of the block, since all it does is adjust the end marker. This is > the logic you are expecting sv_grow to implement, but perl has decided > that this is the allocator's responsibility. Wrong analysis. *In this particular situation* one would find Perl's realloc() as slow as M$'s one - but it would be called MUCH more rarely. With my malloc(), I implemented an extra call: malloced_size(). So on the 1st iteration, things would go similarly to "their-malloc", and realloc() would be called - and your analysis would apply. But on the 2nd iteration of appending, Perl would discover the new *really* allocated size for the string, and would update the buffer size stored in SV. After this, for many iteractions the realloc() is out-of-the-loop altogether. BTW, IIRC, my malloc() does not store the "end marker" at all (unless a debugging version is used). (The bucket size of short strings depends only on which page of memory they belong to - well, actually, half-page.) Hope this helps, Ilya
From: Ilya Zakharevich on 27 Jul 2010 21:02 On 2010-07-27, Peter J. Holzer <hjp-usenet2(a)hjp.at> wrote: > I'm pretty sure that GNU malloc doesn't round up to powers of two or > something like that. However, the performance difference between GNU > malloc and Perl malloc is rather small: Yeah, right. Not more than 3 times. ;-) > perl 5.12.1, default config, EGLIBC 2.11.2-2: > > 1E5 chars + 1E4 x 1E2 chars: 3.9 ms > 1E6 chars + 1E4 x 1E2 chars: 3.8 ms > 1E7 chars + 1E4 x 1E2 chars: 4.4 ms > > 1E7 chars + 1E5 x 1E1 chars: 28.4 ms > 1E7 chars + 1E4 x 1E2 chars: 4.5 ms > 1E7 chars + 1E3 x 1E3 chars: 2.6 ms > 1E7 chars + 1E2 x 1E4 chars: 2.0 ms > 1E7 chars + 1E1 x 1E5 chars: 1.9 ms > > 1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 2.0 ms > 1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 4.4 ms > > perl 5.12.1, usemymalloc=y, EGLIBC 2.11.2-2: > > 1E5 chars + 1E4 x 1E2 chars: 2.6 ms > 1E6 chars + 1E4 x 1E2 chars: 3.8 ms > 1E7 chars + 1E4 x 1E2 chars: 2.5 ms > > 1E7 chars + 1E5 x 1E1 chars: 18.8 ms > 1E7 chars + 1E4 x 1E2 chars: 2.5 ms > 1E7 chars + 1E3 x 1E3 chars: 0.9 ms > 1E7 chars + 1E2 x 1E4 chars: 0.9 ms > 1E7 chars + 1E1 x 1E5 chars: 1.1 ms > > 1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.9 ms > 1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 3.4 ms Yours, Ilya
From: Peter J. Holzer on 28 Jul 2010 04:15 On 2010-07-28 01:02, Ilya Zakharevich <nospam-abuse(a)ilyaz.org> wrote: > On 2010-07-27, Peter J. Holzer <hjp-usenet2(a)hjp.at> wrote: >> I'm pretty sure that GNU malloc doesn't round up to powers of two or >> something like that. However, the performance difference between GNU >> malloc and Perl malloc is rather small: > > Yeah, right. Not more than 3 times. ;-) Compared to the factor of about 1000 that Wolfram and Ben observed this is small. More importantly, growing a string to length n in constant increments appears to be an O(n) operation with both your malloc and GNU malloc, but O(n�) with Win32 malloc and BSD malloc. Your malloc is certainly faster than GNU malloc. OTOH, AFAICS from quickly scanning the comments in malloc.c, it always pads allocation to the next power of two (minus 4) and it doesn't return unused memory back to the OS. So it might use quite a bit more memory. Which of these effects is more important depends on the program and is hard to determine analytically. Which means that I should benchmark some of my more performance-critical scripts with both allocators ;-). hp
From: Wolfram Humann on 28 Jul 2010 11:09
On Jul 28, 2:50 am, Ilya Zakharevich <nospam-ab...(a)ilyaz.org> wrote: > On 2010-07-27, Wolfram Humann <w.c.hum...(a)arcor.de> wrote: > > > sv.c). Perl_sv_grow then needs to decide if the string's memory is > > already sufficient or really needs to grow. In the latter case, > > safesysrealloc -> Perl_safesysrealloc -> realloc is called. The > > interesting point is: how much memory does it request? The answer is: > > > newlen += 10 * (newlen - SvCUR(sv)); /* avoid copy each time */ > > > I.e. it requests 10 times as much memory as is required for the > > current append operation. So when I loop 10000 times and each time > > append 100 chars to an initial string size of 10 million, the memory > > grows from 10.000e6 to 10.001e6 to 10.002e6 and so on 1000 times till > > it ends at 11.000e6. > > Good l*rd! > > The current algorithm is optimized to work in tandem with "my" > malloc(), which would round up to a certain geometric progression > anyway. So if one use as different malloc()s, one should better use > > newlen += (newlen >> 4) + 10; /* avoid copy each time */ I finally managed to compile my own win32 perl. (Actually it was quite easy once I refrained from doing mistakes so stupid I do not dare to talk about them...) Now I could modify Perl_sv_grow() and insert debugging prints and I found good and bad news. The bad news: Looks like I was *overly optimistic* (LOL!) concerning the efficiency of the current string memory allocation on win32. The "newlen += 10 * (newlen - SvCUR(sv))" line is only executed if SvOOK(sv) -- i.e. in most cases it is *not* executed. Therefore win32 system realloc is not called every tenth string-append operation but *every* time something gets appended to a string. The good news: A single additional line of code makes win32 perl 100...1000 times faster! (for code that appends to strings very frequently) I went with Ilya's proposal but inserted the line a little further down, just after if (newlen > SvLEN(sv)) { /* need more room? */ So now we have: if (newlen > SvLEN(sv)) { /* need more room? */ newlen += (newlen >> 2) + 10; #ifndef Perl_safesysmalloc_size newlen = PERL_STRLEN_ROUNDUP(newlen); #endif if (SvLEN(sv) && s) { s = (char*)saferealloc(s, newlen); } The remaining question is by what ratio a string's memory should grow. I tried several values from (newlen >> 0) to (newlen >> 6) for the best compromise between execution time and memory usage and my personal favorite is (newlen >> 2). What do others here think? At the end of this post I will attach the results for my benchmark script starting with Cygwin Perl followed by several versions of (newlen >> x) and finally the unpatched Strawberry Perl. These reports now also include memory footprint info (courtesy of pslist from the Sysinternals suite). I also went back to my original task of reading a 12 MB postscript file using qx(cat ...) and in some cases I also report times for that -- here Cygwin (70 ms) still beats my modified perl (210 ms), but that's still waaaaay better than the original 18000 ms :-) I will also report to p5p. Wolfram ########################################################### c:\cygwin\bin\perl d:\exe\LongStrings.pl 1E5 chars + 1E4 x 1E2 chars: 1.5 ms 1E6 chars + 1E4 x 1E2 chars: 2.3 ms 1E7 chars + 1E4 x 1E2 chars: 1.5 ms 1E7 chars + 1E5 x 1E1 chars: 12.2 ms 1E7 chars + 1E4 x 1E2 chars: 1.4 ms 1E7 chars + 1E3 x 1E3 chars: 0.6 ms 1E7 chars + 1E2 x 1E4 chars: 0.6 ms 1E7 chars + 1E1 x 1E5 chars: 0.8 ms 1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.2 ms 1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 5.9 ms Private MB: 326.5 Peak Private MB: 326.5 -------------- qx(cat postscriptfile.ps): 68.7 ms Private MB: 38.5 Peak Private MB: 38.5 ########################################################### newlen += (newlen >> 0) + 10; C:\wh_fast_perl\bin\perl d:\exe\LongStrings.pl 1E5 chars + 1E4 x 1E2 chars: 2.2 ms 1E6 chars + 1E4 x 1E2 chars: 1.4 ms 1E7 chars + 1E4 x 1E2 chars: 1.4 ms 1E7 chars + 1E5 x 1E1 chars: 10.4 ms 1E7 chars + 1E4 x 1E2 chars: 1.4 ms 1E7 chars + 1E3 x 1E3 chars: 0.6 ms 1E7 chars + 1E2 x 1E4 chars: 0.6 ms 1E7 chars + 1E1 x 1E5 chars: 0.6 ms 1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.2 ms 1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 6.0 ms Private MB: 378.3 Peak Private MB: 418.0 -------------- qx(cat postscriptfile.ps): 181.2 ms Private MB: 25.1 Peak Private MB: 40.3 ########################################################### newlen += (newlen >> 1) + 10; C:\wh_fast_perl\bin\perl d:\exe\LongStrings.pl 1E5 chars + 1E4 x 1E2 chars: 2.5 ms 1E6 chars + 1E4 x 1E2 chars: 2.4 ms 1E7 chars + 1E4 x 1E2 chars: 1.3 ms 1E7 chars + 1E5 x 1E1 chars: 9.6 ms 1E7 chars + 1E4 x 1E2 chars: 1.3 ms 1E7 chars + 1E3 x 1E3 chars: 0.7 ms 1E7 chars + 1E2 x 1E4 chars: 0.7 ms 1E7 chars + 1E1 x 1E5 chars: 0.6 ms 1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.1 ms 1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 6.4 ms Private MB: 290.2 Peak Private MB: 319.5 ########################################################### newlen += (newlen >> 2) + 10; C:\wh_fast_perl\bin\perl d:\exe\LongStrings.pl 1E5 chars + 1E4 x 1E2 chars: 9.2 ms 1E6 chars + 1E4 x 1E2 chars: 5.3 ms 1E7 chars + 1E4 x 1E2 chars: 1.5 ms 1E7 chars + 1E5 x 1E1 chars: 9.9 ms 1E7 chars + 1E4 x 1E2 chars: 1.4 ms 1E7 chars + 1E3 x 1E3 chars: 0.5 ms 1E7 chars + 1E2 x 1E4 chars: 0.5 ms 1E7 chars + 1E1 x 1E5 chars: 0.6 ms 1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.1 ms 1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 5.4 ms Private MB: 244.9 Peak Private MB: 270.1 -------------- qx(cat postscriptfile.ps): 209.8 ms Private MB: 16.2 Peak Private MB: 29.0 ########################################################### newlen += (newlen >> 3) + 10; C:\wh_fast_perl\bin\perl d:\exe\LongStrings.pl 1E5 chars + 1E4 x 1E2 chars: 12.1 ms 1E6 chars + 1E4 x 1E2 chars: 6.9 ms 1E7 chars + 1E4 x 1E2 chars: 1.4 ms 1E7 chars + 1E5 x 1E1 chars: 10.3 ms 1E7 chars + 1E4 x 1E2 chars: 1.4 ms 1E7 chars + 1E3 x 1E3 chars: 0.5 ms 1E7 chars + 1E2 x 1E4 chars: 0.5 ms 1E7 chars + 1E1 x 1E5 chars: 0.5 ms 1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.1 ms 1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 5.6 ms Private MB: 221.9 Peak Private MB: 244.3 ########################################################### newlen += (newlen >> 4) + 10; C:\wh_fast_perl\bin\perl d:\exe\LongStrings.pl 1E5 chars + 1E4 x 1E2 chars: 17.0 ms 1E6 chars + 1E4 x 1E2 chars: 13.8 ms 1E7 chars + 1E4 x 1E2 chars: 11.2 ms 1E7 chars + 1E5 x 1E1 chars: 19.4 ms 1E7 chars + 1E4 x 1E2 chars: 10.1 ms 1E7 chars + 1E3 x 1E3 chars: 10.9 ms 1E7 chars + 1E2 x 1E4 chars: 11.1 ms 1E7 chars + 1E1 x 1E5 chars: 11.0 ms 1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.2 ms 1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 6.3 ms Private MB: 219.4 Peak Private MB: 233.8 -------------- qx(cat postscriptfile.ps): 312.0 ms Private MB: 14.0 Peak Private MB: 25.8 ########################################################### newlen += (newlen >> 6) + 10; C:\wh_fast_perl\bin\perl d:\exe\LongStrings.pl 1E5 chars + 1E4 x 1E2 chars: 57.7 ms 1E6 chars + 1E4 x 1E2 chars: 59.8 ms 1E7 chars + 1E4 x 1E2 chars: 67.9 ms 1E7 chars + 1E5 x 1E1 chars: 69.4 ms 1E7 chars + 1E4 x 1E2 chars: 71.6 ms 1E7 chars + 1E3 x 1E3 chars: 69.6 ms 1E7 chars + 1E2 x 1E4 chars: 64.8 ms 1E7 chars + 1E1 x 1E5 chars: 53.8 ms 1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.2 ms 1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 5.7 ms Private MB: 219.8 Peak Private MB: 230.0 ########################################################### unpatched Strawberry Perl c:\strawberry\perl\bin\perl d:\exe\LongStrings.pl 1E5 chars + 1E4 x 1E2 chars: 96.2 ms 1E6 chars + 1E4 x 1E2 chars: 325.7 ms 1E7 chars + 1E4 x 1E2 chars: 2655.9 ms 1E7 chars + 1E5 x 1E1 chars: 2687.3 ms 1E7 chars + 1E4 x 1E2 chars: 2687.4 ms 1E7 chars + 1E3 x 1E3 chars: 2656.1 ms 1E7 chars + 1E2 x 1E4 chars: 1093.6 ms 1E7 chars + 1E1 x 1E5 chars: 108.3 ms 1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.1 ms 1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 6.1 ms Private MB: 200.4 Peak Private MB: 210.2 -------------- qx(cat postscriptfile.ps): 18187.5 ms Private MB: 13.2 Peak Private MB: 24.9 |