From: Ben Morrow on

Quoth Wolfram Humann <w.c.humann(a)arcor.de>:
> Alright, I did some profiling and code-reading and what I found is
> something that I would consider a bug or at least fairly poor coding
> practice in the core.
> Opinions very welcome!
>
> In Strawberry almost the entire time is spent in the following call
> sequence:
>
> Perl_sv_catpvn_flags -> Perl_sv_grow -> Perl_safesysrealloc -> realloc
> (msvcrt)
>
> Perl_sv_catpvn_flags (in sv.c) is documented as "Concatenates the
> string onto the end of the string which is in the SV". That's what my
> code does all the time. So far so good.
> Perl_sv_catpvn_flags *always* calls SvGrow -> Perl_sv_grow (also in
> sv.c). Perl_sv_grow then needs to decide if the string's memory is
> already sufficient or really needs to grow. In the latter case,
> safesysrealloc -> Perl_safesysrealloc -> realloc is called. The
> interesting point is: how much memory does it request? The answer is:
>
> newlen += 10 * (newlen - SvCUR(sv)); /* avoid copy each time */
>
> I.e. it requests 10 times as much memory as is required for the
> current append operation. So when I loop 10000 times and each time
> append 100 chars to an initial string size of 10 million, the memory
> grows from 10.000e6 to 10.001e6 to 10.002e6 and so on 1000 times till
> it ends at 11.000e6. I can sort of confirm this to be true if I look
> at the memory graph in Process Explorer: it grows smoothly (no
> discernible steps), becoming incrementally slower towards the end
> (because the amount of memory that needs to be copied for each realloc
> increases).
>
> Growing memory in such tiny increments is what I consider bad
> practice.

Possibly; I don't know what the rationale behind that choice was.
Certainly Perl seems to expect whatever malloc it's using to be smart
about pre-allocating extra memory and using that to satisfy reallocs.

> By the way: I estimate the time required for each realloc to be around
> 3 ms for 10e6 chars, growing linearly with the amount of data -- I
> consider that a fair speed and no reason to blame win32.

If you timed perl's own realloc, you would (I believe) find it does much
better than this. AFAICS from the code, it has a fixed set of block
sizes it actually allocates. Enlarging a block such that it doesn't go
over the block size actually allocated is *free*, not even linear in the
size of the block, since all it does is adjust the end marker. This is
the logic you are expecting sv_grow to implement, but perl has decided
that this is the allocator's responsibility.

> What happens in Cygwin? The stack-sampling profiler is of little help
> because it easily misses infrequent events. I would expect that
> Perl_sv_grow is called just as often as in Strawberry Perl. The
> difference is that safesysrealloc does not call Perl_safesysrealloc ->
> realloc, it calls Perl_realloc.

Right, so your Cygwin perl is built with -Dusemymalloc.

> And Perl_realloc (in malloc.c) seems
> to have it's own logic (something with '<<' and 'LOG' and 'pow' which
> I did not try to fully understand) to determine what amount of memory
> it finally allocates. When I add some sleep() to the string append
> process, I can see how memory grows in Process Explorer: There are 5
> steps (probably corresponding to 5 calls to Perl_realloc

No, I think not. I think Perl_realloc gets called just as often as
realloc got called with Strawberry, it's just that most of the time
realloc can return the new block without having to call sbrk (or
whatever Cygwin uses instead) and without having to do any copying.

> ) of growing
> size when I start with 0.1e6 chars and then grow to 1.1e6 chars. When
> I start with 10e6 chars and grow to 11e6 chars, there is just 1 step
> in memory size. This looks like a sensible memory growth strategy to
> me. It explains why Cygwin is several 100 times faster than Strawberry
> Perl. It also explains why I observed during my experiments that
> Cygwin Perl consistently needs more memory than Strawberry Perl -- but
> that's a small price to pay for such a dramatic speedup.

OK, I just ran your benchmark with the following perls:

5.8.8-vanilla i386-freebsd, default build options
5.8.8-malloc i386-freebsd, -Dusemymalloc

(chosen solely because this is the only matched pair of mymalloc/not
perls I have lying around) and got these results:

~/src/perl% runperl 5.8.8-vanilla realloc
1E5 chars + 1E4 x 1E2 chars: 420.3 ms
1E6 chars + 1E4 x 1E2 chars: 1043.1 ms
1E7 chars + 1E4 x 1E2 chars: 7159.0 ms

1E7 chars + 1E5 x 1E1 chars: 7590.6 ms
1E7 chars + 1E4 x 1E2 chars: 7148.9 ms
1E7 chars + 1E3 x 1E3 chars: 7158.1 ms
1E7 chars + 1E2 x 1E4 chars: 2948.6 ms
1E7 chars + 1E1 x 1E5 chars: 326.1 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 5.1 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 15.4 ms
~/src/perl% runperl 5.8.8-malloc realloc
1E5 chars + 1E4 x 1E2 chars: 18.6 ms
1E6 chars + 1E4 x 1E2 chars: 18.5 ms
1E7 chars + 1E4 x 1E2 chars: 45.3 ms

1E7 chars + 1E5 x 1E1 chars: 86.1 ms
1E7 chars + 1E4 x 1E2 chars: 7.4 ms
1E7 chars + 1E3 x 1E3 chars: 6.5 ms
1E7 chars + 1E2 x 1E4 chars: 3.1 ms
1E7 chars + 1E1 x 1E5 chars: 3.4 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 39.6 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 8.7 ms
~/src/perl%

So the difference you are seeing is precisely the difference between
using the system malloc and using perl's. (FreeBSD's malloc, unlike
Win32's, has a reputation for being rather efficient, so this lets
Microsoft off the hook.)

Would you be able to repeat these tests with 5.12.0: that is, build
(under Cygwin, if you don't have access to a Unix system) matched perls
configured with and without -Dusemymalloc, and run the test on both?
I'll try and do the same here, but I can't promise I'll have time. If
the slowdown still exists in 5.12, I think you have a good case for a
bug report. I'm not sure how possible it would be to fix, but it would
clearly be a big win under some circumstances to be able to build Win32
perl with perl's malloc.

Ben

From: Wolfram Humann on
On Jul 27, 4:13 pm, Ben Morrow <b...(a)morrow.me.uk> wrote:
> Would you be able to repeat these tests with 5.12.0: that is, build
> (under Cygwin, if you don't have access to a Unix system) matched perls
> configured with and without -Dusemymalloc, and run the test on both?
> I'll try and do the same here, but I can't promise I'll have time. If
> the slowdown still exists in 5.12, I think you have a good case for a
> bug report. I'm not sure how possible it would be to fix, but it would
> clearly be a big win under some circumstances to be able to build Win32
> perl with perl's malloc.

I do have a linux machine and I did comile my own perl there so I
think a could redo that (possibly easier than recompiling Perl for
Cygwin). The strange thing is that on Cygwin perl -V says
'usemymalloc=y' while the one on Linux says 'usemymalloc=n'. And on
Linux my bechmark runs everything under 12 ms. Are you certain
changing usemymalloc would have much effect there?


What I would much more *like* to try is recompile a perl (e.g.
strawberry perl) on win32 and replace

newlen += 10 * (newlen - SvCUR(sv));

with something like

newlen += 10 * (newlen - SvCUR(sv)) + 0.5 * SvCUR(sv);

(with the factor reasonably somewhere between 0.2 and 1)
but a quick attempt to follow http://perldoc.perl.org/perlwin32.html
was not successful :(

Wolfram
From: Ben Morrow on

Quoth Wolfram Humann <w.c.humann(a)arcor.de>:
> On Jul 27, 4:13�pm, Ben Morrow <b...(a)morrow.me.uk> wrote:
> > Would you be able to repeat these tests with 5.12.0: that is, build
> > (under Cygwin, if you don't have access to a Unix system) matched perls
> > configured with and without -Dusemymalloc, and run the test on both?
> > I'll try and do the same here, but I can't promise I'll have time. If
> > the slowdown still exists in 5.12, I think you have a good case for a
> > bug report. I'm not sure how possible it would be to fix, but it would
> > clearly be a big win under some circumstances to be able to build Win32
> > perl with perl's malloc.
>
> I do have a linux machine and I did comile my own perl there so I
> think a could redo that (possibly easier than recompiling Perl for
> Cygwin). The strange thing is that on Cygwin perl -V says
> 'usemymalloc=y' while the one on Linux says 'usemymalloc=n'. And on
> Linux my bechmark runs everything under 12 ms. Are you certain
> changing usemymalloc would have much effect there?

No. It's possible that glibc's malloc already behaves the way perl is
expecting it to, so using perl's malloc doesn't change the performance
much.

Given that we know Win32's malloc behaves badly, one thing to try would
be building Win32 perls without USE_IMP_SYS, but with and without
PERL_MALLOC. I will try to repeat the FreeBSD tests with 5.12, since
that seems to show the symptoms.

> What I would much more *like* to try is recompile a perl (e.g.
> strawberry perl) on win32 and replace
>
> newlen += 10 * (newlen - SvCUR(sv));
>
> with something like
>
> newlen += 10 * (newlen - SvCUR(sv)) + 0.5 * SvCUR(sv);
>
> (with the factor reasonably somewhere between 0.2 and 1)
> but a quick attempt to follow http://perldoc.perl.org/perlwin32.html
> was not successful :(

Last time I built perl on Win32 I started by installing Strawberry and
putting c:\strawberry\c\bin in %PATH%, and setting INCLUDE and LIB to
c:\strawberry\c\include and \lib respectively. That gives you a
known-good toolchain to start with. It's best to make sure you don't
have anything unnecessary in %PATH%; in particular, you mustn't have
some other copy of perl. Also remember that your build directory must
not have any spaces in its name.

Ben

From: jl_post on
On Jul 22, 7:02 am, Wolfram Humann <w.c.hum...(a)arcor.de> wrote:
> I have a program that processes PDF files by converting them to
> Postscript, read the ps and do something with it. I use pdftops (from
> xpdf) for the pdf->ps conversion and retrieve the result like this:
>
> $ps_text = qx( pdftops $infile - );
>
> On win32 using strawberry perl (tried 5.10 and 5.12) this takes much
> more time than I expected so I did a test and first converted the PDF
> to Postscript, then read the Postscript (about 12 MB) like this (cat
> on win32 provided by cygwin):
>
> perl -E" $t = qx(cat psfile.ps); say length $t "
>
> This takes about 16 seconds on win32 but only <1 seconds on Linux.


Dear Wolfram,

I've encountered a similar problem on Strawberry Perl before.

I'm curious: Could you try "pre-allocating" the needed space to
$ps_text (or $t) before you set it? For example, try this:

perl -E "$t = ' ' x (-s 'psfile.ps'); $t = qx(cat psfile.ps); say
length $t"

See if that helps. I've found that setting my variable to the
target length BEFORE it's set to the proper string can reduce time
significantly (when it is eventually being set to its target value).
I'm not sure why this is so, but I can guess that it's because it can
avoid the time-consuming process of "growing" the string a little at a
time.

I hope this helps,

-- Jean-Luc
From: Peter J. Holzer on
On 2010-07-27 17:07, Ben Morrow <ben(a)morrow.me.uk> wrote:
>
> Quoth Wolfram Humann <w.c.humann(a)arcor.de>:
>> On Jul 27, 4:13�pm, Ben Morrow <b...(a)morrow.me.uk> wrote:
>> > Would you be able to repeat these tests with 5.12.0: that is, build
>> > (under Cygwin, if you don't have access to a Unix system) matched perls
>> > configured with and without -Dusemymalloc, and run the test on both?
>> > I'll try and do the same here, but I can't promise I'll have time. If
>> > the slowdown still exists in 5.12, I think you have a good case for a
>> > bug report. I'm not sure how possible it would be to fix, but it would
>> > clearly be a big win under some circumstances to be able to build Win32
>> > perl with perl's malloc.
>>
>> I do have a linux machine and I did comile my own perl there so I
>> think a could redo that (possibly easier than recompiling Perl for
>> Cygwin). The strange thing is that on Cygwin perl -V says
>> 'usemymalloc=y' while the one on Linux says 'usemymalloc=n'. And on
>> Linux my bechmark runs everything under 12 ms. Are you certain
>> changing usemymalloc would have much effect there?
>
> No. It's possible that glibc's malloc already behaves the way perl is
> expecting it to, so using perl's malloc doesn't change the performance
> much.

I'm pretty sure that GNU malloc doesn't round up to powers of two or
something like that. However, the performance difference between GNU
malloc and Perl malloc is rather small:

perl 5.12.1, default config, EGLIBC 2.11.2-2:

1E5 chars + 1E4 x 1E2 chars: 3.9 ms
1E6 chars + 1E4 x 1E2 chars: 3.8 ms
1E7 chars + 1E4 x 1E2 chars: 4.4 ms

1E7 chars + 1E5 x 1E1 chars: 28.4 ms
1E7 chars + 1E4 x 1E2 chars: 4.5 ms
1E7 chars + 1E3 x 1E3 chars: 2.6 ms
1E7 chars + 1E2 x 1E4 chars: 2.0 ms
1E7 chars + 1E1 x 1E5 chars: 1.9 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 2.0 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 4.4 ms

perl 5.12.1, usemymalloc=y, EGLIBC 2.11.2-2:

1E5 chars + 1E4 x 1E2 chars: 2.6 ms
1E6 chars + 1E4 x 1E2 chars: 3.8 ms
1E7 chars + 1E4 x 1E2 chars: 2.5 ms

1E7 chars + 1E5 x 1E1 chars: 18.8 ms
1E7 chars + 1E4 x 1E2 chars: 2.5 ms
1E7 chars + 1E3 x 1E3 chars: 0.9 ms
1E7 chars + 1E2 x 1E4 chars: 0.9 ms
1E7 chars + 1E1 x 1E5 chars: 1.1 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.9 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 3.4 ms

That may be accidental, though: An strace output for GNU malloc) shows
that very few reallocations actually result in a different address -
mostly the allocated area can be grown because there is nothing after
it. This is even true if two strings grow in parallel - each time one of
the strings moves it leaves a hole which the other string can grow into,
so in practice this works like a binary backoff. I guess there are
allocation patterns which spoil this effect (e.g. if you allocate lots
of small objects while growing large strings) but I haven't tried to
find them.

hp