From: Ben Morrow on

Quoth Ilya Zakharevich <nospam-abuse(a)ilyaz.org>:
> On 2010-07-26, Peter J. Holzer <hjp-usenet2(a)hjp.at> wrote:
> > On 2010-07-26 13:11, Ben Morrow <ben(a)morrow.me.uk> wrote:
> >> If the issue simply turns out to be 'Microsoft don't know how to write a
> >> decent malloc', there is very little p5p can do about it, of course. On
> >> most platforms perl can, and often does, use its own malloc
> >> implementation which is optimised for perl's use (lots of tiny
> >> allocations and deallocations all the time). This isn't possible on
> >> Win32 unless you make a custom build of perl that doesn't support the
> >> fork emulation.
>
> > Since the fork emulation works with Win32 malloc, I think it should be
> > possible to write a custom malloc based on Win32 malloc (or the
> > underlying API calls) which still works with the fork emulation but is
> > faster. But it's probably not easy or somebody would have done it
> > already (I don't pretend to understand either memory allocation or the
> > fork emulation on windows).
>
> "My" malloc() (one shipped with Perl) needs only 1 syscall implemented
> on a particular architecture: get some pages from the system. So the
> macro should be defined on the command line of $(CC) -c malloc.c.

There is an option in the Win32 makefile to do exactly that, with a note
that it doesn't work with USE_IMP_SYS. I believe this has something to
do with needing a separate malloc pool for each thread (pseudo-process),
but I don't know. I would have thought, on the face of it, that it
should be possible to simply replace malloc with mymalloc wherever it
ends up being called, but I don't pretend to understand anything in
perlhost.h.

Ben

From: Wolfram Humann on
I do understand how an unefficient malloc would explain the chunk-size
dependency in the code. What I do not understand is the dependency on
the initial size of the string that I append to. Does Perl re-allocate
the complete string (at least from time to time) when I keep appending
to it? Or could there be other effects involved besides malloc speed ?
(Ideally I would profile Strawberry Perl on the C level but I don't
know how to do that...)

Wolfram
From: Uri Guttman on
>>>>> "WH" == Wolfram Humann <w.c.humann(a)arcor.de> writes:

WH> I do understand how an unefficient malloc would explain the chunk-size
WH> dependency in the code. What I do not understand is the dependency on
WH> the initial size of the string that I append to. Does Perl re-allocate
WH> the complete string (at least from time to time) when I keep appending
WH> to it? Or could there be other effects involved besides malloc speed ?
WH> (Ideally I would profile Strawberry Perl on the C level but I don't
WH> know how to do that...)

perl does a realloc (or its equiv) on strings which grow too big. i
believe it does a doubling each time (same as for hash buckets). if so,
a large initial size will affect things as it will mean fewer malloc
calls as it grows. if malloc is the bottleneck, this will show up.

uri

--
Uri Guttman ------ uri(a)stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
From: Wolfram Humann on
On 27 Jul., 08:20, "Uri Guttman" <u...(a)StemSystems.com> wrote:
> >>>>> "WH" == Wolfram Humann <w.c.hum...(a)arcor.de> writes:
>
>   WH> I do understand how an unefficient malloc would explain the chunk-size
>   WH> dependency in the code. What I do not understand is the dependency on
>   WH> the initial size of the string that I append to. Does Perl re-allocate
>   WH> the complete string (at least from time to time) when I keep appending
>   WH> to it? Or could there be other effects involved besides malloc speed ?
>   WH> (Ideally I would profile Strawberry Perl on the C level but I don't
>   WH> know how to do that...)
>
> perl does a realloc (or its equiv) on strings which grow too big. i
> believe it does a doubling each time (same as for hash buckets). if so,
> a large initial size will affect things as it will mean fewer malloc
> calls as it grows. if malloc is the bottleneck, this will show up.

Unfortunately, that makes thing just more mysterious (at least for my
understanding):
> 1E5 chars + 1E4 x 1E2 chars: 94.4 ms
> 1E6 chars + 1E4 x 1E2 chars: 319.9 ms
> 1E7 chars + 1E4 x 1E2 chars: 2710.4 ms
While appending 1E6 chars (= 1E4 chunks of 1E2 chars each) to an
initial string size of 1E5 might require 3-4 reallocs, appending 1E6
chars to an initial string size of 1E7 chars involves at most 1
realloc. And (even though not measured in my sample code above) the
from-scratch en-bloc allocation of a 2E7 chars string is reasonably
fast in Strawberry Perl.

Wolfram
From: Wolfram Humann on
Alright, I did some profiling and code-reading and what I found is
something that I would consider a bug or at least fairly poor coding
practice in the core.
Opinions very welcome!

In Strawberry almost the entire time is spent in the following call
sequence:

Perl_sv_catpvn_flags -> Perl_sv_grow -> Perl_safesysrealloc -> realloc
(msvcrt)

Perl_sv_catpvn_flags (in sv.c) is documented as "Concatenates the
string onto the end of the string which is in the SV". That's what my
code does all the time. So far so good.
Perl_sv_catpvn_flags *always* calls SvGrow -> Perl_sv_grow (also in
sv.c). Perl_sv_grow then needs to decide if the string's memory is
already sufficient or really needs to grow. In the latter case,
safesysrealloc -> Perl_safesysrealloc -> realloc is called. The
interesting point is: how much memory does it request? The answer is:

newlen += 10 * (newlen - SvCUR(sv)); /* avoid copy each time */

I.e. it requests 10 times as much memory as is required for the
current append operation. So when I loop 10000 times and each time
append 100 chars to an initial string size of 10 million, the memory
grows from 10.000e6 to 10.001e6 to 10.002e6 and so on 1000 times till
it ends at 11.000e6. I can sort of confirm this to be true if I look
at the memory graph in Process Explorer: it grows smoothly (no
discernible steps), becoming incrementally slower towards the end
(because the amount of memory that needs to be copied for each realloc
increases).

Growing memory in such tiny increments is what I consider bad
practice.

By the way: I estimate the time required for each realloc to be around
3 ms for 10e6 chars, growing linearly with the amount of data -- I
consider that a fair speed and no reason to blame win32.

What happens in Cygwin? The stack-sampling profiler is of little help
because it easily misses infrequent events. I would expect that
Perl_sv_grow is called just as often as in Strawberry Perl. The
difference is that safesysrealloc does not call Perl_safesysrealloc ->
realloc, it calls Perl_realloc. And Perl_realloc (in malloc.c) seems
to have it's own logic (something with '<<' and 'LOG' and 'pow' which
I did not try to fully understand) to determine what amount of memory
it finally allocates. When I add some sleep() to the string append
process, I can see how memory grows in Process Explorer: There are 5
steps (probably corresponding to 5 calls to Perl_realloc) of growing
size when I start with 0.1e6 chars and then grow to 1.1e6 chars. When
I start with 10e6 chars and grow to 11e6 chars, there is just 1 step
in memory size. This looks like a sensible memory growth strategy to
me. It explains why Cygwin is several 100 times faster than Strawberry
Perl. It also explains why I observed during my experiments that
Cygwin Perl consistently needs more memory than Strawberry Perl -- but
that's a small price to pay for such a dramatic speedup.

Wolfram