From: Zooko O'Whielacronx on
Folks:

Every couple of years I run into a problem where some Python code that
worked well at small scales starts burning up my CPU at larger scales,
and the underlying issue turns out to be the idiom of accumulating
data by string concatenation. It just happened again
(http://foolscap.lothar.com/trac/ticket/149 ), and as usual it is hard
to make the data accumulator efficient without introducing a bunch of
bugs into the surrounding code. So this time around I decided to try
encapsulating my preferred more efficient idiom into a reusable class.

So I present to you StringChain, which is an efficient way to
accumulate and process data in many chunks:

http://tahoe-lafs.org/trac/stringchain

Here are some benchmarks generated by running python -OOu -c 'from
stringchain.bench import bench; bench.quick_bench()' as instructed by
the README.txt file.

The N: in the left-hand column is how many bytes were in the test
dataset. The ave rate: number in the right-hand column is how many
bytes per second were processed. "naive" means the string-based idiom
sketched above and "strch" means using the StringChain class.

_buildup init_naive
N: 65536, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 890, ave
rate: 58350579
N: 131072, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 265, ave
rate: 34800398
N: 262144, time: best: 0.01, 2th-best: 0.01, ave: 0.01,
2th-worst: 0.01, worst: 0.01 (of 5), reps/s: 79, ave
rate: 20745346
N: 524288, time: best: 0.05, 2th-best: 0.05, ave: 0.05,
2th-worst: 0.05, worst: 0.05 (of 5), reps/s: 20, ave
rate: 10823850
_buildup init_strch
N: 65536, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 25543, ave
rate: 1674043282
N: 131072, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 14179, ave
rate: 1858538925
N: 262144, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 8016, ave
rate: 2101513050
N: 524288, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 4108, ave
rate: 2154215572
_consume init_naive_loaded
N: 65536, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 931, ave
rate: 61037862
N: 131072, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 270, ave
rate: 35454393
N: 262144, time: best: 0.01, 2th-best: 0.01, ave: 0.01,
2th-worst: 0.01, worst: 0.01 (of 5), reps/s: 74, ave
rate: 19471963
N: 524288, time: best: 0.05, 2th-best: 0.05, ave: 0.05,
2th-worst: 0.05, worst: 0.06 (of 5), reps/s: 19, ave
rate: 10146747
_consume init_strch_loaded
N: 65536, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 4309, ave
rate: 282447500
N: 131072, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 2313, ave
rate: 303263357
N: 262144, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 1186, ave
rate: 311159052
N: 524288, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 606, ave
rate: 317814669
_randomy init_naive
N: 65536, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 479, ave
rate: 31450561
N: 131072, time: best: 0.01, 2th-best: 0.01, ave: 0.01,
2th-worst: 0.01, worst: 0.01 (of 5), reps/s: 140, ave
rate: 18461191
N: 262144, time: best: 0.02, 2th-best: 0.02, ave: 0.02,
2th-worst: 0.03, worst: 0.03 (of 5), reps/s: 42, ave
rate: 11127714
N: 524288, time: best: 0.06, 2th-best: 0.07, ave: 0.08,
2th-worst: 0.08, worst: 0.09 (of 5), reps/s: 13, ave
rate: 6906341
_randomy init_strch
N: 65536, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 973, ave
rate: 63827127
N: 131072, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 495, ave
rate: 64970669
N: 262144, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 239, ave
rate: 62913360
N: 524288, time: best: 0.01, 2th-best: 0.01, ave: 0.01,
2th-worst: 0.01, worst: 0.01 (of 5), reps/s: 121, ave
rate: 63811569

The naive approach is slower than the StringChain class, and the
bigger the dataset the slower it goes. The StringChain class is fast
and also it is scalable (with regard to these benchmarks at least...).

Thanks!

Regards,

Zooko
From: Steven D'Aprano on
On Fri, 12 Mar 2010 00:11:37 -0700, Zooko O'Whielacronx wrote:

> Folks:
>
> Every couple of years I run into a problem where some Python code that
> worked well at small scales starts burning up my CPU at larger scales,
> and the underlying issue turns out to be the idiom of accumulating data
> by string concatenation.

I don't mean to discourage you, but the simple way to avoid that is not
to accumulate data by string concatenation.

The usual Python idiom is to append substrings to a list, then once, at
the very end, combine into a single string:


accumulator = []
for item in sequence:
accumulator.append(process(item))
string = ''.join(accumulator)


> It just happened again
> (http://foolscap.lothar.com/trac/ticket/149 ), and as usual it is hard
> to make the data accumulator efficient without introducing a bunch of
> bugs into the surrounding code.

I'm sorry, I don't agree about that at all. I've never come across a
situation where I wanted to use string concatenation and couldn't easily
modify it to use the list idiom above.

[...]
> Here are some benchmarks generated by running python -OOu -c 'from
> stringchain.bench import bench; bench.quick_bench()' as instructed by
> the README.txt file.

To be taken seriously, I think you need to compare stringchain to the
list idiom. If your benchmarks favourably compare to that, then it might
be worthwhile.



--
Steven
From: Zooko O'Whielacronx on
Folks:

I failed to make something sufficiently clear in my original message
about StringChain. The use case that I am talking about is not simply
that you need to accumulate a sequence of incoming chunks of data,
concatenate them together, and then process the entire result. If that
is all you need to do then indeed you can accumulate the incoming
strings in a list and then run ''.join(thelist) when you have received
all of them and are ready to process them.

But the use case that I am talking about is where you need to
accumulate new incoming strings into your buffer while alternately
processing leading prefixes of the buffer. The typical example is that
sequences of bytes are arriving on your TCP socket and you are
processing the stream incrementally. You need to process the leading
prefixes of the stream as soon as you can (e.g. as soon as a line
followed by a newline has accumulated, or as soon as another complete
element in your data format has accumulated, etc.); you can't simply
wait until the bytes stop coming and then concatenate the entire
collection together at once.

This is exactly the current situation in foolscap [1] which is causing
a performance problem in Tahoe-LAFS [2].

To illustrate what I mean I cooked up some benchmarks showing the task
of "accumulate a bunch of things then consume them in a single gulp"
versus the task of "alternate between accumulating and consuming"
(with accumulation going faster than consumption until the input
stream runs dry).

I implemented a few data structures for benchmarking: StringChain, the
naive "accumulatorstr += newstr" approach (named "Stringy" in the
benchmarks), one based on cStringIO (named "StringIOy"), one based on
accumulating the new strings into a list and then doing
''.join(accumulatorlist) whenever you need to consume a leading prefix
(called "SimplerStringChain").

Below are the abbreviated results of the benchmark. You can easily run
this benchmark yourself by following the README.txt in the StringChain
source package [3]. These results show that for the load imposed by
this benchmark StringChain is the only one of the four that scales and
that it is also nice and fast.

The left-hand column is the total number of bytes that were run
through it. The results are shown in nanoseconds per byte in
exponential notation. ("e+01" means times 10, "e+02" means times 100,
and "e+03" means times 1000, etc.) Each result is the best of 10
tries, or of 5 tries for the tasks which were taking too long to run
it 10 times.

Regards,

Zooko

[1] http://foolscap.lothar.com/trac/ticket/149
[2] http://allmydata.org/pipermail/tahoe-dev/2010-March/004181.html
[3] http://tahoe-lafs.org/trac/stringchain

impl: StringChain
task: _accumulate_then_one_gulp
10000 best: 2.694e+00
50000 best: 2.742e+00
100000 best: 2.310e+00
500000 best: 2.040e+00
1000000 best: 1.988e+00
5000000 best: 2.193e+00

task: _alternate_str
10000 best: 6.509e+00
50000 best: 4.559e+00
100000 best: 4.308e+00
500000 best: 4.070e+00
1000000 best: 3.991e+00
5000000 best: 4.000e+00

impl: SimplerStringChain
task: _accumulate_then_one_gulp
10000 best: 1.407e+00
50000 best: 2.317e+00
100000 best: 2.012e+00
500000 best: 1.902e+00
1000000 best: 1.897e+00
5000000 best: 2.104e+00

task: _alternate_str
10000 best: 4.888e+00
50000 best: 5.198e+00
100000 best: 1.750e+01
500000 best: 6.233e+01
1000000 best: 1.134e+02
5000000 best: 7.599e+02

impl: StringIOy
task: _accumulate_then_one_gulp
10000 best: 4.196e+00
50000 best: 5.522e+00
100000 best: 4.499e+00
500000 best: 3.756e+00
1000000 best: 4.176e+00
5000000 best: 5.414e+00

task: _alternate_str
10000 best: 5.484e+00
50000 best: 7.863e+00
100000 best: 2.126e+01
500000 best: 6.972e+01
1000000 best: 1.219e+02
5000000 best: 9.463e+02

task: _accumulate_then_one_gulp
10000 best: 1.502e+00
50000 best: 1.420e+01
100000 best: 2.245e+01
500000 best: 8.577e+01
1000000 best: 2.295e+02
5000000 best: 1.326e+03

task: _alternate_str
10000 best: 3.290e+00
50000 best: 4.220e+00
100000 best: 1.665e+01
500000 best: 6.281e+01
1000000 best: 1.127e+02
5000000 best: 7.626e+02
From: Steven D'Aprano on
On Sun, 21 Mar 2010 23:09:46 -0600, Zooko O'Whielacronx wrote:

> But the use case that I am talking about is where you need to accumulate
> new incoming strings into your buffer while alternately processing
> leading prefixes of the buffer.
[...]
> Below are the abbreviated results of the benchmark. You can easily run
> this benchmark yourself by following the README.txt in the StringChain
> source package [3]. These results show that for the load imposed by this
> benchmark StringChain is the only one of the four that scales and that
> it is also nice and fast.

I was reading this email, and thinking "Why do you need this StringChain
data structure, from the description it sounds like a job for
collections.deque?"

And funnily enough, following the links you supplied I came across this:

"You can get the package from http://pypi.python.org/pypi/stringchain or
with darcs get http://tahoe-lafs.org/source/stringchain/trunk. It has
unit tests. It is in pure Python (it uses collections.deque and string)."

http://tahoe-lafs.org/trac/stringchain


Perhaps you should have said that it was a wrapper around deque giving
richer functionality, rather than giving the impression that it was a
brand new data structure invented by you. People are naturally going to
be more skeptical about a newly invented data structure than one based on
a reliable, component like deque.

In any case, it looks to me that the StringChain data structure itself is
a little too application specific for me to be personally interested in
it. Maybe you should consider linking to it on PyPI and seeing if there
is any interest from others?



Regards,





Steven
From: Zooko O'Whielacronx on
On Mon, Mar 22, 2010 at 2:07 AM, Steven D'Aprano
<steven(a)remove.this.cybersource.com.au> wrote:
>
> Perhaps you should have said that it was a wrapper around deque giving
> richer functionality, rather than giving the impression that it was a
> brand new data structure invented by you. People are naturally going to
> be more skeptical about a newly invented data structure than one based on
> a reliable, component like deque.

The fact that StringChain uses deque to hold the queue of strings
isn't that important. I just benchmarked it by swapping in the deque
for a list and using the list costs about one third of a nanosecond
per byte at the scales that the benchmark exercises (namely 10,000,000
bytes in about 10,000 strings). A third of a nanosecond per byte is
about 4% of the runtime.

I also implemented and benchmarked a simpler deque-based scheme which
puts all the actual bytes from the strings into a deque with
self.d.extend(newstr). As you would expect, this shows good asymptotic
performance but the constants are relatively bad so that at all of the
actual loads measured it was three orders of magnitude worse than
StringChain and than String-Chain-with-a-list-instead-of-a-deque.
Moral: the constants matter!

Those benchmarks are appended. You can run the benchmarks yourself per
the README.txt.

But anyway, I take your point and I updated the StringChain README to
explain that it is a pretty simple data structure that holds a list
(actually a deque) of strings and isn't anything too clever or novel.

By the way, we could further micro-optimize this kind of thing if
''.join() would accept either strings or buffers instead of requiring
strings:

>>> ''.join([buffer("abc"), "def"])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: sequence item 0: expected string, buffer found

Then whenever StringChain needs to slice a string into leading and
trailing parts, it could construct a buffer() viewing each part
instead of making a copy of each part.


> it. Maybe you should consider linking to it on PyPI and seeing if there
> is any interest from others?

http://pypi.python.org/pypi/stringchain

Regards,

Zooko

impl: StringChain
task: _accumulate_then_one_gulp
10000 best: 5.698e+00, 3th-best: 7.486e+00, mean: 7.758e+00,
100000 best: 4.640e+00, 3th-best: 4.690e+00, mean: 7.674e+00,
1000000 best: 3.789e+00, 3th-best: 3.806e+00, mean: 3.995e+00,
10000000 best: 4.095e+00, 1th-best: 4.095e+00, mean: 4.095e+00,
task: _alternate_str
10000 best: 1.378e+01, 3th-best: 1.390e+01, mean: 1.500e+01,
100000 best: 9.198e+00, 3th-best: 9.248e+00, mean: 9.385e+00,
1000000 best: 8.715e+00, 3th-best: 8.755e+00, mean: 8.808e+00,
10000000 best: 8.738e+00, 1th-best: 8.738e+00, mean: 8.738e+00,
impl: StringChainWithList
task: _accumulate_then_one_gulp
10000 best: 3.600e+00, 3th-best: 3.695e+00, mean: 4.129e+00,
100000 best: 4.070e+00, 3th-best: 4.079e+00, mean: 4.162e+00,
1000000 best: 3.662e+00, 3th-best: 3.688e+00, mean: 3.721e+00,
10000000 best: 3.902e+00, 1th-best: 3.902e+00, mean: 3.902e+00,
1th-worst: 3.902e+00, worst: 3.902e+00 (of 1)
task: _alternate_str
10000 best: 1.369e+01, 3th-best: 1.380e+01, mean: 1.442e+01,
100000 best: 9.251e+00, 3th-best: 9.289e+00, mean: 9.416e+00,
1000000 best: 8.809e+00, 3th-best: 8.876e+00, mean: 8.943e+00,
10000000 best: 9.095e+00, 1th-best: 9.095e+00, mean: 9.095e+00,
impl: Dequey
task: _accumulate_then_one_gulp
10000 best: 2.772e+02, 3th-best: 2.785e+02, mean: 2.911e+02,
100000 best: 2.314e+02, 3th-best: 2.334e+02, mean: 2.422e+02,
1000000 best: 2.282e+02, 3th-best: 2.288e+02, mean: 2.370e+02,
10000000 best: 2.587e+02, 1th-best: 2.587e+02, mean: 2.587e+02,
task: _alternate_str
10000 best: 1.576e+03, 3th-best: 1.580e+03, mean: 1.608e+03,
100000 best: 1.301e+03, 3th-best: 1.303e+03, mean: 1.306e+03,
1000000 best: 1.275e+03, 3th-best: 1.276e+03, mean: 1.278e+03,
10000000 best: 1.280e+03, 1th-best: 1.280e+03, mean: 1.280e+03,