From: Steve Howell on
On Mar 28, 8:17 am, Duncan Booth <duncan.bo...(a)invalid.invalid> wrote:
> Steve Howell <showel...(a)yahoo.com> wrote:
> > The mildly surprising part of sum() is that is does add vs. add-in-
> > place, which leads to O(N) vs. O(1) for the inner loop calls, for
> > certain data structures, notably lists, even though none of the
> > intermediate results get used by the caller.  For lists, you could
> > make a more efficient variant of sum() that clones the start value and
> > does add-in-place.
>
> Doing add-in-place isn't the only way to make sum more efficient: if you
> assume that addition is associative (which of course the builtin sum can't)
> then you can form partial sums. e.g. instead of calculating:
>
>   (((((((a + b) + c) + d) + e) + f) + g) + h)
>
> you calculate:
>
>   (((a + b) + (c + d)) + ((e + f) + (g + h)))
>
> Obviously this requires more space than the naive sum, but not as much as
> you might at first expect: you only need to hold log(n) intermediates
> values at any time.
>

Yep, I did mention in my post that the outer loop does not *have* to
be O(N), if you can parallelize it. Apart from reducing
intermediates, the divide-and-conquer method does not reduce overall
computation time unless you have multiple processors, correct?

> > I could guess pretty confidently that the reason this optimization was
> > never tried is that sum() has always been intended to be used on
> > numerics, since other alternatives exist for strings (join), lists
> > (chain), and hand-coded data classes that support add-in-place (roll-
> > your-own loop).
>
> Doing it this way helps summing lists or strings (though not as much as
> str.join), but it also helps if you need to sum a long list of similarly
> sized floats as you'll get a more accurate answer.
>

Interesting! That makes sense. The docs for math.fsum() suggest that
partial sums are used to maintain precision.

http://docs.python.org/library/math.html#math.fsum

> Seehttp://groups.google.com/group/comp.lang.python/browse_thread/thread/....
> faf92f532e/027cef7d4429aa3a
> for an earlier discussion of this, or just Google comp.lang.python for
> 'pairwise sum'.
>
> Here's the code I posted in that thread:
>
> def sumpairs(seq):
>     tmp = []
>     for i,v in enumerate(seq):
>         if i&1:
>             tmp[-1] = tmp[-1] + v
>             i = i + 1
>             n = i & -i
>             while n > 2:
>                 t = tmp.pop(-1)
>                 tmp[-1] = tmp[-1] + t
>                 n >>= 1
>         else:
>             tmp.append(v)
>     while len(tmp) > 1:
>         t = tmp.pop(-1)
>         tmp[-1] = tmp[-1] + t
>     return tmp[0]
>
> and I claimed that my Python coded sumpairs function was faster than the
> builtin sum on a list of lists once you had more than about 210 items.
> I never did get round to rewriting it in C for a more realistic speed
> comparison: summing integers my Python version is about 60 times slower
> than the builtin.

From: Steve Howell on
On Mar 28, 7:57 am, Steven D'Aprano <st...(a)REMOVE-THIS-
cybersource.com.au> wrote:
> On Sun, 28 Mar 2010 07:16:10 -0700, Steve Howell wrote:
> > The mildly surprising part of sum() is that is does add vs. add-in-
> > place, which leads to O(N) vs. O(1) for the inner loop calls, for
> > certain data structures, notably lists, even though none of the
> > intermediate results get used by the caller.  For lists, you could make
> > a more efficient variant of sum() that clones the start value and does
> > add-in-place.
>
> I have no doubt that you could make a version of sum for lists which is
> more efficient than the existing one. After all, join more or less does
> the same sort of thing, and it's very efficient. But don't think that add-
> in-place is necessarily cheap. List appends are amortized O(1) each; if
> you are adding M lists of N items each, that gives you O(M*N).
>

O(M*N) is still cheaper than O(M*N*N).

> It's possible to improve the performance a tad if you can make multiple
> appends in roughly constant time, which is what list.extend (probably?)
> does, but only up to a point. Lists are over-allocated, but if you try to
> add more items than there is room for, you need to make the list bigger,
> and that means reallocating memory, which could easily be O(N**2) or
> worse, depending on how good your OS's memory management is. Under Linux,
> at least by default, malloc will never fail, but there's no guarantee how
> long it will take to return. If the OOM killer has to start shutting down
> other applications, and paging more and more memory to disk, eventually
> malloc will return (or the system will core dump), but it could take a
> while...
>

Even though extend() obviously has to do memory allocations along the
way itself, it is still more efficient than the alternative. No need
to speculate, you can measure these methods on your platform:

M = 10
N = 1000

def in_place(
start = [],
sublists = ([[None] * M]) * N
):
accum = start[:]
for sublist in sublists:
accum.extend(sublist)
return accum

def with_intermediates(
start = [],
sublists = ([[None] * M]) * N
):
accum = start
for sublist in sublists:
accum = accum + sublist
return accum
From: Patrick Maupin on
On Mar 28, 10:17 am, Duncan Booth <duncan.bo...(a)invalid.invalid>
wrote:
> Doing add-in-place isn't the only way to make sum more efficient: if you
> assume that addition is associative (which of course the builtin sum can't)
> then you can form partial sums. e.g. instead of calculating:
....
>
> Doing it this way helps summing lists or strings (though not as much as
> str.join), but it also helps if you need to sum a long list of similarly
> sized floats as you'll get a more accurate answer.

Also, partial sums would be a clear winner over add-in-place if
someone were dumb^H^H^H^Hnaive enough to use sum() on a long list of
tuples :-)
From: Alf P. Steinbach on
* Steve Howell:
> On Mar 28, 8:17 am, Duncan Booth <duncan.bo...(a)invalid.invalid> wrote:
>> Steve Howell <showel...(a)yahoo.com> wrote:
>>> The mildly surprising part of sum() is that is does add vs. add-in-
>>> place, which leads to O(N) vs. O(1) for the inner loop calls, for
>>> certain data structures, notably lists, even though none of the
>>> intermediate results get used by the caller. For lists, you could
>>> make a more efficient variant of sum() that clones the start value and
>>> does add-in-place.
>> Doing add-in-place isn't the only way to make sum more efficient: if you
>> assume that addition is associative (which of course the builtin sum can't)
>> then you can form partial sums. e.g. instead of calculating:
>>
>> (((((((a + b) + c) + d) + e) + f) + g) + h)
>>
>> you calculate:
>>
>> (((a + b) + (c + d)) + ((e + f) + (g + h)))
>>
>> Obviously this requires more space than the naive sum, but not as much as
>> you might at first expect: you only need to hold log(n) intermediates
>> values at any time.
>>
>
> Yep, I did mention in my post that the outer loop does not *have* to
> be O(N), if you can parallelize it. Apart from reducing
> intermediates, the divide-and-conquer method does not reduce overall
> computation time unless you have multiple processors, correct?
>

With a single thread of execution it can reduce the time for large integers and
custom numerical types, with n the number of final bits, from O(n^2) to O(n log n).

I don't think the parallelism here is very relevant for Python.

From a more practical point of view, the sum efficiency could be improved by
doing the first addition using '+' and the rest using '+=', without changing the
behavior.


Cheers & hth.,

- Alf
From: Steve Howell on
On Mar 28, 9:56 am, "Alf P. Steinbach" <al...(a)start.no> wrote:
> * Steve Howell:
>> (talking about summing a list of lists)
>
>  From a more practical point of view, the sum efficiency could be improved by
> doing the first addition using '+' and the rest using '+=', without changing the
> behavior.

I like that approach, because you don't have to assume that the start
object is clonable, although perhaps you need to make that assumption
anyway for the case where the list is empty.

Here is code that might shed light on Alf's suggested approach:

import timeit
M = 10
N = 8000

def in_place(
start = [],
sublists = ([[None] * M]) * N
):
# only macro-optimized
i = 0
for sublist in sublists:
if i == 0:
accum = start + sublist
i += 1
else:
accum.extend(sublist)
if i == 0:
return 'whatever' # semantics here?
return accum

def with_intermediates(
start = [],
sublists = ([[None] * M]) * N
):
accum = start
for sublist in sublists:
accum = accum + sublist
return accum

t1 = timeit.timeit('in_place()', 'from __main__ import in_place',
number=10)
t2 = timeit.timeit('with_intermediates()', 'from __main__ import
with_intermediates', number=10)
print M, N, t1, t2, int(t2 / t1)

For N=100, you get a significant speedup (x84):

10 1000 0.00102496147156 0.0867249965668 84

More data:

10 1000 0.00102496147156 0.0867249965668 84
10 2000 0.00211381912231 0.172616004944 81
10 4000 0.00491786003113 0.561800956726 114
10 8000 0.0706541538239 19.9019489288 281
10 16000 0.0161759853363 9.64171791077 596
10 32000 0.0351569652557 43.98346591 1251