sum for sequences? [Python]

Prev: Advice needed on parallel processing in python
Next: coding contest

From: Steven D'Aprano on 29 Mar 2010 19:19

On Mon, 29 Mar 2010 07:40:54 -0700, Patrick Maupin wrote:

> On Mar 28, 9:45 pm, Steven D'Aprano
> <ste...(a)REMOVE.THIS.cybersource.com.au> wrote:
>> And what about tuples? And subclasses of list/tuples? How many
>> different types need to be optimized?
>
> One of the beautiful things about Python is that, for most things, there
> are few surprises for even new users. "There should be one obvious way
> to do it" for the user means that, sometimes, under the hood, there are
> a lot of special cases for the implementers.

It never ceases to amaze me how often people simply don't understand this.

"There should be one obvious way to do it" is the opposite of "NO obvious
way", not of "many ways which may or may not be obvious". The complete
quote from the Zen makes that clear:

There should be one-- and preferably ONLY one --obvious way to do it.
[Emphasis added]

And don't forget the next line:

Although that way may not be obvious at first unless you're Dutch.

Python is under no compulsion to make "the obvious way" obvious to anyone
except Guido. It's a bonus if it happens to be obvious to newbies, not a
requirement.

And besides, what is "it" you're talking about?

* Adding integers, decimals or fractions, or floats with a low
requirement for precision and accuracy? Use sum.

* Adding floats with a high requirement for precision and accuracy?
Use math.fsum.

* Concatenating strings? Use ''.join.

* Joining lists? Use [].extend.

* Iterating over an arbitrary sequence of arbitrary sequences?
Use itertools.chain.

That's five different "its", and five obvious ways to do them.

>> In practical terms, does anyone actually ever use sum on more than a
>> handful of lists? I don't believe this is more than a hypothetical
>> problem.
>
> Right now, it's probably not, because when somebody sums a large list
> and gets thwacked on the head by the lack of efficiency, they then come
> here and get thwacked because "everybody knows" they should user
> itertools or something else; not sum().

Exactly. If you're adding a large number of large lists, you're already
doing it wrong. Making sum more efficient is just a band aid.

>> The primary use case for sum is adding numbers when floating point
>> accuracy is not critical. If you need float accuracy, use math.fsum.
>
> See, I think the very existence of math.fsum() already violates "there
> should be one obvious way to do it."

How does the existence of math.fsum contradict the existence of sum?

--
Steven

From: Steven D'Aprano on 29 Mar 2010 19:38

On Mon, 29 Mar 2010 08:12:00 -0700, Steve Howell wrote:

> On Mar 29, 7:40 am, Patrick Maupin <pmau...(a)gmail.com> wrote:
>> On Mar 28, 9:45 pm, Steven D'Aprano
>>
>> <ste...(a)REMOVE.THIS.cybersource.com.au> wrote:
>> > And what about tuples? And subclasses of list/tuples? How many
>> > different types need to be optimized?
>>
>> One of the beautiful things about Python is that, for most things,
>> there are few surprises for even new users. "There should be one
>> obvious way to do it" for the user means that, sometimes, under the
>> hood, there are a lot of special cases for the implementers.
>>
>>
> If nothing else, I think it's reasonably for users to expect symmetry.

Why? What is symmetry in programming?

Since the + operator takes both numbers and lists, and the - operator
doesn't, does "symmetry" require that we make up some list operation so
that integers and lists are symmetrical?

> If you can use "+" to concatentate lists, then it seems reasonable that
> something spelled "sum" would concatenate lists as well, and in
> reasonable time.

Where do you get the "reasonable time" from? A *single* + on lists can be
slooooow. Try this:

L = [None]*10**9
result = L+L

(assuming you've even got enough memory to do so)

With a very few exceptions (e.g. dict lookup being "usually" O(1), list
append being amortized O(1)), Python makes no promises about performance.
It's not part of the language. If you, the programmer, are making any
assumptions about performance that aren't clearly and explicitly
documented in the official docs, then YOU are at fault, not Python.

>> > In practical terms, does anyone actually ever use sum on more than a
>> > handful of lists? I don't believe this is more than a hypothetical
>> > problem.
>>
>> Right now, it's probably not, because when somebody sums a large list
>> and gets thwacked on the head by the lack of efficiency, they then come
>> here and get thwacked because "everybody knows" they should user
>> itertools or something else; not sum().
>>
>>
> Indeed. It would be nice if the docs for sum() at least pointed to
> list(itertools.chain(list_of_lists)), or whatever the most kosher
> alternative is supposed to be.
>
> It only takes a handful of sublists, about ten on my box, to expose the
> limitation of the Shlemeil-the-Painter O(M*N*N) algorithm that's under
> the hood. It only takes 200 sublists to start getting a 10x degradation
> in performance.

And how many times have you needed to add 200 sublists?

How many times have you needed to add 10 sublists?

Nobody has demonstrated that this is anything more than a hypothetical
problem.

>> > The primary use case for sum is adding numbers when floating point
>> > accuracy is not critical. If you need float accuracy, use math.fsum.
>>
>> See, I think the very existence of math.fsum() already violates "there
>> should be one obvious way to do it."
>>
>>
> The nice thing about math.fsum() is that it is at least documented from
> sum(), although I suspect some users try sum() without even consulting
> the docs.
>
> You could appease all users with an API where the most obvious choice,
> sum(), never behaves badly, and where users can still call more
> specialized versions (math.fsum() and friends) directly if they know
> what they are doing. This goes back to the statement that Patrick
> makes--under the hood, this means more special cases for implementers,
> but fewer pitfalls for users.

More special cases for implementers means more bugs in the language,
which means I end up writing my own code and ignoring the built-in
version anyway.

More special cases means I have to pay the cost of high accuracy float
summation even when I don't need, or want, it.

More special cases means I'm fooled into paying the cost of summing lists
when I don't need to, because it's easier than importing itertools:

for item in sum(lots_of_lists):
pass

needlessly creates a large list out of the smaller ones. Even if I don't
fall for the temptation, and write bad code, I still pay the cost in the
libraries and applications written by others.

More special cases isn't free. It's MORE costly than teaching users to
use list.extend or itertools.chain.

--
Steven

From: Paul Rubin on 29 Mar 2010 19:36

Steven D'Aprano <steve(a)REMOVE-THIS-cybersource.com.au> writes:
> * Iterating over an arbitrary sequence of arbitrary sequences?
> Use itertools.chain.

chain is only for finite sequences. For arbitrary sequences,
use chain.from_iterable.

From: Steve Howell on 29 Mar 2010 22:05

On Mar 29, 4:38 pm, Steven D'Aprano <st...(a)REMOVE-THIS-
cybersource.com.au> wrote:
> On Mon, 29 Mar 2010 08:12:00 -0700, Steve Howell wrote:
> > On Mar 29, 7:40 am, Patrick Maupin <pmau...(a)gmail.com> wrote:
> >> On Mar 28, 9:45 pm, Steven D'Aprano
>
> >> <ste...(a)REMOVE.THIS.cybersource.com.au> wrote:
> >> > And what about tuples? And subclasses of list/tuples? How many
> >> > different types need to be optimized?
>
> >> One of the beautiful things about Python is that, for most things,
> >> there are few surprises for even new users. "There should be one
> >> obvious way to do it" for the user means that, sometimes, under the
> >> hood, there are a lot of special cases for the implementers.
>
> > If nothing else, I think it's reasonably for users to expect symmetry.
>
> Why? What is symmetry in programming?

"Symmetry" is best shown by example.

>>> 3 - 2
1
>>> set([1,2,3]) - set([2,3])
set([1])

>>> 5 * 3
15
>>> [1, 2, 3, 4, 5] * 3
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5]

>>> 1 + 2 + 3 + 4
10
>>> sum([1,2,3,4])
10

>>> [1,2] + [3,4]
[1, 2, 3, 4]
>>> sum([[1,2], [3,4]], [])
[1, 2, 3, 4]

> Since the + operator takes both numbers and lists, and the - operator
> doesn't, does "symmetry" require that we make up some list operation so
> that integers and lists are symmetrical?
>

No. Nobody is advocating for list subtraction.

> > If you can use "+" to concatentate lists, then it seems reasonable that
> > something spelled "sum" would concatenate lists as well, and in
> > reasonable time.
>
> Where do you get the "reasonable time" from?
>

From common sense. Concatenation of lists should be in O(M*N) time,
not O(M*N*N), because there is no need to build intermediate lists.

>
> With a very few exceptions (e.g. dict lookup being "usually" O(1), list
> append being amortized O(1)), Python makes no promises about performance.
> It's not part of the language. If you, the programmer, are making any
> assumptions about performance that aren't clearly and explicitly
> documented in the official docs, then YOU are at fault, not Python.

I'm not making any assumptions here. I know that sum() is needlessly
slow for lists, even though it is not explicitly documented.

>
> > It only takes a handful of sublists, about ten on my box, to expose the
> > limitation of the Shlemeil-the-Painter O(M*N*N) algorithm that's under
> > the hood. It only takes 200 sublists to start getting a 10x degradation
> > in performance.
>
> And how many times have you needed to add 200 sublists?
>
> How many times have you needed to add 10 sublists?
>

Aggregating lists is a very common task in programming. An example of
a list of lists would be a list of departments, where each department
is a list of employees. If you want to sort the employees, you will
want to aggregate to a list.

> More special cases for implementers means more bugs in the language,
> which means I end up writing my own code and ignoring the built-in
> version anyway.
>

Special cases do not have to introduce bugs in the language.

> More special cases means I have to pay the cost of high accuracy float
> summation even when I don't need, or want, it.
>

Nothing about making sum() work for the general cases precludes more
specific, optimized functions.

> More special cases means I'm fooled into paying the cost of summing lists
> when I don't need to, because it's easier than importing itertools:

You don't have to be fooled.

> for item in sum(lots_of_lists):
> pass
>
> needlessly creates a large list out of the smaller ones.

Although this is a mostly harmless example, developers with common
sense would not sum lots of lists unless they expected to keep the
resulting list around for multiple operations, or for one operation,
like sort(), where you need to create a list for the subsequent
operation.

> Even if I don't
> fall for the temptation, and write bad code, I still pay the cost in the
> libraries and applications written by others.

You already pay the small price for polymorphism when you use Python.

> More special cases isn't free.

Nobody said they were.

> It's MORE costly than teaching users to
> use list.extend or itertools.chain.
>

Which the docs for sum() don't do.

From: Patrick Maupin on 29 Mar 2010 22:24

On Mar 29, 6:19 pm, Steven D'Aprano <st...(a)REMOVE-THIS-
cybersource.com.au> wrote:
> How does the existence of math.fsum contradict the existence of sum?

You're exceptionally good at (probably deliberately) mis-interpreting
what people write.

Regards,
Pat

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: Advice needed on parallel processing in python
Next: coding contest