sum for sequences? [Python]

Prev: Advice needed on parallel processing in python
Next: coding contest

From: Patrick Maupin on 29 Mar 2010 22:31

On Mar 29, 6:38 pm, Steven D'Aprano <st...(a)REMOVE-THIS-
cybersource.com.au> wrote:
> With a very few exceptions (e.g. dict lookup being "usually" O(1), list
> append being amortized O(1)), Python makes no promises about performance.
> It's not part of the language. If you, the programmer, are making any
> assumptions about performance that aren't clearly and explicitly
> documented in the official docs, then YOU are at fault, not Python.

It's not about promises, guarantees, quid-pro-quo, etc.

It's about a lack of surprises. Which, 99% of the time, Python excels
at. This is why many of us program in Python. This is why some of us
who would never use sum() on lists, EVEN IF IT WERE FIXED TO NOT BE SO
OBNOXIOUSLY SLOW, advocate that it, in fact, be fixed to not be so
obnoxiously slow.

BTW, it's also not about "fault". There is no shame in writing a
Python program, seeing that it doesn't go fast enough, and then
hammering on it until it does. There is also no shame in not reading
the docs before you write the program, although arguably (and you
obviously work very hard to help see to this) a great deal of shame
attaches to posting to the newsgroup before reading the docs.

Regards,
Pat

From: Steve Howell on 29 Mar 2010 22:53

On Mar 29, 4:19 pm, Steven D'Aprano <st...(a)REMOVE-THIS-
cybersource.com.au> wrote:
> On Mon, 29 Mar 2010 07:40:54 -0700, Patrick Maupin wrote:
> > On Mar 28, 9:45 pm, Steven D'Aprano
> > <ste...(a)REMOVE.THIS.cybersource.com.au> wrote:
> >> And what about tuples? And subclasses of list/tuples? How many
> >> different types need to be optimized?
>
> > One of the beautiful things about Python is that, for most things, there
> > are few surprises for even new users. "There should be one obvious way
> > to do it" for the user means that, sometimes, under the hood, there are
> > a lot of special cases for the implementers.
>
> It never ceases to amaze me how often people simply don't understand this..
>
> "There should be one obvious way to do it" is the opposite of "NO obvious
> way", not of "many ways which may or may not be obvious". The complete
> quote from the Zen makes that clear:
>
> There should be one-- and preferably ONLY one --obvious way to do it.
> [Emphasis added]
>
> And don't forget the next line:
>
> Although that way may not be obvious at first unless you're Dutch.
>
> Python is under no compulsion to make "the obvious way" obvious to anyone
> except Guido. It's a bonus if it happens to be obvious to newbies, not a
> requirement.
>
> And besides, what is "it" you're talking about?
>
> * Adding integers, decimals or fractions, or floats with a low
> requirement for precision and accuracy? Use sum.
>
> * Adding floats with a high requirement for precision and accuracy?
> Use math.fsum.
>
> * Concatenating strings? Use ''.join.
>
> * Joining lists? Use [].extend.
>
> * Iterating over an arbitrary sequence of arbitrary sequences?
> Use itertools.chain.
>
> That's five different "its", and five obvious ways to do them.
>

Let's go through them...

> * Adding integers, decimals or fractions, or floats with a low
> requirement for precision and accuracy? Use sum.
>

Pretty obvious.

>>> sum([1, 2, 3])
6

> * Adding floats with a high requirement for precision and accuracy?
> Use math.fsum.
>

Mostly obvious.

>>> fsum([1.234534665989, 2.987, 3])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'fsum' is not defined
>>> import math
>>> math.fsum([1.234534665989, 2.987, 3])
7.2215346659890001

> * Concatenating strings? Use ''.join.

Common pitfall:

>>> ['abc', 'def', 'ghi'].join()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'join'
>>> ''.join(['abc', 'def', 'ghi'])
'abcdefghi'

> * Joining lists? Use [].extend.

Obvious, yes. Convenient? Not really.

>>> start = []
>>> for list in [[1, 2], [3, 4]]:
... start.extend(list)
...
>>> start
[1, 2, 3, 4]

> * Iterating over an arbitrary sequence of arbitrary sequences?
> Use itertools.chain.

>>> group1 = ['al', 'bob']
>>> group2 = ['cal']
>>> groups = [group1, group2]

Obvious if you are Dutch...

>>> itertools.chain(groups)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'itertools' is not defined
>>> import itertools
>>> itertools.chain(groups)
<itertools.chain object at 0x9a9658c>
>>> len(itertools.chain(groups))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: object of type 'itertools.chain' has no len()
>>> len(list(itertools.chain(groups)))
2
>>> len(list(itertools.chain(*groups)))
3

So you have sum, fsum, join, extend, and chain.

Sum is builtin, but you have to import fsum from math and chain from
itertools.

Join is actually a method on strings, not sequences.

Chain can take multiple lists, or you can use the splat operator on a
list of lists.

Extend is actually a method on lists, and it only takes one list, not
multiple ones.

>>> [].extend(*groups)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: extend() takes exactly one argument (2 given)

Just commit all that to memory, and enjoy the productivity of using a
high level language! ;)

From: Steven D'Aprano on 29 Mar 2010 23:01

On Mon, 29 Mar 2010 19:05:30 -0700, Steve Howell wrote:

>> > If nothing else, I think it's reasonably for users to expect
>> > symmetry.
>>
>> Why? What is symmetry in programming?
>
> "Symmetry" is best shown by example.
>
> >>> 3 - 2
> 1
> >>> set([1,2,3]) - set([2,3])
> set([1])

That's a useless example.

>>> 42 - 10
32
>>> set([42]) - set([10])
set([42])

You don't define symmetry. You don't even give a sensible example of
symmetry. Consequently I reject your argument that because sum is the
obvious way to sum a lot of integers, "symmetry" implies that it should
be the obvious way to concatenate a lot of lists.

>>> 1+2
3
>>> [1] + [2]
[1, 2]
>>> set([1]) + set([2])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for +: 'set' and 'set'

>> > If you can use "+" to concatentate lists, then it seems reasonable
>> > that something spelled "sum" would concatenate lists as well, and in
>> > reasonable time.
>>
>> Where do you get the "reasonable time" from?
>>
>>
> From common sense. Concatenation of lists should be in O(M*N) time, not
> O(M*N*N), because there is no need to build intermediate lists.

You are correct that building intermediate lists isn't *compulsory*,
there are alternatives, but the alternatives themselves have costs.
Complexity itself is a cost. sum currently has nice simple semantics,
which means you can reason about it: sum(sequence, start) is the same as

total = start
for item in sequence:
total = total + start
return total

You don't have to care what the items in sequence are, you don't have to
make assumptions about what methods sequence and start have (beyond
supporting iteration and addition). Adding special cases to sum means it
becomes more complex and harder to reason about. If you pass some other
sequence type in the middle of a bunch of lists, what will happen? Will
sum suddenly break, or perhaps continue to work but inefficiently?

You still need to ask these questions with existing sum, but it is
comparatively easy to answer them: you only need to consider how the
alternative behaves when added to a list. You don't have to think about
the technicalities of the sum algorithm itself -- sometimes it calls +,
sometimes extend, sometimes +=, sometimes something else... which of the
various different optimized branches will I fall into this time? Who
knows? sum already has two branches. In my opinion, three branches is one
too many.

[...]
>> > It only takes a handful of sublists, about ten on my box, to expose
>> > the limitation of the Shlemeil-the-Painter O(M*N*N) algorithm that's
>> > under the hood. It only takes 200 sublists to start getting a 10x
>> > degradation in performance.
>>
>> And how many times have you needed to add 200 sublists?
>>
>> How many times have you needed to add 10 sublists?
>>
>>
> Aggregating lists is a very common task in programming.

"Aggregating" lists? Not summing them? I think you've just undercut your
argument that sum is the "obvious" way of concatenating lists.

In natural language, we don't talk about "summing" lists, we talk about
joining, concatenating or aggregating them. You have just done it
yourself, and made my point for me. And this very thread started because
somebody wanted to know what the equivalent to sum for sequences.

If sum was the obvious way to concatenate sequences, this thread wouldn't
even exist.

> An example of a
> list of lists would be a list of departments, where each department is a
> list of employees. If you want to sort the employees, you will want to
> aggregate to a list.

I grant you that 10 departments is likely to be a borderline case, but if
you have 200 departments, then you will most likely be querying a
database and getting back a single list of employees. And if you're not,
you probably should be.

>> More special cases for implementers means more bugs in the language,
>> which means I end up writing my own code and ignoring the built-in
>> version anyway.
>>
>>
> Special cases do not have to introduce bugs in the language.

They don't *have* to, but since even Donald Knuth has written code with
bugs in it, it is a safe bet that the number of bugs in code written by
mere mortals will be roughly proportional to the complexity: more complex
means more bugs. You can go to the bank with that.

>> More special cases means I have to pay the cost of high accuracy float
>> summation even when I don't need, or want, it.
>>
>>
> Nothing about making sum() work for the general cases precludes more
> specific, optimized functions.

You have missed my point. If I call your version of sum which all the
special cases, I pay the overhead of the complexity regardless of whether
I need it or not. The existence of other functions which duplicate the
special cases in sum is irrelevant.

[...]
>> It's MORE costly than teaching users to use list.extend or
>> itertools.chain.
>>
>>
> Which the docs for sum() don't do.

Patches are always welcome.

--
Steven

From: Steven D'Aprano on 29 Mar 2010 23:02

On Mon, 29 Mar 2010 16:36:17 -0700, Paul Rubin wrote:

> Steven D'Aprano <steve(a)REMOVE-THIS-cybersource.com.au> writes:
>> * Iterating over an arbitrary sequence of arbitrary sequences?
>> Use itertools.chain.
>
> chain is only for finite sequences. For arbitrary sequences, use
> chain.from_iterable.

Correction noted. Obviously I'm not Dutch.

--
Steven

From: Steven D'Aprano on 29 Mar 2010 23:29

On Mon, 29 Mar 2010 19:24:42 -0700, Patrick Maupin wrote:

> On Mar 29, 6:19 pm, Steven D'Aprano <st...(a)REMOVE-THIS-
> cybersource.com.au> wrote:
>> How does the existence of math.fsum contradict the existence of sum?
>
> You're exceptionally good at (probably deliberately) mis-interpreting
> what people write.

I cannot read your mind, I can only interpret the words you choose to
write. You said

[quote]
See, I think the very existence of math.fsum() already violates "there
should be one obvious way to do it."
[end quote]

If sum satisfies the existence of one obvious way, how does math.fsum
violate it? sum exists, and is obvious, regardless of whatever other
solutions exist as well.

--
Steven

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: Advice needed on parallel processing in python
Next: coding contest