list.pop(0) vs. collections.dequeue [Python]

Prev: distutils not finding all of my pure python modules
Next: iterating lists

From: Arnaud Delobelle on 23 Jan 2010 07:24

Dave Angel <davea(a)ieee.org> writes:

> Arnaud Delobelle wrote:
>> Steve Howell <showell30(a)yahoo.com> writes:
>>
>>
>>> On Jan 22, 12:14 pm, Chris Rebert <c...(a)rebertia.com> wrote:
>>> <snip>
>> I made the comment you quoted. In CPython, it is O(n) to delete/insert
>> an element at the start of the list - I know it because I looked at the
>> implementation a while ago. This is why collections.deque exists I
>> guess. I don't know how they are implemented but insertion/deletion at
>> either end are O(1) and so is random access. So they are the data
>> structure that you want.
>>
>>
> Not according to the 2.6 docs.
>
> Indexed access is O(1) at both ends but slows to O(n) in the
> middle. For fast random access, use lists instead.

Yes you are correct. This will teach me (again!) to check my facts.

>
> That sounds to me like a doubly-linked list implementation.

I've just looked and it is a doubly-linked list of 'blocks' of size
BLOCKLEN, which is 62 on the source I have (I guess it's 62 because then
the whole block structure is 64 exactly, 62 + 1 for each link). So a
small list will have near constant random access, in a way.

--
Arnaud

From: Christian Heimes on 23 Jan 2010 08:46

Steve Howell wrote:
> Another benchmark is that deques are slower than lists for accessing
> elements.

deques are optimized for accessing, inserting and removing data from
both ends. For anything else it's slower than the list type. The fact
was explained in this very thread yesterday.

Christian

From: Roy Smith on 23 Jan 2010 09:57

In article <hje979$kc9$1(a)news.eternal-september.org>,
"Alf P. Steinbach" <alfps(a)start.no> wrote:

> But it would IMHO have been better if it wasn't called "list", which brings
> in the wrong associations for someone used to other languages.

+1.

When I first started using Python (back in the 1.4 days), I assumed a list
was a singly-linked list. Which, of course, leads to the assumption that
pop(0) is O(1), and lots of other wrong thinking(*).

Oddly enough, I was going to write in the above paragraph, "like a C++ STL
list", until I happened to glance at the STL docs and refreshed my memory
that an STL list is doubly-linked. Which just goes to show that making
assumptions based on names is a bad idea.

So, we're right back to my statement earlier in this thread that the docs
are deficient in that they describe behavior with no hint about cost.
Given that, it should be no surprise that users make incorrect assumptions
about cost.

(*) I suspect somebody is going to point out that list.pop was added in
some version later than 1.4, but that's a detail.

From: Steven D'Aprano on 23 Jan 2010 10:54

On Sat, 23 Jan 2010 09:57:04 -0500, Roy Smith wrote:

> In article <hje979$kc9$1(a)news.eternal-september.org>,
> "Alf P. Steinbach" <alfps(a)start.no> wrote:
>
>> But it would IMHO have been better if it wasn't called "list", which
>> brings in the wrong associations for someone used to other languages.
>
> +1.
>
> When I first started using Python (back in the 1.4 days), I assumed a
> list was a singly-linked list.

Why would you do that? I can think of at least eight different
implementations of the abstract list data structure:

constant-size array
variable-size array
variable-size array with amortised O(1) appends
singly-linked list
singly-linked list with CDR coding
doubly-linked list
skip list
indexable skip list

One can reasonably disregard constant-sized arrays as a possibility,
given that Python lists aren't fixed size (pity the poor Pascal and
Fortran coders who are stuck with static arrays!), but the rest are all
reasonable possibilities. Why assume one specific implementation in the
absence of documentation promising certain performance characteristics?

> Oddly enough, I was going to write in the above paragraph, "like a C++
> STL list", until I happened to glance at the STL docs and refreshed my
> memory that an STL list is doubly-linked. Which just goes to show that
> making assumptions based on names is a bad idea.

Exactly :)

> So, we're right back to my statement earlier in this thread that the
> docs are deficient in that they describe behavior with no hint about
> cost. Given that, it should be no surprise that users make incorrect
> assumptions about cost.

There are quite a few problems with having the documentation specify cost:

(1) Who is going to do it? Any volunteers?

(2) Big-oh notation can be misleading, especially for naive users, or
those whose intuition for what's fast has been shaped by other languages.
Big-oh doesn't tell you whether something is fast or slow, only how it
scales -- and sometimes not even then.

(3) Having documented a particular performance, that discourages
implementation changes. Any would-be patch or new implementation not only
has to consider that the functional behaviour doesn't change, but that
the performance doesn't either.

In practice the Python developers are unlikely to make an implementation
change which leads to radically worse performance, particularly for
critical types like list and dict. But in other cases, they might choose
to change big-oh behaviour, and not wish to be tied down by documentation
of the cost of operations.

(4) How much detail is necessary? What about degenerate cases? E.g. dict
lookup in CPython is typically O(1) amortised, but if all the keys hash
to the same value, it falls to O(N).

(5) Should the language guarantee such degenerate behaviour? Who decides
which costs are guaranteed and which are not?

(6) Such performance guarantees should be implementation specific, not
language specific. CPython is only one implementation of the language out
of many.

--
Steven

From: Alf P. Steinbach on 23 Jan 2010 11:37

* Steven D'Aprano:
> On Sat, 23 Jan 2010 09:57:04 -0500, Roy Smith wrote:
>
>> In article <hje979$kc9$1(a)news.eternal-september.org>,
>> "Alf P. Steinbach" <alfps(a)start.no> wrote:
>>
>>> But it would IMHO have been better if it wasn't called "list", which
>>> brings in the wrong associations for someone used to other languages.
>> +1.
>>
>> When I first started using Python (back in the 1.4 days), I assumed a
>> list was a singly-linked list.
>
> Why would you do that? I can think of at least eight different
> implementations of the abstract list data structure:
>
> constant-size array
> variable-size array
> variable-size array with amortised O(1) appends
> singly-linked list
> singly-linked list with CDR coding
> doubly-linked list
> skip list
> indexable skip list
>
> One can reasonably disregard constant-sized arrays as a possibility,
> given that Python lists aren't fixed size (pity the poor Pascal and
> Fortran coders who are stuck with static arrays!), but the rest are all
> reasonable possibilities.

A linked list implementation would yield O(n) indexing. A great many loops in
e.g. Python libraries code now having linear time would then get quadratic time,
O(n^2). Those libraries would then be effectively unusable without extensive
rewriting: one version for ordinary Python and one for 'list-as-list' Pythons...

Thus, the linked list implementations are IMO *not* reasonable.

And the reason is precisely the implied complexity guarantees, especially on
indexing -- which could reasonably be O(log n), but not worse than that.

> Why assume one specific implementation in the
> absence of documentation promising certain performance characteristics?
>
>
>> Oddly enough, I was going to write in the above paragraph, "like a C++
>> STL list", until I happened to glance at the STL docs and refreshed my
>> memory that an STL list is doubly-linked. Which just goes to show that
>> making assumptions based on names is a bad idea.
>
> Exactly :)
>
>
>
>> So, we're right back to my statement earlier in this thread that the
>> docs are deficient in that they describe behavior with no hint about
>> cost. Given that, it should be no surprise that users make incorrect
>> assumptions about cost.
>
> There are quite a few problems with having the documentation specify cost:
>
> (1) Who is going to do it? Any volunteers?

This problem must have been addressed at each time the documentation for some
version of Python was written or updated.

> (2) Big-oh notation can be misleading, especially for naive users, or
> those whose intuition for what's fast has been shaped by other languages.
> Big-oh doesn't tell you whether something is fast or slow, only how it
> scales -- and sometimes not even then.

It's how things scale that are of interest. :-)

Big-oh tells you an upper asymptotic limit.

That's sufficient for e.g. the C++ standard -- which, by the way, constitutes
a concrete example of the practicality of specifying complexity.

> (3) Having documented a particular performance, that discourages
> implementation changes. Any would-be patch or new implementation not only
> has to consider that the functional behaviour doesn't change, but that
> the performance doesn't either.
>
> In practice the Python developers are unlikely to make an implementation
> change which leads to radically worse performance, particularly for
> critical types like list and dict. But in other cases, they might choose
> to change big-oh behaviour, and not wish to be tied down by documentation
> of the cost of operations.

Say that there was an O(log n) documented worst complexity for 'list' indexing.
Above you have described it as "reasonable" to break that, having O(n)
complexity... But in light of my comments on that, and especially thinking a bit
about maintainance of two or more! versions of various libraries, don't you
agree that it would be Just Bad(TM)?

> (4) How much detail is necessary? What about degenerate cases? E.g. dict
> lookup in CPython is typically O(1) amortised, but if all the keys hash
> to the same value, it falls to O(N).

From N1745, the Technical Report 1 on C++ library extensions (will be part of
the C++0x standard), table 21 specifying general requirements of unordered
associative containers:

expression: b.find(k)
return type: iterator;
assertion: Returns an iterator pointing to an element with key equivalent
to k, or b.end() if no such element exists.
complexity: Average case O(1), worst case O(b.size()).

> (5) Should the language guarantee such degenerate behaviour? Who decides
> which costs are guaranteed and which are not?

I think the C++ standard (the latest draft of C++0x is freely available as PDF
from the commitee pages) provides good guidance in this regard. :-)

> (6) Such performance guarantees should be implementation specific, not
> language specific. CPython is only one implementation of the language out
> of many.

Disagree Very Strongly. An implementation may offer stricter guarantees. But
what matters regarding e.g. avoiding having to maintain two or three or umpteen
versions of a library, is the set of language level complexity guarantees.

Cheers,

- Alf

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Prev: distutils not finding all of my pure python modules
Next: iterating lists