From: Chris Rebert on
2010/4/22 Jo Chan <csjcg2(a)gmail.com>:
> Hi,friends.
>  I wanna ask if there is a function which is able to take a list as argument
> and then return its top-k maximums?
> I only know about max which is poorly a top-1 maximum function, now I want
> more yet I am lazy enough that don't want to write one by myself.

http://docs.python.org/library/heapq.html#heapq.nlargest

Cheers,
Chris
--
http://blog.rebertia.com
From: Bryan on
Steven D'Aprano wrote:
> John Nagle wrote:
> >    Is "nlargest" smart enough to decide when it's cheaper to track the
> > N largest entries on a linear pass through the list than to sort?

It *always* does a linear pass through the list (linear, that is in
the length of the entire list). It tracks the n largest elements seen
so far in a heap.

> Doesn't appear to do so. From Python 3.1:

I think you're mis-reading it.

> def nlargest(n, iterable):
>     """Find the n largest elements in a dataset.
>
>     Equivalent to:  sorted(iterable, reverse=True)[:n]
>     """
>     it = iter(iterable)
>     result = list(islice(it, n))
>     if not result:
>         return result
>     heapify(result)
>     _heappushpop = heappushpop
>     for elem in it:
>         _heappushpop(result, elem)
>     result.sort(reverse=True)
>     return result

That doesn't sort, or even heapify, the whole list. It keeps the
largest n elements seen so far in the 'result' heap, smallest on top.

Note the doc for heappushpop: "Push item on the heap, then pop and
return the smallest item from the heap." Thus the 'result' heap stays
of size n.

The final result.sort() is just so the returned list is sorted, and
assuming that's a requirement, I think the algorithm is asymptotically
the best a comparison-based method can do, regardless of whether the
length of the entire sequence dominates n. I figure the worst-case run
time is Theta(s lg(n)) where s in the length of the sequence.

> Interestingly, nsmallest does use two different algorithms,
> depending on how many items you ask for. See the source code.

That is interesting. The above algorithm for nlargest is better, but
to use it for nsmallest requires a largest-on-top heap, which the
module does not bother to implement.


--
--Bryan
From: Raymond Hettinger on
On Apr 22, 10:49 am, John Nagle <na...(a)animats.com> wrote:
> Chris Rebert wrote:
> > 2010/4/22 Jo Chan <csj...(a)gmail.com>:
> >> Hi,friends.
> >>  I wanna ask if there is a function which is able to take a list as argument
> >> and then return its top-k maximums?
> >> I only know about max which is poorly a top-1 maximum function, now I want
> >> more yet I am lazy enough that don't want to write one by myself.
>
> >http://docs.python.org/library/heapq.html#heapq.nlargest
>
> > Cheers,
> > Chris
> > --
> >http://blog.rebertia.com
>
>    Is "nlargest" smart enough to decide when it's cheaper to track the
> N largest entries on a linear pass through the list than to sort?

nlargest() has gotten smarter over time. So, the answer depends on
which version you're using.

Here's a snippet from http://svn.python.org/view/python/branches/py3k/Lib/heapq.py?revision=70695&view=markup

def nlargest(n, iterable, key=None):
"""Find the n largest elements in a dataset.

Equivalent to: sorted(iterable, key=key, reverse=True)[:n]
"""

# Short-cut for n==1 is to use max() when len(iterable)>0
if n == 1:
it = iter(iterable)
head = list(islice(it, 1))
if not head:
return []
if key is None:
return [max(chain(head, it))]
return [max(chain(head, it), key=key)]

# When n>=size, it's faster to use sort()
try:
size = len(iterable)
except (TypeError, AttributeError):
pass
else:
if n >= size:
return sorted(iterable, key=key, reverse=True)[:n]

# When key is none, use simpler decoration
if key is None:
it = zip(iterable, count(0,-1)) # decorate
result = _nlargest(n, it)
return [r[0] for r in result] #
undecorate

# General case, slowest method
in1, in2 = tee(iterable)
it = zip(map(key, in1), count(0,-1), in2) # decorate
result = _nlargest(n, it)
return [r[2] for r in result] #
undecorate
From: Raymond Hettinger on
On Apr 23, 1:24 am, Bryan <bryanjugglercryptograp...(a)yahoo.com> wrote:
> That is interesting. The above algorithm for nlargest is better, but
> to use it for nsmallest requires a largest-on-top heap, which the
> module does not bother to implement.

FWIW, the pure python versions differ because they are both
implemented in terms of the basic minheap functions, but the
underlying C code uses the same algorithm for both (using an
underlying maxheap when necessary):

http://svn.python.org/view/python/branches/py3k/Modules/_heapqmodule.c?revision=64351&view=markup


Raymond