ANN: blist 1.2.0 [Python]

Prev: Multiline regex
Next: Inconsistency in the format docstring (2.7).

From: Antoine Pitrou on 21 Jul 2010 11:32

On Wed, 21 Jul 2010 09:47:08 -0500
Daniel Stutzbach <daniel(a)stutzbachenterprises.com> wrote:
>
> What's new?
> -----------
>
> - blist.sort() is now *substantially* faster than list.sort() when using int
> or float keys (O(n) vs. O(n log n))

Are you using some kind of radix sort?
Could it be contributed back into the standard list type?

Regards

Antoine.

From: Terry Reedy on 21 Jul 2010 18:27

On 7/21/2010 12:15 PM, Daniel Stutzbach wrote:

> Here's a performance comparison of sorting with blist versus 3.1's list:
> http://stutzbachenterprises.com/performance-blist/sort-random-list

Question related to possible addition of radix sort to lists:

These tests use random numbers with a constant, relatively high density
of 25%, which is favorable to radix sort. What if you do the same test
with a constant range of, say, 1000000000 (1 billion) or even a trillion
or quadrillion. Or how about sorting a thousand 128bit ip6 addresses?
Which wins for that?

list.sort is (near) linear for lists that are mostly ordered. I think
Tim's writeup about this in in the source. For instance, if one extends
a sorted with 1% random additions and resorts, list.sort will skip the
sorted part, sort the additions, and merge them back in. Radix sort
ignores pre-existing order. So either a lot of comparitive profiling
would be needed for auto-selection of sort method, or it should be user
selectable.

Does your radix sort meet the stability guarantee of list.sort?
If not, a separate .rsort method would be needed anyway.

--
Terry Jan Reedy

From: Terry Reedy on 22 Jul 2010 18:25

On 7/21/2010 6:56 PM, Daniel Stutzbach wrote:
> On Wed, Jul 21, 2010 at 5:27 PM, Terry Reedy <tjreedy(a)udel.edu
> <mailto:tjreedy(a)udel.edu>> wrote:
>
> These tests use random numbers with a constant, relatively high
> density of 25%, which is favorable to radix sort. What if you do the
> same test with a constant range of, say, 1000000000 (1 billion) or
> even a trillion or quadrillion. Or how about sorting a thousand
> 128bit ip6 addresses? Which wins for that?

> blist switches to radix sort only if the keys contain only floats or
> only integers that fit into a C long. If the integers get too big, it
> reverts to timsort.
>
> When using a sort key, the decorate-sort-undecorate pattern requires
> touching every object once before the sort. The check for a consistent
> type is done inexpensively during the decorate phase.

And it is pretty cheap even when there is no decorate phase.

> Here's an example result with a density of only 1%, where the radix sort
> is around 4 times as fast:
>
> otto:~/src/blist$ python3.1 -m timeit -s 'import random' -s 'x =
> [random.randrange(10000*100) for i in range(10000)]' 'y = list(x)'
> 'y.sort(key=float)'
> 100 loops, best of 3: 12.4 msec per loop
>
> otto:~/src/blist$ python3.1 -m timeit -s 'from blist import blist' -s
> 'import random' -s 'x = [random.randrange(10000*100) for i in
> range(10000)]' 'y = blist(x)' 'y.sort(key=float)'
> 100 loops, best of 3: 3.4 msec per loop

> And a density of 0.01%:
>
> otto:~/src/blist$ python3.1 -m timeit -s 'import random' -s 'x =
> [random.randrange(10000*10000) for i in range(10000)]' 'y = list(x)'
> 'y.sort(key=float)'
> 100 loops, best of 3: 12 msec per loop
>
> otto:~/src/blist$ python3.1 -m timeit -s 'from blist import blist' -s
> 'import random' -s 'x = [random.randrange(10000*10000) for i in
> range(10000)]' 'y = blist(x)' 'y.sort(key=float)'
> 100 loops, best of 3: 3.52 msec per loop

Looks good so far. I would like to see that repeated all the way down to
range(10) to make sure people doing millions of small sorts were not
getting screwed.

> list.sort is (near) linear for lists that are mostly ordered. I
> think Tim's writeup about this in in the source. For instance, if
> one extends a sorted with 1% random additions and resorts, list.sort
> will skip the sorted part, sort the additions, and merge them back
> in. Radix sort ignores pre-existing order. So either a lot of
> comparitive profiling would be needed for auto-selection of sort
> method, or it should be user selectable.
> I've tested exactly that scenario. For already-ordered lists, radix
> sort and timsort tie.

Tim tested about 7 list structure scenarios with a range of lengths. If
you post a patch, I would request that you do the same.

Have you run a patched version against test_sort.py? I believe it mostly
tests lists of small ints, so radix methods would mostly be switched in.

If it were added and the switching were internal, new test cases would
be needed to test test timsort.

>> Does your radix sort meet the stability guarantee of list.sort?
> Yes.

Great. There is, of course, a test for that in the suite.

--
Terry Jan Reedy

From: Mark Lawrence on 22 Jul 2010 19:02

On 22/07/2010 23:25, Terry Reedy wrote:
> On 7/21/2010 6:56 PM, Daniel Stutzbach wrote:
>> On Wed, Jul 21, 2010 at 5:27 PM, Terry Reedy <tjreedy(a)udel.edu
>> <mailto:tjreedy(a)udel.edu>> wrote:
>>
>> These tests use random numbers with a constant, relatively high
>> density of 25%, which is favorable to radix sort. What if you do the
>> same test with a constant range of, say, 1000000000 (1 billion) or
>> even a trillion or quadrillion. Or how about sorting a thousand
>> 128bit ip6 addresses? Which wins for that?
>
>> blist switches to radix sort only if the keys contain only floats or
>> only integers that fit into a C long. If the integers get too big, it
>> reverts to timsort.
>>
>> When using a sort key, the decorate-sort-undecorate pattern requires
>> touching every object once before the sort. The check for a consistent
>> type is done inexpensively during the decorate phase.
>
> And it is pretty cheap even when there is no decorate phase.
>
>> Here's an example result with a density of only 1%, where the radix sort
>> is around 4 times as fast:
>>
>> otto:~/src/blist$ python3.1 -m timeit -s 'import random' -s 'x =
>> [random.randrange(10000*100) for i in range(10000)]' 'y = list(x)'
>> 'y.sort(key=float)'
>> 100 loops, best of 3: 12.4 msec per loop
>>
>> otto:~/src/blist$ python3.1 -m timeit -s 'from blist import blist' -s
>> 'import random' -s 'x = [random.randrange(10000*100) for i in
>> range(10000)]' 'y = blist(x)' 'y.sort(key=float)'
>> 100 loops, best of 3: 3.4 msec per loop
>
>> And a density of 0.01%:
>>
>> otto:~/src/blist$ python3.1 -m timeit -s 'import random' -s 'x =
>> [random.randrange(10000*10000) for i in range(10000)]' 'y = list(x)'
>> 'y.sort(key=float)'
>> 100 loops, best of 3: 12 msec per loop
>>
>> otto:~/src/blist$ python3.1 -m timeit -s 'from blist import blist' -s
>> 'import random' -s 'x = [random.randrange(10000*10000) for i in
>> range(10000)]' 'y = blist(x)' 'y.sort(key=float)'
>> 100 loops, best of 3: 3.52 msec per loop
>
> Looks good so far. I would like to see that repeated all the way down to
> range(10) to make sure people doing millions of small sorts were not
> getting screwed.
>
>> list.sort is (near) linear for lists that are mostly ordered. I
>> think Tim's writeup about this in in the source. For instance, if
>> one extends a sorted with 1% random additions and resorts, list.sort
>> will skip the sorted part, sort the additions, and merge them back
>> in. Radix sort ignores pre-existing order. So either a lot of
>> comparitive profiling would be needed for auto-selection of sort
>> method, or it should be user selectable.
>> I've tested exactly that scenario. For already-ordered lists, radix
>> sort and timsort tie.
>
> Tim tested about 7 list structure scenarios with a range of lengths. If
> you post a patch, I would request that you do the same.
>
> Have you run a patched version against test_sort.py? I believe it mostly
> tests lists of small ints, so radix methods would mostly be switched in.
>
> If it were added and the switching were internal, new test cases would
> be needed to test test timsort.
>
>>> Does your radix sort meet the stability guarantee of list.sort?
>> Yes.
>
> Great. There is, of course, a test for that in the suite.
>

Can I please book front row tickets for the shoot out between the
reigning heavyweight champ Timbot and the challenger, the up and coming
Danbot? :)

Kindest regards.

Mark Lawrence.

From: Terry Reedy on 22 Jul 2010 19:52

On 7/22/2010 7:22 PM, Daniel Stutzbach wrote:
> On Thu, Jul 22, 2010 at 5:25 PM, Terry Reedy <tjreedy(a)udel.edu

> That's a good point. It's tempting to add an undocumented parameter to
> blist.sort that selects the sorting algorithm to use, to make it make it
> easier to test multiple algorithms. There are probably several
> different ways to achieve a similar effect. Do you mind if we table
> that discussion until I actually have a patch?

Definitely. That will not be my decision. Since you seem serious about
this, I decided to give you a preview of the questions to expect, and
suggest some to answer in your initial filing.

Another sort of issue will be code maintainability. Two algorithms is
potentially more problems than one. To me, that partly depends on how
well factored the current code is. It would be best is rsort were a
switch replacement for timsort after all preparations (such as decorate)
were done. I will leave that for you to address when you file.

And that is about all I can think of.

--
Terry Jan Reedy

|
Pages: 1
Prev: Multiline regex
Next: Inconsistency in the format docstring (2.7).