From: Laszlo Nagy on

>> Maybe this should be implemented in C. But I believe that the
>> algorithm itself must be wrong (regardless of the language). I really
>> think that I'm doing something wrong. Looks like my algorithm's
>> processing time is not linear to the number of rows. Not even
>> log(n)*n. There should be a more effective way to do this. But how?
>> I had the idea of sorting the rows by a given dimension. Then it
>> would be obvious to speed up the indexing part - for that dimension.
>> PROBABLY sorting all rows would be faster than calling list.index()
>> for each row. But... it does not seem very efficient either. What if
>> I have 1 million rows and 10 dimensions? Do I sort 1 million rows on
>> the disk 10 times? Some of you might have ran into the same problem
>> before, and can tell me which is the most efficient way to do this.
> The .index method does a linear search, checking on average 1/2 of the
> items in the list. That's why it's so slow.
> In order to avoid that you could build a dict of each value in
> dimension_values[col_idx] and its index in a single pass so that it
> becomes a quick lookup.
Changed to dicts and hashed lookups. Now the processing time is O(n),
and went up to 8000 rows/sec.

Probably I'll never want to process more than 1M rows. That will take
about 125 seconds. Fair enough.

Thank you very much!