From: Laszlo Nagy on 30 Apr 2010 06:10 >> Maybe this should be implemented in C. But I believe that the >> algorithm itself must be wrong (regardless of the language). I really >> think that I'm doing something wrong. Looks like my algorithm's >> processing time is not linear to the number of rows. Not even >> log(n)*n. There should be a more effective way to do this. But how? >> >> I had the idea of sorting the rows by a given dimension. Then it >> would be obvious to speed up the indexing part - for that dimension. >> PROBABLY sorting all rows would be faster than calling list.index() >> for each row. But... it does not seem very efficient either. What if >> I have 1 million rows and 10 dimensions? Do I sort 1 million rows on >> the disk 10 times? Some of you might have ran into the same problem >> before, and can tell me which is the most efficient way to do this. >> > The .index method does a linear search, checking on average 1/2 of the > items in the list. That's why it's so slow. > > In order to avoid that you could build a dict of each value in > dimension_values[col_idx] and its index in a single pass so that it > becomes a quick lookup. Changed to dicts and hashed lookups. Now the processing time is O(n), and went up to 8000 rows/sec. Probably I'll never want to process more than 1M rows. That will take about 125 seconds. Fair enough. Thank you very much! Laszlo |