From: M.-A. Lemburg on
Dennis Lee Bieber wrote:
> On Thu, 29 Apr 2010 11:38:28 +0200, "Karin Lagesen"
> <karin.lagesen(a)> declaimed the following in comp.lang.python:
>> Hello.
>> I have approx 83 million strings, all 14 characters long. I need to be
>> able to take another string and find out whether this one is present
>> within the 83 million strings.
> Is this "another string" also exactly 14 characters long?
>> Now, I have tried storing these strings as a list, a set and a dictionary.
>> I know that finding things in a set and a dictionary is very much faster
>> than working with a list, so I tried those first. However, I run out of
>> memory building both the set and the dictionary, so what I seem to be left
>> with is the list, and using the in method.
> So don't load them into memory... First use a file-based (not memory
> limited) sort routine on the 80M strings (if position in the file is
> important, make a copy with the line number on the end of each line
> before sorting -- but since you've tried using Sets which do not retain
> order... probably not a matter).
> Then, since you know the length of the file, and the length of each
> line is fixed, you can do direct access I/O (well, since the C-stream
> style I/O doesn't do "record" access, you'll need to calculate offsets
> from the start of the file as (record# - 1) * recordlen)...
> That lets you do a binary search on the file. Much faster than a
> linear search (linear search will average out to 41.5M read operations;
> binary should be around 10000 reads)
> Or load the data into some relational database and let the compiled
> code do the searching -- probably faster than byte-code interpreted
> Python for the same algorithm...

.... or put the data into a simple on-disc dictionary such as
what mxBeeBase implements:

Once you have that dict set up, lookups are very fast.

Marc-Andre Lemburg

Professional Python Services directly from the Source (#1, May 06 2010)
>>> Python/Zope Consulting and Support ...
>>> mxODBC.Zope.Database.Adapter ...
>>> mxODBC, mxDateTime, mxTextTools ...

::: Try our new mxODBC.Connect Python Database Interface for free ! :::: Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611