From: M.-A. Lemburg on
Dennis Lee Bieber wrote:
> On Thu, 29 Apr 2010 11:38:28 +0200, "Karin Lagesen"
> <karin.lagesen(a)bio.uio.no> declaimed the following in comp.lang.python:
>
>> Hello.
>>
>> I have approx 83 million strings, all 14 characters long. I need to be
>> able to take another string and find out whether this one is present
>> within the 83 million strings.
>>
> Is this "another string" also exactly 14 characters long?
>
>> Now, I have tried storing these strings as a list, a set and a dictionary.
>> I know that finding things in a set and a dictionary is very much faster
>> than working with a list, so I tried those first. However, I run out of
>> memory building both the set and the dictionary, so what I seem to be left
>> with is the list, and using the in method.
>>
> So don't load them into memory... First use a file-based (not memory
> limited) sort routine on the 80M strings (if position in the file is
> important, make a copy with the line number on the end of each line
> before sorting -- but since you've tried using Sets which do not retain
> order... probably not a matter).
>
> Then, since you know the length of the file, and the length of each
> line is fixed, you can do direct access I/O (well, since the C-stream
> style I/O doesn't do "record" access, you'll need to calculate offsets
> from the start of the file as (record# - 1) * recordlen)...
>
> That lets you do a binary search on the file. Much faster than a
> linear search (linear search will average out to 41.5M read operations;
> binary should be around 10000 reads)
>
> Or load the data into some relational database and let the compiled
> code do the searching -- probably faster than byte-code interpreted
> Python for the same algorithm...

.... or put the data into a simple on-disc dictionary such as
what mxBeeBase implements:

http://www.egenix.com/products/python/mxBase/mxBeeBase/

Once you have that dict set up, lookups are very fast.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, May 06 2010)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/