matching strings in a large set of strings [Python]

Prev: dynamic function add to an instance of a class
Next: using python2.6 on windows without installation

From: News123 on 1 May 2010 07:48

Dennis Lee Bieber wrote:
> On Thu, 29 Apr 2010 11:38:28 +0200, "Karin Lagesen"
> <karin.lagesen(a)bio.uio.no> declaimed the following in comp.lang.python:
>
>> Hello.
>>
>> I have approx 83 million strings, all 14 characters long. I need to be
>> able to take another string and find out whether this one is present
>> within the 83 million strings.
>>

>>
> So don't load them into memory... First use a file-based (not memory
>
>
> That lets you do a binary search on the file. Much faster than a
> linear search (linear search will average out to 41.5M read operations;
> binary should be around 10000 reads)

Don't you meant 27 reads instead of 41.5 M reads?

>>> from math import log
>>> log(83e6)/log(2)
26.306608000671101
>>>

N

From: Stefan Behnel on 1 May 2010 09:05

Duncan Booth, 30.04.2010 10:20:
> So more than 3GB just for the strings (and that's for Python 2.x on
> Python 3.x you'll need nearly 5GB).
>
> Running on a 64 bit version of Python should be fine, but for a 32 bit
> system a naive approach just isn't going to work.
>
> Option 1: use a trie. That should reduce storage, maybe it will reduce
> it enough, maybe not. It depends on the data.

Depending on the implementation and the data, a trie can also use a lot
/more/ space than the set of strings that it contains. The 83 million 14
character strings can well include a relatively large subset of the
possible permutations (imagine purely numeric, hex-numeric or lower-case
alpha-numeric strings, for example), so the trie will likely need to branch
very often with very low intermediate run length. If you use pointers for
trie branches, that's at least 8 bytes per branch on a 64bit system, versus
1 byte per character in a byte string list. Depending on the ratio of
branches to characters, one or the other may win. So a "naive approach"
likely won't work for tries either.

Stefan

From: News123 on 1 May 2010 17:46

Dennis Lee Bieber wrote:
> On Sat, 01 May 2010 13:48:02 +0200, News123 <news1234(a)free.fr> declaimed
> the following in gmane.comp.python.general:
>
>> Dennis Lee Bieber wrote:
>>> That lets you do a binary search on the file. Much faster than a
>>> linear search (linear search will average out to 41.5M read operations;
>>> binary should be around 10000 reads)
>> Don't you meant 27 reads instead of 41.5 M reads?
>>
> Don't you mean my 10000 reads? The 41.5M is the average for the
> linear search.
>
Indeed, this is what I meant. or phrased differently:
"about 27 reads with a binary search instead of 41.5M reads average with
a linear search."

>>>>> from math import log
>>>>> log(83e6)/log(2)
>> 26.306608000671101
> Probably -- it was late at night and I was working things out in my
> head...

I know about late nights. I just wondered whether I overlooked something.

From: Albert van der Horst on 2 May 2010 08:23

In article <877hnpjtdw.fsf(a)rudin.co.uk>,
Paul Rudin <paul.nospam(a)rudin.co.uk> wrote:
>"Karin Lagesen" <karin.lagesen(a)bio.uio.no> writes:
>
>> Hello.
>>
>> I have approx 83 million strings, all 14 characters long. I need to be
>> able to take another string and find out whether this one is present
>> within the 83 million strings.
>>
>> Now, I have tried storing these strings as a list, a set and a dictionary.
>> I know that finding things in a set and a dictionary is very much faster
>> than working with a list, so I tried those first. However, I run out of
>> memory building both the set and the dictionary, so what I seem to be left
>> with is the list, and using the in method.
>>
>> I imagine that there should be a faster/better way than this?
>>
>
>Shouldn't a set with 83 million 14 character strings be fine in memory
>on a stock PC these days? I suppose if it's low on ram you might start
>swapping which will kill performance. Perhaps the method you're using to
>build the data structures creates lots of garbage? How much ram do you
>have and how much memory does the python process use as it builds your
>data structures?

And if this is practical there should be no swapping problems,
as the working set will be a small fraction of the data used.

<SNIP>
>
>There are other algorithms you can use that have better theoretical
>performance than using bisect for this particular problem, but you get
>into trade offs between executing things in python as opposed to C if
>you start to hand code things in python.

There are a lot of possibility, but they depend a great deal on
secondary conditions, like how often the 83 M dataset changes.

Groetjes Albert

--
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert(a)spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

From: Bryan on 3 May 2010 11:53

Karin Lagesen wrote:
> I have approx 83 million strings, all 14 characters long. I need to be
> able to take another string and find out whether this one is present
> within the 83 million strings.
[...]
> I run out of memory building both the set and the dictionary, so
> what I seem to be left with is the list

I agree with the recommendations to try modules that maintain the set
on disk (sqlite, dbm, shelve), but here's an in-memory solution: Hash
to about 8 million buckets, and use a compact representation for the
several strings in each bucket. That gives you most of the speed and
avoids most of the memory overhead. Python makes it easy:

class ConstLenStrSet:

""" Keep a set of strings all of the same length.
Support set.add() and Python's 'in' test.
"""

def __init__(self, strlen=14, nbuckets=8000000):
assert strlen > 0 and nbuckets > 0
self.strlen = strlen
self.nbuckets = nbuckets | 1
self.table = [''] * self.nbuckets

def _hashstr(self, s):
return hash(s) % self.nbuckets

def add(self, s):
assert len(s) == self.strlen
if s not in self:
self.table[self._hashstr(s)] += s

def __contains__(self, s):
data = self.table[self._hashstr(s)]
for i in range(0, len(data), self.strlen):
if data[i : i + self.strlen] == s:
return True
return False

# Rudimentary demo-test:

from random import choice
from string import letters

def rnd_str(slen=14):
return ''.join(choice(letters) for _ in range(slen))

# Test against Python's built-in set
sref = set()
for i in range(830000):
sref.add(rnd_str())

print('Strings generated.')

sset = sset = ConstLenStrSet(14, 80000)
for s in sref:
sset.add(s)

print 'ConstLenStrSet loaded.'

for s in sref:
assert s in sset

for i in range(1000):
s = rnd_str()
if s in sref:
print 'That was unlucky.'
else:
assert s not in sset

If building the set is too slow, and you know you don't have a lot of
duplicate strings, you can use a faster insert method that doesn't
check whether the string is already in the set:

def add_quick(self, s):
assert len(s) == self.strlen
self.table[self._hashstr(s)] += s

--
--Bryan

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: dynamic function add to an instance of a class
Next: using python2.6 on windows without installation