matching strings in a large set of strings [Python]

Prev: dynamic function add to an instance of a class
Next: using python2.6 on windows without installation

From: Terry Reedy on 29 Apr 2010 16:29

On 4/29/2010 5:38 AM, Karin Lagesen wrote:
> Hello.
>
> I have approx 83 million strings, all 14 characters long. I need to be
> able to take another string and find out whether this one is present
> within the 83 million strings.

If the 'other string' is also 14 chars, so that you are looking for
exact matches, and you are doing this repeatedly, you might look to
match hash values first.

From: Paul Rudin on 30 Apr 2010 03:23

"Karin Lagesen" <karin.lagesen(a)bio.uio.no> writes:

> Hello.
>
> I have approx 83 million strings, all 14 characters long. I need to be
> able to take another string and find out whether this one is present
> within the 83 million strings.
>
> Now, I have tried storing these strings as a list, a set and a dictionary.
> I know that finding things in a set and a dictionary is very much faster
> than working with a list, so I tried those first. However, I run out of
> memory building both the set and the dictionary, so what I seem to be left
> with is the list, and using the in method.
>
> I imagine that there should be a faster/better way than this?
>

Shouldn't a set with 83 million 14 character strings be fine in memory
on a stock PC these days? I suppose if it's low on ram you might start
swapping which will kill performance. Perhaps the method you're using to
build the data structures creates lots of garbage? How much ram do you
have and how much memory does the python process use as it builds your
data structures?

A set should give good performance if the target string is also 14
characters.

If you do end up with the list then try using bisect
<http://docs.python.org/library/bisect.html> which should be quicker
than just using "in" (which I think just scans linearly through the list
testing for equality).

There are other algorithms you can use that have better theoretical
performance than using bisect for this particular problem, but you get
into trade offs between executing things in python as opposed to C if
you start to hand code things in python.

From: Steven D'Aprano on 30 Apr 2010 04:41

On Fri, 30 Apr 2010 08:23:39 +0100, Paul Rudin wrote:

> "Karin Lagesen" <karin.lagesen(a)bio.uio.no> writes:
>
>> Hello.
>>
>> I have approx 83 million strings, all 14 characters long. I need to be
>> able to take another string and find out whether this one is present
>> within the 83 million strings.
>>
>> Now, I have tried storing these strings as a list, a set and a
>> dictionary. I know that finding things in a set and a dictionary is
>> very much faster than working with a list, so I tried those first.
>> However, I run out of memory building both the set and the dictionary,
>> so what I seem to be left with is the list, and using the in method.
>>
>> I imagine that there should be a faster/better way than this?
>>
>>
> Shouldn't a set with 83 million 14 character strings be fine in memory
> on a stock PC these days?

Not even close. Using Python 2.6:

>>> s = "12345678901234"
>>> assert len(s) == 14
>>> import sys
>>> sys.getsizeof(s)
38

So a single 14 char string takes 38 bytes.

>>> import random, string
>>> chars = list(string.letters + string.digits)*4
>>> def rnd_str():
.... random.shuffle(chars)
.... return ''.join(chars[:14])
....
>>> s = set()
>>> while len(s) < 83000:
.... s.add(rnd_str())
....
>>> sys.getsizeof(s)
1048688

So a set with 83000 such strings takes approximately 1 MB. So far fairly
trivial. But that's just the memory used by the container (the set), not
the contents. 38 bytes * 83,000 strings = another 3 MB. Which of course
is trivial for a modern PC, but the OP is using 83 million such strings,
not 83 thousand, which gives us a grand total of at least 3 gigabytes. An
entry level desktop PC these days is generally 2GB, and entry level
notebooks might have half a gig.

If the OP is on a 64 bit system, every pointer will be twice the size,
leading to even larger memory requirements. And if the strings are
unicode, then potentially they could be anything up to four times larger
each. Worst case, the OP could need something of the order of 24 GB to
store the strings all in memory.

--
Steven

From: Paul Rudin on 30 Apr 2010 04:50

Duncan Booth <duncan.booth(a)invalid.invalid> writes:

> Paul Rudin <paul.nospam(a)rudin.co.uk> wrote:
>
>> Shouldn't a set with 83 million 14 character strings be fine in memory
>> on a stock PC these days? I suppose if it's low on ram you might start
>> swapping which will kill performance. Perhaps the method you're using
>> to build the data structures creates lots of garbage? How much ram do
>> you have and how much memory does the python process use as it builds
>> your data structures?
>
> Some simple experiments should show you that a stock PC running a 32 bit
> Python will struggle:
>
>>>> s = "12345678901234"
>>>> sys.getsizeof(s)
> 38
>>>> 83*38
> 3154
>
> So more than 3GB just for the strings (and that's for Python 2.x on
> Python 3.x you'll need nearly 5GB).
>
> Running on a 64 bit version of Python should be fine, but for a 32 bit
> system a naive approach just isn't going to work.

It depends - a few gig of RAM can be cheap compared with programmer
time. If you know you can solve a problem by spending a few euros on
some extra RAM it can be a good solution! It depends of course where the
code is being deployed - if it's something that's intended to be
deployed widely then you can't expect thousands of people to go out and
buy more RAM - but if it's a one off deployment for a particular
environment then it can be the best way to go.

From: Christian Heimes on 30 Apr 2010 05:10

>>>> s = "12345678901234"
>>>> assert len(s) == 14
>>>> import sys
>>>> sys.getsizeof(s)
> 38
>
> So a single 14 char string takes 38 bytes.

Make that at least 40 bytes. You have to take memory alignment into account.

> So a set with 83000 such strings takes approximately 1 MB. So far fairly
> trivial. But that's just the memory used by the container (the set), not
> the contents. 38 bytes * 83,000 strings = another 3 MB. Which of course
> is trivial for a modern PC, but the OP is using 83 million such strings,
> not 83 thousand, which gives us a grand total of at least 3 gigabytes. An
> entry level desktop PC these days is generally 2GB, and entry level
> notebooks might have half a gig.

You are pretty much screwed on a 32bit system here. In my experience
32bit system can't store more than 2.5 to 2.8 GB on the heap. Eventually
malloc() will fail since large amounts of the 4 GB address space is
reserved for other things like stack, entry point, shared library
mappings, error detection etc. Memory fragmentation isn't an issue here.

Other ideas:

* use a SQL database with an index on the data column. The index could
optimize the "starting with" case.
* You need to search for a string inside a large set of texts? Sounds
like a job for a fulltext search machine! Grab PyLucene and index your
data inside a lucene database. A SSD helps alot here.

Christian

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: dynamic function add to an instance of a class
Next: using python2.6 on windows without installation