Prev: dynamic function add to an instance of a class
Next: using python2.6 on windows without installation
From: Terry Reedy on 29 Apr 2010 16:29 On 4/29/2010 5:38 AM, Karin Lagesen wrote: > Hello. > > I have approx 83 million strings, all 14 characters long. I need to be > able to take another string and find out whether this one is present > within the 83 million strings. If the 'other string' is also 14 chars, so that you are looking for exact matches, and you are doing this repeatedly, you might look to match hash values first.
From: Paul Rudin on 30 Apr 2010 03:23 "Karin Lagesen" <karin.lagesen(a)bio.uio.no> writes: > Hello. > > I have approx 83 million strings, all 14 characters long. I need to be > able to take another string and find out whether this one is present > within the 83 million strings. > > Now, I have tried storing these strings as a list, a set and a dictionary. > I know that finding things in a set and a dictionary is very much faster > than working with a list, so I tried those first. However, I run out of > memory building both the set and the dictionary, so what I seem to be left > with is the list, and using the in method. > > I imagine that there should be a faster/better way than this? > Shouldn't a set with 83 million 14 character strings be fine in memory on a stock PC these days? I suppose if it's low on ram you might start swapping which will kill performance. Perhaps the method you're using to build the data structures creates lots of garbage? How much ram do you have and how much memory does the python process use as it builds your data structures? A set should give good performance if the target string is also 14 characters. If you do end up with the list then try using bisect <http://docs.python.org/library/bisect.html> which should be quicker than just using "in" (which I think just scans linearly through the list testing for equality). There are other algorithms you can use that have better theoretical performance than using bisect for this particular problem, but you get into trade offs between executing things in python as opposed to C if you start to hand code things in python.
From: Steven D'Aprano on 30 Apr 2010 04:41 On Fri, 30 Apr 2010 08:23:39 +0100, Paul Rudin wrote: > "Karin Lagesen" <karin.lagesen(a)bio.uio.no> writes: > >> Hello. >> >> I have approx 83 million strings, all 14 characters long. I need to be >> able to take another string and find out whether this one is present >> within the 83 million strings. >> >> Now, I have tried storing these strings as a list, a set and a >> dictionary. I know that finding things in a set and a dictionary is >> very much faster than working with a list, so I tried those first. >> However, I run out of memory building both the set and the dictionary, >> so what I seem to be left with is the list, and using the in method. >> >> I imagine that there should be a faster/better way than this? >> >> > Shouldn't a set with 83 million 14 character strings be fine in memory > on a stock PC these days? Not even close. Using Python 2.6: >>> s = "12345678901234" >>> assert len(s) == 14 >>> import sys >>> sys.getsizeof(s) 38 So a single 14 char string takes 38 bytes. >>> import random, string >>> chars = list(string.letters + string.digits)*4 >>> def rnd_str(): .... random.shuffle(chars) .... return ''.join(chars[:14]) .... >>> s = set() >>> while len(s) < 83000: .... s.add(rnd_str()) .... >>> sys.getsizeof(s) 1048688 So a set with 83000 such strings takes approximately 1 MB. So far fairly trivial. But that's just the memory used by the container (the set), not the contents. 38 bytes * 83,000 strings = another 3 MB. Which of course is trivial for a modern PC, but the OP is using 83 million such strings, not 83 thousand, which gives us a grand total of at least 3 gigabytes. An entry level desktop PC these days is generally 2GB, and entry level notebooks might have half a gig. If the OP is on a 64 bit system, every pointer will be twice the size, leading to even larger memory requirements. And if the strings are unicode, then potentially they could be anything up to four times larger each. Worst case, the OP could need something of the order of 24 GB to store the strings all in memory. -- Steven
From: Paul Rudin on 30 Apr 2010 04:50 Duncan Booth <duncan.booth(a)invalid.invalid> writes: > Paul Rudin <paul.nospam(a)rudin.co.uk> wrote: > >> Shouldn't a set with 83 million 14 character strings be fine in memory >> on a stock PC these days? I suppose if it's low on ram you might start >> swapping which will kill performance. Perhaps the method you're using >> to build the data structures creates lots of garbage? How much ram do >> you have and how much memory does the python process use as it builds >> your data structures? > > Some simple experiments should show you that a stock PC running a 32 bit > Python will struggle: > >>>> s = "12345678901234" >>>> sys.getsizeof(s) > 38 >>>> 83*38 > 3154 > > So more than 3GB just for the strings (and that's for Python 2.x on > Python 3.x you'll need nearly 5GB). > > Running on a 64 bit version of Python should be fine, but for a 32 bit > system a naive approach just isn't going to work. It depends - a few gig of RAM can be cheap compared with programmer time. If you know you can solve a problem by spending a few euros on some extra RAM it can be a good solution! It depends of course where the code is being deployed - if it's something that's intended to be deployed widely then you can't expect thousands of people to go out and buy more RAM - but if it's a one off deployment for a particular environment then it can be the best way to go.
From: Christian Heimes on 30 Apr 2010 05:10 >>>> s = "12345678901234" >>>> assert len(s) == 14 >>>> import sys >>>> sys.getsizeof(s) > 38 > > So a single 14 char string takes 38 bytes. Make that at least 40 bytes. You have to take memory alignment into account. > So a set with 83000 such strings takes approximately 1 MB. So far fairly > trivial. But that's just the memory used by the container (the set), not > the contents. 38 bytes * 83,000 strings = another 3 MB. Which of course > is trivial for a modern PC, but the OP is using 83 million such strings, > not 83 thousand, which gives us a grand total of at least 3 gigabytes. An > entry level desktop PC these days is generally 2GB, and entry level > notebooks might have half a gig. You are pretty much screwed on a 32bit system here. In my experience 32bit system can't store more than 2.5 to 2.8 GB on the heap. Eventually malloc() will fail since large amounts of the 4 GB address space is reserved for other things like stack, entry point, shared library mappings, error detection etc. Memory fragmentation isn't an issue here. Other ideas: * use a SQL database with an index on the data column. The index could optimize the "starting with" case. * You need to search for a string inside a large set of texts? Sounds like a job for a fulltext search machine! Grab PyLucene and index your data inside a lucene database. A SSD helps alot here. Christian
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 4 Prev: dynamic function add to an instance of a class Next: using python2.6 on windows without installation |