Prev: dynamic function add to an instance of a class
Next: using python2.6 on windows without installation
From: News123 on 1 May 2010 07:48 Dennis Lee Bieber wrote: > On Thu, 29 Apr 2010 11:38:28 +0200, "Karin Lagesen" > <karin.lagesen(a)bio.uio.no> declaimed the following in comp.lang.python: > >> Hello. >> >> I have approx 83 million strings, all 14 characters long. I need to be >> able to take another string and find out whether this one is present >> within the 83 million strings. >> >> > So don't load them into memory... First use a file-based (not memory > > > That lets you do a binary search on the file. Much faster than a > linear search (linear search will average out to 41.5M read operations; > binary should be around 10000 reads) Don't you meant 27 reads instead of 41.5 M reads? >>> from math import log >>> log(83e6)/log(2) 26.306608000671101 >>> N
From: Stefan Behnel on 1 May 2010 09:05 Duncan Booth, 30.04.2010 10:20: > So more than 3GB just for the strings (and that's for Python 2.x on > Python 3.x you'll need nearly 5GB). > > Running on a 64 bit version of Python should be fine, but for a 32 bit > system a naive approach just isn't going to work. > > Option 1: use a trie. That should reduce storage, maybe it will reduce > it enough, maybe not. It depends on the data. Depending on the implementation and the data, a trie can also use a lot /more/ space than the set of strings that it contains. The 83 million 14 character strings can well include a relatively large subset of the possible permutations (imagine purely numeric, hex-numeric or lower-case alpha-numeric strings, for example), so the trie will likely need to branch very often with very low intermediate run length. If you use pointers for trie branches, that's at least 8 bytes per branch on a 64bit system, versus 1 byte per character in a byte string list. Depending on the ratio of branches to characters, one or the other may win. So a "naive approach" likely won't work for tries either. Stefan
From: News123 on 1 May 2010 17:46 Dennis Lee Bieber wrote: > On Sat, 01 May 2010 13:48:02 +0200, News123 <news1234(a)free.fr> declaimed > the following in gmane.comp.python.general: > >> Dennis Lee Bieber wrote: >>> That lets you do a binary search on the file. Much faster than a >>> linear search (linear search will average out to 41.5M read operations; >>> binary should be around 10000 reads) >> Don't you meant 27 reads instead of 41.5 M reads? >> > Don't you mean my 10000 reads? The 41.5M is the average for the > linear search. > Indeed, this is what I meant. or phrased differently: "about 27 reads with a binary search instead of 41.5M reads average with a linear search." >>>>> from math import log >>>>> log(83e6)/log(2) >> 26.306608000671101 > Probably -- it was late at night and I was working things out in my > head... I know about late nights. I just wondered whether I overlooked something.
From: Albert van der Horst on 2 May 2010 08:23 In article <877hnpjtdw.fsf(a)rudin.co.uk>, Paul Rudin <paul.nospam(a)rudin.co.uk> wrote: >"Karin Lagesen" <karin.lagesen(a)bio.uio.no> writes: > >> Hello. >> >> I have approx 83 million strings, all 14 characters long. I need to be >> able to take another string and find out whether this one is present >> within the 83 million strings. >> >> Now, I have tried storing these strings as a list, a set and a dictionary. >> I know that finding things in a set and a dictionary is very much faster >> than working with a list, so I tried those first. However, I run out of >> memory building both the set and the dictionary, so what I seem to be left >> with is the list, and using the in method. >> >> I imagine that there should be a faster/better way than this? >> > >Shouldn't a set with 83 million 14 character strings be fine in memory >on a stock PC these days? I suppose if it's low on ram you might start >swapping which will kill performance. Perhaps the method you're using to >build the data structures creates lots of garbage? How much ram do you >have and how much memory does the python process use as it builds your >data structures? And if this is practical there should be no swapping problems, as the working set will be a small fraction of the data used. <SNIP> > >There are other algorithms you can use that have better theoretical >performance than using bisect for this particular problem, but you get >into trade offs between executing things in python as opposed to C if >you start to hand code things in python. There are a lot of possibility, but they depend a great deal on secondary conditions, like how often the 83 M dataset changes. Groetjes Albert -- -- Albert van der Horst, UTRECHT,THE NETHERLANDS Economic growth -- being exponential -- ultimately falters. albert(a)spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
From: Bryan on 3 May 2010 11:53 Karin Lagesen wrote: > I have approx 83 million strings, all 14 characters long. I need to be > able to take another string and find out whether this one is present > within the 83 million strings. [...] > I run out of memory building both the set and the dictionary, so > what I seem to be left with is the list I agree with the recommendations to try modules that maintain the set on disk (sqlite, dbm, shelve), but here's an in-memory solution: Hash to about 8 million buckets, and use a compact representation for the several strings in each bucket. That gives you most of the speed and avoids most of the memory overhead. Python makes it easy: class ConstLenStrSet: """ Keep a set of strings all of the same length. Support set.add() and Python's 'in' test. """ def __init__(self, strlen=14, nbuckets=8000000): assert strlen > 0 and nbuckets > 0 self.strlen = strlen self.nbuckets = nbuckets | 1 self.table = [''] * self.nbuckets def _hashstr(self, s): return hash(s) % self.nbuckets def add(self, s): assert len(s) == self.strlen if s not in self: self.table[self._hashstr(s)] += s def __contains__(self, s): data = self.table[self._hashstr(s)] for i in range(0, len(data), self.strlen): if data[i : i + self.strlen] == s: return True return False # Rudimentary demo-test: from random import choice from string import letters def rnd_str(slen=14): return ''.join(choice(letters) for _ in range(slen)) # Test against Python's built-in set sref = set() for i in range(830000): sref.add(rnd_str()) print('Strings generated.') sset = sset = ConstLenStrSet(14, 80000) for s in sref: sset.add(s) print 'ConstLenStrSet loaded.' for s in sref: assert s in sset for i in range(1000): s = rnd_str() if s in sref: print 'That was unlucky.' else: assert s not in sset If building the set is too slow, and you know you don't have a lot of duplicate strings, you can use a faster insert method that doesn't check whether the string is already in the set: def add_quick(self, s): assert len(s) == self.strlen self.table[self._hashstr(s)] += s -- --Bryan
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 4 Prev: dynamic function add to an instance of a class Next: using python2.6 on windows without installation |