From: dmtr on
On Aug 6, 10:56 pm, Michael Torrie <torr...(a)gmail.com> wrote:
> On 08/06/2010 07:56 PM, dmtr wrote:
>
> > Ultimately a dict that can store ~20,000,000 entries: (u'short
> > string' : (int, int, int, int, int, int, int)).
>
> I think you really need a real database engine.  With the proper
> indexes, MySQL could be very fast storing and retrieving this
> information for you.  And it will use your RAM to cache as it sees fit.
>  Don't try to reinvent the wheel here.

No, I've tried. DB solutions are not even close in terms of the speed.
Processing would take weeks :( Memcached or REDIS sort of work, but
they are still a bit on the slow side, to be a pleasure to work with.
The standard dict() container is *a lot* faster. It is also hassle
free (accepting unicode keys/etc). I just wish there was a bit more
compact dict container, optimized for large dataset and memory, not
for speed. And with the default dict() I'm also running into some kind
of nonlinear performance degradation, apparently after
10,000,000-13,000,000 keys. But I can't recreate this with a solid
test case (see http://bugs.python.org/issue9520 ) :(

-- Dmitry
From: Peter Otten on
dmtr wrote:

>> > Well... 63 bytes per item for very short unicode strings... Is there
>> > any way to do better than that? Perhaps some compact unicode objects?
>>
>> There is a certain price you pay for having full-feature Python objects.
>
> Are there any *compact* Python objects? Optimized for compactness?
>
>> What are you trying to accomplish anyway? Maybe the array module can be
>> of some help. Or numpy?
>
> Ultimately a dict that can store ~20,000,000 entries: (u'short
> string' : (int, int, int, int, int, int, int)).

I don't know to what extent it still applys but switching off cyclic garbage
collection with

import gc
gc.disable()

while building large datastructures used to speed up things significantly.
That's what I would try first with your real data.

Encoding your unicode strings as UTF-8 could save some memory.

When your integers fit into two bytes, say, you can use an array.array()
instead of the tuple.

Peter

From: dmtr on
On Aug 6, 11:50 pm, Peter Otten <__pete...(a)web.de> wrote:
> I don't know to what extent it still applys but switching off cyclic garbage
> collection with
>
> import gc
> gc.disable()


Haven't tried it on the real dataset. On the synthetic test it (and
sys.setcheckinterval(100000)) gave ~2% speedup and no change in memory
usage. Not significant. I'll try it on the real dataset though.


> while building large datastructures used to speed up things significantly..
> That's what I would try first with your real data.
>
> Encoding your unicode strings as UTF-8 could save some memory.

Yes... In fact that's what I'm trying now... .encode('utf-8')
definitely creates some clutter in the code, but I guess I can
subclass dict... And it does saves memory! A lot of it. Seems to be a
bit faster too....

> When your integers fit into two bytes, say, you can use an array.array()
> instead of the tuple.

Excellent idea. Thanks! And it seems to work too, at least for the
test code. Here are some benchmarks (x86 desktop):

Unicode key / tuple:
>>> for i in xrange(0, 1000000): d[unicode(i)] = (i, i+1, i+2, i+3, i+4, i+5, i+6)
1000000 keys, ['VmPeak:\t 224704 kB', 'VmSize:\t 224704 kB'],
4.079240 seconds, 245143.698209 keys per second

>>> for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] = array.array('i', (i, i+1, i+2, i+3, i+4, i+5, i+6))
1000000 keys, ['VmPeak:\t 201440 kB', 'VmSize:\t 201440 kB'],
4.985136 seconds, 200596.331486 keys per second

>>> for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] = (i, i+1, i+2, i+3, i+4, i+5, i+6)
1000000 keys, ['VmPeak:\t 125652 kB', 'VmSize:\t 125652 kB'],
3.572301 seconds, 279931.625282 keys per second

Almost halved the memory usage. And faster too. Nice.

-- Dmitry
From: dmtr on
Correction. I've copy-pasted it wrong! array.array('i', (i, i+1, i+2, i
+3, i+4, i+5, i+6)) was the best.

>>> for i in xrange(0, 1000000): d[unicode(i)] = (i, i+1, i+2, i+3, i+4, i+5, i+6)
1000000 keys, ['VmPeak:\t 224704 kB', 'VmSize:\t 224704 kB'],
4.079240 seconds, 245143.698209 keys per second

>>> for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] = (i, i+1, i+2, i+3, i+4, i+5, i+6)
1000000 keys, ['VmPeak:\t 201440 kB', 'VmSize:\t 201440 kB'],
4.985136 seconds, 200596.331486 keys per second

>>> for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] = array.array('i', (i, i+1, i+2, i+3, i+4, i+5, i+6))
1000000 keys, ['VmPeak:\t 125652 kB', 'VmSize:\t 125652 kB'],
3.572301 seconds, 279931.625282 keys per second

-- Dmitry

From: Peter Otten on
dmtr wrote:

> On Aug 6, 11:50 pm, Peter Otten <__pete...(a)web.de> wrote:
>> I don't know to what extent it still applys but switching off cyclic
>> garbage collection with
>>
>> import gc
>> gc.disable()
>
>
> Haven't tried it on the real dataset. On the synthetic test it (and
> sys.setcheckinterval(100000)) gave ~2% speedup and no change in memory
> usage. Not significant. I'll try it on the real dataset though.
>
>
>> while building large datastructures used to speed up things
>> significantly. That's what I would try first with your real data.
>>
>> Encoding your unicode strings as UTF-8 could save some memory.
>
> Yes... In fact that's what I'm trying now... .encode('utf-8')
> definitely creates some clutter in the code, but I guess I can
> subclass dict... And it does saves memory! A lot of it. Seems to be a
> bit faster too....
>
>> When your integers fit into two bytes, say, you can use an array.array()
>> instead of the tuple.
>
> Excellent idea. Thanks! And it seems to work too, at least for the
> test code. Here are some benchmarks (x86 desktop):
>
> Unicode key / tuple:
>>>> for i in xrange(0, 1000000): d[unicode(i)] = (i, i+1, i+2, i+3, i+4,
>>>> i+5, i+6)
> 1000000 keys, ['VmPeak:\t 224704 kB', 'VmSize:\t 224704 kB'],
> 4.079240 seconds, 245143.698209 keys per second
>
>>>> for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] =
>>>> array.array('i', (i, i+1, i+2, i+3, i+4, i+5, i+6))
> 1000000 keys, ['VmPeak:\t 201440 kB', 'VmSize:\t 201440 kB'],
> 4.985136 seconds, 200596.331486 keys per second
>
>>>> for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] = (i, i+1,
>>>> i+2, i+3, i+4, i+5, i+6)
> 1000000 keys, ['VmPeak:\t 125652 kB', 'VmSize:\t 125652 kB'],
> 3.572301 seconds, 279931.625282 keys per second
>
> Almost halved the memory usage. And faster too. Nice.

> def benchmark_dict(d, N):
> start = time.time()
>
> for i in xrange(N):
> length = lengths[random.randint(0, 255)]
> word = ''.join([ letters[random.randint(0, 255)] for i in
xrange(length) ])
> d[word] += 1
>
> dt = time.time() - start
> vm = re.findall("(VmPeak.*|VmSize.*)", open('/proc/%d/status' %
os.getpid()).read())
> print "%d keys (%d unique), %s, %f seconds, %f keys per second" % (N,
len(d), vm, dt, N / dt)
>

Looking at your benchmark, random.choice(letters) has probably less overhead
than letters[random.randint(...)]. You might even try to inline it as

letters[int(random.random())*256)]

Peter