From: dmtr on
I'm running into some performance / memory bottlenecks on large lists.
Is there any easy way to minimize/optimize memory usage?

Simple str() and unicode objects() [Python 2.6.4/Linux/x86]:
>>> sys.getsizeof('') 24 bytes
>>> sys.getsizeof('0') 25 bytes
>>> sys.getsizeof(u'') 28 bytes
>>> sys.getsizeof(u'0') 32 bytes

Lists of str() and unicode() objects (see ref. code below):
>>> [str(i) for i in xrange(0, 10000000)] 370 Mb (37 bytes/item)
>>> [unicode(i) for i in xrange(0, 10000000)] 613 Mb (63 bytes/item)

Well... 63 bytes per item for very short unicode strings... Is there
any way to do better than that? Perhaps some compact unicode objects?

-- Regards, Dmitry

----
import os, time, re
start = time.time()
l = [unicode(i) for i in xrange(0, 10000000)]
dt = time.time() - start
vm = re.findall("(VmPeak.*|VmSize.*)", open('/proc/%d/status' %
os.getpid()).read())
print "%d keys, %s, %f seconds, %f keys per second" % (len(l), vm, dt,
len(l) / dt)
From: dmtr on
Steven, thank you for answering. See my comments inline. Perhaps I
should have formulated my question a bit differently: Are there any
*compact* high performance containers for unicode()/str() objects in
Python? By *compact* I don't mean compression. Just optimized for
memory usage, rather than performance.

What I'm really looking for is a dict() that maps short unicode
strings into tuples with integers. But just having a *compact* list
container for unicode strings would help a lot (because I could add a
__dict__ and go from it).


> Yes, lots of ways. For example, do you *need* large lists? Often a better
> design is to use generators and iterators to lazily generate data when
> you need it, rather than creating a large list all at once.

Yes. I do need to be able to process large data sets.
No, there is no way I can use an iterator or lazily generate data when
I need it.


> An optimization that sometimes may help is to intern strings, so that
> there's only a single copy of common strings rather than multiple copies
> of the same one.

Unfortunately strings are unique (think usernames on facebook or
wikipedia). And I can't afford storing them in db/memcached/redis/
etc... Too slow.


> Can you compress the data and use that? Without knowing what you are
> trying to do, and why, it's really difficult to advise a better way to do
> it (other than vague suggestions like "use generators instead of lists").

Yes. I've tried. But I was unable to find a good, unobtrusive way to
do that. Every attempt either adds some unnecessary pesky code, or
slow, or something like that. See more at: http://bugs.python.org/issue9520


> Very often, it is cheaper and faster to just put more memory in the
> machine than to try optimizing memory use. Memory is cheap, your time and
> effort is not.

Well... I'd really prefer to use say 16 bytes for 10 chars strings and
fit data into 8Gb
Rather than paying extra $1k for 32Gb.

> > Well...  63 bytes per item for very short unicode strings... Is there
> > any way to do better than that? Perhaps some compact unicode objects?
>
> If you think that unicode objects are going to be *smaller* than byte
> strings, I think you're badly informed about the nature of unicode.

I don't think that that unicode objects are going to be *smaller*!
But AFAIK internally CPython uses UTF-8? No? And 63 bytes per item
seems a bit excessive.
My question was - is there any way to do better than that....


> Python is not a low-level language, and it trades off memory compactness
> for ease of use. Python strings are high-level rich objects, not merely a
> contiguous series of bytes. If all else fails, you might have to use
> something like the array module, or even implement your own data type in
> C.

Are there any *compact* high performance containers (with dict, list
interface) in Python?

-- Regards, Dmitry
From: dmtr on
> > Well...  63 bytes per item for very short unicode strings... Is there
> > any way to do better than that? Perhaps some compact unicode objects?
>
> There is a certain price you pay for having full-feature Python objects.

Are there any *compact* Python objects? Optimized for compactness?

> What are you trying to accomplish anyway? Maybe the array module can be
> of some help. Or numpy?

Ultimately a dict that can store ~20,000,000 entries: (u'short
string' : (int, int, int, int, int, int, int)).

-- Regards, Dmitry
From: Neil Hodgson on
dmtr:

> What I'm really looking for is a dict() that maps short unicode
> strings into tuples with integers. But just having a *compact* list
> container for unicode strings would help a lot (because I could add a
> __dict__ and go from it).

Add them all into one string or array and use indexes into that string.

Neil
From: Carl Banks on
On Aug 6, 6:56 pm, dmtr <dchich...(a)gmail.com> wrote:
> > > Well...  63 bytes per item for very short unicode strings... Is there
> > > any way to do better than that? Perhaps some compact unicode objects?
>
> > There is a certain price you pay for having full-feature Python objects..
>
> Are there any *compact* Python objects? Optimized for compactness?

Yes, but probably not in the way that'd be useful to you.

Look at the array module, and also consider the third-party numpy
library. They store compact arrays of numeric types (mostly) but they
have character type storage as well. That probably won't help you,
though, since you have variable-length strings.

I don't know of any third-party types that can do what you want, but
there might be some. Search PyPI.


> > What are you trying to accomplish anyway? Maybe the array module can be
> > of some help. Or numpy?
>
> Ultimately a dict that can store ~20,000,000 entries: (u'short
> string' : (int, int, int, int, int, int, int)).

My recommendation would be to use sqlite3. Only if you know for sure
that it's too slow--meaning that you've actually tried it and it was
too slow, and nothing else--then should you bother with a

For that I'd probably go with a binary tree rather than a hash. So
you have a huge numpy character array that stores all 20 million short
strings end-to-end (in lexical order, so that you can look up the
strings with a binary search), then you have an numpy integer array
that stores the indices into this string where the word boundaries
are, and then an Nx7 numpy integer array storing the int return
vslues. That's three compact arrays.


Carl Banks