From: Thomas Jollans on
On 08/07/2010 02:45 AM, dmtr wrote:
> I'm running into some performance / memory bottlenecks on large lists.
> Is there any easy way to minimize/optimize memory usage?
>
> Simple str() and unicode objects() [Python 2.6.4/Linux/x86]:
>>>> sys.getsizeof('') 24 bytes
>>>> sys.getsizeof('0') 25 bytes
>>>> sys.getsizeof(u'') 28 bytes
>>>> sys.getsizeof(u'0') 32 bytes
>
> Lists of str() and unicode() objects (see ref. code below):
>>>> [str(i) for i in xrange(0, 10000000)] 370 Mb (37 bytes/item)
>>>> [unicode(i) for i in xrange(0, 10000000)] 613 Mb (63 bytes/item)
>
> Well... 63 bytes per item for very short unicode strings... Is there
> any way to do better than that? Perhaps some compact unicode objects?

There is a certain price you pay for having full-feature Python objects.

What are you trying to accomplish anyway? Maybe the array module can be
of some help. Or numpy?



>
> -- Regards, Dmitry
>
> ----
> import os, time, re
> start = time.time()
> l = [unicode(i) for i in xrange(0, 10000000)]
> dt = time.time() - start
> vm = re.findall("(VmPeak.*|VmSize.*)", open('/proc/%d/status' %
> os.getpid()).read())
> print "%d keys, %s, %f seconds, %f keys per second" % (len(l), vm, dt,
> len(l) / dt)

From: Chris Rebert on
On Fri, Aug 6, 2010 at 6:39 PM, dmtr <dchichkov(a)gmail.com> wrote:
<snip>
>> > Well...  63 bytes per item for very short unicode strings... Is there
>> > any way to do better than that? Perhaps some compact unicode objects?
>>
>> If you think that unicode objects are going to be *smaller* than byte
>> strings, I think you're badly informed about the nature of unicode.
>
> I don't think that that unicode objects are going to be *smaller*!
> But AFAIK internally CPython uses UTF-8?

Nope. unicode objects internally use UCS-2 or UCS-4, depending on how
CPython was ./configure-d; the former is the default.
See PEP 261.

Cheers,
Chris
--
http://blog.rebertia.com
From: Christian Heimes on
> I'm running into some performance / memory bottlenecks on large lists.
> Is there any easy way to minimize/optimize memory usage?
>
> Simple str() and unicode objects() [Python 2.6.4/Linux/x86]:
>>>> sys.getsizeof('') 24 bytes
>>>> sys.getsizeof('0') 25 bytes
>>>> sys.getsizeof(u'') 28 bytes
>>>> sys.getsizeof(u'0') 32 bytes

A Python str object contains much more than just the raw string. On a
32bit system it contains:

* a pointer to its type (ptr with 4 bytes)
* a reference counter (ssize_t, 4 bytes)
* the length of the string (ssize_t, 4 bytes)
* the cached hash of the string (long, 8 bytes)
* interning state (int, 4 bytes)
* a null terminated char array for its data.

Christian


From: Michael Torrie on
On 08/06/2010 07:56 PM, dmtr wrote:
> Ultimately a dict that can store ~20,000,000 entries: (u'short
> string' : (int, int, int, int, int, int, int)).

I think you really need a real database engine. With the proper
indexes, MySQL could be very fast storing and retrieving this
information for you. And it will use your RAM to cache as it sees fit.
Don't try to reinvent the wheel here.