From: GZ on
Hi All,

I need to store a large number of large objects to file and then
access them sequentially. I am talking about a few thousands of
objects and each with size of a few hundred kilobytes, and total file
size a few gigabytes. I tried shelve, but it is not good at
sequentially accessing the data. In essence, shelve.keys() takes
forever.

I am wondering if there is a module that can persist a stream of
objects without having to load everything into memory. (For this
reason, I think Pickle is out, too, because it needs everything to be
in memory.)

Thanks,
GZ
From: Alex Willmer on
On Aug 7, 5:26 pm, GZ <zyzhu2...(a)gmail.com> wrote:
> I am wondering if there is a module that can persist a stream of
> objects without having to load everything into memory. (For this
> reason, I think Pickle is out, too, because it needs everything to be
> in memory.)

From the pickle docs it looks like you could do something like:

try:
import cPickle as pickle
except ImportError
import pickle

file_obj = open('whatever', 'wb')
p = pickle.Pickler(file_obj)

for x in stream_of_objects:
p.dump(x)
p.memo.clear()

del p
file_obj.close()

then later

file_obj = open('whatever', 'rb')
p = pickle.Unpickler(file_obj)

while True:
try:
x = p.load()
do_something_with(x)
except EOFError:
break

Your loading loop could be wrapped in a generator function, so only
one object should be held in memory at once.
From: GZ on
Hi Alex,

On Aug 7, 6:54 pm, Alex Willmer <a...(a)moreati.org.uk> wrote:
> On Aug 7, 5:26 pm, GZ <zyzhu2...(a)gmail.com> wrote:
>
> > I am wondering if there is a module that can persist a stream of
> > objects without having to load everything into memory. (For this
> > reason, I think Pickle is out, too, because it needs everything to be
> > in memory.)
>
> From the pickle docs it looks like you could do something like:
>
> try:
>     import cPickle as pickle
> except ImportError
>     import pickle
>
> file_obj = open('whatever', 'wb')
> p = pickle.Pickler(file_obj)
>
> for x in stream_of_objects:
>     p.dump(x)
>     p.memo.clear()
>
> del p
> file_obj.close()
>
> then later
>
> file_obj = open('whatever', 'rb')
> p = pickle.Unpickler(file_obj)
>
> while True:
>     try:
>         x = p.load()
>         do_something_with(x)
>     except EOFError:
>         break
>
> Your loading loop could be wrapped in a generator function, so only
> one object should be held in memory at once.

This totally works!

Thanks!