Python parsing XML file problem with SAX [Python]

Prev: Which multiprocessing methods use shared memory?
Next: Nice way to cast a homogeneous tuple

From: Stefan Behnel on 10 Aug 2010 02:35

Christian Heimes, 10.08.2010 01:39:
> Am 10.08.2010 01:20, schrieb Aahz:
>> The docs say, "Parses an XML section into an element tree incrementally".
>> Sure sounds like it retains the entire parsed tree in RAM. Not good.
>> Again, how do you parse an XML file larger than your available memory
>> using something other than SAX?
>
> The document at
> http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ explains it
> one way.
>
> The iterparser approach is ingenious but it doesn't work for every XML
> format. Let's say you have a 10 GB XML file with one million<part/>
> tags. An iterparser doesn't load the entire document. Instead it
> iterates over the file and yields (for example) one million ElementTrees
> for each<part/> tag and its children. You can get the nice API of
> ElementTree with the memory efficiency of a SAX parser if you obey
> "Listing 4".

In the very common case that you are interested in all children of the root
element, it's even enough to intercept on the specific tag name (lxml.etree
has an option for that, but an 'if' block will do just fine in ET) and just
".clear()" the child element at the end of the loop body. That results in
very fast and simple code, but will leave the tags in the tree while only
removing their content and attributes. Usually works well enough for
several ten thousand elements, especially when using cElementTree.

As usual, a bit of benchmarking will uncover the right way to do it in your
case. That's also a huge advantage over SAX: iterparse code is much easier
to tune into a streamlined loop body when you need it.

Stefan

First | Prev |
Pages: 1 2
Prev: Which multiprocessing methods use shared memory?
Next: Nice way to cast a homogeneous tuple