From: Stefan Behnel on
jia li, 28.07.2010 12:10:
> I have an XML file with hundreds of<error> elements.
>
> What's strange is only one of there elements could not be parsed correctly:
> <error>
> <checker>REVERSE_INULL</checker>
> <function>Dispose_ParameterList</function>
> <unmangled_function>Dispose_ParameterList</unmangled_function>
> <status>UNINSPECTED</status>
> <num>146</num>
> <home>1/146MMSLib_LinkedList.c</home>
> </error>
>
> I printed the data in "characters(self, data)" and after parsing. The result
> is one "\r\n" is inserted after "1/" and "146MMSLib_LinkedList.c" for the
> latter.
>
> But if I make my XML file only this element left, it could parse correctly.

First of all: don't use SAX. Use ElementTree's iterparse() function. That
will shrink you code down to a simple loop in a few lines.

Then, the problem is likely that you are getting separate events for text
nodes. The "\r\n" most likely only occurs due to your print statement, I
doubt that it's really in the data returned from SAX. Again: using
ElementTree instead of SAX will avoid this kind of problem.

Stefan

From: Aahz on
In article <mailman.1250.1280314148.1673.python-list(a)python.org>,
Stefan Behnel <stefan_ml(a)behnel.de> wrote:
>
>First of all: don't use SAX. Use ElementTree's iterparse() function. That
>will shrink you code down to a simple loop in a few lines.

Unless I'm missing something, that only helps if the final tree fits into
memory. What do you suggest other than SAX if your XML file may be
hundreds of megabytes?
--
Aahz (aahz(a)pythoncraft.com) <*> http://www.pythoncraft.com/

"...if I were on life-support, I'd rather have it run by a Gameboy than a
Windows box." --Cliff Wells
From: Stefan Behnel on
Aahz, 09.08.2010 18:52:
> In article<mailman.1250.1280314148.1673.python-list(a)python.org>,
> Stefan Behnel wrote:
>>
>> First of all: don't use SAX. Use ElementTree's iterparse() function. That
>> will shrink you code down to a simple loop in a few lines.
>
> Unless I'm missing something, that only helps if the final tree fits into
> memory. What do you suggest other than SAX if your XML file may be
> hundreds of megabytes?

Well, what about using ElementTree's iterparse() function in that case?
That's what it's good at, and its cElementTree version is extremely fast.

Stefan

From: Aahz on
In article <mailman.1860.1281375095.1673.python-list(a)python.org>,
Stefan Behnel <stefan_ml(a)behnel.de> wrote:
>Aahz, 09.08.2010 18:52:
>> In article<mailman.1250.1280314148.1673.python-list(a)python.org>,
>> Stefan Behnel wrote:
>>>
>>> First of all: don't use SAX. Use ElementTree's iterparse() function. That
>>> will shrink you code down to a simple loop in a few lines.
>>
>> Unless I'm missing something, that only helps if the final tree fits into
>> memory. What do you suggest other than SAX if your XML file may be
>> hundreds of megabytes?
>
>Well, what about using ElementTree's iterparse() function in that case?
>That's what it's good at, and its cElementTree version is extremely fast.

The docs say, "Parses an XML section into an element tree incrementally".
Sure sounds like it retains the entire parsed tree in RAM. Not good.
Again, how do you parse an XML file larger than your available memory
using something other than SAX?
--
Aahz (aahz(a)pythoncraft.com) <*> http://www.pythoncraft.com/

"...if I were on life-support, I'd rather have it run by a Gameboy than a
Windows box." --Cliff Wells
From: Christian Heimes on
Am 10.08.2010 01:20, schrieb Aahz:
> The docs say, "Parses an XML section into an element tree incrementally".
> Sure sounds like it retains the entire parsed tree in RAM. Not good.
> Again, how do you parse an XML file larger than your available memory
> using something other than SAX?

The document at
http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ explains it
one way.

The iterparser approach is ingenious but it doesn't work for every XML
format. Let's say you have a 10 GB XML file with one million <part/>
tags. An iterparser doesn't load the entire document. Instead it
iterates over the file and yields (for example) one million ElementTrees
for each <part/> tag and its children. You can get the nice API of
ElementTree with the memory efficiency of a SAX parser if you obey
"Listing 4".

Christian