From: Brian Candler on
I plan to parse a huge XML document (too big to fit into RAM) using a
stream parser. I can divide the stream into logical chunks which can be
processed individually. If a particular chunk fails, I want to append it
to an output XML file, which will contain all the failed chunks, and can
be patched up and retried.

To do this, I want to be able to regenerate the XML of the failed chunk,
preferably identical to how it was seen.

The options I can think of are:

1. A stream parser which gives me the raw XML alongside each parsed
item; I can concatenate the raw XML into a string.

2. A stream parser which gives me the byte pos of the current node, so I
can seek back within the file to fetch the XML again

3. A stream parser which gives me events to identify the different parts
of XML, together with an inverse process to which I can replay the
events and get the XML back again.

Playing with REXML StreamListener, I can get a series of method calls
like start_tag(...) and end_tag(...), and I can collect these into an
array; is there existing code which would let me squirt that array and
recreate the XML? Any other options I should be looking at?

Thanks,

Brian.
--
Posted via http://www.ruby-forum.com/.

From: Caleb Clausen on
On 4/30/10, Brian Candler <b.candler(a)pobox.com> wrote:
> I plan to parse a huge XML document (too big to fit into RAM) using a
> stream parser. I can divide the stream into logical chunks which can be
> processed individually. If a particular chunk fails, I want to append it
> to an output XML file, which will contain all the failed chunks, and can
> be patched up and retried.
>
> To do this, I want to be able to regenerate the XML of the failed chunk,
> preferably identical to how it was seen.
>
> The options I can think of are:
>
> 1. A stream parser which gives me the raw XML alongside each parsed
> item; I can concatenate the raw XML into a string.
>
> 2. A stream parser which gives me the byte pos of the current node, so I
> can seek back within the file to fetch the XML again
>
> 3. A stream parser which gives me events to identify the different parts
> of XML, together with an inverse process to which I can replay the
> events and get the XML back again.
>
> Playing with REXML StreamListener, I can get a series of method calls
> like start_tag(...) and end_tag(...), and I can collect these into an
> array; is there existing code which would let me squirt that array and
> recreate the XML? Any other options I should be looking at?

From my experience, REXML is far too wimpy to deal with data on this
scale. (Among other things, it was too slow.) I suggest using the
'stream parser' (a misnomer, this is really a lexer) in libxml
instead. I don't know for sure if it can reconstruct the original text
the way you want, but that should be possible.

I think the class you'd want is LibXML::XML::SaxParser. See
http://libxml.rubyforge.org/.

From: John W Higgins on
[Note: parts of this message were removed to make it a legal post.]

Morning,

On Fri, Apr 30, 2010 at 9:26 AM, Brian Candler <b.candler(a)pobox.com> wrote:

> I plan to parse a huge XML document (too big to fit into RAM) using a
> stream parser. I can divide the stream into logical chunks which can be
> processed individually. If a particular chunk fails, I want to append it
> to an output XML file, which will contain all the failed chunks, and can
> be patched up and retried.
>

If you aren't completely against Perl - XML-Twig [1] has a tool called
xml_split [2] which works rather well at splitting xml files. You might wish
to split up your files into smaller files prior to even beginning the
processing and then if a file fails to process you just have the file in
hand. When finished you could smash the failed files back together using
xml_merge [3] from the same perl package.

If there is some ruby variant of this I couldn't locate it but that never
means much :)

John

[1] - http://search.cpan.org/~mirod/XML-Twig-3.34/
[2] - http://search.cpan.org/~mirod/XML-Twig-3.34/tools/xml_split/xml_split
[3] - http://search.cpan.org/~mirod/XML-Twig-3.34/tools/xml_merge/xml_merge

From: Robert Dober on
On Fri, Apr 30, 2010 at 6:26 PM, Brian Candler <b.candler(a)pobox.com> wrote:

Would you care to use JRuby?
That would give you access to top XML Stream parsers IIRC ;)
Just as an example: org.apache.xerces.parsers.SAXParser seems very
suited for your purpose, although it is a little bit of work to
construct your xml fragments it should be rather easy.

HTH
R.

--
The best way to predict the future is to invent it.
-- Alan Kay

From: Brian Candler on
> Would you care to use JRuby?

I don't mind which stream parser, but Java is out :-)

Since this is a bit of disposable code, I've decided to cheat. I
pretty-print the XML, then I can read it line-at-a-time using gets into
a buffer, identify a range of lines which forms a chunk, then parse the
buffer. On error I write out the buffer again.

Thanks for all your suggestions.
--
Posted via http://www.ruby-forum.com/.

 |  Next  |  Last
Pages: 1 2 3
Prev: Need informations
Next: freetds implementation in Ruby