From: jakecjacobson on
I need to take a XML web resource and split it up into smaller XML
files. I am able to retrieve the web resource but I can't find any
good XML examples. I am just learning Python so forgive me if this
question has been answered many times in the past.

My resource is like:

<document>
...
...
</document>
<document>
...
...
</document>

So in this example, I would need to output 2 files with the contents
of each file what is between the open and close document tag.
From: Adam Tauno Williams on
On Fri, 2010-01-29 at 09:25 -0800, jakecjacobson wrote:
> I need to take a XML web resource and split it up into smaller XML
> files. I am able to retrieve the web resource but I can't find any
> good XML examples. I am just learning Python so forgive me if this
> question has been answered many times in the past.
> My resource is like:
> <document>
> ...
> ...
> </document>
> <document>
> ...
> ...
> </document>
> So in this example, I would need to output 2 files with the contents
> of each file what is between the open and close document tag.

Do you want to parse the document or SaX?

I have a SaX example at
<http://coils.hg.sourceforge.net/hgweb/coils/coils/file/99b227b08f7f/src/coils/logic/workflow/xml/bpml.py>

From: jakecjacobson on
On Jan 29, 1:04 pm, Adam Tauno Williams <awill...(a)opengroupware.us>
wrote:
> On Fri, 2010-01-29 at 09:25 -0800, jakecjacobson wrote:
> > I need to take a XML web resource and split it up into smaller XML
> > files.  I am able to retrieve the web resource but I can't find any
> > good XML examples.  I am just learning Python so forgive me if this
> > question has been answered many times in the past.
> > My resource is like:
> > <document>
> >      ...
> >      ...
> > </document>
> > <document>
> >      ...
> >      ...
> > </document>
> > So in this example, I would need to output 2 files with the contents
> > of each file what is between the open and close document tag.
>
> Do you want to parse the document or SaX?
>
> I have a SaX example at
> <http://coils.hg.sourceforge.net/hgweb/coils/coils/file/99b227b08f7f/s...>

Thanks but I am way over my head with XML, Python. I am working with
DDMS and need to output the individual resource nodes to their own
file. I hope that this helps and I need a good example and how to use
it.

Here is what a resource node looks like:
<ddms:Resource
xsi:schemaLocation="https://metadata.dod.mil/mdr/ns/DDMS/1.4/
https://metadata.dod.mil/mdr/ns/DDMS/1.4/"
xmlns:ddms="https://metadata.dod.mil/mdr/ns/DDMS/1.4/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:ICISM="urn:us:gov:ic:ism:v2">
<ddms:identifier ddms:qualifier="URL" ddms:value="https://
metadata.dod.mil/mdr/ns/TBD/1.0/SampleTaxonomy.owl"/>
<ddms:identifier ddms:qualifier="https://metadata.dod.mil/mdr/
ns/MDR/1.0/MDR.owl#GovernanceNamespace" ddms:value="TBD"/>
<ddms:identifier ddms:qualifier="Version" ddms:value="1.0"/>
<ddms:title ICISM:ownerProducer="USA"
ICISM:classification="U">Sample Taxonomy</ddms:title>
<ddms:description ICISM:ownerProducer="USA"
ICISM:classification="U">
This is a sample taxonomy created for the Help page.
</ddms:description>
<ddms:dates ddms:posted="2007-11-24"/>
<ddms:creator ICISM:ownerProducer="USA"
ICISM:classification="U">
<ddms:Person>
<ddms:name>Sample</ddms:name>
<ddms:surname>Developer</ddms:surname>
<ddms:affiliation>FGM, Inc.</ddms:affiliation>
<ddms:phone>703-885-1000</ddms:phone>
<ddms:email>sampleDeveloper(a)fgm.com</ddms:email>
</ddms:Person>
</ddms:creator>
<ddms:security ICISM:ownerProducer="USA"
ICISM:classification="U" ICISM:nonICmarkings="DIST_STMT_A" />
<!-- Other DDMS elements may appear here. -->
</ddms:Resource>

You can see the DDMS site at https://metadata.dod.mil/.
From: Stefan Behnel on
jakecjacobson, 29.01.2010 18:25:
> I need to take a XML web resource and split it up into smaller XML
> files. I am able to retrieve the web resource but I can't find any
> good XML examples. I am just learning Python so forgive me if this
> question has been answered many times in the past.
>
> My resource is like:
>
> <document>
> ...
> ...
> </document>
> <document>
> ...
> ...
> </document>

Is this what you get as a document or is this just /contained/ in the document?

Note that XML does not allow more than one root element, so the above is
not XML. Each of the two <document>...</document> parts form an XML
document by themselves, though.


> So in this example, I would need to output 2 files with the contents
> of each file what is between the open and close document tag.

Are the two files formatted as you show above? In that case, you can simply
iterate over the lines and cut the document when you see "<document>". Or,
if you are sure that "<document>" only appears as top-most elements and not
inside of the documents, you can search for "<document>" in the content (a
string, I guess) and split it there.

As was pointed out before, once you have these two documents, use the
xml.etree package to work with them.

Something like this might work:

import xml.etree.ElementTree as ET

data = urllib2.urlopen(url).read()

for part in data.split('<document>'):
document = ET.fromstring('<document>'+part)
print(document.tag)
# ... do other stuff

Stefan
From: Sells, Fred on
Google is your friend. Elementtree is one of the better documented
IMHO, but there are many modules to do this.

> -----Original Message-----
> From: python-list-bounces+frsells=adventistcare.org(a)python.org
> [mailto:python-list-bounces+frsells=adventistcare.org(a)python.org] On
> Behalf Of Stefan Behnel
> Sent: Friday, January 29, 2010 2:25 PM
> To: python-list(a)python.org
> Subject: Re: Processing XML File
>
> jakecjacobson, 29.01.2010 18:25:
> > I need to take a XML web resource and split it up into smaller XML
> > files. I am able to retrieve the web resource but I can't find any
> > good XML examples. I am just learning Python so forgive me if this
> > question has been answered many times in the past.
> >
> > My resource is like:
> >
> > <document>
> > ...
> > ...
> > </document>
> > <document>
> > ...
> > ...
> > </document>
>
> Is this what you get as a document or is this just /contained/ in the
> document?
>
> Note that XML does not allow more than one root element, so the above
is
> not XML. Each of the two <document>...</document> parts form an XML
> document by themselves, though.
>
>
> > So in this example, I would need to output 2 files with the contents
> > of each file what is between the open and close document tag.
>
> Are the two files formatted as you show above? In that case, you can
> simply
> iterate over the lines and cut the document when you see "<document>".
Or,
> if you are sure that "<document>" only appears as top-most elements
and
> not
> inside of the documents, you can search for "<document>" in the
content (a
> string, I guess) and split it there.
>
> As was pointed out before, once you have these two documents, use the
> xml.etree package to work with them.
>
> Something like this might work:
>
> import xml.etree.ElementTree as ET
>
> data = urllib2.urlopen(url).read()
>
> for part in data.split('<document>'):
> document = ET.fromstring('<document>'+part)
> print(document.tag)
> # ... do other stuff
>
> Stefan
> --
> http://mail.python.org/mailman/listinfo/python-list