Issue with xml iterparse [Python]

Prev: changing format of time duration.
Next: An empty object with dynamic attributes (expando)

From: bfrederi on 3 Jun 2010 16:44

I am using lxml iterparse and running into a very obscure error. When
I run iterparse on a file, it will occasionally return an element that
has a element.text == None when the element clearly has text in it.

I copy and pasted the problem xml into a python string, used StringIO
to create a file-like object out of it, and ran a test using iterparse
with expected output, and it ran perfectly fine. So it only happens
when I try to run iterparse on the actual file.

So then I tried opening the file, reading the data, turning that data
into a file-like object using StringIO, then running iterparse on it,
and the same problem (element.text == None) occurred.

I even tried this:
f = codecs.open(abbyy_filename, 'r', encoding='utf-8')
file_data = f.read()
file_like_object = StringIO.StringIO(file_data)
for event, element in iterparse(file_like_object, events=("start",
"end")):

And I got this Traceback:
Traceback (most recent call last):
File "abbyyParser/parseAbbyy.py", line 391, in <module>
extension=options.extension,
File "abbyyParser/parseAbbyy.py", line 103, in __init__
self.generate_output_files()
File "abbyyParser/parseAbbyy.py", line 164, in generate_output_files
AbbyyDocParse(abby_filename, self.extension, self.output_types)
File "abbyyParser/parseAbbyy.py", line 239, in __init__
self.parse_doc(abbyy_filename)
File "abbyyParser/parseAbbyy.py", line 281, in parse_doc
for event, element in iterparse(file_like_object, events=("start",
"end")):
File "iterparse.pxi", line 484, in lxml.etree.iterparse.__next__
(src/lxml/lxml.etree.c:86333)
TypeError: reading file objects must return plain strings

If I do this:
file_data = f.read().encode("utf-8")

iterparse will run on it, but I still get elements.text with a value
of None when I should not.

My XML file does have diacritics in it, but I've put the proper
encoding at the head of the XML file (<?xml version="1.0"
encoding="UTF-8"?>). I've also tried using elementree's iterparse, and
I get even more of the same problem with the same files. Any idea what
the problem might be?

From: Chris Rebert on 3 Jun 2010 16:59

On Thu, Jun 3, 2010 at 1:44 PM, bfrederi <brfredericks(a)gmail.com> wrote:
> I am using lxml iterparse and running into a very obscure error. When
> I run iterparse on a file, it will occasionally return an element that
> has a element.text == None when the element clearly has text in it.
>
> I copy and pasted the problem xml into a python string, used StringIO
> to create a file-like object out of it, and ran a test using iterparse
> with expected output, and it ran perfectly fine. So it only happens
> when I try to run iterparse on the actual file.
>
> So then I tried opening the file, reading the data, turning that data
> into a file-like object using StringIO, then running iterparse on it,
> and the same problem (element.text == None) occurred.
>
> I even tried this:
> f = codecs.open(abbyy_filename, 'r', encoding='utf-8')
> file_data = f.read()
> file_like_object = StringIO.StringIO(file_data)
> for event, element in iterparse(file_like_object, events=("start",
> "end")):

IIRC, XML parsers operate on bytes directly (since they have to
determine the encoding themselves anyway), not pre-decoded Unicode
characters, so I think your manual UTF-8 decoding could be the
problem.
Have you tried simply:

f = open(abbyy_filename, 'r')
for event, element in iterparse(f, events=("start", "end")):
#whatever

?

Apologies if you already have, but since you didn't include the
original, albeit probably trivial, error-causing code, this relatively
simple error couldn't be ruled out.

Cheers,
Chris
--
http://blog.rebertia.com

From: bfrederi on 3 Jun 2010 17:13

On Jun 3, 3:59 pm, Chris Rebert <c...(a)rebertia.com> wrote:
> On Thu, Jun 3, 2010 at 1:44 PM, bfrederi <brfrederi...(a)gmail.com> wrote:
> > I am using lxml iterparse and running into a very obscure error. When
> > I run iterparse on a file, it will occasionally return an element that
> > has a element.text == None when the element clearly has text in it.
>
> > I copy and pasted the problem xml into a python string, used StringIO
> > to create a file-like object out of it, and ran a test using iterparse
> > with expected output, and it ran perfectly fine. So it only happens
> > when I try to run iterparse on the actual file.
>
> > So then I tried opening the file, reading the data, turning that data
> > into a file-like object using StringIO, then running iterparse on it,
> > and the same problem (element.text == None) occurred.
>
> > I even tried this:
> > f = codecs.open(abbyy_filename, 'r', encoding='utf-8')
> > file_data = f.read()
> > file_like_object = StringIO.StringIO(file_data)
> > for event, element in iterparse(file_like_object, events=("start",
> > "end")):
>
> IIRC, XML parsers operate on bytes directly (since they have to
> determine the encoding themselves anyway), not pre-decoded Unicode
> characters, so I think your manual UTF-8 decoding could be the
> problem.
> Have you tried simply:
>
> f = open(abbyy_filename, 'r')
> for event, element in iterparse(f, events=("start", "end")):
> #whatever
>
> ?
>
> Apologies if you already have, but since you didn't include the
> original, albeit probably trivial, error-causing code, this relatively
> simple error couldn't be ruled out.
>
> Cheers,
> Chris
> --http://blog.rebertia.com

Sorry for not mentioning it, but I tried that as well and it failed.
Here is the relevant class. AbbyyLine and Abbyyword just take the
element's text and writes it to a file/file-like object. parse_doc is
where I use iterparse. The relevant part is very minimal and there is
a lot of fluff to ignore, so I didn't initially post it:

class AbbyyDocParse(object):

"""Takes an abbyy filename and parses the contents"""
def __init__(self, abbyy_filename, extension=DEFAULT_ABBYY_EXT,
format_list=OUTPUT_TYPES, string_only=False):
self.extension = extension
self.format_list = format_list
#Create the file handles for the output files
self.create_filehandles(abbyy_filename, string_only)
#Parse the document
self.parse_doc(abbyy_filename)
#Close the output filehandles
self.close_filehandles(abbyy_filename, string_only)

def create_filehandles(self, abbyy_filename, string_only):
"""Create output filehandles"""
#if output goes to a file
if not string_only:
#Make sure the file is an abbyy file
if not abbyy_filename.endswith(self.extension):
raise ParserException, "Bad abbyy filename given: %s"
\
% (abbyy_filename)
#get the base path and filename for output files
filename = abbyy_filename.replace(self.extension, '')
#Loop through the different formats
for format_type in self.format_list:
#if output goes to a file
if not string_only:
#Create output filename
out_file = "%s%s" % (filename,
OUTPUT_EXTENSIONS.get(format_type))
#Opens the format type filehandle
try:
setattr(self, "%s_handle" % (format_type),
open(out_file,'w'))
except:
raise IOError, "Could not open file: %s" %
(out_file)
#if output goes to a string
else:
#Opens the format type StringIO
try:
setattr(self, "%s_handle" % (format_type),
StringIO.StringIO())
except:
raise IOError, "Could not open string output: %s"
% (out_file)

def parse_doc(self, abbyy_filename):
"""Parses the abbyy document"""
#Write the first line of the xml doc, if specified
if getattr(self, 'xml_handle', None):
self.xml_handle.write('<?xml version="1.0"
encoding="utf-8"?>\n')
#Memory efficient iterparse opens file and loops through
content
for event, element in iterparse(abbyy_filename,
events=("start", "end")):
#ignore the namespace, if it has one
if NAMESPACE_REGEX.search(element.tag, 0):
element_tag = NAMESPACE_REGEX.search(element.tag,
0).group(1)
else:
element_tag = element.tag
#if this is the page element
if element_tag == 'page':
self.write_page(event, element)
#If at the beginning of the line
elif element_tag == 'line' and event == 'start':
#Create the line
line = AbbyyLine(element)
#Instantiate first word
word = AbbyyWord(line)
#If at the end of the line, and an output text file exists
if element_tag == 'line' and event == 'end' and \
getattr(self, 'text_handle', None):
#output line data to text file
line.write_line(self.text_handle)
#If at the end of the line, and an output text file exists
if element_tag == 'line' and event == 'end' and \
getattr(self, 'xml_handle', None):
#output line data to text file
word.write_word(self.xml_handle)
#if outputting to an xml file, create word data
if getattr(self, 'xml_handle', None) and \
element_tag == 'charParams' and event == 'start':
#Insert character into word
word.insert_char(element, self.xml_handle)
#if outputting to a text file, create line data
if getattr(self, 'text_handle', None) and \
element_tag == 'charParams' and event == 'start':
#Insert character into line
line.insert_char(element)

def write_page(self, event, element):
"""Parse the page contents"""
#page open tag event
if event == 'start':
#Write page info to xml file
if getattr(self, 'xml_handle', None):
#Get the page info
x_dim = element.get('width')
y_dim = element.get('height')
resolution = element.get('resolution')
#Write the page info to the file
self.xml_handle.write('<page>\n')
self.xml_handle.write('<filename/>\n')
self.xml_handle.write('<confidence/>\n')
self.xml_handle.write("<xDim>%s</xDim>\n" % (x_dim))
self.xml_handle.write("<yDim>%s</yDim>\n" % (y_dim))
self.xml_handle.write("<resolution>%s</resolution>\n"
% (resolution))
self.xml_handle.write('<zone/>\n')
self.xml_handle.write('<wordsboundingboxes>\n')
#page close tag event
elif event == 'end':
#Write page info to xml file
if getattr(self, 'xml_handle', None):
#Write closing tags to file
self.xml_handle.write('</wordsboundingboxes>\n')
self.xml_handle.write('</page>')

def write_line(self, event, element):
"""Parse the line contents"""
#line open tag event
if event == 'start':
pass
#page close tag event
elif event == 'end':
pass

def write_word(self, event, element):
"""Parse the charParams contents"""
pass

def close_filehandles(self, abbyy_filename, string_only):
"""Close the open filehandles"""
#if the files exist
if not string_only:
#Loop through the different formats
for format_type in self.format_list:
#Opens the format type filehandle
try:
getattr(self, "%s_handle" % (format_type)).close()
except:
raise IOError, "Could not close format type: %s
for file: %s" \
% (format_type, abbyy_filename)

From: bfrederi on 4 Jun 2010 12:54

On Jun 3, 4:13 pm, bfrederi <brfrederi...(a)gmail.com> wrote:
> On Jun 3, 3:59 pm, Chris Rebert <c...(a)rebertia.com> wrote:
>
>
>
> > On Thu, Jun 3, 2010 at 1:44 PM, bfrederi <brfrederi...(a)gmail.com> wrote:
> > > I am using lxml iterparse and running into a very obscure error. When
> > > I run iterparse on a file, it will occasionally return an element that
> > > has a element.text == None when the element clearly has text in it.
>
> > > I copy and pasted the problem xml into a python string, used StringIO
> > > to create a file-like object out of it, and ran a test using iterparse
> > > with expected output, and it ran perfectly fine. So it only happens
> > > when I try to run iterparse on the actual file.
>
> > > So then I tried opening the file, reading the data, turning that data
> > > into a file-like object using StringIO, then running iterparse on it,
> > > and the same problem (element.text == None) occurred.
>
> > > I even tried this:
> > > f = codecs.open(abbyy_filename, 'r', encoding='utf-8')
> > > file_data = f.read()
> > > file_like_object = StringIO.StringIO(file_data)
> > > for event, element in iterparse(file_like_object, events=("start",
> > > "end")):
>
> > IIRC, XML parsers operate on bytes directly (since they have to
> > determine the encoding themselves anyway), not pre-decoded Unicode
> > characters, so I think your manual UTF-8 decoding could be the
> > problem.
> > Have you tried simply:
>
> > f = open(abbyy_filename, 'r')
> > for event, element in iterparse(f, events=("start", "end")):
> > #whatever
>
> > ?
>
> > Apologies if you already have, but since you didn't include the
> > original, albeit probably trivial, error-causing code, this relatively
> > simple error couldn't be ruled out.
>
> > Cheers,
> > Chris
> > --http://blog.rebertia.com
>
> Sorry for not mentioning it, but I tried that as well and it failed.
> Here is the relevant class. AbbyyLine and Abbyyword just take the
> element's text and writes it to a file/file-like object. parse_doc is
> where I use iterparse. The relevant part is very minimal and there is
> a lot of fluff to ignore, so I didn't initially post it:
>
> class AbbyyDocParse(object):
>
> """Takes an abbyy filename and parses the contents"""
> def __init__(self, abbyy_filename, extension=DEFAULT_ABBYY_EXT,
> format_list=OUTPUT_TYPES, string_only=False):
> self.extension = extension
> self.format_list = format_list
> #Create the file handles for the output files
> self.create_filehandles(abbyy_filename, string_only)
> #Parse the document
> self.parse_doc(abbyy_filename)
> #Close the output filehandles
> self.close_filehandles(abbyy_filename, string_only)
>
> def create_filehandles(self, abbyy_filename, string_only):
> """Create output filehandles"""
> #if output goes to a file
> if not string_only:
> #Make sure the file is an abbyy file
> if not abbyy_filename.endswith(self.extension):
> raise ParserException, "Bad abbyy filename given: %s"
> \
> % (abbyy_filename)
> #get the base path and filename for output files
> filename = abbyy_filename.replace(self.extension, '')
> #Loop through the different formats
> for format_type in self.format_list:
> #if output goes to a file
> if not string_only:
> #Create output filename
> out_file = "%s%s" % (filename,
> OUTPUT_EXTENSIONS.get(format_type))
> #Opens the format type filehandle
> try:
> setattr(self, "%s_handle" % (format_type),
> open(out_file,'w'))
> except:
> raise IOError, "Could not open file: %s" %
> (out_file)
> #if output goes to a string
> else:
> #Opens the format type StringIO
> try:
> setattr(self, "%s_handle" % (format_type),
> StringIO.StringIO())
> except:
> raise IOError, "Could not open string output: %s"
> % (out_file)
>
> def parse_doc(self, abbyy_filename):
> """Parses the abbyy document"""
> #Write the first line of the xml doc, if specified
> if getattr(self, 'xml_handle', None):
> self.xml_handle.write('<?xml version="1.0"
> encoding="utf-8"?>\n')
> #Memory efficient iterparse opens file and loops through
> content
> for event, element in iterparse(abbyy_filename,
> events=("start", "end")):
> #ignore the namespace, if it has one
> if NAMESPACE_REGEX.search(element.tag, 0):
> element_tag = NAMESPACE_REGEX.search(element.tag,
> 0).group(1)
> else:
> element_tag = element.tag
> #if this is the page element
> if element_tag == 'page':
> self.write_page(event, element)
> #If at the beginning of the line
> elif element_tag == 'line' and event == 'start':
> #Create the line
> line = AbbyyLine(element)
> #Instantiate first word
> word = AbbyyWord(line)
> #If at the end of the line, and an output text file exists
> if element_tag == 'line' and event == 'end' and \
> getattr(self, 'text_handle', None):
> #output line data to text file
> line.write_line(self.text_handle)
> #If at the end of the line, and an output text file exists
> if element_tag == 'line' and event == 'end' and \
> getattr(self, 'xml_handle', None):
> #output line data to text file
> word.write_word(self.xml_handle)
> #if outputting to an xml file, create word data
> if getattr(self, 'xml_handle', None) and \
> element_tag == 'charParams' and event == 'start':
> #Insert character into word
> word.insert_char(element, self.xml_handle)
> #if outputting to a text file, create line data
> if getattr(self, 'text_handle', None) and \
> element_tag == 'charParams' and event == 'start':
> #Insert character into line
> line.insert_char(element)
>
> def write_page(self, event, element):
> """Parse the page contents"""
> #page open tag event
> if event == 'start':
> #Write page info to xml file
> if getattr(self, 'xml_handle', None):
> #Get the page info
> x_dim = element.get('width')
> y_dim = element.get('height')
> resolution = element.get('resolution')
> #Write the page info to the file
> self.xml_handle.write('<page>\n')
> self.xml_handle.write('<filename/>\n')
> self.xml_handle.write('<confidence/>\n')
> self.xml_handle.write("<xDim>%s</xDim>\n" % (x_dim))
> self.xml_handle.write("<yDim>%s</yDim>\n" % (y_dim))
> self.xml_handle.write("<resolution>%s</resolution>\n"
> % (resolution))
> self.xml_handle.write('<zone/>\n')
> self.xml_handle.write('<wordsboundingboxes>\n')
> #page close tag event
> elif event == 'end':
> #Write page info to xml file
> if getattr(self, 'xml_handle', None):
> #Write closing tags to file
> self.xml_handle.write('</wordsboundingboxes>\n')
> self.xml_handle.write('</page>')
>
> def write_line(self, event, element):
> """Parse the line contents"""
> #line open tag event
> if event == 'start':
> pass
> #page close tag event
> elif event == 'end':
> pass
>
> def write_word(self, event, element):
> """Parse the charParams contents"""
> pass
>
> def close_filehandles(self, abbyy_filename, string_only):
> """Close the open filehandles"""
> #if the files exist
> if not string_only:
> #Loop through the different formats
> for format_type in self.format_list:
> #Opens the format type filehandle
> try:
> getattr(self, "%s_handle" % (format_type)).close()
> except:
> raise IOError, "Could not close format type: %s
> for file: %s" \
> % (format_type, abbyy_filename)

I think this is a bug with iterparse. I switched to using regular
parse for the parse_doc function, and it worked just fine:

def parse_doc(self, abbyy_filename):
"""Parses the abbyy document"""
#Write the first line of the xml doc, if specified
if getattr(self, 'xml_handle', None):
self.xml_handle.write('<?xml version="1.0" encoding="utf-8"?>
\n')
#Try to open the abbyy file
try:
f = open(abbyy_filename, "r")
#abbyy_filename is already and instance of a file-like object
except:
#parse the abbyy file
tree = parse(abbyy_filename)
#parse the open abbyyfile
else:
tree = parse(f)
f.close()
root = tree.getroot()
line = None
for element in root.iter("*"):
#ignore the namespace, if it has one
if NAMESPACE_REGEX.search(element.tag, 0):
element_tag = NAMESPACE_REGEX.search(element.tag,
0).group(1)
else:
element_tag = element.tag

#if this is the page element
if element_tag == 'page':
self.write_page('start', element)
#If at the beginning of the new line
elif element_tag == 'line':
#if a line already existed, and there is an output text
file
if line != None:
if getattr(self, 'text_handle', None):
#output line data to text file
line.write_line(self.text_handle)
elif getattr(self, 'xml_handle', None):
#output line data to xml file
word.write_word(self.xml_handle)
#Create the line
line = AbbyyLine(element)
#Instantiate first word
word = AbbyyWord(line)

#if outputting to an xml file, create word data
if getattr(self, 'xml_handle', None) and element_tag ==
'charParams':
#Insert character into word
word.insert_char(element, self.xml_handle)
#if outputting to a text file, create line data
if getattr(self, 'text_handle', None) and element_tag ==
'charParams':
#Insert character into line
line.insert_char(element)
#if a line already existed, and there is an output text file
if line != None:
if getattr(self, 'text_handle', None):
#output line data to text file
line.write_line(self.text_handle)
elif getattr(self, 'xml_handle', None):
#output line data to xml file
word.write_word(self.xml_handle)
self.write_page('end', element)

From: Stefan Behnel on 13 Jun 2010 08:13

bfrederi, 03.06.2010 22:44:
> I am using lxml iterparse and running into a very obscure error. When
> I run iterparse on a file, it will occasionally return an element that
> has a element.text == None when the element clearly has text in it.

I assume you are referring to the 'start' event here, right? Tag content is
not guaranteed to be parsed at this point, so containing text may or may
not be available. Only the 'end' event guarantees that it has been parsed
(well, or the 'start' event of a child element).

> I copy and pasted the problem xml into a python string, used StringIO
> to create a file-like object out of it

Note that the right thing to use in Py2.6 and later is "BytesIO".

Stefan

|
Pages: 1
Prev: changing format of time duration.
Next: An empty object with dynamic attributes (expando)