From: C. Benson Manica on
I have the following simple script running on 2.5.2 on a machine where
the default character encoding is "ascii":

#!/usr/bin/env python
#coding: utf-8

import xml.dom.minidom
import codecs

str=u"<?xml version=\"1.0\" encoding=\"utf-8\"?><elements><elem attrib=
\"ó\"/></elements>"
doc=xml.dom.minidom.parseString( str )
xml=doc.toxml( encoding="utf-8" )
file=codecs.open( "foo.xml", "w", "utf-8" )
file.write( xml )
file.close()

I've specified utf-8 every place I can find that the documentation
allows me to, and yet this doesn't even come close to working without
UnicodeEncodeErrors. What on Earth do I have to do to please the
character encoding gods?
From: Peter Otten on
C. Benson Manica wrote:

> I have the following simple script running on 2.5.2 on a machine where
> the default character encoding is "ascii":
>
> #!/usr/bin/env python
> #coding: utf-8
>
> import xml.dom.minidom
> import codecs
>
> str=u"<?xml version=\"1.0\" encoding=\"utf-8\"?><elements><elem attrib=
> \"ó\"/></elements>"
> doc=xml.dom.minidom.parseString( str )
> xml=doc.toxml( encoding="utf-8" )
> file=codecs.open( "foo.xml", "w", "utf-8" )
> file.write( xml )
> file.close()
>
> I've specified utf-8 every place I can find that the documentation
> allows me to, and yet this doesn't even come close to working without
> UnicodeEncodeErrors. What on Earth do I have to do to please the
> character encoding gods?

Verify every step as you proceed?

>>> import xml.dom.minidom
>>> s = u"<?xml version=\"1.0\" encoding=\"utf-8\"?><elements><elem
attrib=\"ó\"/></elements>"
>>> doc = xml.dom.minidom.parseString(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/xml/dom/minidom.py", line 1925, in parseString
return expatbuilder.parseString(string)
File "/usr/lib/python2.5/xml/dom/expatbuilder.py", line 940, in
parseString
return builder.parseString(string)
File "/usr/lib/python2.5/xml/dom/expatbuilder.py", line 223, in
parseString
parser.Parse(string, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position
62: ordinal not in range(128)

It seems that parseString() doesn't like unicode -- let's try a byte string
then:

>>> doc = xml.dom.minidom.parseString(s.encode("utf-8"))
>>> xml = doc.toxml(encoding="utf-8")

No complaints -- let's have a look at the result:

>>> xml
'<?xml version="1.0" encoding="utf-8"?><elements><elem
attrib="\xc3\xb3"/></elements>'

That's a byte string, no need for codecs.open() then:

>>> f = open("foo.xml", "w")
>>> f.write(xml)
>>> f.close()

Peter
From: C. Benson Manica on
On Apr 21, 1:58 pm, Peter Otten <__pete...(a)web.de> wrote:
> C. Benson Manica wrote:
>> (snip)
>
> It seems that parseString() doesn't like unicode

Yes, I noticed that, and I already tried...

> -- let's try a byte string
> then:
>
> >>> doc = xml.dom.minidom.parseString(s.encode("utf-8"))
> >>> xml = doc.toxml(encoding="utf-8")

....except that it didn't work:

File "./demo.py", line 8, in <module>
doc=xml.dom.minidom.parseString( str.encode("utf-8") )
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
62: ordinal not in range(128)
From: Peter Otten on
C. Benson Manica wrote:

> On Apr 21, 1:58 pm, Peter Otten <__pete...(a)web.de> wrote:
>> C. Benson Manica wrote:
>>> (snip)
>>
>> It seems that parseString() doesn't like unicode
>
> Yes, I noticed that, and I already tried...
>
>> -- let's try a byte string
>> then:
>>
>> >>> doc = xml.dom.minidom.parseString(s.encode("utf-8"))
>> >>> xml = doc.toxml(encoding="utf-8")
>
> ...except that it didn't work:
>
> File "./demo.py", line 8, in <module>
> doc=xml.dom.minidom.parseString( str.encode("utf-8") )
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
> 62: ordinal not in range(128)

Are you sure that your script has

str = u"..."

like in your post and not just

str = "..."

?

Peter


From: C. Benson Manica on
On Apr 21, 2:25 pm, Peter Otten <__pete...(a)web.de> wrote:

> Are you sure that your script has
>
> str = u"..."
>
> like in your post and not just
>
> str = "..."

No :-)

str=u"<?xml version=\"1.0\" encoding=\"utf-8\"?><elements><elem attrib=
\"ó\"/></elements>"
doc=xml.dom.minidom.parseString( str.encode("utf-8") )
xml=doc.toxml( encoding="utf-8")
file=codecs.open( "foo.xml", "w", "utf-8" )
file.write( xml )
file.close()

fails:

File "./demo.py", line 12, in <module>
file.write( xml )
File "/usr/lib/python2.5/codecs.py", line 638, in write
return self.writer.write(data)
File "/usr/lib/python2.5/codecs.py", line 303, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
62: ordinal not in range(128)

but dropping the encoding argument to doc.toxml() seems to finally
work. I'd be curious to know why the code you posted (that worked for
you) didn't for me, but at this point I'm just happy with something
functional. Thank you very kindly!