From: Diez B. Roggisch on
StarWing schrieb:
> On 10月18日, 上午12时50分, "Diez B. Roggisch" <de...(a)nospam.web.de> wrote:
>> StarWing schrieb:
>>
>>
>>
>>> On 10月17日, 下午9时54分, Arian Kuschki <arian.kusc...(a)googlemail.com>
>>> wrote:
>>>> Hi all
>>>> this has been bugging me for a long time and I do not seem to be able to
>>>> understand what to do. I always have problems when dealing input text that
>>>> contains umlauts. Consider the following:
>>>> In [1]: import urllib
>>>> In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
>>>> In [3]: xml = f.read()
>>>> In [4]: f.close()
>>>> In [5]: print xml
>>>> ------> print(xml)
>>>> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
>>>> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit
>>>> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
>>>> data=""/><longitude_e6 data=""/><forecast_date
>>>> data="2009-10-17"/><current_date_time data="2009-10
>>>> -17 14:20:00 +0000"/><unit_system
>>>> data="SI"/></forecast_information><current_conditions><condition data="Meistens
>>>> bew kt"/><temp_f data="43"/><temp_c data="6"/><h
>>>> umidity data="Feuchtigkeit: 87 %"/><icon
>>>> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
>>>> Windgeschwindigkeiten von 13 km/h"/></curr
>>>> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
>>>> data="1"/><high data="7"/><icon
>>>> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
>>>> ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
>>>> data="So."/><low data="-1"/><high data="8"/><icon
>>>> data="/ig/images/weather/chance_of_sno
>>>> w.gif"/><condition data="Vereinzelt
>>>> Schnee"/></forecast_conditions><forecast_conditions><day_of_week
>>>> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
>>>> mages/weather/mostly_sunny.gif"/><condition data="Teils
>>>> sonnig"/></forecast_conditions><forecast_conditions><day_of_week
>>>> data="Di."/><low data="0"/><high data="8"
>>>> /><icon data="/ig/images/weather/sunny.gif"/><condition
>>>> data="Klar"/></forecast_conditions></weather></xml_api_reply>
>>>> As you can see the umlauts in the XML are not displayed properly. When I want
>>>> to process this text (for example with xml.sax), I get error messages because
>>>> the parses can't read this.
>>>> I've tried to read up on this and there is a lot of information on the web, but
>>>> nothing seems to work for me. For example setting the coding to UTF like this:
>>>> # -*- coding: utf-8 -*- or using the decode() string method.
>>>> I always have this kind of problem when input contains umlauts, not just in
>>>> this case. My locale (on Ubuntu) is en_GB.UTF-8.
>>>> Cheers
>>>> Arian
>>> try this?
>>> # vim: set fencoding=utf-8:
>>> import urllib
>>> import xml.sax as sax, xml.sax.handler as handler
>>> f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
>>> xml = f.read()
>>> xml = xml.decode("cp1252")
>>> f.close()
>>> class my_handler(handler.ContentHandler):
>>> def startElement(self, name, attrs):
>>> print "begin:", name, attrs
>>> def endElement(self, name):
>>> print "end:", name
>>> sax.parseString(xml, my_handler())
>> This is wrong. XML is a *byte*-based format, which explicitly states
>> encodings. So decoding a byte-string to a unicode-object and then
>> passing it to a parser is not working in the very moment you have data that
>>
>> - is outside your default-system-encoding (ususally ascii)
>> - the system-encoding and the declared decoding differ
>>
>> Besides, I don't see where the whole SAX-stuff is supposed to do
>> anything the direct print and the decode() don't do - smells like
>> cargo-cult to me.
>>
>> Diez
>
> yes, XML is a *byte*-based format, and so as utf-8 and code-page
> (cp936, cp1252, etc.). so usually XML will sign its coding at head.
> but this didn't work now.
>
> in Python2.6, sys.getdefaultcoding() return 'ascii', and I can't use
> sys.setdefaultcoding(), and f.read() return a str. so it must be a
> undecoded, byte-base format (i.e. raw XML data). so use the right code-
> page to decode it is safe.(notice the webpage is google.de).
>
> in Python3.1, read() returns a bytes object. so we *must* decode it,
> nor we can't pass it into a parser.

You didn't get my point. A XML-parser only *takes* a byte-string.
Decoding is it's business. So your above last sentence is wrong.

Because regardless of the python-version, if you feed the parser a
unicode-object, python will first encode that to a byte-string, possibly
giving a UnicodeError (maybe this automated conversion has gone in Py3K,
but then you get a type-error instead).

So to make the above work (if one wants to parse the xml), the proper
thing to do would be

xml = xml.decode("cp1252").encode("utf-8")

and then feed that. Of course the really good thing would be to fix the
webpage, but that's beyond our capabilities I fear...

Diez
From: I V on
On Sat, 17 Oct 2009 18:54:10 +0200, Diez B. Roggisch wrote:

> This is wierd. I looked at the site in FireFox - and it was displayed
> correctly, including umlauts. Bringing up the info-dialog claims the
> page is UTF-8, the XML itself says so as well (implicit, through the
> missing declaration of an encoding) - but it clearly is *not* utf-8.

The headers correctly identify it as ISO-8859-1, which overrides the
implicit specification of UTF-8. I'm not sure why Firefox is reporting it
as UTF-8 (it does that for me, too); I can see the umlauts, so it's
clearly processing it as ISO-8859-1.
From: Arian Kuschki on
Whoa, that was quick! Thanks for all the answers, I'll try to recapitulate

>What does this show you in your interactive interpreter?
>
>>>> print "\xc3\xb6"

>
>For me, it's o-umlaut, ö. This is because the above bytes are the
>sequence for ö in utf-8.
>
>If this shows something else, you need to adjust your terminal settings.

for me it also prints the correct o-umlaut (ö), so that was not the problem.


All of the below result in xml that shows all umlauts correctly when printed:

xml.decode("cp1252")
xml.decode("cp1252").encode("utf-8")
xml.decode("iso-8859-1")
xml.decode("iso-8859-1").encode("utf-8")

But when I want to parse the xml then, it only works if I
do both decode and encode. If I only decode, I get the following error:
SAXParseException: <unknown>:1:1: not well-formed (invalid token)

Do I understand right that since the encoding was not specified in the xml
response, it should have been utf-8 by default? And that if it had indeed been utf-8 I
would not have had the encoding problem in the first place?

Anyway, thanks everybody, this has helped me a lot.

Arian


On Sat 17, 20:17 +0200, Diez B. Roggisch wrote:

> StarWing schrieb:
> >On 10月18日, 上午12时50分, "Diez B. Roggisch" <de...(a)nospam.web.de> wrote:
> >>StarWing schrieb:
> >>
> >>
> >>
> >>>On 10月17日, 下午9时54分, Arian Kuschki <arian.kusc...(a)googlemail.com>
> >>>wrote:
> >>>>Hi all
> >>>>this has been bugging me for a long time and I do not seem to be able to
> >>>>understand what to do. I always have problems when dealing input text that
> >>>>contains umlauts. Consider the following:
> >>>>In [1]: import urllib
> >>>>In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
> >>>>In [3]: xml = f.read()
> >>>>In [4]: f.close()
> >>>>In [5]: print xml
> >>>>------> print(xml)
> >>>><?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
> >>>>tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit
> >>>>y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
> >>>>data=""/><longitude_e6 data=""/><forecast_date
> >>>>data="2009-10-17"/><current_date_time data="2009-10
> >>>>-17 14:20:00 +0000"/><unit_system
> >>>>data="SI"/></forecast_information><current_conditions><condition data="Meistens
> >>>>bew kt"/><temp_f data="43"/><temp_c data="6"/><h
> >>>>umidity data="Feuchtigkeit: 87 %"/><icon
> >>>>data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
> >>>>Windgeschwindigkeiten von 13 km/h"/></curr
> >>>>ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
> >>>>data="1"/><high data="7"/><icon
> >>>>data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
> >>>>ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
> >>>>data="So."/><low data="-1"/><high data="8"/><icon
> >>>>data="/ig/images/weather/chance_of_sno
> >>>>w.gif"/><condition data="Vereinzelt
> >>>>Schnee"/></forecast_conditions><forecast_conditions><day_of_week
> >>>>data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
> >>>>mages/weather/mostly_sunny.gif"/><condition data="Teils
> >>>>sonnig"/></forecast_conditions><forecast_conditions><day_of_week
> >>>>data="Di."/><low data="0"/><high data="8"
> >>>>/><icon data="/ig/images/weather/sunny.gif"/><condition
> >>>>data="Klar"/></forecast_conditions></weather></xml_api_reply>
> >>>>As you can see the umlauts in the XML are not displayed properly. When I want
> >>>>to process this text (for example with xml.sax), I get error messages because
> >>>>the parses can't read this.
> >>>>I've tried to read up on this and there is a lot of information on the web, but
> >>>>nothing seems to work for me. For example setting the coding to UTF like this:
> >>>># -*- coding: utf-8 -*- or using the decode() string method.
> >>>>I always have this kind of problem when input contains umlauts, not just in
> >>>>this case. My locale (on Ubuntu) is en_GB.UTF-8.
> >>>>Cheers
> >>>>Arian
> >>>try this?
> >>># vim: set fencoding=utf-8:
> >>>import urllib
> >>>import xml.sax as sax, xml.sax.handler as handler
> >>>f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
> >>>xml = f.read()
> >>>xml = xml.decode("cp1252")
> >>>f.close()
> >>>class my_handler(handler.ContentHandler):
> >>> def startElement(self, name, attrs):
> >>> print "begin:", name, attrs
> >>> def endElement(self, name):
> >>> print "end:", name
> >>>sax.parseString(xml, my_handler())
> >>This is wrong. XML is a *byte*-based format, which explicitly states
> >>encodings. So decoding a byte-string to a unicode-object and then
> >>passing it to a parser is not working in the very moment you have data that
> >>
> >> - is outside your default-system-encoding (ususally ascii)
> >> - the system-encoding and the declared decoding differ
> >>
> >>Besides, I don't see where the whole SAX-stuff is supposed to do
> >>anything the direct print and the decode() don't do - smells like
> >>cargo-cult to me.
> >>
> >>Diez
> >
> >yes, XML is a *byte*-based format, and so as utf-8 and code-page
> >(cp936, cp1252, etc.). so usually XML will sign its coding at head.
> >but this didn't work now.
> >
> >in Python2.6, sys.getdefaultcoding() return 'ascii', and I can't use
> >sys.setdefaultcoding(), and f.read() return a str. so it must be a
> >undecoded, byte-base format (i.e. raw XML data). so use the right code-
> >page to decode it is safe.(notice the webpage is google.de).
> >
> >in Python3.1, read() returns a bytes object. so we *must* decode it,
> >nor we can't pass it into a parser.
>
> You didn't get my point. A XML-parser only *takes* a byte-string.
> Decoding is it's business. So your above last sentence is wrong.
>
> Because regardless of the python-version, if you feed the parser a
> unicode-object, python will first encode that to a byte-string,
> possibly giving a UnicodeError (maybe this automated conversion has
> gone in Py3K, but then you get a type-error instead).
>
> So to make the above work (if one wants to parse the xml), the
> proper thing to do would be
>
> xml = xml.decode("cp1252").encode("utf-8")
>
> and then feed that. Of course the really good thing would be to fix
> the webpage, but that's beyond our capabilities I fear...
>
> Diez
> --
> http://mail.python.org/mailman/listinfo/python-list

--
From: Arian Kuschki on
I just checked and I see the following in the headers:
Content-Type text/xml; charset=UTF-8

Where does it say ISO-8859-1?

On Sat 17, 20:57 +0200, I V wrote:

> On Sat, 17 Oct 2009 18:54:10 +0200, Diez B. Roggisch wrote:
>
> > This is wierd. I looked at the site in FireFox - and it was displayed
> > correctly, including umlauts. Bringing up the info-dialog claims the
> > page is UTF-8, the XML itself says so as well (implicit, through the
> > missing declaration of an encoding) - but it clearly is *not* utf-8.
>
> The headers correctly identify it as ISO-8859-1, which overrides the
> implicit specification of UTF-8. I'm not sure why Firefox is reporting it
> as UTF-8 (it does that for me, too); I can see the umlauts, so it's
> clearly processing it as ISO-8859-1.
> --
> http://mail.python.org/mailman/listinfo/python-list

--
From: I V on
On Sat, 17 Oct 2009 21:24:59 +0330, Arian Kuschki wrote:
> I just checked and I see the following in the headers: Content-Type
> text/xml; charset=UTF-8
>
> Where does it say ISO-8859-1?

In the headers returned via urllib (and via wget). But checking in
Firefox, it does indeed specify UTF-8 in the content type. Using wget,
but specifying the same User-Agent header that Firefox uses, I get the
same UTF-8 Content-Type that I see in Firefox. How bizarre.