From: Diez B. Roggisch on
StarWing schrieb:
> On 10月17日, 下午9时54分, Arian Kuschki <arian.kusc...(a)googlemail.com>
> wrote:
>> Hi all
>>
>> this has been bugging me for a long time and I do not seem to be able to
>> understand what to do. I always have problems when dealing input text that
>> contains umlauts. Consider the following:
>>
>> In [1]: import urllib
>>
>> In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
>>
>> In [3]: xml = f.read()
>>
>> In [4]: f.close()
>>
>> In [5]: print xml
>> ------> print(xml)
>> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
>> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit
>>
>> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
>> data=""/><longitude_e6 data=""/><forecast_date
>> data="2009-10-17"/><current_date_time data="2009-10
>> -17 14:20:00 +0000"/><unit_system
>> data="SI"/></forecast_information><current_conditions><condition data="Meistens
>> bew kt"/><temp_f data="43"/><temp_c data="6"/><h
>> umidity data="Feuchtigkeit: 87 %"/><icon
>> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
>> Windgeschwindigkeiten von 13 km/h"/></curr
>> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
>> data="1"/><high data="7"/><icon
>> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
>> ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
>> data="So."/><low data="-1"/><high data="8"/><icon
>> data="/ig/images/weather/chance_of_sno
>> w.gif"/><condition data="Vereinzelt
>> Schnee"/></forecast_conditions><forecast_conditions><day_of_week
>> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
>> mages/weather/mostly_sunny.gif"/><condition data="Teils
>> sonnig"/></forecast_conditions><forecast_conditions><day_of_week
>> data="Di."/><low data="0"/><high data="8"
>> /><icon data="/ig/images/weather/sunny.gif"/><condition
>> data="Klar"/></forecast_conditions></weather></xml_api_reply>
>>
>> As you can see the umlauts in the XML are not displayed properly. When I want
>> to process this text (for example with xml.sax), I get error messages because
>> the parses can't read this.
>>
>> I've tried to read up on this and there is a lot of information on the web, but
>> nothing seems to work for me. For example setting the coding to UTF like this:
>> # -*- coding: utf-8 -*- or using the decode() string method.
>>
>> I always have this kind of problem when input contains umlauts, not just in
>> this case. My locale (on Ubuntu) is en_GB.UTF-8.
>>
>> Cheers
>> Arian
>
> try this?
>
> # vim: set fencoding=utf-8:
> import urllib
> import xml.sax as sax, xml.sax.handler as handler
>
> f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
> xml = f.read()
> xml = xml.decode("cp1252")
> f.close()
>
> class my_handler(handler.ContentHandler):
> def startElement(self, name, attrs):
> print "begin:", name, attrs
>
> def endElement(self, name):
> print "end:", name
>
> sax.parseString(xml, my_handler())

This is wrong. XML is a *byte*-based format, which explicitly states
encodings. So decoding a byte-string to a unicode-object and then
passing it to a parser is not working in the very moment you have data that

- is outside your default-system-encoding (ususally ascii)
- the system-encoding and the declared decoding differ

Besides, I don't see where the whole SAX-stuff is supposed to do
anything the direct print and the decode() don't do - smells like
cargo-cult to me.

Diez
From: Diez B. Roggisch on
MRAB schrieb:
> Arian Kuschki wrote:
>> Hi all
>>
>> this has been bugging me for a long time and I do not seem to be able
>> to understand what to do. I always have problems when dealing input
>> text that contains umlauts. Consider the following:
>>
>> In [1]: import urllib
>>
>> In [2]: f =
>> urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
>>
>> In [3]: xml = f.read()
>>
>> In [4]: f.close()
>>
>> In [5]: print xml
>> ------> print(xml)
>> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
>> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"
>>> <forecast_information><cit
>> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
>> data=""/><longitude_e6 data=""/><forecast_date
>> data="2009-10-17"/><current_date_time data="2009-10
>> -17 14:20:00 +0000"/><unit_system
>> data="SI"/></forecast_information><current_conditions><condition
>> data="Meistens bew�kt"/><temp_f data="43"/><temp_c data="6"/><h
>> umidity data="Feuchtigkeit: 87�%"/><icon
>> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition
>> data="Wind: W mit Windgeschwindigkeiten von 13 km/h"/></curr
>> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
>> data="1"/><high data="7"/><icon
>> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
>> ereinzelt
>> Regen"/></forecast_conditions><forecast_conditions><day_of_week
>> data="So."/><low data="-1"/><high data="8"/><icon
>> data="/ig/images/weather/chance_of_sno
>> w.gif"/><condition data="Vereinzelt
>> Schnee"/></forecast_conditions><forecast_conditions><day_of_week
>> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
>> mages/weather/mostly_sunny.gif"/><condition data="Teils
>> sonnig"/></forecast_conditions><forecast_conditions><day_of_week
>> data="Di."/><low data="0"/><high data="8"
>> /><icon data="/ig/images/weather/sunny.gif"/><condition
>> data="Klar"/></forecast_conditions></weather></xml_api_reply>
>>
>> As you can see the umlauts in the XML are not displayed properly. When
>> I want to process this text (for example with xml.sax), I get error
>> messages because the parses can't read this.
>>
>> I've tried to read up on this and there is a lot of information on the
>> web, but nothing seems to work for me. For example setting the coding
>> to UTF like this: # -*- coding: utf-8 -*- or using the decode() string
>> method.
>>
>> I always have this kind of problem when input contains umlauts, not
>> just in this case. My locale (on Ubuntu) is en_GB.UTF-8.
>>
> The string you received from the website is a bytestring and you're just
> printing it to your console, which is configured for UTF-8. However, the
> bytestring isn't valid UTF-8, so the console is replacing the invalid
> parts with the funny characters.

This is wierd. I looked at the site in FireFox - and it was displayed
correctly, including umlauts. Bringing up the info-dialog claims the
page is UTF-8, the XML itself says so as well (implicit, through the
missing declaration of an encoding) - but it clearly is *not* utf-8.

One would expect google to be better at this...

Diez
From: Diez B. Roggisch on
MRAB schrieb:
> Arian Kuschki wrote:
>> Hi all
>>
>> this has been bugging me for a long time and I do not seem to be able
>> to understand what to do. I always have problems when dealing input
>> text that contains umlauts. Consider the following:
>>
>> In [1]: import urllib
>>
>> In [2]: f =
>> urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
>>
>> In [3]: xml = f.read()
>>
>> In [4]: f.close()
>>
>> In [5]: print xml
>> ------> print(xml)
>> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
>> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"
>>> <forecast_information><cit
>> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
>> data=""/><longitude_e6 data=""/><forecast_date
>> data="2009-10-17"/><current_date_time data="2009-10
>> -17 14:20:00 +0000"/><unit_system
>> data="SI"/></forecast_information><current_conditions><condition
>> data="Meistens bew�kt"/><temp_f data="43"/><temp_c data="6"/><h
>> umidity data="Feuchtigkeit: 87�%"/><icon
>> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition
>> data="Wind: W mit Windgeschwindigkeiten von 13 km/h"/></curr
>> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
>> data="1"/><high data="7"/><icon
>> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
>> ereinzelt
>> Regen"/></forecast_conditions><forecast_conditions><day_of_week
>> data="So."/><low data="-1"/><high data="8"/><icon
>> data="/ig/images/weather/chance_of_sno
>> w.gif"/><condition data="Vereinzelt
>> Schnee"/></forecast_conditions><forecast_conditions><day_of_week
>> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
>> mages/weather/mostly_sunny.gif"/><condition data="Teils
>> sonnig"/></forecast_conditions><forecast_conditions><day_of_week
>> data="Di."/><low data="0"/><high data="8"
>> /><icon data="/ig/images/weather/sunny.gif"/><condition
>> data="Klar"/></forecast_conditions></weather></xml_api_reply>
>>
>> As you can see the umlauts in the XML are not displayed properly. When
>> I want to process this text (for example with xml.sax), I get error
>> messages because the parses can't read this.
>>
>> I've tried to read up on this and there is a lot of information on the
>> web, but nothing seems to work for me. For example setting the coding
>> to UTF like this: # -*- coding: utf-8 -*- or using the decode() string
>> method.
>>
>> I always have this kind of problem when input contains umlauts, not
>> just in this case. My locale (on Ubuntu) is en_GB.UTF-8.
>>
> The string you received from the website is a bytestring and you're just
> printing it to your console, which is configured for UTF-8. However, the
> bytestring isn't valid UTF-8, so the console is replacing the invalid
> parts with the funny characters.

This is wierd. I looked at the site in FireFox - and it was displayed
correctly, including umlauts. Bringing up the info-dialog claims the
page is UTF-8, the XML itself says so as well (implicit, through the
missing declaration of an encoding) - but it clearly is *not* utf-8.

One would expect google to be better at this...

Diez
From: StarWing on
On 10月18日, 上午12时14分, MRAB <pyt...(a)mrabarnett.plus.com> wrote:
> Arian Kuschki wrote:
> > Hi all
>
> > this has been bugging me for a long time and I do not seem to be able to
> > understand what to do. I always have problems when dealing input text that
> > contains umlauts. Consider the following:
>
> > In [1]: import urllib
>
> > In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
>
> > In [3]: xml = f.read()
>
> > In [4]: f.close()
>
> > In [5]: print xml
> > ------> print(xml)
> > <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
> > tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"
> >> <forecast_information><cit
> > y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
> > data=""/><longitude_e6 data=""/><forecast_date
> > data="2009-10-17"/><current_date_time data="2009-10
> > -17 14:20:00 +0000"/><unit_system
> > data="SI"/></forecast_information><current_conditions><condition data="Meistens
> > bew kt"/><temp_f data="43"/><temp_c data="6"/><h
> > umidity data="Feuchtigkeit: 87 %"/><icon
> > data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
> > Windgeschwindigkeiten von 13 km/h"/></curr
> > ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
> > data="1"/><high data="7"/><icon
> > data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
> > ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
> > data="So."/><low data="-1"/><high data="8"/><icon
> > data="/ig/images/weather/chance_of_sno
> > w.gif"/><condition data="Vereinzelt
> > Schnee"/></forecast_conditions><forecast_conditions><day_of_week
> > data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
> > mages/weather/mostly_sunny.gif"/><condition data="Teils
> > sonnig"/></forecast_conditions><forecast_conditions><day_of_week
> > data="Di."/><low data="0"/><high data="8"
> > /><icon data="/ig/images/weather/sunny.gif"/><condition
> > data="Klar"/></forecast_conditions></weather></xml_api_reply>
>
> > As you can see the umlauts in the XML are not displayed properly. When I want
> > to process this text (for example with xml.sax), I get error messages because
> > the parses can't read this.
>
> > I've tried to read up on this and there is a lot of information on the web, but
> > nothing seems to work for me. For example setting the coding to UTF like this:
> > # -*- coding: utf-8 -*- or using the decode() string method.
>
> > I always have this kind of problem when input contains umlauts, not just in
> > this case. My locale (on Ubuntu) is en_GB.UTF-8.
>
> The string you received from the website is a bytestring and you're just
> printing it to your console, which is configured for UTF-8. However, the
> bytestring isn't valid UTF-8, so the console is replacing the invalid
> parts with the funny characters.
>
> You should decode the bytestring to Unicode and then re-encode it to
> UTF-8. I don't know what encoding the website is actually using; here
> I'm assuming ISO-8859-1:
>
> print xml.decode("iso-8859-1").encode("utf-8")

in 2.6, str.decode return unicode, so you can directly print it.
in 3.1, str.encode return bytes, so you can also directly print it.

so, just decode("cp1252"), it's enough.
From: StarWing on
On 10月18日, 上午12时50分, "Diez B. Roggisch" <de...(a)nospam.web.de> wrote:
> StarWing schrieb:
>
>
>
> > On 10月17日, 下午9时54分, Arian Kuschki <arian.kusc...(a)googlemail.com>
> > wrote:
> >> Hi all
>
> >> this has been bugging me for a long time and I do not seem to be able to
> >> understand what to do. I always have problems when dealing input text that
> >> contains umlauts. Consider the following:
>
> >> In [1]: import urllib
>
> >> In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
>
> >> In [3]: xml = f.read()
>
> >> In [4]: f.close()
>
> >> In [5]: print xml
> >> ------> print(xml)
> >> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
> >> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit
>
> >> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
> >> data=""/><longitude_e6 data=""/><forecast_date
> >> data="2009-10-17"/><current_date_time data="2009-10
> >> -17 14:20:00 +0000"/><unit_system
> >> data="SI"/></forecast_information><current_conditions><condition data="Meistens
> >> bew kt"/><temp_f data="43"/><temp_c data="6"/><h
> >> umidity data="Feuchtigkeit: 87 %"/><icon
> >> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
> >> Windgeschwindigkeiten von 13 km/h"/></curr
> >> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
> >> data="1"/><high data="7"/><icon
> >> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
> >> ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
> >> data="So."/><low data="-1"/><high data="8"/><icon
> >> data="/ig/images/weather/chance_of_sno
> >> w.gif"/><condition data="Vereinzelt
> >> Schnee"/></forecast_conditions><forecast_conditions><day_of_week
> >> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
> >> mages/weather/mostly_sunny.gif"/><condition data="Teils
> >> sonnig"/></forecast_conditions><forecast_conditions><day_of_week
> >> data="Di."/><low data="0"/><high data="8"
> >> /><icon data="/ig/images/weather/sunny.gif"/><condition
> >> data="Klar"/></forecast_conditions></weather></xml_api_reply>
>
> >> As you can see the umlauts in the XML are not displayed properly. When I want
> >> to process this text (for example with xml.sax), I get error messages because
> >> the parses can't read this.
>
> >> I've tried to read up on this and there is a lot of information on the web, but
> >> nothing seems to work for me. For example setting the coding to UTF like this:
> >> # -*- coding: utf-8 -*- or using the decode() string method.
>
> >> I always have this kind of problem when input contains umlauts, not just in
> >> this case. My locale (on Ubuntu) is en_GB.UTF-8.
>
> >> Cheers
> >> Arian
>
> > try this?
>
> > # vim: set fencoding=utf-8:
> > import urllib
> > import xml.sax as sax, xml.sax.handler as handler
>
> > f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
> > xml = f.read()
> > xml = xml.decode("cp1252")
> > f.close()
>
> > class my_handler(handler.ContentHandler):
> >     def startElement(self, name, attrs):
> >         print "begin:", name, attrs
>
> >     def endElement(self, name):
> >         print "end:", name
>
> > sax.parseString(xml, my_handler())
>
> This is wrong. XML is a *byte*-based format, which explicitly states
> encodings. So decoding a byte-string to a unicode-object and then
> passing it to a parser is not working in the very moment you have data that
>
>   - is outside your default-system-encoding (ususally ascii)
>   - the system-encoding and the declared decoding differ
>
> Besides, I don't see where the whole SAX-stuff is supposed to do
> anything the direct print  and the decode() don't do - smells like
> cargo-cult to me.
>
> Diez

yes, XML is a *byte*-based format, and so as utf-8 and code-page
(cp936, cp1252, etc.). so usually XML will sign its coding at head.
but this didn't work now.

in Python2.6, sys.getdefaultcoding() return 'ascii', and I can't use
sys.setdefaultcoding(), and f.read() return a str. so it must be a
undecoded, byte-base format (i.e. raw XML data). so use the right code-
page to decode it is safe.(notice the webpage is google.de).

in Python3.1, read() returns a bytes object. so we *must* decode it,
nor we can't pass it into a parser.