From: Mark Tolonen on

"Diez B. Roggisch" <deets(a)nospam.web.de> wrote in message
news:7jub5rF37divlU4(a)mid.uni-berlin.de...
[snip]
> This is wierd. I looked at the site in FireFox - and it was displayed
> correctly, including umlauts. Bringing up the info-dialog claims the page
> is UTF-8, the XML itself says so as well (implicit, through the missing
> declaration of an encoding) - but it clearly is *not* utf-8.
>
> One would expect google to be better at this...
>
> Diez

According to the XML 1.0 specification:

"Although an XML processor is required to read only entities in the UTF-8
and UTF-16 encodings, it is recognized that other encodings are used around
the world, and it may be desired for XML processors to read entities that
use them. In the absence of external character encoding information (such as
MIME headers), parsed entities which are stored in an encoding other than
UTF-8 or UTF-16 must begin with a text declaration..."

So UTF-8 and UTF-16 are the defaults supported without an xml declaration in
the absence of external encoding information. But we have external
character encoding information:

>>> f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
>>> f.headers.dict['content-type']
'text/xml; charset=ISO-8859-1'

So the page seems correct.

-Mark


From: Arian Kuschki on
Hm yes, that is true. In Firefox on the other hand, the response header is
"Content-Type text/xml; charset=UTF-8"

On Sat 17, 13:16 -0700, Mark Tolonen wrote:

>
> "Diez B. Roggisch" <deets(a)nospam.web.de> wrote in message
> news:7jub5rF37divlU4(a)mid.uni-berlin.de...
> [snip]
> >This is wierd. I looked at the site in FireFox - and it was
> >displayed correctly, including umlauts. Bringing up the
> >info-dialog claims the page is UTF-8, the XML itself says so as
> >well (implicit, through the missing declaration of an encoding) -
> >but it clearly is *not* utf-8.
> >
> >One would expect google to be better at this...
> >
> >Diez
>
> According to the XML 1.0 specification:
>
> "Although an XML processor is required to read only entities in the
> UTF-8 and UTF-16 encodings, it is recognized that other encodings
> are used around the world, and it may be desired for XML processors
> to read entities that use them. In the absence of external character
> encoding information (such as MIME headers), parsed entities which
> are stored in an encoding other than UTF-8 or UTF-16 must begin with
> a text declaration..."
>
> So UTF-8 and UTF-16 are the defaults supported without an xml
> declaration in the absence of external encoding information. But we
> have external character encoding information:
>
> >>>f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
> >>>f.headers.dict['content-type']
> 'text/xml; charset=ISO-8859-1'
>
> So the page seems correct.
>
> -Mark
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list

--
From: Diez B. Roggisch on
Arian Kuschki schrieb:
> Whoa, that was quick! Thanks for all the answers, I'll try to recapitulate
>
>> What does this show you in your interactive interpreter?
>>
>>>>> print "\xc3\xb6"
>> ö
>>
>> For me, it's o-umlaut, ö. This is because the above bytes are the
>> sequence for ö in utf-8.
>>
>> If this shows something else, you need to adjust your terminal settings.
>
> for me it also prints the correct o-umlaut (ö), so that was not the problem.
>
>
> All of the below result in xml that shows all umlauts correctly when printed:
>
> xml.decode("cp1252")
> xml.decode("cp1252").encode("utf-8")
> xml.decode("iso-8859-1")
> xml.decode("iso-8859-1").encode("utf-8")
>
> But when I want to parse the xml then, it only works if I
> do both decode and encode. If I only decode, I get the following error:
> SAXParseException: <unknown>:1:1: not well-formed (invalid token)
>
> Do I understand right that since the encoding was not specified in the xml
> response, it should have been utf-8 by default? And that if it had indeed been utf-8 I
> would not have had the encoding problem in the first place?

Yes. XML without explicit encoding is implicitly UTF-8, and the page is
borked using cp* or latin* without saying so.


Diez
From: Diez B. Roggisch on
Diez B. Roggisch schrieb:
> Arian Kuschki schrieb:
>> Whoa, that was quick! Thanks for all the answers, I'll try to
>> recapitulate
>>
>>> What does this show you in your interactive interpreter?
>>>
>>>>>> print "\xc3\xb6"
>>> ö
>>>
>>> For me, it's o-umlaut, ö. This is because the above bytes are the
>>> sequence for ö in utf-8.
>>>
>>> If this shows something else, you need to adjust your terminal settings.
>>
>> for me it also prints the correct o-umlaut (ö), so that was not the
>> problem.
>>
>>
>> All of the below result in xml that shows all umlauts correctly when
>> printed:
>>
>> xml.decode("cp1252")
>> xml.decode("cp1252").encode("utf-8")
>> xml.decode("iso-8859-1")
>> xml.decode("iso-8859-1").encode("utf-8")
>>
>> But when I want to parse the xml then, it only works if I
>> do both decode and encode. If I only decode, I get the following error:
>> SAXParseException: <unknown>:1:1: not well-formed (invalid token)
>>
>> Do I understand right that since the encoding was not specified in the
>> xml response, it should have been utf-8 by default? And that if it had
>> indeed been utf-8 I would not have had the encoding problem in the
>> first place?
>
> Yes. XML without explicit encoding is implicitly UTF-8, and the page is
> borked using cp* or latin* without saying so.

Ok, after reading some other posts in this thread this assumption seems
not to hold. HTTP-protocol allows for other encodings to be implicitly
given. Which I think is an atrocity.

Diez