UnicodeDecodeError having fetch web page [Python]

Prev: Legal Point Dhaka bangladesh
Next: Kohonen neural network

From: John Machin on 26 May 2010 03:04

Rob Williscroft <rtw <at> rtw.me.uk> writes:

>
> Barry wrote in news:83dc485a-5a20-403b-99ee-c8c627bdbab3
> @m21g2000vbr.googlegroups.com in gmane.comp.python.general:
>

> > UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1:
> > unexpected code byte
>
> It may not be you, en.wiktionary.org is sending gzip
> encoded content back,

It sure is; here's where the offending 0x8b comes from:

"""ID1 (IDentification 1)
ID2 (IDentification 2)
These have the fixed values ID1 = 31 (0x1f, \037), ID2 = 139
(0x8b, \213), to identify the file as being in gzip format."""

(from http://www.faqs.org/rfcs/rfc1952.html)

From: Kushal Kumaran on 26 May 2010 11:59

On Tue, 2010-05-25 at 20:12 +0000, Rob Williscroft wrote:
> Barry wrote in news:83dc485a-5a20-403b-99ee-c8c627bdbab3
> @m21g2000vbr.googlegroups.com in gmane.comp.python.general:
>
> > Hi,
> >
> > The code below is giving me the error:
> >
> > Traceback (most recent call last):
> > File "C:\Users\Administratör\Desktop\test.py", line 4, in <module>
> > UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1:
> > unexpected code byte
> >
> >
> > What am i doing wrong?
>
> It may not be you, en.wiktionary.org is sending gzip
> encoded content back, it seems to do this even if you set
> the Accept header as in:
>
> request.add_header( "Accept", "text/html" )
>
> But maybe I'm not doing it correctly.
>

You need the Accept-Encoding: identity header.

http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html

<snip>

--
regards,
kushal

From: Rob Williscroft on 26 May 2010 14:10

Kushal Kumaran wrote in news:1274889564.2339.16.camel(a)nitrogen in
gmane.comp.python.general:

> On Tue, 2010-05-25 at 20:12 +0000, Rob Williscroft wrote:
>> Barry wrote in news:83dc485a-5a20-403b-99ee-c8c627bdbab3
>> @m21g2000vbr.googlegroups.com in gmane.comp.python.general:
>>
>> > Hi,
>> >
>> > The code below is giving me the error:
>> >
>> > Traceback (most recent call last):
>> > File "C:\Users\Administratör\Desktop\test.py", line 4, in
>> > <module>
>> > UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position
>> > 1: unexpected code byte
>> >
>> >
>> > What am i doing wrong?
>>
>> It may not be you, en.wiktionary.org is sending gzip
>> encoded content back, it seems to do this even if you set
>> the Accept header as in:
>>
>> request.add_header( "Accept", "text/html" )
>>
>> But maybe I'm not doing it correctly.
>>
> You need the Accept-Encoding: identity header.
> http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html

Thanks, following this I did change the line to be:

request.add_header( "Accept-Encoding", "identity" )

but it made no difference to en.wiktionary.org it just sent the
back a gzip encoded response.

Rob.

From: Kushal Kumaran on 27 May 2010 01:00

On Wed, May 26, 2010 at 11:40 PM, Rob Williscroft <rtw(a)rtw.me.uk> wrote:
> Kushal Kumaran wrote in news:1274889564.2339.16.camel(a)nitrogen in
> gmane.comp.python.general:
>
>> On Tue, 2010-05-25 at 20:12 +0000, Rob Williscroft wrote:
>>> Barry wrote in news:83dc485a-5a20-403b-99ee-c8c627bdbab3
>>> @m21g2000vbr.googlegroups.com in gmane.comp.python.general:
>>>
>>> > Hi,
>>> >
>>> > The code below is giving me the error:
>>> >
>>> > Traceback (most recent call last):
>>> > Â File "C:\Users\AdministratÃ¶r\Desktop\test.py", line 4, in
>>> > Â <module>
>>> > UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position
>>> > 1: unexpected code byte
>>> >
>>> >
>>> > What am i doing wrong?
>>>
>>> It may not be you, en.wiktionary.org is sending gzip
>>> encoded content back, it seems to do this even if you set
>>> the Accept header as in:
>>>
>>> request.add_header( "Accept", "text/html" )
>>>
>>> But maybe I'm not doing it correctly.
>>>
>> You need the Accept-Encoding: identity header.
>> http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html
>
> Thanks, following this I did change the line to be:
>
> request.add_header( "Accept-Encoding", "identity" )
>
> but it made no difference to en.wiktionary.org it just sent the
> back a gzip encoded response.
>

A known problem, I guess... https://bugzilla.wikimedia.org/show_bug.cgi?id=7098

You'll just have to handle the gzipped data.

--
regards,
kushal

First | Prev |
Pages: 1 2
Prev: Legal Point Dhaka bangladesh
Next: Kohonen neural network