From: Barry on 25 May 2010 15:13 Hi, The code below is giving me the error: Traceback (most recent call last): File "C:\Users\Administratör\Desktop\test.py", line 4, in <module> UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: unexpected code byte What am i doing wrong? Thanks, Barry request = urllib.request.Request(url='http://en.wiktionary.org/wiki/ baby',headers={'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/ 20071127 Firefox/2.0.0.11'} ) response = urllib.request.urlopen(request) html = response.read().decode('utf-8')
From: Philip Semanchuk on 25 May 2010 15:39 On May 25, 2010, at 3:13 PM, Barry wrote: > Hi, > > The code below is giving me the error: > > Traceback (most recent call last): > File "C:\Users\Administratör\Desktop\test.py", line 4, in <module> > UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: > unexpected code byte > > > What am i doing wrong? > > Thanks, > > Barry > > request = urllib.request.Request(url='http://en.wiktionary.org/wiki/ > baby',headers={'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/ > 20071127 Firefox/2.0.0.11'} ) > > response = urllib.request.urlopen(request) > html = response.read().decode('utf-8') Well, for starters you're assuming that the response content is in UTF-8. You need to examine the Content-Type header to see what the encoding is. If it's not UTF-8, there's your problem. HTH P
From: Barry on 25 May 2010 16:00 On 25 Maj, 21:39, Philip Semanchuk <phi...(a)semanchuk.com> wrote: > On May 25, 2010, at 3:13 PM, Barry wrote: > > > > > Hi, > > > The code below is giving me the error: > > > Traceback (most recent call last): > > File "C:\Users\Administratör\Desktop\test.py", line 4, in <module> > > UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: > > unexpected code byte > > > What am i doing wrong? > > > Thanks, > > > Barry > > > request = urllib.request.Request(url='http://en.wiktionary.org/wiki/ > > baby',headers={'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/ > > 20071127 Firefox/2.0.0.11'} ) > > > response = urllib.request.urlopen(request) > > html = response.read().decode('utf-8') > > Well, for starters you're assuming that the response content is in > UTF-8. You need to examine the Content-Type header to see what the > encoding is. If it's not UTF-8, there's your problem. > > HTH > P The content type is utf-8: Date: Wed, 19 May 2010 19:17:39 GMT Server: Apache Cache-Control: private, s-maxage=0, max-age=0, must-revalidate Content-Language: en Vary: Accept-Encoding,Cookie Last-Modified: Wed, 19 May 2010 10:10:34 GMT Content-Encoding: gzip Content-Length: 25247 Content-Type: text/html; charset=utf-8 X-Cache: HIT from sq61.wikimedia.org X-Cache-Lookup: HIT from sq61.wikimedia.org:3128 Age: 520549 X-Cache: HIT from amssq32.esams.wikimedia.org X-Cache-Lookup: HIT from amssq32.esams.wikimedia.org:3128 X-Cache: MISS from amssq37.esams.wikimedia.org X-Cache-Lookup: MISS from amssq37.esams.wikimedia.org:80 Connection: close Can it be that the page is corrupt? If so, how can I make the best of the situation? Many other pages from this server work without problem. Thanks! Barry
From: Peter Otten on 25 May 2010 16:10 Barry wrote: > On 25 Maj, 21:39, Philip Semanchuk <phi...(a)semanchuk.com> wrote: >> On May 25, 2010, at 3:13 PM, Barry wrote: >> >> >> >> > Hi, >> >> > The code below is giving me the error: >> >> > Traceback (most recent call last): >> > File "C:\Users\Administratör\Desktop\test.py", line 4, in <module> >> > UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: >> > unexpected code byte >> >> > What am i doing wrong? >> >> > Thanks, >> >> > Barry >> >> > request = urllib.request.Request(url='http://en.wiktionary.org/wiki/ >> > baby',headers={'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/ >> > 20071127 Firefox/2.0.0.11'} ) >> >> > response = urllib.request.urlopen(request) >> > html = response.read().decode('utf-8') >> >> Well, for starters you're assuming that the response content is in >> UTF-8. You need to examine the Content-Type header to see what the >> encoding is. If it's not UTF-8, there's your problem. >> >> HTH >> P > > The content type is utf-8: > > Date: Wed, 19 May 2010 19:17:39 GMT > Server: Apache > Cache-Control: private, s-maxage=0, max-age=0, must-revalidate > Content-Language: en > Vary: Accept-Encoding,Cookie > Last-Modified: Wed, 19 May 2010 10:10:34 GMT > Content-Encoding: gzip But the data is gzipped. You have to uncompress it before decoding. Peter
From: Philip Semanchuk on 25 May 2010 16:23
On May 25, 2010, at 4:00 PM, Barry wrote: > On 25 Maj, 21:39, Philip Semanchuk <phi...(a)semanchuk.com> wrote: >> On May 25, 2010, at 3:13 PM, Barry wrote: >> >> >> >>> Hi, >> >>> The code below is giving me the error: >> >>> Traceback (most recent call last): >>> File "C:\Users\Administratör\Desktop\test.py", line 4, in <module> >>> UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in >>> position 1: >>> unexpected code byte >> >>> What am i doing wrong? >> >>> Thanks, >> >>> Barry >> >>> request = urllib.request.Request(url='http://en.wiktionary.org/wiki/ >>> baby',headers={'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/ >>> 20071127 Firefox/2.0.0.11'} ) >> >>> response = urllib.request.urlopen(request) >>> html = response.read().decode('utf-8') >> >> Well, for starters you're assuming that the response content is in >> UTF-8. You need to examine the Content-Type header to see what the >> encoding is. If it's not UTF-8, there's your problem. >> >> HTH >> P > > The content type is utf-8: > > Date: Wed, 19 May 2010 19:17:39 GMT > Server: Apache > Cache-Control: private, s-maxage=0, max-age=0, must-revalidate > Content-Language: en > Vary: Accept-Encoding,Cookie > Last-Modified: Wed, 19 May 2010 10:10:34 GMT > Content-Encoding: gzip > Content-Length: 25247 > Content-Type: text/html; charset=utf-8 > X-Cache: HIT from sq61.wikimedia.org > X-Cache-Lookup: HIT from sq61.wikimedia.org:3128 > Age: 520549 > X-Cache: HIT from amssq32.esams.wikimedia.org > X-Cache-Lookup: HIT from amssq32.esams.wikimedia.org:3128 > X-Cache: MISS from amssq37.esams.wikimedia.org > X-Cache-Lookup: MISS from amssq37.esams.wikimedia.org:80 > Connection: close Looks like the content is gzipped. Have you unzipped it? Also, from where are you getting those headers? The server might well send different headers to your browser than to a urllib request. Have you examined the raw content in a hex editor on in the debugger? That would probably answer a lot of questions. > Can it be that the page is corrupt? Of course that's always possible, but personally whenever I have to decide whether bits are being flipped at random or my code is buggy, it's almost always the latter. > If so, how can I make the best of the situation? Depends on what you're trying to accomplish. bye Philip |