From: Stefan Behnel on
Mister Yu, 01.04.2010 13:38:
> i m still not very sure how to convert a unicode object **
> u'\xd6\xd0\xce\xc4 ** back to "中文" the string it supposed to be?

You are confused. '\xd6\xd0\xce\xc4' is an encoded byte string, not a
unicode string. The fact that you have it stored in a unicode string
implies that something in your code (or in a library) has done an incorrect
conversion from bytes to unicode that did not take into account the real
character set in use. So you end up with a completely meaningless unicode
string.

Please show us the code that does the conversion to a unicode string.

Stefan

From: Mister Yu on
On Apr 1, 8:13 pm, Chris Rebert <c...(a)rebertia.com> wrote:
> On Thu, Apr 1, 2010 at 4:38 AM, Mister Yu <eryan...(a)gmail.com> wrote:
> > On Apr 1, 7:22 pm, Chris Rebert <c...(a)rebertia.com> wrote:
> >> 2010/4/1 Mister Yu <eryan...(a)gmail.com>:
> >> > hi experts,
>
> >> > i m new to python, i m writing crawlers to extract data from some
> >> > chinese websites, and i run into a encoding problem.
>
> >> > i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4'
> >> > which is encoded in "gb2312",
> <snip>
> > hi, thanks for the tips.
>
> > but i m still not very sure how to convert a unicode object  **
> > u'\xd6\xd0\xce\xc4 ** back to "中文" the string it supposed to be?
>
> Ah, my apologies! I overlooked something (sorry, it's early in the
> morning where I am).
> What you have there is ***really*** screwy. It's the 2 Chinese
> characters, encoded in gb2312, and then somehow cast *directly* into a
> 'unicode' string (which ought never to be done).
>
> In answer to your original question (after some experimentation):
> gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4'])
> unicode_string = gb2312_bytes.decode('gb2312')
> utf8_bytes = unicode_string.encode('utf-8') #as you wanted
>
> If possible, I'd look at the code that's giving you that funky
> "string" in the first place and see if it can be fixed to give you
> either a proper bytestring or proper unicode string rather than the
> bastardized mess you're currently having to deal with.
>
> Apologies again and Cheers,
> Chris
> --http://blog.rebertia.com

Hi Chris,

thanks for the great tips! it works like a charm.

i m using the Scrapy project(http://doc.scrapy.org/intro/
tutorial.html) to write my crawler, when it extract data with xpath,
it puts the chinese characters directly into the unicode object.

thanks again chris, and have a good april fool day.

Cheers,
Yu
From: Stefan Behnel on
Mister Yu, 01.04.2010 14:26:
> On Apr 1, 8:13 pm, Chris Rebert wrote:
>> gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4'])
>> unicode_string = gb2312_bytes.decode('gb2312')
>> utf8_bytes = unicode_string.encode('utf-8') #as you wanted

Simplifying this hack a bit:

gb2312_bytes = u'\xd6\xd0\xce\xc4'.encode('ISO-8859-1')
unicode_string = gb2312_bytes.decode('gb2312')
utf8_bytes = unicode_string.encode('utf-8')

Although I have to wonder why you want a UTF-8 encoded byte string as
output instead of Unicode.


>> If possible, I'd look at the code that's giving you that funky
>> "string" in the first place and see if it can be fixed to give you
>> either a proper bytestring or proper unicode string rather than the
>> bastardized mess you're currently having to deal with.
>
> thanks for the great tips! it works like a charm.

I hope you're aware that it's a big ugly hack, though. You should really
try to fix your input instead.


> i m using the Scrapy project(http://doc.scrapy.org/intro/
> tutorial.html) to write my crawler, when it extract data with xpath,
> it puts the chinese characters directly into the unicode object.

My guess is that the HTML page you are parsing is broken and doesn't
specify its encoding. In that case, all that scrapy can do is guess, and it
seems to have guessed incorrectly.

You should check if there is a way to tell scrapy about the expected page
encoding, so that it can return correctly decoded unicode strings directly,
instead of resorting to dirty hacks that may or may not work depending on
the page you are parsing.

Stefan

From: Mister Yu on
On Apr 1, 9:31 pm, Stefan Behnel <stefan...(a)behnel.de> wrote:
> Mister Yu, 01.04.2010 14:26:
>
> > On Apr 1, 8:13 pm, Chris Rebert wrote:
> >> gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4'])
> >> unicode_string = gb2312_bytes.decode('gb2312')
> >> utf8_bytes = unicode_string.encode('utf-8') #as you wanted
>
> Simplifying this hack a bit:
>
> gb2312_bytes = u'\xd6\xd0\xce\xc4'.encode('ISO-8859-1')
> unicode_string = gb2312_bytes.decode('gb2312')
> utf8_bytes = unicode_string.encode('utf-8')
>
> Although I have to wonder why you want a UTF-8 encoded byte string as
> output instead of Unicode.
>
> >> If possible, I'd look at the code that's giving you that funky
> >> "string" in the first place and see if it can be fixed to give you
> >> either a proper bytestring or proper unicode string rather than the
> >> bastardized mess you're currently having to deal with.
>
> > thanks for the great tips! it works like a charm.
>
> I hope you're aware that it's a big ugly hack, though. You should really
> try to fix your input instead.
>
> > i m using the Scrapy project(http://doc.scrapy.org/intro/
> > tutorial.html) to write my crawler, when it extract data with xpath,
> > it puts the chinese characters directly into the unicode object.
>
> My guess is that the HTML page you are parsing is broken and doesn't
> specify its encoding. In that case, all that scrapy can do is guess, and it
> seems to have guessed incorrectly.
>
> You should check if there is a way to tell scrapy about the expected page
> encoding, so that it can return correctly decoded unicode strings directly,
> instead of resorting to dirty hacks that may or may not work depending on
> the page you are parsing.
>
> Stefan

Hi Stefan,

i don't think the page is broken or somehow, you can take a look at
the page http://www.7176.com/Sections/Genre/Comedy , it's kinda like
a chinese IMDB rip off

from what i can see from the source code of the page header, it
contains the coding info:
<HTML><head><meta http-equiv="Content-Type" content="text/html;
charset=gb2312" /><meta http-equiv="Content-Language" content="zh-CN" /
><meta content="all" name="robots" /><meta name="author"
content="admin(at)7176.com" /><meta name="Copyright" content="www.
7176.com" /> <meta content="Àà±ðΪ ¾çÇé µÄµçÓ°ÁÐ±í µÚ1Ò³" name="keywords" /><TITLE>
Àà±ðΪ ¾çÇé µÄµçÓ°ÁÐ±í µÚ1Ò³</TITLE><LINK href="http://www.7176.com/images/
pro.css" rel=stylesheet></HEAD>

maybe i should take a look at the source code of Scrapy, but i m just
not more than a week's newbie of python. not sure if i can understand
the source.

earlier Chris's walk around is looking pretty well until it meets some
string like this:
>>> su = u'Ò»¶þÈýËÄ 12345 Ò»¶þÈýËÄ'
>>> su
u'\u4e00\u4e8c\u4e09\u56db 12345 \u4e00\u4e8c\u4e09\u56db'
>>> gb2312_bytes = ''.join([chr(ord(c)) for c in u'\u4e00\u4e8c\u4e09\u56db 12345 \u4e00\u4e8c\u4e09\u56db'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: chr() arg not in range(256)

the digis doesn't get encoded so it messes up the code.

any ideas?

once again, thanks everybody's help!!!!