From: Mister Yu on
hi experts,

i m new to python, i m writing crawlers to extract data from some
chinese websites, and i run into a encoding problem.

i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4'
which is encoded in "gb2312", but i have no idea of how to convert it
back to utf-8

to re-create this one is easy:

this will work
============================
>>> su = u"¤¤¤å".encode('gb2312')
>>> su
u
>>> print su.decode('gb2312')
¤¤¤å -> (same as the original string)

============================
but this doesn't,why
===========================
>>> su = u'\xd6\xd0\xce\xc4'
>>> su
u'\xd6\xd0\xce\xc4'
>>> print su.decode('gb2312')
Traceback (most recent call last):
File "<console>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-3: ordinal not in range(128)
===========================

thank you
From: Chris Rebert on
2010/4/1 Mister Yu <eryan.yu(a)gmail.com>:
> hi experts,
>
> i m new to python, i m writing crawlers to extract data from some
> chinese websites, and i run into a encoding problem.
>
> i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4'
> which is encoded in "gb2312",

No! Instances of type 'unicode' (i.e. strings with a leading 'u')
***aren't encoded at all***.

> but i have no idea of how to convert it
> back to utf-8

To convert u'\xd6\xd0\xce\xc4' to UTF-8, do u'\xd6\xd0\xce\xc4'.encode('utf-8')

> to re-create this one is easy:
>
> this will work
> ============================
>>>> su = u"中文".encode('gb2312')
>>>> su
> u
>>>> print su.decode('gb2312')
> 中文    -> (same as the original string)
>
> ============================
> but this doesn't,why
> ===========================
>>>> su = u'\xd6\xd0\xce\xc4'
>>>> su
> u'\xd6\xd0\xce\xc4'
>>>> print su.decode('gb2312')
You can't decode a unicode string, it's already been decoded!

One decodes a bytestring to get a unicode string.
One **encodes** a unicode string to get a bytestring.

So the last line of your example should be:
print su.encode('gb2312')

Only call .encode() on things of type 'unicode'.
Only call .decode() on things of type 'str'.
[When using Python 2.x that is. Python 3.x renames the types in question.]

Cheers,
Chris
--
http://blog.rebertia.com
From: Mister Yu on
On Apr 1, 7:22 pm, Chris Rebert <c...(a)rebertia.com> wrote:
> 2010/4/1 Mister Yu <eryan...(a)gmail.com>:
>
> > hi experts,
>
> > i m new to python, i m writing crawlers to extract data from some
> > chinese websites, and i run into a encoding problem.
>
> > i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4'
> > which is encoded in "gb2312",
>
> No! Instances of type 'unicode' (i.e. strings with a leading 'u')
> ***aren't encoded at all***.
>
> > but i have no idea of how to convert it
> > back to utf-8
>
> To convert u'\xd6\xd0\xce\xc4' to UTF-8, do u'\xd6\xd0\xce\xc4'.encode('utf-8')
>
>
>
> > to re-create this one is easy:
>
> > this will work
> > ============================
> >>>> su = u"中文".encode('gb2312')
> >>>> su
> > u
> >>>> print su.decode('gb2312')
> > 中文    -> (same as the original string)
>
> > ============================
> > but this doesn't,why
> > ===========================
> >>>> su = u'\xd6\xd0\xce\xc4'
> >>>> su
> > u'\xd6\xd0\xce\xc4'
> >>>> print su.decode('gb2312')
>
> You can't decode a unicode string, it's already been decoded!
>
> One decodes a bytestring to get a unicode string.
> One **encodes** a unicode string to get a bytestring.
>
> So the last line of your example should be:
> print su.encode('gb2312')
>
> Only call .encode() on things of type 'unicode'.
> Only call .decode() on things of type 'str'.
> [When using Python 2.x that is. Python 3.x renames the types in question.]
>
> Cheers,
> Chris
> --http://blog.rebertia.com

hi, thanks for the tips.

but i m still not very sure how to convert a unicode object **
u'\xd6\xd0\xce\xc4 ** back to "中文" the string it supposed to be?

thanks.

sorry i m really new to python.
From: Mister Yu on
===========================================
print u'\xd6\xd0\xce\xc4'.encode('utf-8')
ÖÐÎÄ (the result is supposed to be "中文" but not something like
this)
===========================================

>>> su = u"中文".encode('gb2312')
>>> su
'\xd6\xd0\xce\xc4'
===========================================
From: Chris Rebert on
On Thu, Apr 1, 2010 at 4:38 AM, Mister Yu <eryan.yu(a)gmail.com> wrote:
> On Apr 1, 7:22 pm, Chris Rebert <c...(a)rebertia.com> wrote:
>> 2010/4/1 Mister Yu <eryan...(a)gmail.com>:
>> > hi experts,
>>
>> > i m new to python, i m writing crawlers to extract data from some
>> > chinese websites, and i run into a encoding problem.
>>
>> > i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4'
>> > which is encoded in "gb2312",
<snip>
> hi, thanks for the tips.
>
> but i m still not very sure how to convert a unicode object  **
> u'\xd6\xd0\xce\xc4 ** back to "中文" the string it supposed to be?

Ah, my apologies! I overlooked something (sorry, it's early in the
morning where I am).
What you have there is ***really*** screwy. It's the 2 Chinese
characters, encoded in gb2312, and then somehow cast *directly* into a
'unicode' string (which ought never to be done).

In answer to your original question (after some experimentation):
gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4'])
unicode_string = gb2312_bytes.decode('gb2312')
utf8_bytes = unicode_string.encode('utf-8') #as you wanted

If possible, I'd look at the code that's giving you that funky
"string" in the first place and see if it can be fixed to give you
either a proper bytestring or proper unicode string rather than the
bastardized mess you're currently having to deal with.

Apologies again and Cheers,
Chris
--
http://blog.rebertia.com