A Unicode problem -HELP [Python]

Prev: multiline comments
Next: Modules... paths... newbie confusion

From: manstey on 11 May 2006 23:34

I am writing a program to translate a list of ascii letters into a
different language that requires unicode encoding. This is what I have
done so far:

1. I have ï»¿# -*- coding: UTF-8 -*- as my first line.
2. In Wing IDE I have set Default Encoding to UTF-8
3. I have imported codecs and opened and written my file, which doesn't
have a BOM, as encoding=UTF-8
4. I have written a dictionary for translation, with entries such as
{'F':u'\u0254'} and a function to do the translation

Everything works fine, except that my output file, when loaded in
unicode aware emeditor has
(u'F', u'\u0254')

But I want to display it as:
('F', 'É”') # where the É” is a back-to-front 'c'

So my questions are:
1. How do I do this?
2. Do I need to change any of my steps above?

From: Martin v. Löwis on 12 May 2006 01:22

manstey wrote:
> 1. I have # -*- coding: UTF-8 -*- as my first line.
> 2. In Wing IDE I have set Default Encoding to UTF-8
> 3. I have imported codecs and opened and written my file, which doesn't
> have a BOM, as encoding=UTF-8
> 4. I have written a dictionary for translation, with entries such as
> {'F':u'\u0254'} and a function to do the translation
>
> Everything works fine, except that my output file, when loaded in
> unicode aware emeditor has
> (u'F', u'\u0254')

I couldn't quite follow this description: what is "your output file"
(in what step is it created?), and how does

(u'F', u'\u0254')

get into this file? What is the precise Python statement that
produces that line of output?

> So my questions are:
> 1. How do I do this?

Most likely, you use (directly or indirectly) the repr() function
to convert a tuple into that string. You shouldn't do that;
instead, you should format the elements of the tuple yourself, e.g.
through

print >>f, u"('%s', '%s')" % value

Regards,
Martin

From: manstey on 16 May 2006 22:19

Hi Martin,

HEre is how I write:

input_file = open(input_file_loc, 'r')
output_file = open(output_file_loc, 'w')
for line in input_file:
output_file.write(str(word_info + parse + gloss)) # = three
functions that return tuples

(u'F', u'\u0254') are two of the many unicode tuple elements returned
by the three functions.

What am I doing wrong?

From: Ben Finney on 16 May 2006 22:38

"manstey" <manstey(a)csu.edu.au> writes:

> input_file = open(input_file_loc, 'r')
> output_file = open(output_file_loc, 'w')
> for line in input_file:
> output_file.write(str(word_info + parse + gloss)) # = three functions that return tuples

If you mean that 'word_info', 'parse' and 'gloss' are three functions
that return tuples, then you get that return value by calling them.

>>> def foo():
... return "foo's return value"
...
>>> def bar(baz):
... return "bar's return value (including '%s')" % baz
...
>>> print foo()
foo's return value
>>> print bar
<function bar at 0x401fe80c>
>>> print bar("orange")
bar's return value (including 'orange')

--
\ "A man must consider what a rich realm he abdicates when he |
`\ becomes a conformist." -- Ralph Waldo Emerson |
_o__) |
Ben Finney

From: manstey on 17 May 2006 00:20

I'm a newbie at python, so I don't really understand how your answer
solves my unicode problem.

I have done more reading on unicode and then tried my code in IDLE
rather than WING IDE, and discovered that it works fine in IDLE, so I
think WING has a problem with unicode. For example, in WING this code
returns an error:

a={'a':u'\u0254'}
print a['a']

UnicodeEncodeError: 'ascii' codec can't encode character u'\u0254' in
position 0: ordinal not in range(128)

but in IDLE it correctly prints open o

So, assuming I now work in IDLE, all I want help with is how to read in
an ascii string and convert its letters to various unicode values and
save the resulting 'string' to a utf-8 text file. Is this clear?

so in pseudo code
1. F is converted to \u0254, $ is converted to \u0283, C is converted
to \u02A6\02C1, etc.
(i want to do this using a dictionary TRANSLATE={'F':u'\u0254', etc)
2. I read in a file with lines like:
F$
FCF$
$$C$ etc
3. I convert this to
\u0254\u0283
\u0254\u02A6\02C1\u0254 etc
4. i save the results in a new file

when i read the new file in a unicode editor (EmEditor), i don't see
\u0254\u02A6\02C1\u0254, but I see the actual characters (open o, esh,
ts digraph, modified letter reversed glottal stop, etc.

I'm sure this is straightforward but I can't get it to work.

All help appreciated!

| Next | Last
Pages: 1 2 3
Prev: multiline comments
Next: Modules... paths... newbie confusion