From: Ben Finney on
"manstey" <manstey(a)csu.edu.au> writes:

> 1. Here is my input data file, line 2:
> gn1:1,1.2 R")$I73YT R")$IYT(a)ncfsa

Your program is reading this using the 'utf-8' encoding. When it does
so, all the characters you show above will be read in happily as you
see them (so long as you view them with the 'utf-8' encoding), and
converted to Unicode characters representing the same thing.

Do you have any other information that might indicate this is *not*
utf-8 encoded data?

> 2. Here is my output data file, line 2:
> u'gn', u'1', u'1', u'1', u'2', u'-', u'R")$I73YT', u'R")$IYT',
> u'R")$IYT', u'@', u'ncfsa', u'nc', '', '', '', u'f', u's', u'a', '',
> '', '', '', '', '', '', '', u'B.:R")$I^YT', u'b.:cv)cv^yc', '\xc9\x94'

As you can see, reading the file with 'utf-8' encoding and writing it
out again as 'utf-8' encoding, the characters (as you posted them in
the message) have been faithfully preserved by Unicode processing and
encoding.


Bear in mind that when you present the "input data file, line 2" to
us, your message is itself encoded using a particular character
encoding. (In the case of the message where you wrote the above, it's
'utf-8'.) This means we may or may not be seeing the exact same bytes
you see in the input file; we're seeing characters in the encoding you
used to post the message.

You need to know what encoding was used when the data in that file was
written. You can then read the file using that encoding, and convert
the characters to unicode for processing inside your program. When you
write them out again, you can choose the 'utf-8' encoding as you have
done.

Have you read this excellent article on understanding the programming
implications of character sets and Unicode?

"The Absolute Minimum Every Software Developer Absolutely,
Positively Must Know About Unicode and Character Sets (No
Excuses!)"
<URL:http://www.joelonsoftware.com/articles/Unicode.html>

--
\ "I'd like to see a nude opera, because when they hit those high |
`\ notes, I bet you can really see it in those genitals." -- Jack |
_o__) Handey |
Ben Finney

From: manstey on
Hi Martin,

Thanks very much. Your def comma_separated_utf8(items): approach raises
an exception in codecs.py, so I tried = u", ".join(word_info + parse +
gloss), which works perfectly. So I want to understand exactly why this
works. word_info and parse and gloss are all tuples. does str convert
the three into an ascii string? but the join method retains their
unicode status.

In the text file, the unicode characters appear perfectly, so I'm very
happy.

cheers
matthew

From: Martin v. Löwis on
manstey wrote:
> Thanks very much. Your def comma_separated_utf8(items): approach raises
> an exception in codecs.py, so I tried = u", ".join(word_info + parse +
> gloss), which works perfectly. So I want to understand exactly why this
> works. word_info and parse and gloss are all tuples. does str convert
> the three into an ascii string?

Correct: a tuple is converted into a string with (contents), where
contents is achieved through comma-separating repr() of each tuple
element. repr(a_unicode_string) creates a \x or \u representation.

> but the join method retains their unicode status.

Correct. The result is a Unicode string if the joiner is a Unicode
string, and all tuple elements are Unicode strings. If one is not,
a conversion to Unicode is attempted.

> In the text file, the unicode characters appear perfectly, so I'm very
> happy.

Glad it works.

Regards,
Martin