From: Richard Schulman on
On 10 Sep 2006 15:27:17 -0700, "John Machin" <sjmachin(a)lexicon.net>
wrote:

>...
>Encode each Unicode text field in UTF-8. Write the file as a CSV file
>using Python's csv module. Read the CSV file using the same module.
>Decode the text fields from UTF-8.
>
>You need to parse the incoming line into column values (the csv module
>does this for you) and then convert each column value from
>string/Unicode to a Python type that is compatible with the Oracle type
>for that column.
>...

John, how am I to reconcile your suggestions above with my
ActivePython 2.4 documentation, which states:

<<12.20 csv -- CSV File Reading and Writing
<<New in version 2.3.
....
<<Note: This version of the csv module doesn't support Unicode input.
Also, there are currently some issues regarding ASCII NUL characters.
Accordingly, all input should generally be printable ASCII to be safe.
These restrictions will be removed in the future.>>

Regards,
Richard Schulman
From: John Machin on
Richard Schulman wrote:
> On 10 Sep 2006 15:27:17 -0700, "John Machin" <sjmachin(a)lexicon.net>
> wrote:
>
> >...
> >Encode each Unicode text field in UTF-8. Write the file as a CSV file
> >using Python's csv module. Read the CSV file using the same module.
> >Decode the text fields from UTF-8.
> >
> >You need to parse the incoming line into column values (the csv module
> >does this for you) and then convert each column value from
> >string/Unicode to a Python type that is compatible with the Oracle type
> >for that column.
> >...
>
> John, how am I to reconcile your suggestions above with my
> ActivePython 2.4 documentation, which states:
>
> <<12.20 csv -- CSV File Reading and Writing
> <<New in version 2.3.
> ...
> <<Note: This version of the csv module doesn't support Unicode input.
> Also, there are currently some issues regarding ASCII NUL characters.
> Accordingly, all input should generally be printable ASCII to be safe.
> These restrictions will be removed in the future.>>
>

1. For "Unicode" read "UTF-16".

2. Unless you have \u0000 in your Unicode data, encoding it into UTF-8
won't cause any ASCII NUL bytes to appear. Ensuring that you don't have
NULs in your data is a good idea in general.

3. There are also evidently some issues regarding ASCII LF characters
embedded in fields (like when Excel users do Alt-Enter (Windows
version) to put a hard line break in their headings); see
http://docs.python.org/dev/whatsnew/modules.html of which the following
is an extract:
"""
The CSV parser is now stricter about multi-line quoted fields.
Previously, if a line ended within a quoted field without a terminating
newline character, a newline would be inserted into the returned field.
This behavior caused problems when reading files that contained
carriage return characters within fields, so the code was changed to
return the field without inserting newlines. As a consequence, if
newlines embedded within fields are important, the input should be
split into lines in a manner that preserves the newline characters.
"""

4. Provided your fields don't contain any of CR, LF, ctrl-Z (maybe),
and NUL, you should be OK. I can't understand the sentence
"Accordingly, all input should generally be printable ASCII to be
safe." -- especially the "accordingly". If it was running amok with
8-bit characters with ord(c) >= 128, there would have been loud shrieks
from several corners of the globe.

5. However, to be safe, you could go the next step and convert the
UTF-8 to base64 -- see
http://docs.python.org/dev/lib/module-base64.html -- BUT once you've
done that your encoded data doesn't even have commas and quotes in it,
so you can avoid the maybe unsafe csv module and just write your data
as ",".join(base64_encoded_fields).

HTH,
John



HTH,
John