From: Baz Walter on
i am using python 2.6 on a linux box and i have some utf-16 encoded
files with crlf line-endings which i would like to open with universal
newlines.

so far, i have been unable to get this to work correctly.

for example:

>>> open('test.txt', 'w').write(u'a\r\nb\r\n'.encode('utf-16'))
>>> repr(open('test.txt', 'rbU').read().decode('utf-16'))
"u'a\\n\\nb\\n\\n'"
>>> import codecs
>>> repr(codecs.open('test.txt', 'rbU', 'utf-16').read())
"u'a\\n\\nb\\n\\n'"

of course, the output i want is:

"u'a\\nb\\n'"

i suppose it's not too surprising that the built-in open converts the
line endings before decoding, but it surprised me that codecs.open does
this as well.

is there a way to get universal newlines to work properly with utf-16 files?

(nb: i'm not interested in other methods of converting line endings -
just whether universal newlines can be made to work correctly).
From: Stefan Behnel on
Baz Walter, 11.04.2010 16:12:
> i am using python 2.6 on a linux box and i have some utf-16 encoded
> files with crlf line-endings which i would like to open with universal
> newlines.
>
> so far, i have been unable to get this to work correctly.
>
> for example:
>
> >>> open('test.txt', 'w').write(u'a\r\nb\r\n'.encode('utf-16'))
> >>> repr(open('test.txt', 'rbU').read().decode('utf-16'))
> "u'a\\n\\nb\\n\\n'"
> >>> import codecs
> >>> repr(codecs.open('test.txt', 'rbU', 'utf-16').read())
> "u'a\\n\\nb\\n\\n'"
>
> of course, the output i want is:
>
> "u'a\\nb\\n'"
>
> i suppose it's not too surprising that the built-in open converts the
> line endings before decoding, but it surprised me that codecs.open does
> this as well.

The codecs module does not support universal newline parsing (see the
docs). You need to use the new io module instead.

Stefan

From: Baz Walter on
On 11/04/10 15:37, Stefan Behnel wrote:
> The codecs module does not support universal newline parsing (see the
> docs). You need to use the new io module instead.

thanks.

i'd completely overlooked the io module - i thought it was only in
python 2.7/3.x.

From: Antoine Pitrou on
Le Sun, 11 Apr 2010 16:16:45 +0100, Baz Walter a écrit :
> On 11/04/10 15:37, Stefan Behnel wrote:
>> The codecs module does not support universal newline parsing (see the
>> docs). You need to use the new io module instead.
>
> thanks.
>
> i'd completely overlooked the io module - i thought it was only in
> python 2.7/3.x.

To be precise, the 2.6 version is a slow one, written in pure Python (and
it might be a bit less debugged too). But codecs.open() is slow, too.