From: Guillermo on
Hi,

I would appreciate if someone could point out what am I doing wrong
here.

Basically, I need to save a string containing non-ascii characters to
a file encoded in utf-8.

If I stay in python, everything seems to work fine, but the moment I
try to read the file with another Windows program, everything goes to
hell.

So here's the script unicode2file.py:
===================================================================
# encoding=utf-8
import codecs

f = codecs.open("m.txt",mode="w", encoding="utf8")
a = u"mañana"
print repr(a)
f.write(a)
f.close()

f = codecs.open("m.txt", mode="r", encoding="utf8")
a = f.read()
print repr(a)
f.close()
===================================================================

That gives the expected output, both calls to repr() yield the same
result.

But now, if I do type me.txt in cmd.exe, I get garbled characters
instead of "ñ".

I then open the file with my editor (Sublime Text), and I see "mañana"
normally. I save (nothing to be saved, really), go back to the dos
prompt, do type m.txt and I get again the same garbled characters.

I then open the file m.txt with notepad, and I see "mañana" normally.
I save (again, no actual modifications), go back to the dos prompt, do
type m.txt and this time it works! I get "mañana". When notepad opens
the file, the encoding is already UTF-8, so short of a UTF-8 bom being
added to the file, I don't know what happens when I save the
unmodified file. Also, I would think that the python script should
save a valid utf-8 file in the first place...

What's going on here?

Regards,
Guillermo
From: Neil Hodgson on
Guillermo:

> I then open the file m.txt with notepad, and I see "ma�ana" normally.
> I save (again, no actual modifications), go back to the dos prompt, do
> type m.txt and this time it works! I get "ma�ana". When notepad opens
> the file, the encoding is already UTF-8, so short of a UTF-8 bom being
> added to the file,

That is what happens: the file now starts with a BOM \xEB\xBB\xBF as
you can see with a hex editor.

> I don't know what happens when I save the
> unmodified file. Also, I would think that the python script should
> save a valid utf-8 file in the first place...

Its just as valid UTF-8 without a BOM. People have different opinions
on this but for compatibility, I think it is best to always start UTF-8
files with a BOM.

Neil
From: Guillermo on
>    That is what happens: the file now starts with a BOM \xEB\xBB\xBF as
> you can see with a hex editor.

Is this an enforced convention under Windows, then? My head's aching
after so much pulling at my hair, but I have the feeling that the
problem only arises when text travels through the dos console...

Cheers,
Guillermo
From: Joaquin Abian on
On 14 mar, 22:22, Guillermo <guillermo.lis...(a)googlemail.com> wrote:
> >    That is what happens: the file now starts with a BOM \xEB\xBB\xBF as
> > you can see with a hex editor.
>
> Is this an enforced convention under Windows, then? My head's aching
> after so much pulling at my hair, but I have the feeling that the
> problem only arises when text travels through the dos console...
>
> Cheers,
> Guillermo

search for BOM in wikipedia.
There it talks about notepad behavior.

ja
From: Neil Hodgson on
Guillermo:

> Is this an enforced convention under Windows, then? My head's aching
> after so much pulling at my hair, but I have the feeling that the
> problem only arises when text travels through the dos console...

The console is commonly using Code Page 437 which is most compatible
with old DOS programs since it can display line drawing characters. You
can change the code page to UTF-8 with
chcp 65001
Now, "type m.txt" with the original BOM-less file and it should be
OK. You may also need to change the console font to one that is Unicode
compatible like Lucida Console.

Neil