what encoding is the encoding named "unicode"? [TCL]

Prev: COM Word and blanks in Filenames
Next: Why "glob -directory" is such a pain?

From: Zhang Weiwu on 21 Jul 2010 00:32

Hello. Am I the first one got confused of the encoding named "unicode"
in the output of this statement?

% encoding names

I first guess the word "unicode" as encoding name are because Windows
system uses "unicode" to mean "UTF-16LE WITH BOM", that is what happens
when you save a text file in Notepad of Windows and choose "Unicode", or
when you save spreadsheet as "Unicode Text" in Microsoft Excel. This
understanding seems to be correct when I run tclkit on Windows. I have
many data files in UTF-16LE and in my tcl script I do "fconfigure
-encoding unicode" before reading them, which works fine.

Today I downloaded tclkit 8.5.1 for Mac OS on X11 and run my application
on Mac OS, and realized encoding name "unicode" have to be interpreted
differently. In fact I had to convert my data file from UTF-16LE to
UTF-8 to make the same tcl script run correctly.

In order to make the script run on both Mac OS, it seems the only choice
I have is to prepare data in UTF-8 only and change script to always read
files in UTF-8, which is not ambiguous. This is a bit difficult because
the data producing workflow is done on MS Windows, always resulting
UTF16-LE. Adding a step to the already repetitive data preparing
workflow isn't nice. Do we have a better idea?

From: Donal K. Fellows on 21 Jul 2010 05:27

On Jul 21, 5:32 am, Zhang Weiwu <zhangweiwu+J...(a)realss.com> wrote:
> Hello. Am I the first one got confused of the encoding named "unicode"
> in the output of this statement?
>
> % encoding names
>
> I first guess the word "unicode" as encoding name are because Windows
> system uses "unicode" to mean "UTF-16LE WITH BOM", that is what happens
> when you save a text file in Notepad of Windows and choose "Unicode", or
> when you save spreadsheet as "Unicode Text" in Microsoft Excel.

Tcl doesn't add a BOM (that's a feature of a file, not of a data
stream; a subtle difference I know) and produces characters in *host*
endianness; it also doesn't parse a BOM on input for you. It also only
handles characters in the BMP, but that's a general Tcl issue. (It's
also really ugly to fix properly since it requires deep changes to the
RE engine - the problems are character sets and what constitutes a
single character - and the addition of a normalization engine, and
there are licensing issues with some of the solutions people have
suggested in the past.)

I wish I had something better to report.

Donal.

From: Alexandre Ferrieux on 21 Jul 2010 06:20

On Jul 21, 11:27 am, "Donal K. Fellows"
<donal.k.fell...(a)manchester.ac.uk> wrote:
> On Jul 21, 5:32 am, Zhang Weiwu <zhangweiwu+J...(a)realss.com> wrote:
>
> > Hello. Am I the first one got confused of the encoding named "unicode"
> > in the output of this statement?
>
> > % encoding names
>
> > I first guess the word "unicode" as encoding name are because Windows
> > system uses "unicode" to mean "UTF-16LE WITH BOM", that is what happens
> > when you save a text file in Notepad of Windows and choose "Unicode", or
> > when you save spreadsheet as "Unicode Text" in Microsoft Excel.
>
> Tcl doesn't add a BOM (that's a feature of a file, not of a data
> stream; a subtle difference I know) and produces characters in *host*
> endianness; it also doesn't parse a BOM on input for you. It also only
> handles characters in the BMP, but that's a general Tcl issue. (It's
> also really ugly to fix properly since it requires deep changes to the
> RE engine - the problems are character sets and what constitutes a
> single character - and the addition of a normalization engine, and
> there are licensing issues with some of the solutions people have
> suggested in the past.)
>
> I wish I had something better to report.

Donal, you do have an OSX Tcl at hand, don't you ?

On that platform, does [fconfigure -encoding unicode] allow to read an
UTF-16LE (assuming an x86 mac, not an mc68k dinosaur ;-) properly or
not, when the characters are "not risky" (say ASCII) ?

(The OP's wording makes it unclear whether there is really a platform-
specific issue, or just a few warts in a specific file with strange
characters or reversed byte order...)

-Alex

From: Joe English on 21 Jul 2010 20:15

Zhang Weiwu wrote:
>
> Hello. Am I the first one got confused of the encoding named "unicode"
> in the output of this statement?
>
> % encoding names

No, you are not. The *first* one to be confused by the encoding
mistakenly called "unicode" in Tcl is the person who wrote the
code in the first place :-)

> I first guess the word "unicode" as encoding name are because Windows
> system uses "unicode" to mean "UTF-16LE WITH BOM", that is what happens
> when you save a text file in Notepad of Windows and choose "Unicode", or
> when you save spreadsheet as "Unicode Text" in Microsoft Excel. This
> understanding seems to be correct when I run tclkit on Windows. I have
> many data files in UTF-16LE and in my tcl script I do "fconfigure
> -encoding unicode" before reading them, which works fine.

Tcl's "unicode" encoding is actually UCS-2, which uses 16-bit
codepoints.

When serialized to octets, it'll either be UCS-2LE or UCS-2BE,
depending on the native byte order of the host computer.

(UCS-2[BE/LE] is a strict subset of UTF-16[BE/LE]. Since Tcl
doesn't recognize characters outside the BMP, the distinction
makes no real difference as far as Tcl is concerned.)

> Today I downloaded tclkit 8.5.1 for Mac OS on X11 and run my application
> on Mac OS, and realized encoding name "unicode" have to be interpreted
> differently. In fact I had to convert my data file from UTF-16LE to
> UTF-8 to make the same tcl script run correctly.

That's consistent with what I'd expect. In Tcl, "unicode" is
compatible with UTF16-LE on Intel boxes, or with UTF16-BE everywhere
else. (Or maybe it's the other way around. I never remember.)

> In order to make the script run on both Mac OS, it seems the only choice
> I have is to prepare data in UTF-8 only and change script to always read
> files in UTF-8, which is not ambiguous.

That's the most sensible thing to do, if it's practical.

> This is a bit difficult because
> the data producing workflow is done on MS Windows, always resulting
> UTF16-LE. Adding a step to the already repetitive data preparing
> workflow isn't nice. Do we have a better idea?

A better idea would be to add explicit "utf16le"/"utf16be" and/or
"ucs2le/be" encodings to Tcl.

(I'm somewhat surprised that that hasn't happened yet -- it's eminently
sensible -- probably just that nobody's gotten around to it yet.)

--Joe English

From: Zhang Weiwu on 21 Jul 2010 21:50

>
> Tcl doesn't add a BOM (that's a feature of a file, not of a data
> stream; a subtle difference I know) and produces characters in *host*
> endianness
>
Do you mean, that "unicode" as an encoding name, means UTF-16LE in
Microsoft Windows, and means something differently (as I tested, means
UTF-8) in Mac OS, and if I use it on big endian system, say Linux on
MIPS arch, it might mean UTF-16BE?

In this case, there should be a comment on wiki or somewhere to warn
against use of "unicode" as encoding name in scripts. Because most
developer would do like me: test it on Windows only (or on his / her
working system only) and decide, emm, Unicode must mean this (IN my
case, I think "unicode" means UTF-16LE) on other OSs and systems too,
and make applications that breaks on other OS (I just did it!).

| Next | Last
Pages: 1 2 3 4
Prev: COM Word and blanks in Filenames
Next: Why "glob -directory" is such a pain?