From: tom.rmadilo on
On Aug 4, 2:45 am, Uwe Klein <uwe_klein_habertw...(a)t-online.de> wrote:
> Fredrik Karlsson wrote:
> > Hi!
>
> > Thank you very much for your answer. However, my question is more
> > simple than that. I understand that UTF-16 is generally tricky because
> > a there may not be a BOM.
> > In my application however, I know that the data to be processed is
> > most likelly UTF-8, but may also be UTF-16 with a BOM. So, what I need
> > is just a safe and robust way of checking the first two elements in
> > the file. This is basically what I have come up with:
>
> > ---
> > set infile [open utf16.TextGrid]
>
> > fconfigure $infile -encoding utf-8
>
> > set cont  [read $infile]
>
> > if {[string equal -length 2 $cont "þÿ"] || [string equal -length 2
> > $cont "ÿþ"]} {
> >    puts "UTF-16"
> > } else {
> >        puts "UTF-8"
> > }
>
> > close $infile
> > ---
>
> > Is this safe? That else can I do?
>
> > /Fredrik
>
> proc determine_encoding file {
>         set infile [open $file]
>         fconfigure infile -encoding binary
>
>         set head [ read infile 4 ]
>         close infile
>         binary scan $head H8 hhex
>
>         # ref from:http://en.wikipedia.org/wiki/Byte_Order_Mark
>
>         switch -glob -- $hhex \
>                   FFFE* {
>                         return utf-16-LE
>                 } FEFF* {
>                         return utf-16-BE
>                 } 0000FEFF {
>                         return uft-32-BE
>                 } EFBBBF* {
>                         return utf-8
>                 } .... {
>                         # insert other encodings/filetypes...
>                 } default {
>                         return utf-8 ;# ??
>                 }
>
> }

Looks good to me. Note that Uwe has configured the channel in binary
mode, which is critical. Depending on the application, you may need or
want to remove the BOM, for instance, UTF-8 doesn't use a BOM, since
bytes are always in the same order. Also, refer to the reference used
in the above proc., it gives some hints, and links to other resources
which explain the many problems associated with the BOM. It is
basically an application level issue and probably shouldn't be hard-
coded into your channel code.