From: Alf P. Steinbach on
* Mark Tolonen:
>
> "Terry Reedy" <tjreedy(a)udel.edu> wrote in message
> news:hnjkuo$n16$1(a)dough.gmane.org...
> On 3/14/2010 4:40 PM, Guillermo wrote:
>> Adding the byte that some call a 'utf-8 bom' makes the file an invalid
>> utf-8 file.
>
> Not true. From http://unicode.org/faq/utf_bom.html:
>
> Q: When a BOM is used, is it only in 16-bit Unicode text?
> A: No, a BOM can be used as a signature no matter how the Unicode text
> is transformed: UTF-16, UTF-8, UTF-7, etc. The exact bytes comprising
> the BOM will be whatever the Unicode character FEFF is converted into by
> that transformation format. In that form, the BOM serves to indicate
> both that it is a Unicode file, and which of the formats it is in.
> Examples:
> BytesEncoding Form
> 00 00 FE FF UTF-32, big-endian
> FF FE 00 00 UTF-32, little-endian
> FE FF UTF-16, big-endian
> FF FE UTF-16, little-endian
> EF BB BF UTF-8

Well, technically true, and Terry was wrong about "There is no such thing as a
utf-8 'byte order mark'. The concept is an oxymoron.". It's true that as a
descriptive term "byte order mark" is an oxymoron for UTF-8. But in this
particular context it's not a descriptive term, and it's not only technically
allowed, as you point out, but sometimes required.

However, some tools are unable to process UTF-8 files with BOM.

The most annoying example is the GCC compiler suite, in particular g++, which in
its Windows MinGW manifestation insists on UTF-8 source code without BOM, while
Microsoft's compiler needs the BOM to recognize the file as UTF-8 -- the only
way I found to satisfy both compilers, apart from a restriction to ASCII or
perhaps Windows ANSI with wide character literals restricted to ASCII
(exploiting a bug in g++ that lets it handle narrow character literals with
non-ASCII chars) is to preprocess the source code. But that's not a general
solution since the g++ preprocessor, via another bug, accepts some constructs
(which then compile nicely) which the compiler doesn't accept when explicit
preprocessing isn't used. So it's a mess.


Cheers,

- Alf