From: Mihai N. on

> the fastest and simplest possible way
> to validate and divide any UTF=8 sequence into its constituent code
> point parts is a regular expression implemented as a finite state
> machine

Sorry, where did you get this one from?

Mihai Nita [Microsoft MVP, Visual C++]
Replace _year_ with _ to get the real email

From: Joshua Maurice on
On May 18, 12:38 am, "Mihai N." <nmihai_year_2...(a)> wrote:
> Then most likely wrong :-)

Yes. It was there just for demonstration purposes on how easy the code
is, and how I might consider "regex" and "state machine libraries" or
whatever to be overkill. I will wait patiently for his code and
compare to what I whipped off the top of my head.
From: James Kanze on
On 16 May, 14:51, Öö Tiib <oot...(a)> wrote:
> On 16 mai, 15:34, "Peter Olcott" <NoS...(a)> wrote:

> I suspect UTF8 fades gradually into history. Reasons are
> similar like 256 color video-modes and raster-graphic formats
> went. GUI-s are already often made with java or C# (for lack
> of C++ devs) and these use UTF16 internally. Notice that
> modern processor architectures are already optimized in the
> way that byte-level operations are often slower.

The network is still 8 bits UTF-8. As are the disks; using
UTF-16 on an external support simply doesn't work.

Also, UTF-8 may result in less memory use, and thus less paging.

If all you're doing are simple operations, searching for a few
ASCII delimiters and copying the delimited substrings, for
example, UTF-8 will probably be significantly faster: the CPU
will always read a word at a time, even if you access it byte by
byte, and you'll usually get more characters per word using

If you need full and complete support, as in an editor, for
example, UTF-32 is the best general solution. For a lot of
things in between, UTF-16 is a good compromise.

But the trade-offs only concern internal representation.
Externally, the world is 8 bits, and UTF-8 is the only solution.

James Kanze
From: Oliver Regenfelder on

Peter Olcott wrote:
> I completed the detailed design on the DFA that would validate and
> translate any valid UTF-8 byte sequence into UTF-32. It can not be done
> faster or simpler. The state transition matrix only takes exactly 2 KB.

Who cares about DFAs and state transition matrix sizes when all you want
to do is convert UTF-8 to UTF-32. That are some if/else and switch
statements in your programming language of choice + error handling.

Best regards,

From: Oliver Regenfelder on

Peter Olcott wrote:
> Maybe it is much simpler for me than it would be for others because of
> my strong bias towards DFA recognizers.

I would say it is exactly the oposite. Your strong bias towards DFA
recognizers lets you complete forget about the current abstraction
level you are dealing with.

> I bet my DFA recognizer is at
> least twice as fast as any other method for validating UTF-8 and
> converting it to code points.
> I am estimating about 20 machine clocks
> per code point.

You might want to reread some of the postings regarding optimization
from the earlier threads.

Have you been a hardware engineer before by any chance?

Best regards,