From: Mihai N. on 18 May 2010 03:44
> the fastest and simplest possible way
> to validate and divide any UTF=8 sequence into its constituent code
> point parts is a regular expression implemented as a finite state
Sorry, where did you get this one from?
Mihai Nita [Microsoft MVP, Visual C++]
Replace _year_ with _ to get the real email
From: Joshua Maurice on 18 May 2010 05:51
On May 18, 12:38 am, "Mihai N." <nmihai_year_2...(a)yahoo.com> wrote:
> > //COMPLETELY UNTESTED
> Then most likely wrong :-)
Yes. It was there just for demonstration purposes on how easy the code
is, and how I might consider "regex" and "state machine libraries" or
whatever to be overkill. I will wait patiently for his code and
compare to what I whipped off the top of my head.
From: James Kanze on 18 May 2010 10:18
On 16 May, 14:51, Öö Tiib <oot...(a)hot.ee> wrote:
> On 16 mai, 15:34, "Peter Olcott" <NoS...(a)OCR4Screen.com> wrote:
> I suspect UTF8 fades gradually into history. Reasons are
> similar like 256 color video-modes and raster-graphic formats
> went. GUI-s are already often made with java or C# (for lack
> of C++ devs) and these use UTF16 internally. Notice that
> modern processor architectures are already optimized in the
> way that byte-level operations are often slower.
The network is still 8 bits UTF-8. As are the disks; using
UTF-16 on an external support simply doesn't work.
Also, UTF-8 may result in less memory use, and thus less paging.
If all you're doing are simple operations, searching for a few
ASCII delimiters and copying the delimited substrings, for
example, UTF-8 will probably be significantly faster: the CPU
will always read a word at a time, even if you access it byte by
byte, and you'll usually get more characters per word using
If you need full and complete support, as in an editor, for
example, UTF-32 is the best general solution. For a lot of
things in between, UTF-16 is a good compromise.
But the trade-offs only concern internal representation.
Externally, the world is 8 bits, and UTF-8 is the only solution.
From: Oliver Regenfelder on 18 May 2010 10:26
Peter Olcott wrote:
> I completed the detailed design on the DFA that would validate and
> translate any valid UTF-8 byte sequence into UTF-32. It can not be done
> faster or simpler. The state transition matrix only takes exactly 2 KB.
Who cares about DFAs and state transition matrix sizes when all you want
to do is convert UTF-8 to UTF-32. That are some if/else and switch
statements in your programming language of choice + error handling.
From: Oliver Regenfelder on 18 May 2010 10:29
Peter Olcott wrote:
> Maybe it is much simpler for me than it would be for others because of
> my strong bias towards DFA recognizers.
I would say it is exactly the oposite. Your strong bias towards DFA
recognizers lets you complete forget about the current abstraction
level you are dealing with.
> I bet my DFA recognizer is at
> least twice as fast as any other method for validating UTF-8 and
> converting it to code points.
> I am estimating about 20 machine clocks
> per code point.
You might want to reread some of the postings regarding optimization
from the earlier threads.
Have you been a hardware engineer before by any chance?