New utf8string design may make UTF-8 the superior encoding [MFC]

Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish

From: Mihai N. on 18 May 2010 03:44

> the fastest and simplest possible way
> to validate and divide any UTF=8 sequence into its constituent code
> point parts is a regular expression implemented as a finite state
> machine

Sorry, where did you get this one from?

--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

From: Joshua Maurice on 18 May 2010 05:51

On May 18, 12:38 am, "Mihai N." <nmihai_year_2...(a)yahoo.com> wrote:
> > //COMPLETELY UNTESTED
>
> Then most likely wrong :-)

Yes. It was there just for demonstration purposes on how easy the code
is, and how I might consider "regex" and "state machine libraries" or
whatever to be overkill. I will wait patiently for his code and
compare to what I whipped off the top of my head.

From: James Kanze on 18 May 2010 10:18

On 16 May, 14:51, Öö Tiib <oot...(a)hot.ee> wrote:
> On 16 mai, 15:34, "Peter Olcott" <NoS...(a)OCR4Screen.com> wrote:

> I suspect UTF8 fades gradually into history. Reasons are
> similar like 256 color video-modes and raster-graphic formats
> went. GUI-s are already often made with java or C# (for lack
> of C++ devs) and these use UTF16 internally. Notice that
> modern processor architectures are already optimized in the
> way that byte-level operations are often slower.

The network is still 8 bits UTF-8. As are the disks; using
UTF-16 on an external support simply doesn't work.

Also, UTF-8 may result in less memory use, and thus less paging.

If all you're doing are simple operations, searching for a few
ASCII delimiters and copying the delimited substrings, for
example, UTF-8 will probably be significantly faster: the CPU
will always read a word at a time, even if you access it byte by
byte, and you'll usually get more characters per word using
UTF-8.

If you need full and complete support, as in an editor, for
example, UTF-32 is the best general solution. For a lot of
things in between, UTF-16 is a good compromise.

But the trade-offs only concern internal representation.
Externally, the world is 8 bits, and UTF-8 is the only solution.

--
James Kanze

From: Oliver Regenfelder on 18 May 2010 10:26

Hello,

Peter Olcott wrote:
> I completed the detailed design on the DFA that would validate and
> translate any valid UTF-8 byte sequence into UTF-32. It can not be done
> faster or simpler. The state transition matrix only takes exactly 2 KB.

Who cares about DFAs and state transition matrix sizes when all you want
to do is convert UTF-8 to UTF-32. That are some if/else and switch
statements in your programming language of choice + error handling.

Best regards,

Oliver

From: Oliver Regenfelder on 18 May 2010 10:29

Hello,

Peter Olcott wrote:
> Maybe it is much simpler for me than it would be for others because of
> my strong bias towards DFA recognizers.

I would say it is exactly the oposite. Your strong bias towards DFA
recognizers lets you complete forget about the current abstraction
level you are dealing with.

> I bet my DFA recognizer is at
> least twice as fast as any other method for validating UTF-8 and
> converting it to code points.
> I am estimating about 20 machine clocks
> per code point.

You might want to reread some of the postings regarding optimization
from the earlier threads.

Have you been a hardware engineer before by any chance?

Best regards,

Oliver

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish