New utf8string design may make UTF-8 the superior encoding [MFC]

Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish

From: Joseph M. Newcomer on 18 May 2010 11:52

See below...
On Tue, 18 May 2010 07:18:42 -0700 (PDT), James Kanze <james.kanze(a)gmail.com> wrote:

>On 16 May, 14:51, �� Tiib <oot...(a)hot.ee> wrote:
>> On 16 mai, 15:34, "Peter Olcott" <NoS...(a)OCR4Screen.com> wrote:
>
>> I suspect UTF8 fades gradually into history. Reasons are
>> similar like 256 color video-modes and raster-graphic formats
>> went. GUI-s are already often made with java or C# (for lack
>> of C++ devs) and these use UTF16 internally. Notice that
>> modern processor architectures are already optimized in the
>> way that byte-level operations are often slower.
>
>The network is still 8 bits UTF-8. As are the disks; using
>UTF-16 on an external support simply doesn't work.
****
UTF-8 is an endian-independent encoding of Unicode (UTF-32). Disks, however, are clueless
about such representations; they store 512-byte sectors and don't care what is in them. It
is quite common to store files using UTF-16 (ideally with a BOM) or UTF-32 (ideally, with
a BOM). Because file systems don't care what data you store, only your program cares. In
some situations, using UTF-8 for files is convenient, but it is not dictated by the nature
of either disks or file systems.

So while the argument about network transport, and the desire to have an
endian-independent encoding, have credibility, disks are meaningless in this discussion.
****
>
>Also, UTF-8 may result in less memory use, and thus less paging.
****
Generally, this argument is deemed "silly". It generally holds no meaning except in very
rare and exotic situations, and is hard to credit for any but the most unusual programs.
***
>
>If all you're doing are simple operations, searching for a few
>ASCII delimiters and copying the delimited substrings, for
>example, UTF-8 will probably be significantly faster: the CPU
>will always read a word at a time, even if you access it byte by
>byte, and you'll usually get more characters per word using
>UTF-8.
****
This logic doens't even make sense. The computer does NOT read a word at a time; the
memory manager reads a cache line at a time, which can be 16, 32, or 64 bytes, depending
on the chip set. And it really is marginal in terms of overall performance, as anyone who
has spent time doing performance measurement realizes.
****
>
>If you need full and complete support, as in an editor, for
>example, UTF-32 is the best general solution. For a lot of
>things in between, UTF-16 is a good compromise.
****
No argument there. UTF-16 with surrogates is a real pain to deal with.
****
>
>But the trade-offs only concern internal representation.
>Externally, the world is 8 bits, and UTF-8 is the only solution.
****
No, this does not follow. UTF-8 is a *convenient* solution in some contexts, but calling
it the *only* solution is presumptuous. UTF-32 is also a convenient solution in some
contexts, as is UTF-16. All decisions have to be made in terms of the context!
joe
****
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Peter Olcott on 18 May 2010 14:41

On 5/18/2010 12:40 AM, Joshua Maurice wrote:
> On May 17, 8:35 pm, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
>> The finite state machine's detailed design is now completed. Its state
>> transition matrix only takes 2048 bytes. It will be faster than any
>> other possible method.
>>
>> The finite state machine is semantically identical to the regular
>> expression. I am somewhat enamored with DFA recognizers. I love them. I
>> will post the source code when it is completed.
>
> I'd love to see it (and test it). I'd be a little surprised that
> anything using something called "regular expressions" could beat the
> code I just posted in terms of speed or executable size. (Unless
> perhaps the regex library itself outputted c code, which was then
> compiled down. I could probably buy it then.)

That is how Lex works. I hand-coded my finite state machine, though.

>
>> A few years ago I beat both Microsoft and Borland's std::string by as
>> much as more than 100-fold. I posted the code this this group. It was
>> called FastString if anyone cares to look this up.
>
> Beat them in what? What operations? I hope you aren't comparing copy-
> on-write string to a copy-on-copy implementation. (Is there a better
> name for not "copy on write"?)

They were both much slower because of two reasons:
(1) Memory allocations
(2) Function call overhead

From: Peter Olcott on 18 May 2010 14:48

On 5/18/2010 2:36 AM, Mihai N. wrote:
>
>> It will be faster than any other possible method.
>
> Not only "the fastest method today" but faster than any possible method.
> Really?
> Never make such absolute statements.
> Looks like a sure sign of delusion.
> No matter how fast the method, someone, at some point, will have something
> faster. It's just how things work.

In this case it is a sure sign of very extensive knowledge of DFA
recognizers. I have a patent on a DFA recognizer: 7,046,848.

>
>
>> I will post the source code when it is completed.
> You do that.
> That way it is easyer to measure and compare with other methods.
> Estimates mean nothing.
>
>
If one achieves the absolute minimum number of clock cycles, then one
achieves the fastest possible code. The code that I will publish will be
in the ball park of the minimum number of operations that can be encoded
in C/C++. Hand tweaking in assembly language might provide further
improvements.

It will not produce smaller code than alternatives. The code will
probably be about 2048 bytes larger because the state transition matrix
needs this much.

From: Peter Olcott on 18 May 2010 14:53

On 5/18/2010 2:44 AM, Mihai N. wrote:
>
>> the fastest and simplest possible way
>> to validate and divide any UTF=8 sequence into its constituent code
>> point parts is a regular expression implemented as a finite state
>> machine
>
> Sorry, where did you get this one from?
>
>

The intersection of my knowledge of UTF-8 encoding and DFA recognizers.

From: Peter Olcott on 18 May 2010 15:05

On 5/18/2010 4:51 AM, Joshua Maurice wrote:
> On May 18, 12:38 am, "Mihai N."<nmihai_year_2...(a)yahoo.com> wrote:
>>> //COMPLETELY UNTESTED
>>
>> Then most likely wrong :-)
>
> Yes. It was there just for demonstration purposes on how easy the code
> is, and how I might consider "regex" and "state machine libraries" or
> whatever to be overkill. I will wait patiently for his code and
> compare to what I whipped off the top of my head.

And it clearly shows that the typical way to do this costs many more
machine cycles than the way that I am proposing.

First | Prev | Next | Last
Pages: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish