New utf8string design may make UTF-8 the superior encoding [MFC]

Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish

From: Öö Tiib on 19 May 2010 10:52

On May 19, 5:14 pm, Peter Olcott <NoS...(a)OCR4Screen.com> wrote:
> On 5/19/2010 12:39 AM, Pete Delgado wrote:
>
> I would think that it would be self-evident to all who really understand
> deterministic finite automatons that nothing can beat the speed of a
> state transition matrix.

Rest assured, there are no silver bullets in software industry.
Fortunately. Otherwise there was nothing to do soon.

Large state transition matrixes cause large static data. Large static
data often causes cache misses. Few additional comparisions in state
transition algorithm are LOT better than frequent cache misses.

From: Öö Tiib on 19 May 2010 11:10

On May 19, 5:52 pm, Öö Tiib <oot...(a)hot.ee> wrote:
> On May 19, 5:14 pm, Peter Olcott <NoS...(a)OCR4Screen.com> wrote:
>
> > On 5/19/2010 12:39 AM, Pete Delgado wrote:
>
> > I would think that it would be self-evident to all who really understand
> > deterministic finite automatons that nothing can beat the speed of a
> > state transition matrix.
>
> Rest assured, there are no silver bullets in software industry.
> Fortunately. Otherwise there was nothing to do soon.
>
> Large state transition matrixes cause large static data. Large static
> data often causes cache misses. Few additional comparisions in state
> transition algorithm are LOT better than frequent cache misses.

Also JK has already said elsewhere in this thread that UTF-8 is
designed to be simple to handle without such matrixes. Branch
prediction in most processors is most probably correct most of the
time since people do not use Hebrew, Chinese and Creek languages in
same sentence and matrixes are therefore pretty apparently just a
waste of memory space and time.

From: Joseph M. Newcomer on 19 May 2010 11:12

A classic error in predicting performance deals with the myth that instruction count is an
important parameter in determining timing.

In a modern pipelined superscalar architecture, with asynchronous, opportunistic
sequencing of instructions, speculative execution, multilevel caches, etc., there is NO
WAY to look at code and predict its behavior unless you are mentally simulating the
asynchronous nature of the entire architecture, from memory access through write pipes,
caches, and the entire asynchronous execution engine. Therefore, "counting" memory
accesses does not work (speculative prefetch may have made the data available at the point
where it is needed, with no "memory delay" a naive simulation may suggest). Counting
instructions is pointless, because such models suggest the instructions are executed
sequentially (this has not been true since the Pentium II).

In principle, the only way to know how long an algorithm takes is to use a high-resolution
(e.g., RDTSC) timer. But the way Windows is implemented, this also has problem.

The chip itself does not guarantee the RDTSC instruction is executed when you think it is;
in fact, it is *documented* that it might be executed BEFORE instructions that precede it
and when executed, an arbitrary number of instructions that FOLLOW it might have *already*
been executed (and this can happen even if the preceding instructions have NOT YET been
executed), so to use the RDTSC instruction directly, it must always be preceded by a
"serializing instruction". The only valid user-mode serializing instruction is CPUID.
Note that QueryPerformanceCounter will do the right thing.

However, in a multicore/multiprocessor environment, a thread may read the high-resolution
timer on one CPU, then be suspended, and read the timer again on another CPU. Note there
can be a "clock skew" between the two CPUs, so it *is* possible for the timer to give
NEGATIVE time for small intervals. There are many other problems with getting good
performance data on the x86 chipset under Windows, so at best all timings are
approximations, and you have to do a sufficient number of experiments that your collected
data has statistical significance (and if you make a claim about performance and cannot
give the mean and standard deviation of your measurements, you don't HAVE measurements,
you have random numbers which might approximate measurements but cannot be trusted).

So trying to compare models by counting the number of potential memory accesses or the
potential instruction trace will NOT give correct information. You can no longer do
static analysis of code to understand what its performance is going to be. Well, you can,
but it is not practical, and in any case, if you can work it out for a particular model of
CPU with a particular motherboard chipset, it is ONLY valid for that instance of a
"computer", and will not necessarily hold for other instances of "computer" with different
CPUs or different chip sets on the motherboard. So performance metrics that say "I looked
at the instruction set and X is true" have no meaning.
joe
***
On Wed, 19 May 2010 03:39:00 -0700 (PDT), James Kanze <james.kanze(a)gmail.com> wrote:

>On May 18, 8:17 pm, Peter Olcott <NoS...(a)OCR4Screen.com> wrote:
>> On 5/18/2010 9:34 AM, James Kanze wrote:
>> > On 17 May, 14:08, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
>> >> On 5/17/2010 1:35 AM, Mihai N. wrote:
>
>> >> a regular expression implemented as a finite state machine
>> >> is the fastest and simplest possible way of every way that
>> >> can possibly exist to validate a UTF-8 sequence and divide
>> >> it into its constituent parts.
>
>> > It all depends on the formal specification; one of the
>> > characteristics of UTF-8 is that you don't have to look at
>> > every character to find the length of a sequence. And
>> > a regular expression generally will have to look at every
>> > character.
>
>> Validation and translation to UTF-32 concurrently can not be
>> done faster than a DFA recognizer, in fact it must always be
>> slower.
>
>UTF-8 was designed intentionally in a way that it doesn't
>require a complete DFA to handle, but can be handled faster.
>Complete DFA's are usually slower than caluculations on modern
>processors, since they require memory accesses, and memory is
>often the limiting factor.
>
>In fact, there is no "must always be slower". There are too
>many variables involved to be able to make such statements.
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Peter Olcott on 19 May 2010 11:13

On 5/19/2010 5:30 AM, Hector Santos wrote:
> On May 19, 1:39 am, "Pete Delgado"<Peter.Delg...(a)NoSpam.com> wrote:
>> "Peter Olcott"<NoS...(a)OCR4Screen.com> wrote in message
>>
>> news:hcednaq6ks5ml2_WnZ2dnUVZ_tqdnZ2d(a)giganews.com...
>>
>>
>>
>>> The finite state machine's detailed design is now completed. Its state
>>> transition matrix only takes 2048 bytes. It will be faster than any other
>>> possible method.
>>
>> So once again you find yourself with a *design* that is complete but you
>> have not done any *coding*? Yet you claim that it will be faster than any
>> other possible method?
>>
>> Is anyone else noticing a pattern here??? (no pun intended...)
>>
>> -Pete
>
> Yup, same old pathetic fantasy rhetorical claims by Peter "The Troll"
> Olcott.
>
> --
> HLS
>

This is the essence of my optimal design, try and show one that is
faster for matching 50 keywords.

Looking up an ActionCode switch statement value based on a
state_transition_matrix that is indexed by current_state and
current_input_byte.

unsigned char States[8][256];

The above is the state transition matrix for validating and translating
UTF-8 to UTF-32.

The above design completely proves my point to everyone with sufficent
knowledge of UTF-8 and DFA state transition matrices.

I will also add that I only have twelve ActionCodes including
InvalidByteError and the OutOfdata sentinel.

From: Peter Olcott on 19 May 2010 11:14

On 5/19/2010 5:39 AM, James Kanze wrote:
> On May 18, 8:17 pm, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
>> On 5/18/2010 9:34 AM, James Kanze wrote:
>>> On 17 May, 14:08, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
>>>> On 5/17/2010 1:35 AM, Mihai N. wrote:
>
>>>> a regular expression implemented as a finite state machine
>>>> is the fastest and simplest possible way of every way that
>>>> can possibly exist to validate a UTF-8 sequence and divide
>>>> it into its constituent parts.
>
>>> It all depends on the formal specification; one of the
>>> characteristics of UTF-8 is that you don't have to look at
>>> every character to find the length of a sequence. And
>>> a regular expression generally will have to look at every
>>> character.
>
>> Validation and translation to UTF-32 concurrently can not be
>> done faster than a DFA recognizer, in fact it must always be
>> slower.
>
> UTF-8 was designed intentionally in a way that it doesn't
> require a complete DFA to handle, but can be handled faster.
> Complete DFA's are usually slower than caluculations on modern
> processors, since they require memory accesses, and memory is
> often the limiting factor.
>
> In fact, there is no "must always be slower". There are too
> many variables involved to be able to make such statements.
>
> --
> James Kanze

This is the essence of my optimal design, try and show one that is
faster for matching 50 keywords.

Looking up an ActionCode switch statement value based on a
state_transition_matrix that is indexed by current_state and
current_input_byte.

unsigned char States[8][256];

The above is the state transition matrix for validating and translating
UTF-8 to UTF-32.

The above design completely proves my point to everyone with sufficent
knowledge of UTF-8 and DFA state transition matrices.

I will also add that I only have twelve ActionCodes including
InvalidByteError and the OutOfdata sentinel.

First | Prev | Next | Last
Pages: 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish