From: Peter Olcott on
On 5/19/2010 10:15 AM, Joseph M. Newcomer wrote:
> You just don't get it, do you? NOTHING can beat the performance of a state transition
> matrix, because Peter says this must be true! And who are the rest of us poor people to
> question him?
> joe

This is the essence of my optimal design, try and show one that is
faster for matching 50 keywords.

Looking up an ActionCode switch statement value based on a
state_transition_matrix that is indexed by current_state and
current_input_byte.

unsigned char States[8][256];

The above is the state transition matrix for validating and translating
UTF-8 to UTF-32.

The above design completely proves my point to everyone with sufficent
knowledge of UTF-8 and DFA state transition matrices.

I will also add that I only have twelve ActionCodes including
InvalidByteError and the OutOfdata sentinel.

>
> On Wed, 19 May 2010 15:32:50 +0100, "Leigh Johnston"<leigh(a)i42.co.uk> wrote:
>
>>
>>
>> "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote in message
>> news:o86dnTxKTd3Xb27WnZ2dnUVZ_g6dnZ2d(a)giganews.com...
>>> On 5/19/2010 12:39 AM, Pete Delgado wrote:
>>>> "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote in message
>>>> news:hcednaq6ks5ml2_WnZ2dnUVZ_tqdnZ2d(a)giganews.com...
>>>>>
>>>>> The finite state machine's detailed design is now completed. Its state
>>>>> transition matrix only takes 2048 bytes. It will be faster than any
>>>>> other
>>>>> possible method.
>>>>
>>>> So once again you find yourself with a *design* that is complete but you
>>>> have not done any *coding*? Yet you claim that it will be faster than any
>>>> other possible method?
>>>>
>>>> Is anyone else noticing a pattern here??? (no pun intended...)
>>>>
>>>> -Pete
>>>>
>>>>
>>>
>>> The code will be complete within a week. Also most of the coding is
>>> complete for my major components.
>>>
>>> I would think that it would be self-evident to all who really understand
>>> deterministic finite automatons that nothing can beat the speed of a state
>>> transition matrix.
>>
>> What if such a matrix does not fit into L1 cache?
>>
>> /Leigh
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Peter Olcott on
On 5/19/2010 10:12 AM, Joseph M. Newcomer wrote:
> A classic error in predicting performance deals with the myth that instruction count is an
> important parameter in determining timing.
>
> In a modern pipelined superscalar architecture, with asynchronous, opportunistic
> sequencing of instructions, speculative execution, multilevel caches, etc., there is NO
> WAY to look at code and predict its behavior unless you are mentally simulating the
> asynchronous nature of the entire architecture, from memory access through write pipes,
> caches, and the entire asynchronous execution engine. Therefore, "counting" memory
> accesses does not work (speculative prefetch may have made the data available at the point
> where it is needed, with no "memory delay" a naive simulation may suggest). Counting
> instructions is pointless, because such models suggest the instructions are executed
> sequentially (this has not been true since the Pentium II).

The above statement has at least one obvious exception. When the
algorithm is such that all of the above re-sequencing of instructions
can't possibly help because there is only a single sequence that can
produce the correct result.

>
> In principle, the only way to know how long an algorithm takes is to use a high-resolution
> (e.g., RDTSC) timer. But the way Windows is implemented, this also has problem.
>
> The chip itself does not guarantee the RDTSC instruction is executed when you think it is;
> in fact, it is *documented* that it might be executed BEFORE instructions that precede it
> and when executed, an arbitrary number of instructions that FOLLOW it might have *already*
> been executed (and this can happen even if the preceding instructions have NOT YET been
> executed), so to use the RDTSC instruction directly, it must always be preceded by a
> "serializing instruction". The only valid user-mode serializing instruction is CPUID.
> Note that QueryPerformanceCounter will do the right thing.
>
> However, in a multicore/multiprocessor environment, a thread may read the high-resolution
> timer on one CPU, then be suspended, and read the timer again on another CPU. Note there
> can be a "clock skew" between the two CPUs, so it *is* possible for the timer to give
> NEGATIVE time for small intervals. There are many other problems with getting good
> performance data on the x86 chipset under Windows, so at best all timings are
> approximations, and you have to do a sufficient number of experiments that your collected
> data has statistical significance (and if you make a claim about performance and cannot
> give the mean and standard deviation of your measurements, you don't HAVE measurements,
> you have random numbers which might approximate measurements but cannot be trusted).
>
> So trying to compare models by counting the number of potential memory accesses or the
> potential instruction trace will NOT give correct information. You can no longer do
> static analysis of code to understand what its performance is going to be. Well, you can,
> but it is not practical, and in any case, if you can work it out for a particular model of
> CPU with a particular motherboard chipset, it is ONLY valid for that instance of a
> "computer", and will not necessarily hold for other instances of "computer" with different
> CPUs or different chip sets on the motherboard. So performance metrics that say "I looked
> at the instruction set and X is true" have no meaning.
> joe

Certainly head to head direct benchmarking will provide much more
accurate and reliable information. This is not to say that instruction
counts are always useless. When designing a fast algorithm it is too
expensive to encode every alternative, thus some reasonable heuristic
must be used to narrow down the number of alternatives.

> ***
> On Wed, 19 May 2010 03:39:00 -0700 (PDT), James Kanze<james.kanze(a)gmail.com> wrote:
>
>> On May 18, 8:17 pm, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
>>> On 5/18/2010 9:34 AM, James Kanze wrote:
>>>> On 17 May, 14:08, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
>>>>> On 5/17/2010 1:35 AM, Mihai N. wrote:
>>
>>>>> a regular expression implemented as a finite state machine
>>>>> is the fastest and simplest possible way of every way that
>>>>> can possibly exist to validate a UTF-8 sequence and divide
>>>>> it into its constituent parts.
>>
>>>> It all depends on the formal specification; one of the
>>>> characteristics of UTF-8 is that you don't have to look at
>>>> every character to find the length of a sequence. And
>>>> a regular expression generally will have to look at every
>>>> character.
>>
>>> Validation and translation to UTF-32 concurrently can not be
>>> done faster than a DFA recognizer, in fact it must always be
>>> slower.
>>
>> UTF-8 was designed intentionally in a way that it doesn't
>> require a complete DFA to handle, but can be handled faster.
>> Complete DFA's are usually slower than caluculations on modern
>> processors, since they require memory accesses, and memory is
>> often the limiting factor.
>>
>> In fact, there is no "must always be slower". There are too
>> many variables involved to be able to make such statements.
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

From: James Kanze on
On May 19, 3:32 pm, "Leigh Johnston" <le...(a)i42.co.uk> wrote:
> "Peter Olcott" <NoS...(a)OCR4Screen.com> wrote in message
> news:o86dnTxKTd3Xb27WnZ2dnUVZ_g6dnZ2d(a)giganews.com...
> > On 5/19/2010 12:39 AM, Pete Delgado wrote:
> >> "Peter Olcott"<NoS...(a)OCR4Screen.com> wrote in message
> >>news:hcednaq6ks5ml2_WnZ2dnUVZ_tqdnZ2d(a)giganews.com...

> >>> The finite state machine's detailed design is now
> >>> completed. Its state transition matrix only takes 2048
> >>> bytes. It will be faster than any other possible method.

> >> So once again you find yourself with a *design* that is
> >> complete but you have not done any *coding*? Yet you claim
> >> that it will be faster than any other possible method?

> >> Is anyone else noticing a pattern here??? (no pun intended...)

> > The code will be complete within a week. Also most of the
> > coding is complete for my major components.

> > I would think that it would be self-evident to all who
> > really understand deterministic finite automatons that
> > nothing can beat the speed of a state transition matrix.

> What if such a matrix does not fit into L1 cache?

What if any number of things. The only thing that is
self-evident to those who understand modern computers is that
your really can't tell until you've measured (within certain
limits, of course). There are too many variables involved. In
general, however, avoiding memory accesses tends to help (as
does avoiding conditions on some machines).

--
James Kanze
From: Leigh Johnston on


"James Kanze" <james.kanze(a)gmail.com> wrote in message
news:1419a152-9e2f-4cd1-b1e3-b734fc7cf64c(a)y21g2000vba.googlegroups.com...
>
>> What if such a matrix does not fit into L1 cache?
>
> What if any number of things. The only thing that is
> self-evident to those who understand modern computers is that
> your really can't tell until you've measured (within certain
> limits, of course). There are too many variables involved. In
> general, however, avoiding memory accesses tends to help (as
> does avoiding conditions on some machines).
>

It was mostly a rhetorical question.

/Leigh

From: James Kanze on
On May 19, 4:14 pm, Peter Olcott <NoS...(a)OCR4Screen.com> wrote:
> On 5/19/2010 5:39 AM, James Kanze wrote:
> > On May 18, 8:17 pm, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
> >> On 5/18/2010 9:34 AM, James Kanze wrote:
> >>> On 17 May, 14:08, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
> >>>> On 5/17/2010 1:35 AM, Mihai N. wrote:

> >>>> a regular expression implemented as a finite state
> >>>> machine is the fastest and simplest possible way of every
> >>>> way that can possibly exist to validate a UTF-8 sequence
> >>>> and divide it into its constituent parts.

> >>> It all depends on the formal specification; one of the
> >>> characteristics of UTF-8 is that you don't have to look at
> >>> every character to find the length of a sequence. And
> >>> a regular expression generally will have to look at every
> >>> character.

> >> Validation and translation to UTF-32 concurrently can not be
> >> done faster than a DFA recognizer, in fact it must always be
> >> slower.

> > UTF-8 was designed intentionally in a way that it doesn't
> > require a complete DFA to handle, but can be handled faster.
> > Complete DFA's are usually slower than caluculations on modern
> > processors, since they require memory accesses, and memory is
> > often the limiting factor.

> > In fact, there is no "must always be slower". There are too
> > many variables involved to be able to make such statements.

> This is the essence of my optimal design, try and show one
> that is faster for matching 50 keywords.

I have to see yours, first. (I posted mine. I won't claim that
nothing is faster, but it does get the job done pretty quickly.)

> Looking up an ActionCode switch statement value based on
> a state_transition_matrix that is indexed by current_state and
> current_input_byte.

> unsigned char States[8][256];

> The above is the state transition matrix for validating and
> translating UTF-8 to UTF-32.

Plus the action table. And executing the action. And... Hand
waving is always easier (and faster) than real code.

And I've implemented DFA's before. I've actually written code
which generates a DFA from a regular expression. You don't have
to explain the obvious to me.

> The above design completely proves my point to everyone with
> sufficent knowledge of UTF-8 and DFA state transition
> matrices.

Design never proves anything with regards to speed, at least as
long as the big-O is the same.

> I will also add that I only have twelve ActionCodes including
> InvalidByteError and the OutOfdata sentinel.

And how do you act on an ActionCode? Switch statements and
indirect jumps are very, very slow on some machines (but not on
others).

--
James Kanze