Fastest DFA recognizer [Theory]

Prev: Atomic variable Vs. Atomic operation
Next: How to program critical section for reader-writer systems?

From: Peter Olcott on 21 May 2010 10:34

"Chris F Clark" <cfc(a)shell01.TheWorld.com> wrote in message
news:sddk4qx95rv.fsf(a)shell01.TheWorld.com...
> "Peter Olcott" <NoSpam(a)OCR4Screen.com> writes:
>
>> I was wondering if there is a generally faster way to
>> implement a DFA recognizer than by using a state
>> transition
>> matrix.
>>
>> I already asked this question on several other groups and
>> never received any definitive answer.
>
> Did you ask on comp.compilers?
>
> In any case, looking at the answers here. It looks like
> one is just
> fiddling with the proportionality constant. A transition
> matrix and
> nested switch tables are exactly equivalent in terms that
> they both
> are linear (with slightly different constants), which
> constant is
> better depends upon how big your DFA is, how clever the
> compiler is,
> and what size of caches your machine has. For small
> enough DFA's the
> case statement will generally be faster. For most
> compilers, though,
> eventually, when one gets a large enough DFA, the
> optimizer and code
> generator "break", sometimes literally, but more often
> simply cannot
> track enough information to generate optimal code and the
> table method
> actually gets faster.
>
> Note, case statements aren't your only option. Most DFA's
> can be
> converted into what looks like hand-code programs. That
> takes *a lot*
> more work. That is in essence what people do when writing
> hand-
> written recursive descent parsers.
>
> ab* could be translated into something like:
>
> if ( token() = 'a' ) {
> while ( token() = 'b' ) ;
> }
> accept();
>

This would likely degrade into sequences of "if else if else
if" for the real world use of creating a lexical analyzer
that can recognize 50 keywords. This would seem to be slower
than a state transition matrix / case statement combination.

> Note, programs like this (which are hard to write
> correctly) *can*
> produce even faster lexers than case statement ones,
> because this code
> is more similar to what the optimizer and code generator
> were
> originally written and tuned for.

Yeah the case statement one has non trivial overhead.

> That is still not the fastest one can do, depending on the
> situation.
> It is probably the fastest if one is only feeding in
> inputs that
> match, but it might be slower at rejecting non-matching
> inputs than
> some non-obvious other methods. For example, the
> methodology of
> Boyer-Moore string searches can be applied to certain DFAs
> to quickly
> look for places where the DFA might fail. For accepting
> one has to
> look at every input byte, but for rejecting there is no
> such
> requirement, and the BM technique is suggestive of how one
> can quickly
> find rejectable bytes in the input (if you don't have to
> process it
> linearly).
>
> However, since I assume you are looking for a simple DFA
> recognizer
> that processes its input left-to-right, and probably want
> one in a
> high-level language, you have already "found" your answer.
> If you are
> willing to use assembly language, find the paper by Thomas
> Pennello on
> "the Fastest LR Parser". The heart of an LR parser is
> essentially a
> DFA recognizer. They went from the state matrix to
> optimal VAX
> assembly language. Still I suspect for a large enough
> DFA, a table
> interpreter could still have been faster due to cache
> issues.

I don't understand this last part. It seems like the first
one used a state transition matrix which is a table (and
then recoded this in assembly language), and an alternative
faster one also used a table. This looks like they are both
using the same method, so how is one faster than the other?

>
> Hope this helps,
> -Chris
>
> ******************************************************************************
> Chris Clark email:
> christopher.f.clark(a)compiler-resources.com
> Compiler Resources, Inc. Web Site:
> http://world.std.com/~compres
> 23 Bailey Rd voice: (508) 435-5016
> Berlin, MA 01503 USA twitter: @intel_chris
> ------------------------------------------------------------------------------

From: Tim Little on 22 May 2010 06:58

On 2010-05-21, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:
> I was thinking of the case of a lexical analyzer that needs
> to recognize fifty keywords. In this case most of the
> alphabet does not need to be explicitly represented, only
> those letters that are part of keywords that are to be
> matched need be represented.

Ah yes, with the default case taking the rest. Yes, that could be
quite compact. A special case making up a tiny fraction of all DFAs,
but a rather important one in practice :)

- Tim

From: Chris F Clark on 22 May 2010 16:16

"Peter Olcott" <NoSpam(a)OCR4Screen.com> writes:

> I don't understand this last part. It seems like the first
> one used a state transition matrix which is a table (and
> then recoded this in assembly language), and an alternative
> faster one also used a table. This looks like they are both
> using the same method, so how is one faster than the other?

In the assembly language version the table is encoded in the code--the
if-then-else sequences you were mentioning when I covered hand-written
recursive descent. There is essentially no table. The table is
completely folded into the code that interprets it (i.e. one has
applied partial evaluation). There is just the large assembly
language program that looks at bytes and decides what to do. If your
lexer is small enough, (your 50 keyword example is likely to be), then
this assembly language program will fit in the cache of the processor
and it will run at maximal speed, because the cache will keep up with
the processor and the processor will deal with the bytes in the
minimum number of cycles.

However, if the lexer is not small enough, (when investigating this
stuff we are looking at "lexers" with hundreds-of-thousands of
"keywords"), the assembly language program grows to be multiple
gigabytes and doesn't fit into cache at all. Then writing a small
interpreter that does fit into cache and building a cache-friendly
data structure for the table makes a difference. You are going to
have cache misses, but you pick a representation that minimizes the
number of them, because a cache miss takes longer than the inner loop
of the interpreter.

Do those two paragraphs make the distinction clear?

I hope so, because I'm going to throw another wrinkle in the problem.
Note, if your problem is small enough, that cache isn't the issue, the
next level you look at is branch prediction, as you want the process
to "guess" the right way on the branch instructions in your program.

The real advantage of merging you table into your code isn't simply
eliminating the array reads to fetch the table. You actually want to
optimize that code. This is the big plus for partial evaluation. You
have a look at which branches are actually taken and optimize the code
for the most frequent of those sequences (and eliminate any code for
branches that can never be taken).

This may cause you to unroll loops and do other things that change the
"shape" of the program, which is why the hand-written recursive
descent parser isn't simply one big loop with a case-statement inside.
If you've seen "the", looking to see whether the next character is "n"
is fairly important (and likely to have a hit), not so if you've seen
"whil". Again, this kind of optimization is relatively hard to
automate, and you are into the diminishing returns level.

(That said, we did some of that in Yacc++, if you ask the tables to be
laid out as code. You can tell it that you want it to unroll the
code. And, if you do, it sees how not only one character of the token
influences what it should do, but looks at character sequences upto
the unrolling limit you have set. It does fairly minimal
optimizations on that, but it does at least eliminate the dead
branches. A fairly typical one is that it eliminates is spurious
checks for EOF, which tend to dominate lexer time if you do them on
every character. That alone made investigating laying out the code
and optimizing it worthwhile. We figured out how to fold that
optimization back into the default interpreter and sped up average
lexing time by 10-25% even when one was just using tables.)

Hope this helps,
-Chris

******************************************************************************
Chris Clark email: christopher.f.clark(a)compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------

First | Prev |
Pages: 1 2 3
Prev: Atomic variable Vs. Atomic operation
Next: How to program critical section for reader-writer systems?