Is this Regular Expression for UTF-8 Correct?? [MFC]

Prev: Where to handle CSliderCtl messages in
Next: Blocking mouse clicks

From: Peter Olcott on 19 May 2010 10:10

On 5/19/2010 12:34 AM, Mihai N. wrote:
>
>> I can't accept this without direct proof.
>
> Yet, we are all supposed to take everything *you*
> say without any proof.
>
>
Please cite examples of what I need to prove. So far no one asked for
any proof.

From: Peter Olcott on 19 May 2010 10:20

On 5/19/2010 12:42 AM, Mihai N. wrote:
>> You code in Word? Word is essentially user-hostile for writing code!
>
> Actually, it is very nice.
>
> You set the spelling language to "C" and get nice squiggles for
> all syntax errors.
> ;-)
>
> Also, has the great benefit that it does not compile and run your code, which
> allows to write perfect code and create super fast algorithms.
> Compiling the code and running it destroy the perfection :-)
> It's called "reality" and it is a bad thing.
>
>

It has the benefit of making code easier to read for extensive desk
checking. I can enlarge the font, add bolding, and color high-lighting.
Also this code can be directly embedded in my design notes. This process
works very well for me.

How much time is typically spent on debugging and testing? From what I
have heard many badly written systems spend almost all of their time on
debugging. I only spent about 5% of my time on debugging and testing
combined. The code is designed to make automated testing easy.

From: Peter Olcott on 19 May 2010 10:52

On 5/19/2010 1:18 AM, Joseph M. Newcomer wrote:
> See below...
> On Tue, 18 May 2010 20:42:55 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>
>> On 5/18/2010 8:07 PM, Joseph M. Newcomer wrote:
>>> The counterexample is that he DID it. You'll have to ask him for proof. But I do lexical
>>> analysis all the time using DFAs and never once do I have a table. Years of experience
>>> taught me that they are faster than anything lex or analogous lexer generators can do. But
>>> I'll reserve judgment.
>>> joe
>>
>> How can you scan for dozens of keywords efficiently without a state
>> transition matrix? Without a state transition matrix its gotta be
>> something like if, else if, else if, et cetera.
> ****
> You assumption that a "state transition matrix" is required shows that you have never
> actually written a hard-coded lexer. I've written so many that I've lost count. I'll
> knock one off as fast as I can type; in fact, typing speed is the limit to how long it
> takes to create one.
>
> How did you develop this fixation that a "state transition matrix" is required?

Because it is intuitively obvious that there can be no other way that
requires fewer machine cycles.

> That is
> only one way to implement a DFA. A switch statement works well,

This is a fundamental part of my DFA recognizer.

> too, and if-statements
> are considered a viable technique.

Several comparisons per byte must take longer than looking up the
ActionCode switch statement value based on a state_transition_matrix
that is indexed by current_state and current_input_byte.

This is especially true when the state_transition_matrix is tiny enough
to always fit in cache. (Cache locality of reference is the most
significant thing that I learned from you).

With all of the experience that you cited below the above sentences
would completely explain every relevant detail of my entire DFA design.

I see no possible way that any lexer that can recognize 50 different
keywords can be made any faster than the above design. This certainly
can not be done with sequences if, else if, ... statements.

> While a table-driven method is known to be one of the
> implementations, it is far and away not the ONLY known implementation. Nigel's experience
> was that a hard-coded recognizer without a table is blindingly fast, MUCH faster than a
> lexer that uses a transition matrix to drive the recognition. As these things go, using
> transition matrices is a rather "modern" invention in lexing. The compiler for the CDC
> Star computer used pipelined floating point operations to create a character class vector
> (because the Star was LOUSY at character manipulation but blindingly fast at pipelined
> floating point). So don't try to explain to me that table-driver lexers are the only
> possible implementation; I was writing hard-coded lexers in 1966, and using table-driven
> lexers in 1967 that were created from a lexer generator written by Art Evans for his PhD
> dissertation in 1961. We wrote a hand-coded lexer for the Bliss compiler. I've been
> writing compilers of one sort or another for 45 years, with table-driven components,
> hard-coded components, and everything in between. I've written sparse matrix packers
> using "comb vectors" to compress parse tables (an invention of people at Kahrsruhe
> University in Germany). And I can write programs that can convert NDFA machines to DFA
> machines, and then optimize the DFA to minimum states, and used to give this as a freshman
> programming exercise. So please stop trying to tell me how much I know or don't know
> about lexers, parsers, and compiler implementations in general. I also worked closely
> with one of the world's best language designers, and I learned a lot about language design
> from him. Not enough to be as good as he is, but enough to know when I see a bad design.
> Do you know what a "perfect hash" is?
Yes

> I've built perfect hash algorithms. Do you know
> where they are used, and why?

No.

>
> I'm not some newbie at this business. I worked with optimizing compiler technology from
> 1970 through 1983, and my Ph.D. was in optimal code generation from table-driven code
> generators. I've built debuggers, byte code interpreters, tree interpreters, threaded
> code interpreters, and machine code generators. I wrote my first compiler in 1967, which
> took an Algol-like language and generator hard code. I was one of the key developers of
> an interactive language from 1967-1969. My first optimizing compiler was a student
> project in 1967 (I can date that exactly, because the machine I wrote it for was
> decomissioned in 1968; our grade was inversely proportional to code size, and I got an
> A+). I've forgotten more about compilers than most people have ever learned. The last
> compiler I wrote had not a trace of tables anywhere in it, and I wrote it three years ago
> (I just checked the file dates). My first Windows project was a compiler which had not a
> single table for either its lexer or parser, back in 1990. So I have no idea why you
> think these representations are essential.
>
> And my skills pale beside those of deep experts like Nigel Horspool; see, for example,
>
> http://webhome.cs.uvic.ca/~nigelh/pubs.html
>
> and I recommend the paper
>
> http://webhome.cs.uvic.ca/~nigelh/Publications/fastparse.pdf
>
> A 1990 paper wherein he shows why performance of table-driven parsers can be improved by
> very large factors (like 8x) by hard-coding the transitions in pure C code without a table
> anywhere in the resulting compiler. Don't try to explain why table-driven must be faster
> than any alternative when it was proven 20 years ago that this is not true. I suggest
> looking at Table 2 on page 19. You asked for proof; here it is.
> ****

The paper is named fast parse, and I already know that parsing might be
done faster without a table. What about lexing? I did a keyword search
on lex (which could be used in lexical) and it was no where in the
paper. I suggest that you might be incorrect in your assertion.

>>
>> With a state transition matrix you can simultaneously look for an
>> unlimited number of different character strings with zero comparisons
>> per byte.
> ****
> So what's your complaint about adding lexer rules for localized digit sequences? You just
> said it won't add any time...

It adds development time that I can not afford to spend right now.

> joe
>>
>>>
>>> On Tue, 18 May 2010 17:36:48 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>>>
>>>> On 5/18/2010 5:26 PM, Joseph M. Newcomer wrote:
>>>>> It is important to distinguish between the *concept* of DFA and the implementation of "a
>>>>> table-driven DFA" and the special case of "a table-drive DFA generated by the lex tool"
>>>>>
>>>>> All lexical analysis is done using the concept of DFAs, but the table-driven DFA output by
>>>>> lex is not very fast. Again, I point to the work of Nigel Horspool, which is very old
>>>>> work at this point, of generating tight C code, and no table appears anywhere in his
>>>>> implementation.
>>>>> joe
>>>>>
>>>> I really, really doubt that no table is used in a lexical analyzer that
>>>> is very very fast. Some little include of an include that you missed is
>>>> more likely. I can't accept this without direct proof.
>>>>
>>>> I can accept that he may have derived a parser that is very very fast
>>>> without the need for a table. This is more plausible.
>>> Joseph M. Newcomer [MVP]
>>> email: newcomer(a)flounder.com
>>> Web: http://www.flounder.com
>>> MVP Tips: http://www.flounder.com/mvp_tips.htm
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 19 May 2010 11:29

Actually, I state *opinions* that nobody has to accept.
joe

On Tue, 18 May 2010 22:34:59 -0700, "Mihai N." <nmihai_year_2000(a)yahoo.com> wrote:

>
>> I can't accept this without direct proof.
>
>Yet, we are all supposed to take everything *you*
>say without any proof.
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Peter Olcott on 19 May 2010 11:47

On 5/19/2010 10:29 AM, Joseph M. Newcomer wrote:
> Actually, I state *opinions* that nobody has to accept.
> joe
>
> On Tue, 18 May 2010 22:34:59 -0700, "Mihai N."<nmihai_year_2000(a)yahoo.com> wrote:
>
>>
>>> I can't accept this without direct proof.
>>
>> Yet, we are all supposed to take everything *you*
>> say without any proof.
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

The evidence that you provided to show that a faster alternative existed
to a state transition matrix based lexer did no lexing only parsing. The
title of the paper pertained to parsing, not lexing.

First | Prev | Next | Last
Pages: 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
Prev: Where to handle CSliderCtl messages in
Next: Blocking mouse clicks