Is this Regular Expression for UTF-8 Correct?? [MFC]

Prev: Where to handle CSliderCtl messages in
Next: Blocking mouse clicks

From: Peter Olcott on 18 May 2010 15:01

On 5/18/2010 2:58 AM, Mihai N. wrote:
>
>> Ultimately all UTF-8 validators must be regular expressions implemented
>> as finite state machines.
>
>> I can't imagine a better way.
> That, right here, is your problem. This is where you are wrong.
>
>
My method can completely validate any UTF-8 sequence of bytes and decode
it into its corresponding code point values in fewer machine clock
cycles than any possible alternative.** I am not wrong and will post the
code within about a week.

** Not counting hand tweaking the corresponding assembly language
equivalent of my method. Also not counting slight improvements to the
basic method that may be suggested.

What I am claiming is that my basic method is the fastest possible
method, I am not claiming that the code perfectly implements this method.

From: Peter Olcott on 18 May 2010 15:30

On 5/18/2010 11:07 AM, Joseph M. Newcomer wrote:
> See below...
> On Tue, 18 May 2010 00:58:16 -0700, "Mihai N."<nmihai_year_2000(a)yahoo.com> wrote:
>
>>
>>> Ultimately all UTF-8 validators must be regular expressions implemented
>>> as finite state machines.
>>
>>> I can't imagine a better way.
>> That, right here, is your problem. This is where you are wrong.
> ****
> Not if you live in the deep center of a Reality Distortion Field.
>
> Nigel Horspool provde, many years ago (close to 20) that table-drive DFAs and parsers were
> far too slow compared to custom C code, but what he did was generate the C code from the
> lex and yacc tables. His comment was "this code embodies some of the worst coding
> practices you can imagine, but because it is generated automatically, no one ever has to
> actually look at it or try to maintain it. So we can use a goto into the middle of
> another case of a switch and not hesitate"
>
> He could parse a million lines a minute of C code and pretty much every other language.
> The critique was "but you have lousy error recovery on the parse" and his response was
> "but at a million lines a minute, I can parse any reasonable-sized program completely on
> every keystroke, and you won't even notice the delay, because the parse is finished before
> your finger comes up from the key". So he only needed to flag the first error and not
> have to restart the parse at a "clean" point, the nightmare of all parsing technologies.
>
> He is a professor at the University of Victoria, British Columbia.

My UTF-8 DFA is hand tweaked. Lex may not produce the fastest possible
DFA, but a DFA is still the fastest possible approach. I doubt that you
can find any lexical analyzer that is not DFA based that can beat the
fastest DFA based lexical analyzer.

If you can, and it will not take me very long to confirm your findings I
would be interested in a valid counter example if you can provide one.

> joe
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 18 May 2010 18:26

It is important to distinguish between the *concept* of DFA and the implementation of "a
table-driven DFA" and the special case of "a table-drive DFA generated by the lex tool"

All lexical analysis is done using the concept of DFAs, but the table-driven DFA output by
lex is not very fast. Again, I point to the work of Nigel Horspool, which is very old
work at this point, of generating tight C code, and no table appears anywhere in his
implementation.
joe

On Tue, 18 May 2010 14:30:39 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/18/2010 11:07 AM, Joseph M. Newcomer wrote:
>> See below...
>> On Tue, 18 May 2010 00:58:16 -0700, "Mihai N."<nmihai_year_2000(a)yahoo.com> wrote:
>>
>>>
>>>> Ultimately all UTF-8 validators must be regular expressions implemented
>>>> as finite state machines.
>>>
>>>> I can't imagine a better way.
>>> That, right here, is your problem. This is where you are wrong.
>> ****
>> Not if you live in the deep center of a Reality Distortion Field.
>>
>> Nigel Horspool provde, many years ago (close to 20) that table-drive DFAs and parsers were
>> far too slow compared to custom C code, but what he did was generate the C code from the
>> lex and yacc tables. His comment was "this code embodies some of the worst coding
>> practices you can imagine, but because it is generated automatically, no one ever has to
>> actually look at it or try to maintain it. So we can use a goto into the middle of
>> another case of a switch and not hesitate"
>>
>> He could parse a million lines a minute of C code and pretty much every other language.
>> The critique was "but you have lousy error recovery on the parse" and his response was
>> "but at a million lines a minute, I can parse any reasonable-sized program completely on
>> every keystroke, and you won't even notice the delay, because the parse is finished before
>> your finger comes up from the key". So he only needed to flag the first error and not
>> have to restart the parse at a "clean" point, the nightmare of all parsing technologies.
>>
>> He is a professor at the University of Victoria, British Columbia.
>
>My UTF-8 DFA is hand tweaked. Lex may not produce the fastest possible
>DFA, but a DFA is still the fastest possible approach. I doubt that you
>can find any lexical analyzer that is not DFA based that can beat the
>fastest DFA based lexical analyzer.
>
>If you can, and it will not take me very long to confirm your findings I
>would be interested in a valid counter example if you can provide one.
>
>> joe
>> Joseph M. Newcomer [MVP]
>> email: newcomer(a)flounder.com
>> Web: http://www.flounder.com
>> MVP Tips: http://www.flounder.com/mvp_tips.htm
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Peter Olcott on 18 May 2010 18:36

On 5/18/2010 5:26 PM, Joseph M. Newcomer wrote:
> It is important to distinguish between the *concept* of DFA and the implementation of "a
> table-driven DFA" and the special case of "a table-drive DFA generated by the lex tool"
>
> All lexical analysis is done using the concept of DFAs, but the table-driven DFA output by
> lex is not very fast. Again, I point to the work of Nigel Horspool, which is very old
> work at this point, of generating tight C code, and no table appears anywhere in his
> implementation.
> joe
>
I really, really doubt that no table is used in a lexical analyzer that
is very very fast. Some little include of an include that you missed is
more likely. I can't accept this without direct proof.

I can accept that he may have derived a parser that is very very fast
without the need for a table. This is more plausible.

From: Joseph M. Newcomer on 18 May 2010 21:04

See below...
On Tue, 18 May 2010 10:39:09 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>>>
>>> That is just as true as the statement that no one is more Jewish than
>>> the pope.
>>>
>>> I started with a known good Yacc and Lex specification for "C"
>>> http://www.lysator.liu.se/c/ANSI-C-grammar-y.html
>>> http://www.lysator.liu.se/c/ANSI-C-grammar-l.html
>> ****
>> I presume these are Yacc and Lex specifications, but why do you claim they are "known
>> good" specifications? And what do they have to do with UTF-8 encodings?
>>
>> You don't have a good grasp of what constitutes a "letter" beyond the ASCII-7 spec so how
>> can you extend it to support UTF-8?
>> ****
>
>I already explained this several times, I will not repeat myself again.
****
But you always give the same silly explanation, that "it is easier". It doesn't justify
why you have decided that punctuation marks and digits are "letters". So if you gave the
same meaningless explanation, the question still remains.
****
>
>>>
>>> I carefully constructed code to build a very clean abstract syntax tree.
>>> I made absolute minimal transformations to the original Yacc and Lex
>>> spec to attain the subset of C++ that I will be implementing.
>>> (Basically the "C" language plus classes and minus pointers).
>>>
>>> I used this abstract syntax tree to generate jump code with absolute
>>> minimum number of of branches for the control flow statements that this
>>> C++ subset will be providing. Testing has indicated that these results
>>> have been correctly achieved.
>> ****
>> What is "jump code"? I've been in the compiler business since 1970 and I've never heard
>> of this code style. We assign tasks of generating code to first-semester compiler course
>> students; it isn't all that challenging. I did a google of "jump code" and got a lot of
>> useless stuff about advertising links in Web pages. Somehow, I don't believe you are
>> generating these from your compiler.
>> ****
>
>Take a look at the Dragon book under intermediate code generation. It is
>pretty east to generate control flow code once you have derived an
>abstract syntax tree. Trying to do this without an abstract syntax tree
>is a little too difficult. The next level challenge is to generate this
>control flow with with an absolute minimum number of jumps (gotos).
****
This is really trivial, but I never heard it called "jump code". We have referred to it
as "byte codes", "quads", "threaded interpretive code", and other designations. The fact
that google did not uncover any compiler references for the phrase "jump code" puts it
definitely off the mainstream.

The minimum-jump solution is often best handled by peephole optimization on the quads that
represent the symbolic code; it is a lot more interesting when you have architectures that
have short-jump and long-jump instructions, particularly when the conditional jumps are
all short-jumps. For example, the optimizations the Bliss-11 compiler was doing in 1972.
The basic blocks were re-sorted by the peephole optimizer to minimize the number of jump
instructions, give the nature of the short-conditional-jumps. Ho hum. Did you think this
was a *new* problem? It is a very, very old problem, and was first addressed in the
FORTRAN I compiler. The problem there was the host machine (IBM 704/709/709x) didn't have
enough memory to do jump-sorting of basic blocks. By the time we did Bliss-11 we had
approximately a megabyte of memory to hold each function's intermediate representation.
Well, half a megabyte, given the code size. Steve Hobbs was the expert on jump
optimization in that compiler.
****
>
>>>
>>> Then I designed the whole infrastucture to handle every type of
>>> elemental operation upon the fundamental data types, as well the
>>> aggregate data types.
>> ****
>> How do you know how much code is required, so that you know you are 75% done? I was once
>
>This can be reasonably accurately estimated once detailed design is
>completed. I can know that I am about 75% complete because most of the
>first draft of all of the code is complete, and the crucial
>infrastructure components are working and tested.
>
>I break down the design of any system into a hierarchy of increasing
>specificity. With my method the last step of detailed design is coding,
>and I do this in Microsoft Word.
****
You code in Word? Word is essentially user-hostile for writing code!
***
>
>The primary measure of code quality is the degree that non essential
>complexity has been eliminated. I spent a lot of time making the code as
>simple and easy to read as possible. Because of this my desk checking is
>very effective at getting the code nearly completely correct the first
>time, without any help from the compiler.
****
There are many measures of code quality. Mine is that the code is robust under
maintenance, that is, it can be easily modified by someone else because there are no
orthogonal dependencies. "Simple" is not always the best metric for such cases.

I still don't figure out how you can estimate the amount of code required in the future.
There is rarely a linear correlation between specifcation and lines of code

But my experience in managing projects really soured me on the ability of a programmer to
estimate how much work is left to do.
joe
****
>
>After I have reviewed this code many times, I drop it into the compiler
>to correct the syntax errors. Then a little more time is required to
>test and debug it. Since it was designed to make testing easy both
>testing and debugging combined typically take about 5% of the total
>development time. There are never any serious logic errors ever detected
>in debugging.
>
>> consulting, and the company said "Yes, you've identified a serious bug, but we can't fix
>> it now during the coding phase, because we are 37.247% done with the coding" (they were
>> coding something the like of which had never existed before, but they could tell to three
>> decimal places how far they were into the coding) "...So we'll let it go to the testing
>> phase, and fix it there" (never mind it would cost several times as much to fix it; it was
>> somebody elses's schedule). So I am always deeply suspect of any "n% done" statements
>> under any conditions (let alonesomething like "75%" which means "more than 74% and less
>> than 76%". An estimate of "3/4 done" (which means you know to within one part in 4 how
>> much might be left) makes sense. But as a physicist I was taught to never add a decimal
>> point or digit of precision beyond what is actually known. "7/10" is not the same as
>> "70%" except in abstract math. "70%" means "not 69%, not 71%, I know to one part in 100
>> how accurate I am" but "7/10" means "more than 6/10 and less than 8/10" which would
>> translate to "70%�10%". So you have to not just quote a value, but your confidence in the
>> value.
>>
>> In Pennsylvania, it was announced we have "75% compliance" with the 2010 census. This
>> leads to the question, if we are attempting to find out how many people we have, how do we
>> know that 75% of them have responded? We don't actually know how many people we have,
>> because we have a partial response, but how do we know it is 75%? 75% of how many? Well,
>> we don't know how many, that's what we're trying to figure out...
>> joe
>> ****
>> Joseph M. Newcomer [MVP]
>> email: newcomer(a)flounder.com
>> Web: http://www.flounder.com
>> MVP Tips: http://www.flounder.com/mvp_tips.htm
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

First | Prev | Next | Last
Pages: 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Prev: Where to handle CSliderCtl messages in
Next: Blocking mouse clicks