Is this Regular Expression for UTF-8 Correct?? [MFC]

Prev: Where to handle CSliderCtl messages in
Next: Blocking mouse clicks

From: Mihai N. on 18 May 2010 03:58

> Ultimately all UTF-8 validators must be regular expressions implemented
> as finite state machines.

> I can't imagine a better way.
That, right here, is your problem. This is where you are wrong.

--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

From: Peter Olcott on 18 May 2010 11:39

>>
>> That is just as true as the statement that no one is more Jewish than
>> the pope.
>>
>> I started with a known good Yacc and Lex specification for "C"
>> http://www.lysator.liu.se/c/ANSI-C-grammar-y.html
>> http://www.lysator.liu.se/c/ANSI-C-grammar-l.html
> ****
> I presume these are Yacc and Lex specifications, but why do you claim they are "known
> good" specifications? And what do they have to do with UTF-8 encodings?
>
> You don't have a good grasp of what constitutes a "letter" beyond the ASCII-7 spec so how
> can you extend it to support UTF-8?
> ****

I already explained this several times, I will not repeat myself again.

>>
>> I carefully constructed code to build a very clean abstract syntax tree.
>> I made absolute minimal transformations to the original Yacc and Lex
>> spec to attain the subset of C++ that I will be implementing.
>> (Basically the "C" language plus classes and minus pointers).
>>
>> I used this abstract syntax tree to generate jump code with absolute
>> minimum number of of branches for the control flow statements that this
>> C++ subset will be providing. Testing has indicated that these results
>> have been correctly achieved.
> ****
> What is "jump code"? I've been in the compiler business since 1970 and I've never heard
> of this code style. We assign tasks of generating code to first-semester compiler course
> students; it isn't all that challenging. I did a google of "jump code" and got a lot of
> useless stuff about advertising links in Web pages. Somehow, I don't believe you are
> generating these from your compiler.
> ****

Take a look at the Dragon book under intermediate code generation. It is
pretty east to generate control flow code once you have derived an
abstract syntax tree. Trying to do this without an abstract syntax tree
is a little too difficult. The next level challenge is to generate this
control flow with with an absolute minimum number of jumps (gotos).

>>
>> Then I designed the whole infrastucture to handle every type of
>> elemental operation upon the fundamental data types, as well the
>> aggregate data types.
> ****
> How do you know how much code is required, so that you know you are 75% done? I was once

This can be reasonably accurately estimated once detailed design is
completed. I can know that I am about 75% complete because most of the
first draft of all of the code is complete, and the crucial
infrastructure components are working and tested.

I break down the design of any system into a hierarchy of increasing
specificity. With my method the last step of detailed design is coding,
and I do this in Microsoft Word.

The primary measure of code quality is the degree that non essential
complexity has been eliminated. I spent a lot of time making the code as
simple and easy to read as possible. Because of this my desk checking is
very effective at getting the code nearly completely correct the first
time, without any help from the compiler.

After I have reviewed this code many times, I drop it into the compiler
to correct the syntax errors. Then a little more time is required to
test and debug it. Since it was designed to make testing easy both
testing and debugging combined typically take about 5% of the total
development time. There are never any serious logic errors ever detected
in debugging.

> consulting, and the company said "Yes, you've identified a serious bug, but we can't fix
> it now during the coding phase, because we are 37.247% done with the coding" (they were
> coding something the like of which had never existed before, but they could tell to three
> decimal places how far they were into the coding) "...So we'll let it go to the testing
> phase, and fix it there" (never mind it would cost several times as much to fix it; it was
> somebody elses's schedule). So I am always deeply suspect of any "n% done" statements
> under any conditions (let alonesomething like "75%" which means "more than 74% and less
> than 76%". An estimate of "3/4 done" (which means you know to within one part in 4 how
> much might be left) makes sense. But as a physicist I was taught to never add a decimal
> point or digit of precision beyond what is actually known. "7/10" is not the same as
> "70%" except in abstract math. "70%" means "not 69%, not 71%, I know to one part in 100
> how accurate I am" but "7/10" means "more than 6/10 and less than 8/10" which would
> translate to "70%�10%". So you have to not just quote a value, but your confidence in the
> value.
>
> In Pennsylvania, it was announced we have "75% compliance" with the 2010 census. This
> leads to the question, if we are attempting to find out how many people we have, how do we
> know that 75% of them have responded? We don't actually know how many people we have,
> because we have a partial response, but how do we know it is 75%? 75% of how many? Well,
> we don't know how many, that's what we're trying to figure out...
> joe
> ****
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 18 May 2010 12:01

See below...
On Mon, 17 May 2010 22:29:15 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/17/2010 4:38 PM, Joseph M. Newcomer wrote:
>> See below...
>> On Mon, 17 May 2010 14:05:54 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>>
>>> On 5/17/2010 11:24 AM, Joseph M. Newcomer wrote:
>>>> See below...
>>>> On Mon, 17 May 2010 09:04:07 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>>>>
>>>>>> Huh? What's this got to do with the encoding?
>>>>> (1) Lex requires a RegEx
>>>> ***
>>>> Your regexp should be in terms of UTF32, not UTF8.
>>>
>>> Wrong, Lex can not handle data larger than bytes.
>> ***
>> The wrong choice of tooling. Just because the tooling is stupid is no reason to cater to
>> its failures.
>>
>> It is too bad, but lex was done in a different era when everyone used 7-bit ASCII. Using
>> antiquated tools generates problems that in a sane world would not have to exist. I'd
>> think that converting lex to 32-bit would be a simpler solution.
>> ****
>>>
>>>> ****
>>>>>
>>>>> (2) I still must convert from UTF-8 to UTF-32, and I don't think that a
>>>>> faster or simpler way to do this besides a regular expression
>>>>> implemented as a finite state machine can possibly exist.
>>>> ****
>>>> Actually, there is; you obviously know nothing about UTF-8, or you would know that the
>>>> high-order bits of the first byte tell you the length of the encoding, and the FSM is
>>>> written entirely in terms of the actual encoding, and is never written as a regexp.
>>>
>>> Ah I see, if I don't know everything that I must know nothing, I think
>>> that this logic is flawed. None of the docs that I read mentioned this
>>> nuance. It may prove to be useful. It looks like this will be most
>>> helpful when translating from UTF-32 to UTF-8, and not the other way
>>> around.
>> ****
>> You are asking questions based on such elementary properties to the encoding that there is
>> little excuse for defending your position. You did not start out by saying "I know
>> nothing about UTF-8 at all..." in which case the simple answer is
>> (a) read about UTF-8 encoding (I've given you the citation, including
>> the page number)
>> (b) realize that your base approach is almost certainly wrong.
>> ****
>>>
>>> It would still seem to be slower and more complex than a DFA based
>>> finite state machine for validating a UTF-8 byte sequence. It also look
>>> like it would be slower for translating from UTF-8 to UTF-32.
>> ****
>> If I didn't have to leave in 10 minutes, I'd write the UTF-18 to UT32 conversion. But I
>> have to put together some items for my presentation (find the extension cord, among other
>> tasks) which doesn't quite leave enough time to write the code, which will take almost the
>> entire 10 minutes to write. But when I get back tonight, I'll write it and time myself.
>>
>> But speed is not an issue here, and lex was never intended to support UTF-8. It sounds
>> like you are complaining that punched cards don't support full Unicode. Wrong tool for
>> the job.
>> joe
>
>I have made Lex handle UTF-8 in an optimal way, thus you are wrong yet
>again. I do now agree with you that UTF-32 is generally much better than
>UTF-8 for internal representation.
****
Prove it.
****
>
>I want my GUI scripting language to be like the original Turbo Pascal
>2.0, at least 100-fold faster compiles than the next best alternative.
****
Look into the work done by Nigel Horspool (University of Victoria, BC) who is possibly the
world's expert on creating fast parsers from lex and yacc output. He uses NO tables at
all, ever, nothing but fast code generated from the tables. The tables are too slow. So
if you are doing anything table-driven, you are already far behind the optimal performance
point.
joe
****
>
>> Joseph M. Newcomer [MVP]
>> email: newcomer(a)flounder.com
>> Web: http://www.flounder.com
>> MVP Tips: http://www.flounder.com/mvp_tips.htm
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 18 May 2010 12:07

See below...
On Tue, 18 May 2010 00:58:16 -0700, "Mihai N." <nmihai_year_2000(a)yahoo.com> wrote:

>
>> Ultimately all UTF-8 validators must be regular expressions implemented
>> as finite state machines.
>
>> I can't imagine a better way.
>That, right here, is your problem. This is where you are wrong.
****
Not if you live in the deep center of a Reality Distortion Field.

Nigel Horspool provde, many years ago (close to 20) that table-drive DFAs and parsers were
far too slow compared to custom C code, but what he did was generate the C code from the
lex and yacc tables. His comment was "this code embodies some of the worst coding
practices you can imagine, but because it is generated automatically, no one ever has to
actually look at it or try to maintain it. So we can use a goto into the middle of
another case of a switch and not hesitate"

He could parse a million lines a minute of C code and pretty much every other language.
The critique was "but you have lousy error recovery on the parse" and his response was
"but at a million lines a minute, I can parse any reasonable-sized program completely on
every keystroke, and you won't even notice the delay, because the parse is finished before
your finger comes up from the key". So he only needed to flag the first error and not
have to restart the parse at a "clean" point, the nightmare of all parsing technologies.

He is a professor at the University of Victoria, British Columbia.
joe
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Peter Olcott on 18 May 2010 14:38

On 5/17/2010 11:23 PM, Joseph M. Newcomer wrote:
> See below...
> On Mon, 17 May 2010 14:53:39 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>
>> On 5/17/2010 11:51 AM, Joseph M. Newcomer wrote:
>>> See below...
>>> On Mon, 17 May 2010 08:57:58 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>>>>
>>>>
>>>> I still MUST have a correct UTF-8 RegEx because my interpreter is 75%
>>>> completed using Lex and Yacc. Besides this I need a good way to parse
>>>> UTF-8 to convert it to UTF-32.
>>> ****
>>> No, it sucks. For reasons I have pointed out. You can easily write a UTF-32 converter
>>> just based on the table in the Unicode 5.0 manual!
>>
>> Lex can ONLY handle bytes. Lex apparently can handle the RegEx that I
>> posted. I am basically defining every UTF-8 byte sequence above the
>> ASCII range as a valid Letter that can be used in an Identifier.
>> [A-Za-z_] can also be used as a Letter.
> ****
> No, you are saying "because I chose to use an obsolete piece of software, I am constrained
> to think in obsolete terms".
>
> Note that it makes no sense to declare every UTF-8 codepoint that translates UTF-32> 0xFF
> as a letter. Only in some fantasy world could this possibly make sense, since you are
> declaring that all localized digits, localized punctuation marks, and math symbols must be
> letters! I find it hard to imagine how you can say this with your stated goal of letting
> people code in their native language. Imagine if I allowed ASCII-7 identifiers like
> "A+B"
> "$%^&"
> but that's what you are doing if you recognize international punctuation marks and digits
> as "letters"
>
> So I don't believe you have a clue about how to handle UTF-8.

I really don't care what you believe.

>
> I'd first look into fixing lex to handle modern coding practice. Otherwise, you have a
> tool that doesn't solve your problem. You have to write lexical rules that recognize ONLY
> letters about U00FF, and for digits, you have to allow digit sequences that allow
> localized digits. And if I use the symbols ABCDEFGHIJ to represent some sequence of
> localized digits, then you should write rules that allow numbers to be written like "123"
> and "ABC" (try to remember that by "A" I mean the localized value of the digit "1"), but
> then how do you handle a sequence like A1A? Does it translate to tne number we would
> write as "111" or is it syntactically illegal? Essentially, doing this in UTF-8 is pretty
> miserable, but if you don't do it, you are parsing nonsense phrases.
> joe
> ****
>>
>>>
>>> I realized that I have this information on a slide in my course, which is on my laptop, so
>>> with a little copy-and-paste-and-reformat, here's the table. Note that no massive FSM
>>> recognition is required to do the conversion, and it is even questionable as to whether an
>>> FSM is required at all!
>>>
>>> All symbols represent bits, and x, y, u, z and w are metasymbols for bits that can be
>>> either 0 or 1
>>>
>>> UTF-32 00000000 00000000 00000000 0xxxxxxx
>>> UTF-16 00000000 0xxxxxxx
>>> UTF-8 0xxxxxx
>>>
>>> UTF-32 00000000 00000000 00000yyy yyxxxxxx
>>> UTF-16 00000yyy yyxxxxxx
>>> UTF-8 110yyyyy 10xxxxxx
>>>
>>> UTF-32 00000000 00000000 zzzzyyyy yyxxxxxx
>>> UTF-16 zzzzyyyy yyxxxxxx
>>> UTF-8 1110zzzz 10yyyyyy 10xxxxxx
>>>
>>> UTF-32 00000000 000uuuuu zzzzyyyy yyzzzzzz
>>> UTF-16 110110ww wwzzzzyy 110111yy yyxxxxxx*
>>> UTF-8 11110uuu 10uuzzzzz 10yyyyyy 10xxxxxx
>>>
>>> uuuuu = wwww + 1
>>> Joseph M. Newcomer [MVP]
>>> email: newcomer(a)flounder.com
>>> Web: http://www.flounder.com
>>> MVP Tips: http://www.flounder.com/mvp_tips.htm
>>
>> I was aware of the encodings between UTF-8 and UTF-32, the encoding to
>> UTF-16 looks a little clumsy when we get to four UTF-8 bytes.
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

First | Prev | Next | Last
Pages: 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Prev: Where to handle CSliderCtl messages in
Next: Blocking mouse clicks