Is this Regular Expression for UTF-8 Correct?? [MFC]

Prev: Where to handle CSliderCtl messages in
Next: Blocking mouse clicks

From: Peter Olcott on 17 May 2010 15:05

On 5/17/2010 11:24 AM, Joseph M. Newcomer wrote:
> See below...
> On Mon, 17 May 2010 09:04:07 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>
>>> Huh? What's this got to do with the encoding?
>> (1) Lex requires a RegEx
> ***
> Your regexp should be in terms of UTF32, not UTF8.

Wrong, Lex can not handle data larger than bytes.

> ****
>>
>> (2) I still must convert from UTF-8 to UTF-32, and I don't think that a
>> faster or simpler way to do this besides a regular expression
>> implemented as a finite state machine can possibly exist.
> ****
> Actually, there is; you obviously know nothing about UTF-8, or you would know that the
> high-order bits of the first byte tell you the length of the encoding, and the FSM is
> written entirely in terms of the actual encoding, and is never written as a regexp.

Ah I see, if I don't know everything that I must know nothing, I think
that this logic is flawed. None of the docs that I read mentioned this
nuance. It may prove to be useful. It looks like this will be most
helpful when translating from UTF-32 to UTF-8, and not the other way
around.

It would still seem to be slower and more complex than a DFA based
finite state machine for validating a UTF-8 byte sequence. It also look
like it would be slower for translating from UTF-8 to UTF-32.

From: Peter Olcott on 17 May 2010 15:15

On 5/17/2010 11:31 AM, Joseph M. Newcomer wrote:
> See below...
> On Mon, 17 May 2010 08:49:14 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>
>> Ultimately all UTF-8 validators must be regular expressions implemented
>> as finite state machines. I can't imagine a better way.
> ***
> Perhaps. But since you obviously never read the documentation on UTF-8, you did not
> realize that it is based entirely on the number of high-order 1-bits in the first byte of
> the sequence. THis does not require a regexp written in terms of byte ranges to handle
> correctly, and nobody writing a UTF-8 validator would consider this approach. The set of
> rules you are citing are essentially "if my only tool is a regexp recongizer, how can I
> solve a trivial problem using that tool?". It doesn't say the solution is a *good*
> solution, only that it is *a* solution in the artificially constrained solution space. If
> you remove the constraint (as most practical programmers would) then the complexity
> evaporates.
>
> So not only is it *possible* to imagine a better way, the better way is *actually
> documented* in the Unicode spec!

My GUI scripting language is 75% completed and written in Lex and Yacc.
Lex can not handle data larger than bytes.

> I'm sorry you have such a limited imagination, but it is
> one of your weaknesses. You start plunging down a garden path and never once ask "is this
> the right or best path to my goal?" It may not even lead to the goal, but you love to
> fasten on worst-possible-paths and then berate the rest of us for telling you that your
> choice is wrong.

From: Peter Olcott on 17 May 2010 15:31

On 5/17/2010 12:42 PM, Pete Delgado wrote:
> "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote in message
> news:lI2dnTWeE-MF0GzWnZ2dnUVZ_gadnZ2d(a)giganews.com...
>> My source code encoding will be UTF-8. My interpreter is written in Yacc
>> and Lex, and 75% complete. This makes a UTF-8 regular expression
>> mandatory.
>
> Peter,
> You keep mentioning that your interpreter is 75% complete. Forgive my morbid
> curiosity but what exactly does that mean?
>
> -Pete
>
>

The Yacc and Lex are done and working and correctly translate all input
into a corresponding abstract syntax tree. The control flow portion of
the code generator is done and correctly translates control flow
statements into corresponding jump code with minimum branches. The
detailed design for everything else is complete. The everything else
mostly involves how to handle all of the data types including objects,
and also including elemental operations upon these data types and objects.

From: Pete Delgado on 17 May 2010 15:47

"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
news:a_WdndCS8YLmBGzWnZ2dnUVZ_o6dnZ2d(a)giganews.com...
> On 5/17/2010 12:42 PM, Pete Delgado wrote:
>> "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote in message
>> news:lI2dnTWeE-MF0GzWnZ2dnUVZ_gadnZ2d(a)giganews.com...
>>> My source code encoding will be UTF-8. My interpreter is written in Yacc
>>> and Lex, and 75% complete. This makes a UTF-8 regular expression
>>> mandatory.
>>
>> Peter,
>> You keep mentioning that your interpreter is 75% complete. Forgive my
>> morbid
>> curiosity but what exactly does that mean?
>>
>> -Pete
>>
>>
>
> The Yacc and Lex are done and working and correctly translate all input
> into a corresponding abstract syntax tree. The control flow portion of the
> code generator is done and correctly translates control flow statements
> into corresponding jump code with minimum branches. The detailed design
> for everything else is complete. The everything else mostly involves how
> to handle all of the data types including objects, and also including
> elemental operations upon these data types and objects.

Peter,
Correct me if I'm wrong, but it sounds to me as if you are throwing
something at Yacc and Lex and are getting something you think is
"reasonable" out of them but you haven't been able to test the validity of
the output yet since the remainder of the coding to your spec is yet to be
done. Is that a true statement?

-Pete

From: Peter Olcott on 17 May 2010 15:53

On 5/17/2010 11:51 AM, Joseph M. Newcomer wrote:
> See below...
> On Mon, 17 May 2010 08:57:58 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>>
>>
>> I still MUST have a correct UTF-8 RegEx because my interpreter is 75%
>> completed using Lex and Yacc. Besides this I need a good way to parse
>> UTF-8 to convert it to UTF-32.
> ****
> No, it sucks. For reasons I have pointed out. You can easily write a UTF-32 converter
> just based on the table in the Unicode 5.0 manual!

Lex can ONLY handle bytes. Lex apparently can handle the RegEx that I
posted. I am basically defining every UTF-8 byte sequence above the
ASCII range as a valid Letter that can be used in an Identifier.
[A-Za-z_] can also be used as a Letter.

>
> I realized that I have this information on a slide in my course, which is on my laptop, so
> with a little copy-and-paste-and-reformat, here's the table. Note that no massive FSM
> recognition is required to do the conversion, and it is even questionable as to whether an
> FSM is required at all!
>
> All symbols represent bits, and x, y, u, z and w are metasymbols for bits that can be
> either 0 or 1
>
> UTF-32 00000000 00000000 00000000 0xxxxxxx
> UTF-16 00000000 0xxxxxxx
> UTF-8 0xxxxxx
>
> UTF-32 00000000 00000000 00000yyy yyxxxxxx
> UTF-16 00000yyy yyxxxxxx
> UTF-8 110yyyyy 10xxxxxx
>
> UTF-32 00000000 00000000 zzzzyyyy yyxxxxxx
> UTF-16 zzzzyyyy yyxxxxxx
> UTF-8 1110zzzz 10yyyyyy 10xxxxxx
>
> UTF-32 00000000 000uuuuu zzzzyyyy yyzzzzzz
> UTF-16 110110ww wwzzzzyy 110111yy yyxxxxxx*
> UTF-8 11110uuu 10uuzzzzz 10yyyyyy 10xxxxxx
>
> uuuuu = wwww + 1
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

I was aware of the encodings between UTF-8 and UTF-32, the encoding to
UTF-16 looks a little clumsy when we get to four UTF-8 bytes.

First | Prev | Next | Last
Pages: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Prev: Where to handle CSliderCtl messages in
Next: Blocking mouse clicks