Is this Regular Expression for UTF-8 Correct?? [MFC]

Prev: Where to handle CSliderCtl messages in
Next: Blocking mouse clicks

From: Peter Olcott on 17 May 2010 10:07

On 5/16/2010 11:42 PM, Joseph M. Newcomer wrote:
> See below...
> On Sat, 15 May 2010 09:12:09 -0500, "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote:
>
>> Joe also said that UTF-8 was designed for data interchange
>> which is how I will be using it. Joe also falsely assumed
>> that I would be using UTF-8 for my internal representation.
>> I will be using UTF-32 for my internal representation.
> ****
> But then, you would not need the UTF-8 regexps! You would only need those if you were
> storing the data in UTF-8. To give an external grammar to your language, you should give
> the UTF-32 regexps, and if necessary, you can TRANSLATE those to UTF-8, but you don't
> start with UTF-8. The lex input would need to be in terms of UTF-32, so you would not be
> using UTF-8 there, either.

My source code encoding will be UTF-8. My interpreter is written in Yacc
and Lex, and 75% complete. This makes a UTF-8 regular expression mandatory.

> ****
>>
>> I will be using UTF-8 as the source code for my language
>> interpreter, which has the advantage of simply being ASCII
>> for the English language, and working across every platform
>> without requiring adaptations such as Little Endian and Big
>> Endian. UTF-8 will also be the output of my OCR4Screen DFA
>> recognizer.
>>
>>>
>>> --
>>> Mihai Nita [Microsoft MVP, Visual C++]
>>> http://www.mihai-nita.net
>>> ------------------------------------------
>>> Replace _year_ with _ to get the real email
>>>
>>
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 17 May 2010 12:24

See below...
On Mon, 17 May 2010 09:04:07 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/16/2010 11:39 PM, Joseph M. Newcomer wrote:
>>>
>>> That is how I intend to use it. To internationalize my GUI
>>> scripting language the interpreter will accept UTF-8 input
>>> as its source code files. It is substantially implemented
>>> using Lex and Yacc specifications for "C" that have been
>>> adapted to implement a subset of C++.
>> *****
>> So why does the question matter? Accepting UTF-8 input makes perfect sense, but the first
>> thing you should do with it is convert it to UTF-16, or better still UTF-32.
>> ****
>>>
>>> It was far easier (and far less error prone) to add the C++
>>> that I needed to the "C" specification than it would have
>>> been to remove what I do not need from the C++
>>> specification.
>> ***
>> Huh? What's this got to do with the encoding?
>(1) Lex requires a RegEx
***
Your regexp should be in terms of UTF32, not UTF8.
****
>
>(2) I still must convert from UTF-8 to UTF-32, and I don't think that a
>faster or simpler way to do this besides a regular expression
>implemented as a finite state machine can possibly exist.
****
Actually, there is; you obviously know nothing about UTF-8, or you would know that the
high-order bits of the first byte tell you the length of the encoding, and the FSM is
written entirely in terms of the actual encoding, and is never written as a regexp.

RTFM.

You are expected to have spent a LITTLE time reading about a subject before asking a
question.
****
>
>
>>> The actual language itself will store its strings as 32-bit
>>> codepoints. The SymbolTable will not bother to convert its
>>> strings from UTF-8. It turns out that UTF-8 byte sort order
>>> is identical to Unicode code point sort order.
>> ****
>> Strange. I though sort order was locale-specific and independent of code points. But
>> then, maybe I just understand what is going on.
>
>The SymbolTable only needs to be able to find its symbols in a std::map.
>Accounting for locale specific sort order is a waste of time in this case.
****
OK, then it is not sort order, and the fact that the byte-encoded sort is in the same
order is irrelevant, so why did you mention it as if it had meaning? std::map doesn't
care about what YOU mean by "sort order", it only requires byte sequences for keys, where
the interpretation of the byte sequence is a function of the data type.

But the text should already be in UTF-32! Why are you wasting time worrying about UTF-8?
joe
****
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 17 May 2010 12:31

See below...
On Mon, 17 May 2010 08:49:14 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/16/2010 11:28 PM, Joseph M. Newcomer wrote:
>> See below...
>> On Fri, 14 May 2010 13:44:56 -0500, "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote:
>>
>>>
>>> "Pete Delgado"<Peter.Delgado(a)NoSpam.com> wrote in message
>>> news:O8vhKE58KHA.980(a)TK2MSFTNGP04.phx.gbl...
>>>>
>>>>> Most often I am not looking for "input from
>>>>> professionals", I am looking for answers to specific
>>>>> questions.
>>>>
>>>> Which is one reason why your projects consistantly fail.
>>>> If you have a few
>>>
>>> None of my projects have ever failed. Some of my projects
>>> inherently take an enormous amount of time to complete.
>> ****
>> No something you should brag about. Going back to my original comments, you are creating
>> an artificially complex solution to what should be a simple problem, by making bad design
>> choices and then warping reality to support them, when the correct answer is "Don't do it
>> that way". If you simplify the problem, you get do make decisions which can be
>> implemented more readily, those decreasing the amount of time required to complete them.
>> ****
>>>
>>>> days, take a look at the book "Programming Pearls" by Jon
>>>> Bentley -specifically the first chapter. Sometimes making
>>>> sure you are asking the *right* question is more important
>>>> than getting an answer to a question. You seem to have a
>>>> problem with that particular concept.
>>>
>>> Yes especially on those cases where I have already thought
>>> the problem through completely using categorically
>>> exhaustively complete reasoning.
>> *****
>> There is no such thing in the world we live in. You have made a number of false
>> assumptions (for example, that conversion time is statistically significant relative to
>> other performance issues) and used that set of false assumptions to drive a set of design
>> decisions which make no sense if you take reality into consideration. For example, these
>> is no possible way the UTF-8-UTF-16 conversion could possibly take longer to handle than a
>> single page fault, but you are optimizing it out of existence without realizing that
>> simply loading the program will have orders of magnitude greater variance than this cost.
>> This is because you are working with the assumptions that (a) loading a program takes
>> either zero time or a fixed time each time it is loaded and (b) opening the file you are
>> reading takes either zero time or a fixed time each time it is opened. Sadly, neither of
>> these assumptions are valid, and consequently if you run 100 experiments or loading and
>> executing the program, these two paramters will dominate the total performance by orders
>> of magnitude more than the cost of the conversion! So you are trying to optimize
>> something that is statistically insignificant!
>> ****
>>>
>>> In those rare instances anything at all besides a direct
>>> answer to a direct question can only be a waste of time for
>>> me.
>> *****
>> You want a direct answer: the design to use UTF-8 internally is a Really Stupid Idea!
>> DON'T WASTE YOUR TIME TRYING TO DO IT! That's the DIRECT answer. Everything else is
>> wasting our time trying to tell you in simple words that even you might understand just
>> WHY it is a Really Stupid Idea.
>>
>> There is no point in trying to analye the regexp because I can not believe why any
>> intelligent programmer would WANT to use such a bad design! Therefore, it was a bad
>> question and does not deserve getting an answer;
>
>Ultimately all UTF-8 validators must be regular expressions implemented
>as finite state machines. I can't imagine a better way.
***
Perhaps. But since you obviously never read the documentation on UTF-8, you did not
realize that it is based entirely on the number of high-order 1-bits in the first byte of
the sequence. THis does not require a regexp written in terms of byte ranges to handle
correctly, and nobody writing a UTF-8 validator would consider this approach. The set of
rules you are citing are essentially "if my only tool is a regexp recongizer, how can I
solve a trivial problem using that tool?". It doesn't say the solution is a *good*
solution, only that it is *a* solution in the artificially constrained solution space. If
you remove the constraint (as most practical programmers would) then the complexity
evaporates.

So not only is it *possible* to imagine a better way, the better way is *actually
documented* in the Unicode spec! I'm sorry you have such a limited imagination, but it is
one of your weaknesses. You start plunging down a garden path and never once ask "is this
the right or best path to my goal?" It may not even lead to the goal, but you love to
fasten on worst-possible-paths and then berate the rest of us for telling you that your
choice is wrong.
joe
****
>
> > the correct answer is to do the job
>> right. You have this fixation that if you pose what is clearly a bad design, we experts
>> are supposed to sit back and encourage bad design decisions? That is not what we do.
>>
>> We feel a little bit like Calvin's dad from the old "Calvin and Hobbes" cartoons. Calvin
>> comes over to his father and says "Dad, can I have a chain saw" and his father says "no".
>> Calvin goes away feeling unhappy, and in the last of the four panels says "but now how am
>> I going to learn how to juggle?"
>>
>> If you want to juggle chain saws, we aren't going to answer your questions on how to do
>> it. We will try to advise you that juggling running chain saws is probably a Really
>> Stupid Idea. If you were an experienced knife juggler, and could juggle flaming torches,
>> we might suggest that there are approaches to this, but your idea that you can apply
>> categorical reasoning to the problem of chain-saw juggling when you have clearly
>> demonstrated by your question that you have never once juggled anything, makes us leery of
>> encouraging you to continue this practice.
>>
>> Note that "categorical reasoning" does not turn into a deep understanding of fundamentally
>> stochastic processes. Las Vegas casinos would love you, because you would try to apply
>> this technique to, say, roulette wheels and dice, and guess who wins?
>>
>> Prove, by exhaustive categorical reasoning, that loading a program takes a fixed amount of
>> time. Then I'll credit its power.
>> joe
>> ****
>>>
>>>
>> Joseph M. Newcomer [MVP]
>> email: newcomer(a)flounder.com
>> Web: http://www.flounder.com
>> MVP Tips: http://www.flounder.com/mvp_tips.htm
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 17 May 2010 12:51

See below...
On Mon, 17 May 2010 08:57:58 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/16/2010 11:33 PM, Joseph M. Newcomer wrote:
>> See below...
>> On Fri, 14 May 2010 08:27:45 -0500, "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote:
>>
>>>
>>> "Joseph M. Newcomer"<newcomer(a)flounder.com> wrote in
>>> message news:68gpu599cjcsm3rjh1ptc6e9qu977smdph(a)4ax.com...
>>>> No, an extremely verbose "You are going about this
>>>> completely wrong".
>>>> joe
>>>
>>> Which still avoids rather than answers my question. This was
>>> at one time a very effective ruse to hide the fact that you
>>> don't know the answer. I can see through this ruse now, so
>>> there is no sense in my attempting to justify my design
>>> decision to you. That would simply be a waste of time.
>> ****
>> I think I answered part of it. The part that matters. THe part that says "this is
>> wrong". I did this by pointing out some counterexamples.
>>
>> I know the answer: Don;t Do It That Way. You are asking for a specific answer that will
>> allow you to pursue a Really Bad Design Decision. I'm not going to answer a bad question;
>> I'm going to tell you what the correct solution is. I'm avoiding the question because it
>> is a really bad question, because you should be able to answer it yourself, and because
>> giving an answer simply justifies a poor design. I don't justify poor designs, I try to
>> kill them.
>>
>> Only you could make a bad design decision and feel you have to justify it. Particularly
>> when the experts have already all told you it is a bad design decision, and you should not
>> go that way.
>> joe
>
>If a decision is truly bad, then there must be dysfunctional results
>that make the decision a bad one. If dysfunctional results can not be
>provided, then the statement that it is a bad decision lacks sufficient
>support. My original intention was to use UTF-32 as my internal
>representation. I have not yet decided to alter this original decision.
****
Dysfunctional results:
Horrible costs to do regexp manipulation when none is needed
Added complexity distributed uniformly over the entire implementation
Actually not correct because it ignores
-localized punctuation
-localized numbers
-bidirectional text
Other than it is needlessly complex, horribly inefficient, and wrong, what more do you
need to know?
***
>
>The fact that someone provided an example where UTF-8 strings would
>often substantially vary in length provides the best counter example
>showing that your view is likely correct about internal representation.
****
But is that not obvious at the beginning? You should have realized that!
****
>
>In fact I will simply state that I am now convinced that UTF-32 is the
>best way to go.
>
>I still MUST have a correct UTF-8 RegEx because my interpreter is 75%
>completed using Lex and Yacc. Besides this I need a good way to parse
>UTF-8 to convert it to UTF-32.
****
No, it sucks. For reasons I have pointed out. You can easily write a UTF-32 converter
just based on the table in the Unicode 5.0 manual!

I realized that I have this information on a slide in my course, which is on my laptop, so
with a little copy-and-paste-and-reformat, here's the table. Note that no massive FSM
recognition is required to do the conversion, and it is even questionable as to whether an
FSM is required at all!

All symbols represent bits, and x, y, u, z and w are metasymbols for bits that can be
either 0 or 1

UTF-32 00000000 00000000 00000000 0xxxxxxx
UTF-16 00000000 0xxxxxxx
UTF-8 0xxxxxx

UTF-32 00000000 00000000 00000yyy yyxxxxxx
UTF-16 00000yyy yyxxxxxx
UTF-8 110yyyyy 10xxxxxx

UTF-32 00000000 00000000 zzzzyyyy yyxxxxxx
UTF-16 zzzzyyyy yyxxxxxx
UTF-8 1110zzzz 10yyyyyy 10xxxxxx

UTF-32 00000000 000uuuuu zzzzyyyy yyzzzzzz
UTF-16 110110ww wwzzzzyy 110111yy yyxxxxxx*
UTF-8 11110uuu 10uuzzzzz 10yyyyyy 10xxxxxx

uuuuu = wwww + 1
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Pete Delgado on 17 May 2010 13:42

"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
news:lI2dnTWeE-MF0GzWnZ2dnUVZ_gadnZ2d(a)giganews.com...
> My source code encoding will be UTF-8. My interpreter is written in Yacc
> and Lex, and 75% complete. This makes a UTF-8 regular expression
> mandatory.

Peter,
You keep mentioning that your interpreter is 75% complete. Forgive my morbid
curiosity but what exactly does that mean?

-Pete

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Prev: Where to handle CSliderCtl messages in
Next: Blocking mouse clicks