From: Joseph M. Newcomer on
Yes,. you have to accept that there are questions that should not be asked.

There are two right now: (a) how to write a parser that works on UTF-8 input (b) how to
disable mouse clicks during a lenghty computation. The correct answers to both questions
are "don't even try to do it that way! Redesign it so these are no longer problems".
Alternatively, think of it as "do not ask questions of how to solve problems which are the
direct result of incorrect design choices; change the design so the problem no longer
exists, then the question does not need to be asked"

Using UTF-8 is a particularly poor design choice. Doing long computations in the main GUI
thread is a particularly poor design choice. The questions would not arise if the poor
design choices had not been made. This is reality. The questions should not be asked,
because they indicate that poor choices have been made which make the questions necessary.

Answering the questions by giving an answer that solves what the questioner asked is not a
service to the person asking the question; what it allows is that a bad design decision is
allowed to stand, which will in turn lead to more problems, which produce more questions.
By doing the redesign and getting rid of the bad decisions, the problem goes away and
cannot return in the forseeable future. The problem of using a MBCS like UTF-8 is not
limited to something like an r.e.; the problems of handling the character set are
pervasive and very complex, and will continue to plague the implementation. The problems
of doing long computations in the main GUI thread merely introduce more and more failure
modes that will have to be kludged around, so the correct answer is "redesign it". Poor
design does not go away by making one patch. The patches eventually grow scar tissue, as
one kludge leads to another which leads to another, and so on.

So get rid of the UTF-8 internally, write it for Unicode, and then the problem goes away.
joe

On Thu, 13 May 2010 16:59:12 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote:

>
>"Leigh Johnston" <leigh(a)i42.co.uk> wrote in message
>news:TNydnVD6sciN9nHWnZ2dnUVZ7oWdnZ2d(a)giganews.com...
>> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
>> news:L76dnYQBH-5s-3HWnZ2dnUVZ_sqdnZ2d(a)giganews.com...
>>>
>>> "Leigh Johnston" <leigh(a)i42.co.uk> wrote in message
>>> news:v7CdnY8dPrNy_nHWnZ2dnUVZ8t-dnZ2d(a)giganews.com...
>>>>
>>>>
>>>> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
>>>> news:xMOdnahxZJNX_3HWnZ2dnUVZ_hCdnZ2d(a)giganews.com...
>>>>>>
>>>>>> What has this got to do with C++? What is your C++
>>>>>> language question?
>>>>>>
>>>>>> /Leigh
>>>>>
>>>>> I will be implementing a utf8string to supplement
>>>>> std::string and will be using a regular expression to
>>>>> quickly divide up UTF-8 bytes into Unicode CodePoints.
>>>>>
>>>>> Since there are no UTF-8 groups, or even Unicode groups
>>>>> I must post these questions to groups that are at most
>>>>> indirectly related to this subject matter.
>>>>
>>>> Wrong: off-topic is off-topic. If I chose to write a
>>>> Tetris game in C++ it would be inappropriate to ask
>>>> about the rules of Tetris in this newsgroup even if
>>>> there was not a more appropriate newsgroup.
>>>>
>>>> /Leigh
>>>
>>> I think that posting to the next most relevant group(s)
>>> where a directly relevant group does not exist is right,
>>> and thus you are simply wrong.
>>
>> From this newsgroup's FAQ:
>>
>> "Only post to comp.lang.c++ if your question is about the
>> C++ language itself."
>>
>> Thus I am simply correct?
>>
>> /Leigh
>
>I can not accept that the "correct" answer is that some
>questions can not be asked.
>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Joseph M. Newcomer on
No, an extremely verbose "You are going about this completely wrong".
joe

On Thu, 13 May 2010 18:14:47 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote:

>Ah so in other words an extremely verbose, "I don't know".
>Let me take a different approach. Can postings on www.w3.org
>generally be relied upon?
>
>"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
>message news:gprou55bvl3rgp2qmp6v3euk20ucf865mi(a)4ax.com...
>> See below...
>> On Thu, 13 May 2010 15:36:24 -0500, "Peter Olcott"
>> <NoSpam(a)OCR4Screen.com> wrote:
>>
>>>
>>>"Leigh Johnston" <leigh(a)i42.co.uk> wrote in message
>>>news:GsGdnbYz-OIj_XHWnZ2dnUVZ8uCdnZ2d(a)giganews.com...
>>>> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
>>>> news:3sudnRDN849QxnHWnZ2dnUVZ_qKdnZ2d(a)giganews.com...
>>>>> Is this Regular Expression for UTF-8 Correct??
>>>>>
>>>>> The solution is based on the GREEN portions of the
>>>>> first
>>>>> chart shown
>>>>> on this link:
>>>>> http://www.w3.org/2005/03/23-lex-U
>> ****
>> Note that in the "green" areas, we find
>>
>> U0482 Cyrillic thousands sign
>> U055A Armenian apostrophe
>> U055C Armenian exclamation mark
>> U05C3 Hebrew punctuation SOF Pasuq
>> U060C Arabic comma
>> U066B Arabic decimal separator
>> U0700-U0709 Assorted Syriac punctuation marks
>> U0966-U096F Devanagari digits 0..9
>> U09E6-U09EF Bengali digits 0..9
>> U09F2-U09F3 Bengali rupee marks
>> U0A66-U0A6F Gurmukhi digits 0..9
>> U0AE6-U0AEF Gujarati digits 0..9
>> U0B66-U0B6F Oriya digits 0..9
>> U0BE6-U0BEF Tamil digits 0..9
>> U0BF0-U0BF2 Tamil indicators for 10, 100, 1000
>> U0BF3-U0BFA Tamil punctuation marks
>> U0C66-U0C6F Telugu digits 0..9
>> U0CE6-U0CEF Kannada digits 0..9
>> U0D66-U0D6F Malayam digits 0..9
>> U0E50-U0E59 Thai digits 0..9
>> U0ED0-U0ED9 Lao digits 0..9
>> U0F20-U0F29 Tibetan digits 0..9
>> U0F2A-U0F33 Miscellaneous Tibetan numeric symbols
>> U1040-U1049 - Myanmar digits 0..9
>> U1360-U1368 Ethiopic punctuation marks
>> U1369-U137C Ethiopic numeric values (digits, tens of
>> digits, etc.)
>> U17E0-U17E9 Khmer digits 0..9
>> U1800-U180E Mongolian punctuation marks
>> U1810-U1819 Mongolian digits 0..9
>> U1946-U194F Limbu digits 0..9
>> U19D0-U19D9 New Tai Lue digits 0..9
>> ...at which point I realized I was wasting my time,
>> because I was attempting to disprovde
>> what is a Really Dumb Idea, which is to write applications
>> that actually work on UTF-8
>> encoded text.
>>
>> You are free to convert these to UTF-8, but in addition,
>> if I've read some of the
>> encodings correctly, the non-green areas preclude what are
>> clearly "letters" in other
>> languages.
>>
>> Forget UTF-8. It is a transport mechanism used at input
>> and output edges. Use Unicode
>> internally.
>> ****
>>>>>
>>>>> A semantically identical regular expression is also
>>>>> found
>>>>> on the above link underValidating lex Template
>>>>>
>>>>> 1 ['\u0000'-'\u007F']
>>>>> 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
>>>>> 3 | ( '\u00E0' ['\u00A0'-'\u00BF']
>>>>> ['\u0080'-'\u00BF'])
>>>>> 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF']
>>>>> ['\u0080'-'\u00BF'])
>>>>> 5 | ( '\u00ED' ['\u0080'-'\u009F']
>>>>> ['\u0080'-'\u00BF'])
>>>>> 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF']
>>>>> ['\u0080'-'\u00BF'])
>>>>> 7 | ( '\u00F0' ['\u0090'-'\u00BF']
>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>> 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF']
>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>> 9 | ( '\u00F4' ['\u0080'-'\u008F']
>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>>
>>>>> Here is my version, the syntax is different, but the
>>>>> UTF8
>>>>> portion should be semantically identical.
>>>>>
>>>>> UTF8_BYTE_ORDER_MARK [\xEF][\xBB][\xBF]
>>>>>
>>>>> ASCII [\x0-\x7F]
>>>>>
>>>>> U1 [a-zA-Z_]
>>>>> U2 [\xC2-\xDF][\x80-\xBF]
>>>>> U3 [\xE0][\xA0-\xBF][\x80-\xBF]
>>>>> U4 [\xE1-\xEC][\x80-\xBF][\x80-\xBF]
>>>>> U5 [\xED][\x80-\x9F][\x80-\xBF]
>>>>> U6 [\xEE-\xEF][\x80-\xBF][\x80-\xBF]
>>>>> U7 [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
>>>>> U8
>>>>> [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
>>>>> U9 [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]
>>>>>
>>>>> UTF8 {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>>>>
>>>>> // This identifies the "Letter" portion of an
>>>>> Identifier.
>>>>> L {U1}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>>>>
>>>>> I guess that most of the analysis may simply boil down
>>>>> to
>>>>> whether or not the original source from the link is
>>>>> considered reliable. I had forgotten this original
>>>>> source
>>>>> when I first asked this question, that is why I am
>>>>> reposting this same question again.
>>>>
>>>> What has this got to do with C++? What is your C++
>>>> language question?
>>>>
>>>> /Leigh
>>>
>>>I will be implementing a utf8string to supplement
>>>std::string and will be using a regular expression to
>>>quickly divide up UTF-8 bytes into Unicode CodePoints.
>> ***
>> For someone who had an unholy fixation on "performance",
>> why would you choose such a slow
>> mechanism for doing recognition?
>>
>> I can imagine a lot of alternative approaches, including
>> having a table of 65,536
>> "character masks" for Unicode characters, including
>> on-the-fly updating of the table, and
>> extensions to support surrogates, which would outperform
>> any regular expression based
>> approach.
>>
>> What is your crtiterion for what constitutes a "letter"?
>> Frankly, I have no interest in
>> decoding something as bizarre as UTF-8 encodings to see if
>> you covered the foreign
>> delimiters, numbers, punctuation marks, etc. properly, and
>> it makes no sense to do so. So
>> there is no way I would waste my time trying to understand
>> an example that should not
>> exist at all.
>>
>> Why do you seem to choose the worst possible choice when
>> there is more than one way to do
>> something? The choices are (a) work in 8-bit ANSI (b)
>> work in UTF-8 (c) work in Unicode.
>> Of these, the worst possible choice is (b), followed by
>> (a). (c) is clearly the winner.
>>
>> So why are you using something as bizarre as UTF-8
>> internally? UTF-8 has ONE role, which
>> is to write Unicode out in an 8-bit encoding, and read
>> Unicode in an 8-bit encoding. You
>> do NOT want to write the program in terms of UTF-8!
>> joe
>> ****
>>>
>>>Since there are no UTF-8 groups, or even Unicode groups I
>>>must post these questions to groups that are at most
>>>indirectly related to this subject matter.
>>>
>> Joseph M. Newcomer [MVP]
>> email: newcomer(a)flounder.com
>> Web: http://www.flounder.com
>> MVP Tips: http://www.flounder.com/mvp_tips.htm
>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Mihai N. on

> I can imagine a lot of alternative approaches, including having a table of
> 65,536 "character masks" for Unicode characters

As we know, 65,536 (FFFF) is not enough, Unicode codepoints go to 10FFFF :-)



> What is your crtiterion for what constitutes a "letter"?

The best way to attack the identification is by using Unicode properties
Each code point has attributes indicating if it is a letter
(General Category)

A good starting point is this:
http://unicode.org/reports/tr31/tr31-1.html

But this only shows that basing that on some UTF-8 kind of thing is no
the way. And how are you going to deal with combining characters?
Normalization?

There are very good reasons why the rule of thumb is:
- UTF-16 or UTF-32 for processing
- UTF-8 for storage/exchange


--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

From: Mihai N. on
> Can postings on www.w3.org generally be relied upon?

For official documents, in general yes.
Unless it is some private post that says something like:
"It is not endorsed by the W3C members, team, or any working group."
(see http://www.w3.org/2005/03/23-lex-U)

And also does not mean that a solution that is enough to do some basic
utf-8 validation for html is the right tool for writing a compiler.


--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

From: Jasen Betts on
On 2010-05-13, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:
>
> "Ian Collins" <ian-news(a)hotmail.com> wrote in message
> news:8539h9F7f1U1(a)mid.individual.net...
>> On 05/14/10 08:06 AM, Peter Olcott wrote:
>>> Is this Regular Expression for UTF-8 Correct??
>>
>> It's a fair bet you are off-topic in all the groups you
>> have cross posted to. Why don't you pick a group for a
>> language with built in UTF8 and regexp support (PHP?) and
>> badger them?
>>
>> --
>> Ian Collins
>
> What does this question have to do with the C++ language?
>
> At least my question is indirectly related to C++ by making
> a utf8string for the C++ language from the regular
> expression.

Just use iconv.

and don't cross post off-topic.


--- news://freenews.netfront.net/ - complaints: news(a)netfront.net ---