Is this Regular Expression for UTF-8 Correct?? [MFC]

Prev: Where to handle CSliderCtl messages in
Next: Blocking mouse clicks

From: Joseph M. Newcomer on 13 May 2010 18:31

See below...
On Thu, 13 May 2010 15:36:24 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote:

>
>"Leigh Johnston" <leigh(a)i42.co.uk> wrote in message
>news:GsGdnbYz-OIj_XHWnZ2dnUVZ8uCdnZ2d(a)giganews.com...
>> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
>> news:3sudnRDN849QxnHWnZ2dnUVZ_qKdnZ2d(a)giganews.com...
>>> Is this Regular Expression for UTF-8 Correct??
>>>
>>> The solution is based on the GREEN portions of the first
>>> chart shown
>>> on this link:
>>> http://www.w3.org/2005/03/23-lex-U
****
Note that in the "green" areas, we find

U0482 Cyrillic thousands sign
U055A Armenian apostrophe
U055C Armenian exclamation mark
U05C3 Hebrew punctuation SOF Pasuq
U060C Arabic comma
U066B Arabic decimal separator
U0700-U0709 Assorted Syriac punctuation marks
U0966-U096F Devanagari digits 0..9
U09E6-U09EF Bengali digits 0..9
U09F2-U09F3 Bengali rupee marks
U0A66-U0A6F Gurmukhi digits 0..9
U0AE6-U0AEF Gujarati digits 0..9
U0B66-U0B6F Oriya digits 0..9
U0BE6-U0BEF Tamil digits 0..9
U0BF0-U0BF2 Tamil indicators for 10, 100, 1000
U0BF3-U0BFA Tamil punctuation marks
U0C66-U0C6F Telugu digits 0..9
U0CE6-U0CEF Kannada digits 0..9
U0D66-U0D6F Malayam digits 0..9
U0E50-U0E59 Thai digits 0..9
U0ED0-U0ED9 Lao digits 0..9
U0F20-U0F29 Tibetan digits 0..9
U0F2A-U0F33 Miscellaneous Tibetan numeric symbols
U1040-U1049 - Myanmar digits 0..9
U1360-U1368 Ethiopic punctuation marks
U1369-U137C Ethiopic numeric values (digits, tens of digits, etc.)
U17E0-U17E9 Khmer digits 0..9
U1800-U180E Mongolian punctuation marks
U1810-U1819 Mongolian digits 0..9
U1946-U194F Limbu digits 0..9
U19D0-U19D9 New Tai Lue digits 0..9
....at which point I realized I was wasting my time, because I was attempting to disprovde
what is a Really Dumb Idea, which is to write applications that actually work on UTF-8
encoded text.

You are free to convert these to UTF-8, but in addition, if I've read some of the
encodings correctly, the non-green areas preclude what are clearly "letters" in other
languages.

Forget UTF-8. It is a transport mechanism used at input and output edges. Use Unicode
internally.
****
>>>
>>> A semantically identical regular expression is also found
>>> on the above link underValidating lex Template
>>>
>>> 1 ['\u0000'-'\u007F']
>>> 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
>>> 3 | ( '\u00E0' ['\u00A0'-'\u00BF']
>>> ['\u0080'-'\u00BF'])
>>> 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF']
>>> ['\u0080'-'\u00BF'])
>>> 5 | ( '\u00ED' ['\u0080'-'\u009F']
>>> ['\u0080'-'\u00BF'])
>>> 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF']
>>> ['\u0080'-'\u00BF'])
>>> 7 | ( '\u00F0' ['\u0090'-'\u00BF']
>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>> 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF']
>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>> 9 | ( '\u00F4' ['\u0080'-'\u008F']
>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>
>>> Here is my version, the syntax is different, but the UTF8
>>> portion should be semantically identical.
>>>
>>> UTF8_BYTE_ORDER_MARK [\xEF][\xBB][\xBF]
>>>
>>> ASCII [\x0-\x7F]
>>>
>>> U1 [a-zA-Z_]
>>> U2 [\xC2-\xDF][\x80-\xBF]
>>> U3 [\xE0][\xA0-\xBF][\x80-\xBF]
>>> U4 [\xE1-\xEC][\x80-\xBF][\x80-\xBF]
>>> U5 [\xED][\x80-\x9F][\x80-\xBF]
>>> U6 [\xEE-\xEF][\x80-\xBF][\x80-\xBF]
>>> U7 [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
>>> U8 [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
>>> U9 [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]
>>>
>>> UTF8 {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>>
>>> // This identifies the "Letter" portion of an Identifier.
>>> L {U1}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>>
>>> I guess that most of the analysis may simply boil down to
>>> whether or not the original source from the link is
>>> considered reliable. I had forgotten this original source
>>> when I first asked this question, that is why I am
>>> reposting this same question again.
>>
>> What has this got to do with C++? What is your C++
>> language question?
>>
>> /Leigh
>
>I will be implementing a utf8string to supplement
>std::string and will be using a regular expression to
>quickly divide up UTF-8 bytes into Unicode CodePoints.
***
For someone who had an unholy fixation on "performance", why would you choose such a slow
mechanism for doing recognition?

I can imagine a lot of alternative approaches, including having a table of 65,536
"character masks" for Unicode characters, including on-the-fly updating of the table, and
extensions to support surrogates, which would outperform any regular expression based
approach.

What is your crtiterion for what constitutes a "letter"? Frankly, I have no interest in
decoding something as bizarre as UTF-8 encodings to see if you covered the foreign
delimiters, numbers, punctuation marks, etc. properly, and it makes no sense to do so. So
there is no way I would waste my time trying to understand an example that should not
exist at all.

Why do you seem to choose the worst possible choice when there is more than one way to do
something? The choices are (a) work in 8-bit ANSI (b) work in UTF-8 (c) work in Unicode.
Of these, the worst possible choice is (b), followed by (a). (c) is clearly the winner.

So why are you using something as bizarre as UTF-8 internally? UTF-8 has ONE role, which
is to write Unicode out in an 8-bit encoding, and read Unicode in an 8-bit encoding. You
do NOT want to write the program in terms of UTF-8!
joe
****
>
>Since there are no UTF-8 groups, or even Unicode groups I
>must post these questions to groups that are at most
>indirectly related to this subject matter.
>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 13 May 2010 18:33

Actually, what it does is give us another opportunity to point how how really bad this
design choice is, and thus Peter can tell us all we are fools for not answering a question
that should never have been asked, not because it is inappropriate for the group, but
because it represents the worst-possible-design decision that could be made.
joe

On Thu, 13 May 2010 22:11:40 +0100, "Leigh Johnston" <leigh(a)i42.co.uk> wrote:

>"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
>news:L76dnYQBH-5s-3HWnZ2dnUVZ_sqdnZ2d(a)giganews.com...
>>
>> "Leigh Johnston" <leigh(a)i42.co.uk> wrote in message
>> news:v7CdnY8dPrNy_nHWnZ2dnUVZ8t-dnZ2d(a)giganews.com...
>>>
>>>
>>> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
>>> news:xMOdnahxZJNX_3HWnZ2dnUVZ_hCdnZ2d(a)giganews.com...
>>>>>
>>>>> What has this got to do with C++? What is your C++ language question?
>>>>>
>>>>> /Leigh
>>>>
>>>> I will be implementing a utf8string to supplement std::string and will
>>>> be using a regular expression to quickly divide up UTF-8 bytes into
>>>> Unicode CodePoints.
>>>>
>>>> Since there are no UTF-8 groups, or even Unicode groups I must post
>>>> these questions to groups that are at most indirectly related to this
>>>> subject matter.
>>>
>>> Wrong: off-topic is off-topic. If I chose to write a Tetris game in C++
>>> it would be inappropriate to ask about the rules of Tetris in this
>>> newsgroup even if there was not a more appropriate newsgroup.
>>>
>>> /Leigh
>>
>> I think that posting to the next most relevant group(s) where a directly
>> relevant group does not exist is right, and thus you are simply wrong.
>
>From this newsgroup's FAQ:
>
>"Only post to comp.lang.c++ if your question is about the C++ language
>itself."
>
>Thus I am simply correct?
>
>/Leigh
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Victor Bazarov on 13 May 2010 18:41

On 5/13/2010 6:04 PM, Peter Olcott wrote:
> "Ian Collins"<ian-news(a)hotmail.com> wrote in message
> news:8539h9F7f1U1(a)mid.individual.net...
>> On 05/14/10 08:06 AM, Peter Olcott wrote:
>>> Is this Regular Expression for UTF-8 Correct??
>>
>> It's a fair bet you are off-topic in all the groups you
>> have cross posted to. Why don't you pick a group for a
>> language with built in UTF8 and regexp support (PHP?) and
>> badger them?
>>
>> --
>> Ian Collins
>
> What does this question have to do with the C++ language?

It does not have to have anything to do with C++. A post on the
topicality of another post is *always on topic*.

> At least my question is indirectly related to C++ by making
> a utf8string for the C++ language from the regular
> expression.

<sarcasm>
I am about to hold a party where I expect my colleagues to show up.
They are all C++ programmers. Would the question on what to feed them,
or whether 1970s pop music is going to be appropriate, be on topic in
comp.lang.c++? It's *indirectly related* to C++, isn't it?
</sarcasm>

> Your question is not even indirectly related to the C++
> language.

See above.

V
--
Please remove capital 'A's when replying by e-mail
I do not respond to top-posted replies, please don't ask

From: Sam on 13 May 2010 19:12

Victor Bazarov writes:

> On 5/13/2010 6:04 PM, Peter Olcott wrote:
>> "Ian Collins"<ian-news(a)hotmail.com> wrote in message
>> news:8539h9F7f1U1(a)mid.individual.net...
>>> On 05/14/10 08:06 AM, Peter Olcott wrote:
>>>> Is this Regular Expression for UTF-8 Correct??
>>>
>>> It's a fair bet you are off-topic in all the groups you
>>> have cross posted to. Why don't you pick a group for a
>>> language with built in UTF8 and regexp support (PHP?) and
>>> badger them?
>>>
>>> --
>>> Ian Collins
>>
>> What does this question have to do with the C++ language?
>
> It does not have to have anything to do with C++. A post on the
> topicality of another post is *always on topic*.
>
>> At least my question is indirectly related to C++ by making
>> a utf8string for the C++ language from the regular
>> expression.
>
> <sarcasm>
> I am about to hold a party where I expect my colleagues to show up.
> They are all C++ programmers. Would the question on what to feed them,
> or whether 1970s pop music is going to be appropriate, be on topic in
> comp.lang.c++? It's *indirectly related* to C++, isn't it?
> </sarcasm>
>
>> Your question is not even indirectly related to the C++
>> language.
>
> See above.

This guy is a tool. He re-posted this question a second time because when he
first posted that snippet nobody cared either. But after watching the
struggle in the original thread, the ugly carnage appealed to the
infinitesimally small humanitarian aspect of my psyche sufficiently enough
to motivate myself into actually looking at the regexp monstrosity. But
after I explained why that spaghetti of a regexp does not jive with RFC
2279, he got all huffy about it. He was confident that I was wrong, and that
the regular expression was right. But I was able to explain my reasoning, by
referencing directly to the contents of RFC 2279, and he was unable to
explain why he thought I was wrong, instead sprinkling more URLs to some
apparently orphaned web pages that said something else.

Which raised an obvious question: if he was so sure that his regular
expressions were correct, why was he asking? What exactly is the part of RFC
2279 that he didn't understand?

It seems to be his personality trait: when he asks a question, he thinks he
knows what the answer is, and every other answer is wrong. I can't figure
out what the real reason for asking the question must be, but I think I
really don't want to know the answer.

It remains to be seen how long it will take him to figure out that the
difficulty he has in getting someone answer this might be, just might be,
due to the simple fact that this is one of these things that can be answered
simply by RTFMing. Really, UTF-8 is not some patented trade secret. Its
specifications are openly available, to anyone who wants to read them. And
anyone who reads them should be able to figure out the correct regexp for
themselves. It's not rocket science.

Amusingly, he's been trying to find the answer to this question longer than
it took myself, originally, to read RFC 2279, and implement encoding and
decoding of Unicode using UTF-8. In C++. Well, in C actually, but it's still
technically valid C++. Which, I guess, makes this on-topic, under the new
rules that just came down, by fiat.

From: Peter Olcott on 13 May 2010 19:14

Ah so in other words an extremely verbose, "I don't know".
Let me take a different approach. Can postings on www.w3.org
generally be relied upon?

"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
message news:gprou55bvl3rgp2qmp6v3euk20ucf865mi(a)4ax.com...
> See below...
> On Thu, 13 May 2010 15:36:24 -0500, "Peter Olcott"
> <NoSpam(a)OCR4Screen.com> wrote:
>
>>
>>"Leigh Johnston" <leigh(a)i42.co.uk> wrote in message
>>news:GsGdnbYz-OIj_XHWnZ2dnUVZ8uCdnZ2d(a)giganews.com...
>>> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
>>> news:3sudnRDN849QxnHWnZ2dnUVZ_qKdnZ2d(a)giganews.com...
>>>> Is this Regular Expression for UTF-8 Correct??
>>>>
>>>> The solution is based on the GREEN portions of the
>>>> first
>>>> chart shown
>>>> on this link:
>>>> http://www.w3.org/2005/03/23-lex-U
> ****
> Note that in the "green" areas, we find
>
> U0482 Cyrillic thousands sign
> U055A Armenian apostrophe
> U055C Armenian exclamation mark
> U05C3 Hebrew punctuation SOF Pasuq
> U060C Arabic comma
> U066B Arabic decimal separator
> U0700-U0709 Assorted Syriac punctuation marks
> U0966-U096F Devanagari digits 0..9
> U09E6-U09EF Bengali digits 0..9
> U09F2-U09F3 Bengali rupee marks
> U0A66-U0A6F Gurmukhi digits 0..9
> U0AE6-U0AEF Gujarati digits 0..9
> U0B66-U0B6F Oriya digits 0..9
> U0BE6-U0BEF Tamil digits 0..9
> U0BF0-U0BF2 Tamil indicators for 10, 100, 1000
> U0BF3-U0BFA Tamil punctuation marks
> U0C66-U0C6F Telugu digits 0..9
> U0CE6-U0CEF Kannada digits 0..9
> U0D66-U0D6F Malayam digits 0..9
> U0E50-U0E59 Thai digits 0..9
> U0ED0-U0ED9 Lao digits 0..9
> U0F20-U0F29 Tibetan digits 0..9
> U0F2A-U0F33 Miscellaneous Tibetan numeric symbols
> U1040-U1049 - Myanmar digits 0..9
> U1360-U1368 Ethiopic punctuation marks
> U1369-U137C Ethiopic numeric values (digits, tens of
> digits, etc.)
> U17E0-U17E9 Khmer digits 0..9
> U1800-U180E Mongolian punctuation marks
> U1810-U1819 Mongolian digits 0..9
> U1946-U194F Limbu digits 0..9
> U19D0-U19D9 New Tai Lue digits 0..9
> ...at which point I realized I was wasting my time,
> because I was attempting to disprovde
> what is a Really Dumb Idea, which is to write applications
> that actually work on UTF-8
> encoded text.
>
> You are free to convert these to UTF-8, but in addition,
> if I've read some of the
> encodings correctly, the non-green areas preclude what are
> clearly "letters" in other
> languages.
>
> Forget UTF-8. It is a transport mechanism used at input
> and output edges. Use Unicode
> internally.
> ****
>>>>
>>>> A semantically identical regular expression is also
>>>> found
>>>> on the above link underValidating lex Template
>>>>
>>>> 1 ['\u0000'-'\u007F']
>>>> 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
>>>> 3 | ( '\u00E0' ['\u00A0'-'\u00BF']
>>>> ['\u0080'-'\u00BF'])
>>>> 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF']
>>>> ['\u0080'-'\u00BF'])
>>>> 5 | ( '\u00ED' ['\u0080'-'\u009F']
>>>> ['\u0080'-'\u00BF'])
>>>> 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF']
>>>> ['\u0080'-'\u00BF'])
>>>> 7 | ( '\u00F0' ['\u0090'-'\u00BF']
>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>> 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF']
>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>> 9 | ( '\u00F4' ['\u0080'-'\u008F']
>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>
>>>> Here is my version, the syntax is different, but the
>>>> UTF8
>>>> portion should be semantically identical.
>>>>
>>>> UTF8_BYTE_ORDER_MARK [\xEF][\xBB][\xBF]
>>>>
>>>> ASCII [\x0-\x7F]
>>>>
>>>> U1 [a-zA-Z_]
>>>> U2 [\xC2-\xDF][\x80-\xBF]
>>>> U3 [\xE0][\xA0-\xBF][\x80-\xBF]
>>>> U4 [\xE1-\xEC][\x80-\xBF][\x80-\xBF]
>>>> U5 [\xED][\x80-\x9F][\x80-\xBF]
>>>> U6 [\xEE-\xEF][\x80-\xBF][\x80-\xBF]
>>>> U7 [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
>>>> U8
>>>> [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
>>>> U9 [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]
>>>>
>>>> UTF8 {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>>>
>>>> // This identifies the "Letter" portion of an
>>>> Identifier.
>>>> L {U1}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>>>
>>>> I guess that most of the analysis may simply boil down
>>>> to
>>>> whether or not the original source from the link is
>>>> considered reliable. I had forgotten this original
>>>> source
>>>> when I first asked this question, that is why I am
>>>> reposting this same question again.
>>>
>>> What has this got to do with C++? What is your C++
>>> language question?
>>>
>>> /Leigh
>>
>>I will be implementing a utf8string to supplement
>>std::string and will be using a regular expression to
>>quickly divide up UTF-8 bytes into Unicode CodePoints.
> ***
> For someone who had an unholy fixation on "performance",
> why would you choose such a slow
> mechanism for doing recognition?
>
> I can imagine a lot of alternative approaches, including
> having a table of 65,536
> "character masks" for Unicode characters, including
> on-the-fly updating of the table, and
> extensions to support surrogates, which would outperform
> any regular expression based
> approach.
>
> What is your crtiterion for what constitutes a "letter"?
> Frankly, I have no interest in
> decoding something as bizarre as UTF-8 encodings to see if
> you covered the foreign
> delimiters, numbers, punctuation marks, etc. properly, and
> it makes no sense to do so. So
> there is no way I would waste my time trying to understand
> an example that should not
> exist at all.
>
> Why do you seem to choose the worst possible choice when
> there is more than one way to do
> something? The choices are (a) work in 8-bit ANSI (b)
> work in UTF-8 (c) work in Unicode.
> Of these, the worst possible choice is (b), followed by
> (a). (c) is clearly the winner.
>
> So why are you using something as bizarre as UTF-8
> internally? UTF-8 has ONE role, which
> is to write Unicode out in an 8-bit encoding, and read
> Unicode in an 8-bit encoding. You
> do NOT want to write the program in terms of UTF-8!
> joe
> ****
>>
>>Since there are no UTF-8 groups, or even Unicode groups I
>>must post these questions to groups that are at most
>>indirectly related to this subject matter.
>>
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
Prev: Where to handle CSliderCtl messages in
Next: Blocking mouse clicks