From: Joseph M. Newcomer on
See below...
On Fri, 14 May 2010 13:44:56 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote:

>
>"Pete Delgado" <Peter.Delgado(a)NoSpam.com> wrote in message
>news:O8vhKE58KHA.980(a)TK2MSFTNGP04.phx.gbl...
>>
>>> Most often I am not looking for "input from
>>> professionals", I am looking for answers to specific
>>> questions.
>>
>> Which is one reason why your projects consistantly fail.
>> If you have a few
>
>None of my projects have ever failed. Some of my projects
>inherently take an enormous amount of time to complete.
****
No something you should brag about. Going back to my original comments, you are creating
an artificially complex solution to what should be a simple problem, by making bad design
choices and then warping reality to support them, when the correct answer is "Don't do it
that way". If you simplify the problem, you get do make decisions which can be
implemented more readily, those decreasing the amount of time required to complete them.
****
>
>> days, take a look at the book "Programming Pearls" by Jon
>> Bentley -specifically the first chapter. Sometimes making
>> sure you are asking the *right* question is more important
>> than getting an answer to a question. You seem to have a
>> problem with that particular concept.
>
>Yes especially on those cases where I have already thought
>the problem through completely using categorically
>exhaustively complete reasoning.
*****
There is no such thing in the world we live in. You have made a number of false
assumptions (for example, that conversion time is statistically significant relative to
other performance issues) and used that set of false assumptions to drive a set of design
decisions which make no sense if you take reality into consideration. For example, these
is no possible way the UTF-8-UTF-16 conversion could possibly take longer to handle than a
single page fault, but you are optimizing it out of existence without realizing that
simply loading the program will have orders of magnitude greater variance than this cost.
This is because you are working with the assumptions that (a) loading a program takes
either zero time or a fixed time each time it is loaded and (b) opening the file you are
reading takes either zero time or a fixed time each time it is opened. Sadly, neither of
these assumptions are valid, and consequently if you run 100 experiments or loading and
executing the program, these two paramters will dominate the total performance by orders
of magnitude more than the cost of the conversion! So you are trying to optimize
something that is statistically insignificant!
****
>
>In those rare instances anything at all besides a direct
>answer to a direct question can only be a waste of time for
>me.
*****
You want a direct answer: the design to use UTF-8 internally is a Really Stupid Idea!
DON'T WASTE YOUR TIME TRYING TO DO IT! That's the DIRECT answer. Everything else is
wasting our time trying to tell you in simple words that even you might understand just
WHY it is a Really Stupid Idea.

There is no point in trying to analye the regexp because I can not believe why any
intelligent programmer would WANT to use such a bad design! Therefore, it was a bad
question and does not deserve getting an answer; the correct answer is to do the job
right. You have this fixation that if you pose what is clearly a bad design, we experts
are supposed to sit back and encourage bad design decisions? That is not what we do.

We feel a little bit like Calvin's dad from the old "Calvin and Hobbes" cartoons. Calvin
comes over to his father and says "Dad, can I have a chain saw" and his father says "no".
Calvin goes away feeling unhappy, and in the last of the four panels says "but now how am
I going to learn how to juggle?"

If you want to juggle chain saws, we aren't going to answer your questions on how to do
it. We will try to advise you that juggling running chain saws is probably a Really
Stupid Idea. If you were an experienced knife juggler, and could juggle flaming torches,
we might suggest that there are approaches to this, but your idea that you can apply
categorical reasoning to the problem of chain-saw juggling when you have clearly
demonstrated by your question that you have never once juggled anything, makes us leery of
encouraging you to continue this practice.

Note that "categorical reasoning" does not turn into a deep understanding of fundamentally
stochastic processes. Las Vegas casinos would love you, because you would try to apply
this technique to, say, roulette wheels and dice, and guess who wins?

Prove, by exhaustive categorical reasoning, that loading a program takes a fixed amount of
time. Then I'll credit its power.
joe
****
>
>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Joseph M. Newcomer on
See below...
On Fri, 14 May 2010 08:27:45 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote:

>
>"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
>message news:68gpu599cjcsm3rjh1ptc6e9qu977smdph(a)4ax.com...
>> No, an extremely verbose "You are going about this
>> completely wrong".
>> joe
>
>Which still avoids rather than answers my question. This was
>at one time a very effective ruse to hide the fact that you
>don't know the answer. I can see through this ruse now, so
>there is no sense in my attempting to justify my design
>decision to you. That would simply be a waste of time.
****
I think I answered part of it. The part that matters. THe part that says "this is
wrong". I did this by pointing out some counterexamples.

I know the answer: Don;t Do It That Way. You are asking for a specific answer that will
allow you to pursue a Really Bad Design Decision. I'm not going to answer a bad question;
I'm going to tell you what the correct solution is. I'm avoiding the question because it
is a really bad question, because you should be able to answer it yourself, and because
giving an answer simply justifies a poor design. I don't justify poor designs, I try to
kill them.

Only you could make a bad design decision and feel you have to justify it. Particularly
when the experts have already all told you it is a bad design decision, and you should not
go that way.
joe

>
>>
>> On Thu, 13 May 2010 18:14:47 -0500, "Peter Olcott"
>> <NoSpam(a)OCR4Screen.com> wrote:
>>
>>>Ah so in other words an extremely verbose, "I don't know".
>>>Let me take a different approach. Can postings on
>>>www.w3.org
>>>generally be relied upon?
>>>
>>>"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
>>>message news:gprou55bvl3rgp2qmp6v3euk20ucf865mi(a)4ax.com...
>>>> See below...
>>>> On Thu, 13 May 2010 15:36:24 -0500, "Peter Olcott"
>>>> <NoSpam(a)OCR4Screen.com> wrote:
>>>>
>>>>>
>>>>>"Leigh Johnston" <leigh(a)i42.co.uk> wrote in message
>>>>>news:GsGdnbYz-OIj_XHWnZ2dnUVZ8uCdnZ2d(a)giganews.com...
>>>>>> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in
>>>>>> message
>>>>>> news:3sudnRDN849QxnHWnZ2dnUVZ_qKdnZ2d(a)giganews.com...
>>>>>>> Is this Regular Expression for UTF-8 Correct??
>>>>>>>
>>>>>>> The solution is based on the GREEN portions of the
>>>>>>> first
>>>>>>> chart shown
>>>>>>> on this link:
>>>>>>> http://www.w3.org/2005/03/23-lex-U
>>>> ****
>>>> Note that in the "green" areas, we find
>>>>
>>>> U0482 Cyrillic thousands sign
>>>> U055A Armenian apostrophe
>>>> U055C Armenian exclamation mark
>>>> U05C3 Hebrew punctuation SOF Pasuq
>>>> U060C Arabic comma
>>>> U066B Arabic decimal separator
>>>> U0700-U0709 Assorted Syriac punctuation marks
>>>> U0966-U096F Devanagari digits 0..9
>>>> U09E6-U09EF Bengali digits 0..9
>>>> U09F2-U09F3 Bengali rupee marks
>>>> U0A66-U0A6F Gurmukhi digits 0..9
>>>> U0AE6-U0AEF Gujarati digits 0..9
>>>> U0B66-U0B6F Oriya digits 0..9
>>>> U0BE6-U0BEF Tamil digits 0..9
>>>> U0BF0-U0BF2 Tamil indicators for 10, 100, 1000
>>>> U0BF3-U0BFA Tamil punctuation marks
>>>> U0C66-U0C6F Telugu digits 0..9
>>>> U0CE6-U0CEF Kannada digits 0..9
>>>> U0D66-U0D6F Malayam digits 0..9
>>>> U0E50-U0E59 Thai digits 0..9
>>>> U0ED0-U0ED9 Lao digits 0..9
>>>> U0F20-U0F29 Tibetan digits 0..9
>>>> U0F2A-U0F33 Miscellaneous Tibetan numeric symbols
>>>> U1040-U1049 - Myanmar digits 0..9
>>>> U1360-U1368 Ethiopic punctuation marks
>>>> U1369-U137C Ethiopic numeric values (digits, tens of
>>>> digits, etc.)
>>>> U17E0-U17E9 Khmer digits 0..9
>>>> U1800-U180E Mongolian punctuation marks
>>>> U1810-U1819 Mongolian digits 0..9
>>>> U1946-U194F Limbu digits 0..9
>>>> U19D0-U19D9 New Tai Lue digits 0..9
>>>> ...at which point I realized I was wasting my time,
>>>> because I was attempting to disprovde
>>>> what is a Really Dumb Idea, which is to write
>>>> applications
>>>> that actually work on UTF-8
>>>> encoded text.
>>>>
>>>> You are free to convert these to UTF-8, but in addition,
>>>> if I've read some of the
>>>> encodings correctly, the non-green areas preclude what
>>>> are
>>>> clearly "letters" in other
>>>> languages.
>>>>
>>>> Forget UTF-8. It is a transport mechanism used at input
>>>> and output edges. Use Unicode
>>>> internally.
>>>> ****
>>>>>>>
>>>>>>> A semantically identical regular expression is also
>>>>>>> found
>>>>>>> on the above link underValidating lex Template
>>>>>>>
>>>>>>> 1 ['\u0000'-'\u007F']
>>>>>>> 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
>>>>>>> 3 | ( '\u00E0' ['\u00A0'-'\u00BF']
>>>>>>> ['\u0080'-'\u00BF'])
>>>>>>> 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF']
>>>>>>> ['\u0080'-'\u00BF'])
>>>>>>> 5 | ( '\u00ED' ['\u0080'-'\u009F']
>>>>>>> ['\u0080'-'\u00BF'])
>>>>>>> 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF']
>>>>>>> ['\u0080'-'\u00BF'])
>>>>>>> 7 | ( '\u00F0' ['\u0090'-'\u00BF']
>>>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>>>> 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF']
>>>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>>>> 9 | ( '\u00F4' ['\u0080'-'\u008F']
>>>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>>>>
>>>>>>> Here is my version, the syntax is different, but the
>>>>>>> UTF8
>>>>>>> portion should be semantically identical.
>>>>>>>
>>>>>>> UTF8_BYTE_ORDER_MARK [\xEF][\xBB][\xBF]
>>>>>>>
>>>>>>> ASCII [\x0-\x7F]
>>>>>>>
>>>>>>> U1 [a-zA-Z_]
>>>>>>> U2 [\xC2-\xDF][\x80-\xBF]
>>>>>>> U3 [\xE0][\xA0-\xBF][\x80-\xBF]
>>>>>>> U4 [\xE1-\xEC][\x80-\xBF][\x80-\xBF]
>>>>>>> U5 [\xED][\x80-\x9F][\x80-\xBF]
>>>>>>> U6 [\xEE-\xEF][\x80-\xBF][\x80-\xBF]
>>>>>>> U7 [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
>>>>>>> U8
>>>>>>> [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
>>>>>>> U9 [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]
>>>>>>>
>>>>>>> UTF8
>>>>>>> {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>>>>>>
>>>>>>> // This identifies the "Letter" portion of an
>>>>>>> Identifier.
>>>>>>> L
>>>>>>> {U1}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>>>>>>
>>>>>>> I guess that most of the analysis may simply boil
>>>>>>> down
>>>>>>> to
>>>>>>> whether or not the original source from the link is
>>>>>>> considered reliable. I had forgotten this original
>>>>>>> source
>>>>>>> when I first asked this question, that is why I am
>>>>>>> reposting this same question again.
>>>>>>
>>>>>> What has this got to do with C++? What is your C++
>>>>>> language question?
>>>>>>
>>>>>> /Leigh
>>>>>
>>>>>I will be implementing a utf8string to supplement
>>>>>std::string and will be using a regular expression to
>>>>>quickly divide up UTF-8 bytes into Unicode CodePoints.
>>>> ***
>>>> For someone who had an unholy fixation on "performance",
>>>> why would you choose such a slow
>>>> mechanism for doing recognition?
>>>>
>>>> I can imagine a lot of alternative approaches, including
>>>> having a table of 65,536
>>>> "character masks" for Unicode characters, including
>>>> on-the-fly updating of the table, and
>>>> extensions to support surrogates, which would outperform
>>>> any regular expression based
>>>> approach.
>>>>
>>>> What is your crtiterion for what constitutes a "letter"?
>>>> Frankly, I have no interest in
>>>> decoding something as bizarre as UTF-8 encodings to see
>>>> if
>>>> you covered the foreign
>>>> delimiters, numbers, punctuation marks, etc. properly,
>>>> and
>>>> it makes no sense to do so. So
>>>> there is no way I would waste my time trying to
>>>> understand
>>>> an example that should not
>>>> exist at all.
>>>>
>>>> Why do you seem to choose the worst possible choice when
>>>> there is more than one way to do
>>>> something? The choices are (a) work in 8-bit ANSI (b)
>>>> work in UTF-8 (c) work in Unicode.
>>>> Of these, the worst possible choice is (b), followed by
>>>> (a). (c) is clearly the winner.
>>>>
>>>> So why are you using something as bizarre as UTF-8
>>>> internally? UTF-8 has ONE role, which
>>>> is to write Unicode out in an 8-bit encoding, and read
>>>> Unicode in an 8-bit encoding. You
>>>> do NOT want to write the program in terms of UTF-8!
>>>> joe
>>>> ****
>>>>>
>>>>>Since there are no UTF-8 groups, or even Unicode groups
>>>>>I
>>>>>must post these questions to groups that are at most
>>>>>indirectly related to this subject matter.
>>>>>
>>>> Joseph M. Newcomer [MVP]
>>>> email: newcomer(a)flounder.com
>>>> Web: http://www.flounder.com
>>>> MVP Tips: http://www.flounder.com/mvp_tips.htm
>>>
>> Joseph M. Newcomer [MVP]
>> email: newcomer(a)flounder.com
>> Web: http://www.flounder.com
>> MVP Tips: http://www.flounder.com/mvp_tips.htm
>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Joseph M. Newcomer on

See below...
On Fri, 14 May 2010 22:42:16 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote:

>
>"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
>message news:68gpu599cjcsm3rjh1ptc6e9qu977smdph(a)4ax.com...
>> No, an extremely verbose "You are going about this
>> completely wrong".
>> joe
>>
>> On Thu, 13 May 2010 18:14:47 -0500, "Peter Olcott"
>> <NoSpam(a)OCR4Screen.com> wrote:
>>
>>>Ah so in other words an extremely verbose, "I don't know".
>>>Let me take a different approach. Can postings on
>>>www.w3.org
>>>generally be relied upon?
>>>
>>>"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
>>>message news:gprou55bvl3rgp2qmp6v3euk20ucf865mi(a)4ax.com...
>>>> See below...
>>>> On Thu, 13 May 2010 15:36:24 -0500, "Peter Olcott"
>>>> <NoSpam(a)OCR4Screen.com> wrote:
>>>>
>>>>>
>>>>>"Leigh Johnston" <leigh(a)i42.co.uk> wrote in message
>>>>>news:GsGdnbYz-OIj_XHWnZ2dnUVZ8uCdnZ2d(a)giganews.com...
>>>>>> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in
>>>>>> message
>>>>>> news:3sudnRDN849QxnHWnZ2dnUVZ_qKdnZ2d(a)giganews.com...
>>>>>>> Is this Regular Expression for UTF-8 Correct??
>>>>>>>
>>>>>>> The solution is based on the GREEN portions of the
>>>>>>> first
>>>>>>> chart shown
>>>>>>> on this link:
>>>>>>> http://www.w3.org/2005/03/23-lex-U
>>>> ****
>>>> Note that in the "green" areas, we find
>>>>
>>>> U0482 Cyrillic thousands sign
>>>> U055A Armenian apostrophe
>>>> U055C Armenian exclamation mark
>>>> U05C3 Hebrew punctuation SOF Pasuq
>>>> U060C Arabic comma
>>>> U066B Arabic decimal separator
>>>> U0700-U0709 Assorted Syriac punctuation marks
>>>> U0966-U096F Devanagari digits 0..9
>>>> U09E6-U09EF Bengali digits 0..9
>>>> U09F2-U09F3 Bengali rupee marks
>>>> U0A66-U0A6F Gurmukhi digits 0..9
>>>> U0AE6-U0AEF Gujarati digits 0..9
>>>> U0B66-U0B6F Oriya digits 0..9
>>>> U0BE6-U0BEF Tamil digits 0..9
>>>> U0BF0-U0BF2 Tamil indicators for 10, 100, 1000
>>>> U0BF3-U0BFA Tamil punctuation marks
>>>> U0C66-U0C6F Telugu digits 0..9
>>>> U0CE6-U0CEF Kannada digits 0..9
>>>> U0D66-U0D6F Malayam digits 0..9
>>>> U0E50-U0E59 Thai digits 0..9
>>>> U0ED0-U0ED9 Lao digits 0..9
>>>> U0F20-U0F29 Tibetan digits 0..9
>>>> U0F2A-U0F33 Miscellaneous Tibetan numeric symbols
>>>> U1040-U1049 - Myanmar digits 0..9
>>>> U1360-U1368 Ethiopic punctuation marks
>>>> U1369-U137C Ethiopic numeric values (digits, tens of
>>>> digits, etc.)
>>>> U17E0-U17E9 Khmer digits 0..9
>>>> U1800-U180E Mongolian punctuation marks
>>>> U1810-U1819 Mongolian digits 0..9
>>>> U1946-U194F Limbu digits 0..9
>>>> U19D0-U19D9 New Tai Lue digits 0..9
>
>Do you know anywhere where I can get a table that maps all
>of the code points to their category?
****
You don't need to. There's an API that does that. Go read the Unicode support. You can
also read the code for my Local Explorer, or you can downlod the table from
www.unicode.org.
joe
****
>
>>>> ...at which point I realized I was wasting my time,
>>>> because I was attempting to disprovde
>>>> what is a Really Dumb Idea, which is to write
>>>> applications
>>>> that actually work on UTF-8
>>>> encoded text.
>>>>
>>>> You are free to convert these to UTF-8, but in addition,
>>>> if I've read some of the
>>>> encodings correctly, the non-green areas preclude what
>>>> are
>>>> clearly "letters" in other
>>>> languages.
>>>>
>>>> Forget UTF-8. It is a transport mechanism used at input
>>>> and output edges. Use Unicode
>>>> internally.
>
>That is how I intend to use it. To internationalize my GUI
>scripting language the interpreter will accept UTF-8 input
>as its source code files. It is substantially implemented
>using Lex and Yacc specifications for "C" that have been
>adapted to implement a subset of C++.
*****
So why does the question matter? Accepting UTF-8 input makes perfect sense, but the first
thing you should do with it is convert it to UTF-16, or better still UTF-32.
****
>
>It was far easier (and far less error prone) to add the C++
>that I needed to the "C" specification than it would have
>been to remove what I do not need from the C++
>specification.
***
Huh? What's this got to do with the encoding?
***
>
>The actual language itself will store its strings as 32-bit
>codepoints. The SymbolTable will not bother to convert its
>strings from UTF-8. It turns out that UTF-8 byte sort order
>is identical to Unicode code point sort order.
****
Strange. I though sort order was locale-specific and independent of code points. But
then, maybe I just understand what is going on.
*****
>
>I am implementing a utf8string that will provide the most
>useful subset of the std::string interface. I need the
>regular expression for Lex, and it also can be easily
>converted into a DFA to very quickly and completely
>correctly break of a UTF-8 string into its code point
>constituent parts.
>
>Do you know anywhere where I can get a table that maps all
>of the code points to their category?
*****
www.unicode.org

Also, there is an API call that does this, and you can check the source of my Locale
Explorer to find it.
joe
****
>
>It is a shame that Microsoft will be killing this group next
>month, where will we go?
>
>>>> ****
>>>>>>>
>>>>>>> A semantically identical regular expression is also
>>>>>>> found
>>>>>>> on the above link underValidating lex Template
>>>>>>>
>>>>>>> 1 ['\u0000'-'\u007F']
>>>>>>> 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
>>>>>>> 3 | ( '\u00E0' ['\u00A0'-'\u00BF']
>>>>>>> ['\u0080'-'\u00BF'])
>>>>>>> 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF']
>>>>>>> ['\u0080'-'\u00BF'])
>>>>>>> 5 | ( '\u00ED' ['\u0080'-'\u009F']
>>>>>>> ['\u0080'-'\u00BF'])
>>>>>>> 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF']
>>>>>>> ['\u0080'-'\u00BF'])
>>>>>>> 7 | ( '\u00F0' ['\u0090'-'\u00BF']
>>>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>>>> 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF']
>>>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>>>> 9 | ( '\u00F4' ['\u0080'-'\u008F']
>>>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>>>>
>>>>>>> Here is my version, the syntax is different, but the
>>>>>>> UTF8
>>>>>>> portion should be semantically identical.
>>>>>>>
>>>>>>> UTF8_BYTE_ORDER_MARK [\xEF][\xBB][\xBF]
>>>>>>>
>>>>>>> ASCII [\x0-\x7F]
>>>>>>>
>>>>>>> U1 [a-zA-Z_]
>>>>>>> U2 [\xC2-\xDF][\x80-\xBF]
>>>>>>> U3 [\xE0][\xA0-\xBF][\x80-\xBF]
>>>>>>> U4 [\xE1-\xEC][\x80-\xBF][\x80-\xBF]
>>>>>>> U5 [\xED][\x80-\x9F][\x80-\xBF]
>>>>>>> U6 [\xEE-\xEF][\x80-\xBF][\x80-\xBF]
>>>>>>> U7 [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
>>>>>>> U8
>>>>>>> [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
>>>>>>> U9 [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]
>>>>>>>
>>>>>>> UTF8
>>>>>>> {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>>>>>>
>>>>>>> // This identifies the "Letter" portion of an
>>>>>>> Identifier.
>>>>>>> L
>>>>>>> {U1}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>>>>>>
>>>>>>> I guess that most of the analysis may simply boil
>>>>>>> down
>>>>>>> to
>>>>>>> whether or not the original source from the link is
>>>>>>> considered reliable. I had forgotten this original
>>>>>>> source
>>>>>>> when I first asked this question, that is why I am
>>>>>>> reposting this same question again.
>>>>>>
>>>>>> What has this got to do with C++? What is your C++
>>>>>> language question?
>>>>>>
>>>>>> /Leigh
>>>>>
>>>>>I will be implementing a utf8string to supplement
>>>>>std::string and will be using a regular expression to
>>>>>quickly divide up UTF-8 bytes into Unicode CodePoints.
>>>> ***
>>>> For someone who had an unholy fixation on "performance",
>>>> why would you choose such a slow
>>>> mechanism for doing recognition?
>>>>
>>>> I can imagine a lot of alternative approaches, including
>>>> having a table of 65,536
>>>> "character masks" for Unicode characters, including
>>>> on-the-fly updating of the table, and
>>>> extensions to support surrogates, which would outperform
>>>> any regular expression based
>>>> approach.
>>>>
>>>> What is your crtiterion for what constitutes a "letter"?
>>>> Frankly, I have no interest in
>>>> decoding something as bizarre as UTF-8 encodings to see
>>>> if
>>>> you covered the foreign
>>>> delimiters, numbers, punctuation marks, etc. properly,
>>>> and
>>>> it makes no sense to do so. So
>>>> there is no way I would waste my time trying to
>>>> understand
>>>> an example that should not
>>>> exist at all.
>>>>
>>>> Why do you seem to choose the worst possible choice when
>>>> there is more than one way to do
>>>> something? The choices are (a) work in 8-bit ANSI (b)
>>>> work in UTF-8 (c) work in Unicode.
>>>> Of these, the worst possible choice is (b), followed by
>>>> (a). (c) is clearly the winner.
>>>>
>>>> So why are you using something as bizarre as UTF-8
>>>> internally? UTF-8 has ONE role, which
>>>> is to write Unicode out in an 8-bit encoding, and read
>>>> Unicode in an 8-bit encoding. You
>>>> do NOT want to write the program in terms of UTF-8!
>>>> joe
>>>> ****
>>>>>
>>>>>Since there are no UTF-8 groups, or even Unicode groups
>>>>>I
>>>>>must post these questions to groups that are at most
>>>>>indirectly related to this subject matter.
>>>>>
>>>> Joseph M. Newcomer [MVP]
>>>> email: newcomer(a)flounder.com
>>>> Web: http://www.flounder.com
>>>> MVP Tips: http://www.flounder.com/mvp_tips.htm
>>>
>> Joseph M. Newcomer [MVP]
>> email: newcomer(a)flounder.com
>> Web: http://www.flounder.com
>> MVP Tips: http://www.flounder.com/mvp_tips.htm
>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Joseph M. Newcomer on
See below...
On Sat, 15 May 2010 09:12:09 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote:

>
>"Mihai N." <nmihai_year_2000(a)yahoo.com> wrote in message
>news:Xns9D7922352F422MihaiN(a)207.46.248.16...
>>
>>> Do you know anywhere where I can get a table that maps
>>> all
>>> of the code points to their category?
>>
>> ftp://ftp.unicode.org/Public/5.2.0/ucd
>>
>> UnicodeData.txt
>> The main guide for that is
>> ftp://ftp.unicode.org/Public/5.1.0/ucd/UCD.html
>> (if you don't want to go thru the standard, which is the
>> adviseable thing)
>>
>> And when you bump your head, remeber that joe and I warned
>> you about utf-8.
>> It was not designed for this kind of usage.
>>
>>
>Joe also said that UTF-8 was designed for data interchange
>which is how I will be using it. Joe also falsely assumed
>that I would be using UTF-8 for my internal representation.
>I will be using UTF-32 for my internal representation.
****
But then, you would not need the UTF-8 regexps! You would only need those if you were
storing the data in UTF-8. To give an external grammar to your language, you should give
the UTF-32 regexps, and if necessary, you can TRANSLATE those to UTF-8, but you don't
start with UTF-8. The lex input would need to be in terms of UTF-32, so you would not be
using UTF-8 there, either.
****
>
>I will be using UTF-8 as the source code for my language
>interpreter, which has the advantage of simply being ASCII
>for the English language, and working across every platform
>without requiring adaptations such as Little Endian and Big
>Endian. UTF-8 will also be the output of my OCR4Screen DFA
>recognizer.
>
>>
>> --
>> Mihai Nita [Microsoft MVP, Visual C++]
>> http://www.mihai-nita.net
>> ------------------------------------------
>> Replace _year_ with _ to get the real email
>>
>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Joseph M. Newcomer on
See below..
On Fri, 14 May 2010 01:30:00 -0700, "Mihai N." <nmihai_year_2000(a)yahoo.com> wrote:

>
>> I can imagine a lot of alternative approaches, including having a table of
>> 65,536 "character masks" for Unicode characters
>
>As we know, 65,536 (FFFF) is not enough, Unicode codepoints go to 10FFFF :-)
****
Yes, but this leads to questions of how to build sparse encodings or handling surrogates
with secondary tables, and I did not want to confuse the issue. First-cut performance
would be to use a 64K table, and for values above FFFF decode to a secondary table.

But this would be too much reality to absorb and once.
****
>
>
>
>> What is your crtiterion for what constitutes a "letter"?
>
>The best way to attack the identification is by using Unicode properties
>Each code point has attributes indicating if it is a letter
>(General Category)
>
>A good starting point is this:
> http://unicode.org/reports/tr31/tr31-1.html
>
>But this only shows that basing that on some UTF-8 kind of thing is no
>the way. And how are you going to deal with combining characters?
>Normalization?
****
Ahh, that old concept, "reality" again. This is what I meant by the question about what
constitutes a letter; for example, the trivial case orf a byte sequence that encodes a
nonspacing accent mark with a letter that follows requires a separate lexical rule because
lex only works on actual input characters (even if modified to support UTF-32), and
therefore the overly-simplistic regexp shown is clearly untenable. But again, I did not
want to point out the subtleties because I would not have been using exhaustive
categorical reasoning to derive why the question was a stupid question. So I pointed out
just the most trivial of failure modes, and asked a fundamental question, for which, alas,
you gave the answer (thus cheating me out of further annoying Peter by forcing him to
actually think the problem through). Vowel marks in some languages (e.g., Hebrew) are
another counterexample.

Even UTF-32 doesn't solve the "what is a letter" question! Which is why the regexp rules
are clearly bad!
joe
****
>
>There are very good reasons why the rule of thumb is:
> - UTF-16 or UTF-32 for processing
> - UTF-8 for storage/exchange
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm