Is this UTF-8 regular expression semantically correct? [MFC]

Prev: Love Potion for Miss Blandish
Next: Newcomer's CAsyncSocket example: trouble connecting with other clients

From: Peter Olcott on 20 May 2010 21:38

On 5/20/2010 8:05 PM, Joseph M. Newcomer wrote:
> See below...
> On Thu, 20 May 2010 13:23:23 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>
>> I was under the false assumption that C++ required ASCII punctuation at
>> the lexical level. This is the key false assumption that prevented me
>>from understanding what you were saying.
> *****
> Actually, this is irrelevant to my point. In fact, it completely irrelevant to the
> discussion. You had said "Allow a programmer to write code in his/her native language" or
> something close to that statement, by which I would interpret this to mean

Something entirely different than the way that I intended it because of
my lack of knowledge about how these sorts of things are typically
handled, hence the source of our failure to communicate.

> "this is an
> extension which permits localized variable names using the native language characters,
> localized numbers using the native language digits, and localized punctuation, using the
> native language punctuation marks", which seems to be a reasonable interpretation. Then

It does now that I have several gaps in my understanding filled in. I
specifically asked for you to fill in these gaps when I repeatedly asked
how C++ does it.

> you made the statement that all characters> U007F would be "letters", which even the most
> superficial reading of the Unicode standard, or even looking at Character Map,
> demonstrates is nonsensical. It has NOTHING to do with what ANSI/ISO C requires, and
> everything to do with you creating a specification inconsistent with your claim. No deep
> understaning of C++ is required

No, but, a deep understanding about the ways that computer languages are
adapted to international use was required, and I lacked this.

> to see that allowing an Armenian to write an identifier
> which has an Armenian comma embedded is going to be confusing to an Armenian reading the
> source code, and that an Armenian who writes his/her own native comma between parameters

I have told you repeatedly that I thought that this was an impossible
scenario because I thought that everyone standardized on the ASCII
Comma. It is still absurd to me why there needs to be more than one comma.

> is writing something that makes PERFECT sense to an Armenian, but by your interpretation,
> if A and B can stand for Armenian letters, and , for an Armenian comma, a function call of
> two parameters f(A,B) written in Armenian script is something you interpret as a function
> call of one argument whose name is "A,B", and I fail to see how this could make sense.
>
> And at no point does the C++ language rule need to enter this discussion to demonstrate
> that your specification ("writing in the native language") and your implementaiton ("Not
> accepting native language digits or punctuation") are inconsistent.
>
> Note that ANSI/ISO C++ does not accept letters outside the set [A-Za-z_] but you did not
> say "I'm not going to let an Armenian write identifiers using Armenian letters because C++
> does not accept that"; instead, you fastened on one particular point, the specifcation of
> comma, and insisted that an Armenian is not entitled to use an Armenian comma, that is,
> write in their native language. Note that many languages have special symbols for colon,
> semicolon, comma, and period, and for that matter, most Europeans object to having to
> write floating point numbers as 1.23 when any child there knows the proper form is 1,23. C
> and C++ do not cater to this, either, so if you REALLY want to make the syntax localized,
> you have to allow for the correct lexical notation for primitive lexemes as recognized by
> native language users. And note that not all languages specify negative number by putting
> a hyphen as the first character; in some languages, the symbol is not a hyphen, and the
> symbol appears as the LAST character of the sequence!
>
> One of the Great Debates of the Algol development era was what notation to use for
> reserved words and numbers, and the committee (all of whom spoke fluent English) agreed
> that the identifiers such as BEGIN, END, FOR, STEP, WHILE, UNTIL, PROCEDURE, INTEGER, and
> so on would be in English. Many other international programmers viewed this as the
> English-speaking world imposing their monoculture on everyone. In fact, one of the
> reasons that China fell behind in computing was that as part of the Cultural Revolution,
> technologies that required "Western" notations (such as FORTRAN, COBOL, ALGOL, and
> probably C--there was no C++ back then, and languages derived from these) were forbidden,
> which meant only assemblers and absolute octal/hex coding were permitted. Foreign numeric
> notations, like Arabic numbers, were also forbidden, because these compromised the
> "purity" of the authentic People's Culture. Of course, they also had to reject a lot that
> came down from the previous emperors, including a lot of technological work, so they were
> left with a lot of problems writing code. Some oganizations (notably, those concerned
> with military computers) were exempted, but civilian (including academic) computing hit a
> real low at that time, and took decades to recover. [This is based on stories told to me
> by visiting Chinese computer specialiists],
>
> Many European compilers had workarounds that allowed the English reserved words to be
> overridden; for example, a set of directive cards (remember, this is the punched-card era)
> of the form
> $BEGIN=localizedword
> $FOR=localizedword
> which allowed a programmer who didn't know English to write a program he or she found
> intelligible. There was a desire to use [] to indicate subscripting, but those character
> codes had been taken over by Germans who used them to represent ��, so the German
> contigent insisted that () be used for subscripting. Hence, it was ambiguous if you saw
> an expression
> n = a(i);
> because you did not know, unless you found the definition of a in scope, whether a was a
> function or an array, And because Algol allows functions to be defined inside functions,
> and we lived in a punched-card universe (no
> hover-the-mouse-over-the-name-and-get-its-type) it could be a quite complex task for
> someone reading the code to figure out where a was defined and what it meant (the compiler
> had no problem, but it didn't always do what you expected, only what was right). In fact,
> a number of notable "hacks" could be done by declaring a variable name that superseded an
> outer-scope array name, or vice-versa, and you could drive people nuts looking for bugs by
> playing this game.
>
> The localization of a programming language is a VERY complex problem, but to go to the
> extreme of forbidding the use of localized numbers and puctuation by interpreting them as
> "letters" is, well, over the top.
>
> My objection had nothing to do with the use of the ASCII-7 comma in the C++ spec, and
> never did. And your response which raised that issue as the objection was ill-founded.
> joe
> ****
>>
>>> ****
>>>>
>>>> Ignorance squared means that one is even ignorance of their own
>>>> ignorance. (or at least the degree of this ignorance).
>>> ***
>>> So what do you call an insistence on remaining ignorant, even when others are supplying
>>> knowledge you didn't have?
>>> joe
>>
>> I explained this term here as applied to myself to help you understand
>> why it took me so long to understand what you were saying. What you were
>> saying made no sense at all within the context of my fundamental (and
>> incorrect) base assumptions.
>>
>> Correcting these fundamental base assumptions was the necessarily
>> prerequisite for my understanding what you were saying.
>>
>>> ****
>>> Joseph M. Newcomer [MVP]
>>> email: newcomer(a)flounder.com
>>> Web: http://www.flounder.com
>>> MVP Tips: http://www.flounder.com/mvp_tips.htm
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Mihai N. on 21 May 2010 02:32

> Ah so my idea to allow UTF-8 encoded identifiers is really not all that
> bad.

It is not the idea that is bad, it is the intended implementation.
(anything above 0x80 (or was it 0xff?) is a letter).

I have pointed you to a relevant document
"Identifier and Pattern Syntax" (Unicode Technical Report #31,
http://unicode.org/reports/tr31/tr31-1.html)

You ignored it, but you keep saying "I am no expert"
Well, maybe reading documents written by experts might be a good start.

--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

From: Peter Olcott on 21 May 2010 04:38

On 5/21/2010 1:32 AM, Mihai N. wrote:
>
>> Ah so my idea to allow UTF-8 encoded identifiers is really not all that
>> bad.
>
> It is not the idea that is bad, it is the intended implementation.
> (anything above 0x80 (or was it 0xff?) is a letter).
>
> I have pointed you to a relevant document
> "Identifier and Pattern Syntax" (Unicode Technical Report #31,
> http://unicode.org/reports/tr31/tr31-1.html)
>
> You ignored it, but you keep saying "I am no expert"
> Well, maybe reading documents written by experts might be a good start.
>
>
I was already familiar with the restriction that Java makes, and this
seems to be similar. Initially I saw no reason to restrict an identifier
to a letter because I had assumed that C++ was always inherently
restricted at the lexical level to ASCII. Since my language will be a
subset of C++, I would follow whatever C++ does.

Since it is possible to map local punctuation (why is there a need for
more than one comma in the world?) and local digits to this ASCII set,
then restricting identifiers from using local punctuation would be
required.

So from all of this what seems to make the most sense is to restrict the
set of code points used by identifiers only to the extent that the lexer
is (or will be) mapping NonASCII code points to ASCII. This would seem
to be the choice that minimizes complexity.

From: Joseph M. Newcomer on 21 May 2010 08:11

See below...
On Thu, 20 May 2010 20:38:59 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/20/2010 8:05 PM, Joseph M. Newcomer wrote:
>> See below...
>> On Thu, 20 May 2010 13:23:23 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>>
>>> I was under the false assumption that C++ required ASCII punctuation at
>>> the lexical level. This is the key false assumption that prevented me
>>>from understanding what you were saying.
>> *****
>> Actually, this is irrelevant to my point. In fact, it completely irrelevant to the
>> discussion. You had said "Allow a programmer to write code in his/her native language" or
>> something close to that statement, by which I would interpret this to mean
>
>Something entirely different than the way that I intended it because of
>my lack of knowledge about how these sorts of things are typically
>handled, hence the source of our failure to communicate.
>
>
>> "this is an
>> extension which permits localized variable names using the native language characters,
>> localized numbers using the native language digits, and localized punctuation, using the
>> native language punctuation marks", which seems to be a reasonable interpretation. Then
>
>It does now that I have several gaps in my understanding filled in. I
>specifically asked for you to fill in these gaps when I repeatedly asked
>how C++ does it.
>
>> you made the statement that all characters> U007F would be "letters", which even the most
>> superficial reading of the Unicode standard, or even looking at Character Map,
>> demonstrates is nonsensical. It has NOTHING to do with what ANSI/ISO C requires, and
>> everything to do with you creating a specification inconsistent with your claim. No deep
>> understaning of C++ is required
>
>No, but, a deep understanding about the ways that computer languages are
>adapted to international use was required, and I lacked this.
>
>> to see that allowing an Armenian to write an identifier
>> which has an Armenian comma embedded is going to be confusing to an Armenian reading the
>> source code, and that an Armenian who writes his/her own native comma between parameters
>
>I have told you repeatedly that I thought that this was an impossible
>scenario because I thought that everyone standardized on the ASCII
>Comma. It is still absurd to me why there needs to be more than one comma.
****
Then try to explain to someone from Israel, Arabia, Thailand, Viet Nam, India, China,
Japan, Korea, or for that matter Native Candadian why they need their own alphabet. By
gum, A-Za-z was good enough for Jesus and it good enough for me! (After all, the King
James edition of the New Testament was printed in English...and this is a takeoff on a
comment made by a Little Old Lday about why the Roman Catholic mass was in Latin: Why
didn't they say the Mass in English, the native language of Jesus?) Who are you to tell
someone from a different culture what their punctuation must look like? They may have
developed these punctuation marks centuries ago, and their question is "Why are you
forcing us to use an English punctuation mark in our native language?" (Which is a far
more valid question!) The use of native punctuation marks is not a choice you can make.
It is only a choice you can accomodate.
joe
****
>
>> is writing something that makes PERFECT sense to an Armenian, but by your interpretation,
>> if A and B can stand for Armenian letters, and , for an Armenian comma, a function call of
>> two parameters f(A,B) written in Armenian script is something you interpret as a function
>> call of one argument whose name is "A,B", and I fail to see how this could make sense.
>>
>> And at no point does the C++ language rule need to enter this discussion to demonstrate
>> that your specification ("writing in the native language") and your implementaiton ("Not
>> accepting native language digits or punctuation") are inconsistent.
>>
>> Note that ANSI/ISO C++ does not accept letters outside the set [A-Za-z_] but you did not
>> say "I'm not going to let an Armenian write identifiers using Armenian letters because C++
>> does not accept that"; instead, you fastened on one particular point, the specifcation of
>> comma, and insisted that an Armenian is not entitled to use an Armenian comma, that is,
>> write in their native language. Note that many languages have special symbols for colon,
>> semicolon, comma, and period, and for that matter, most Europeans object to having to
>> write floating point numbers as 1.23 when any child there knows the proper form is 1,23. C
>> and C++ do not cater to this, either, so if you REALLY want to make the syntax localized,
>> you have to allow for the correct lexical notation for primitive lexemes as recognized by
>> native language users. And note that not all languages specify negative number by putting
>> a hyphen as the first character; in some languages, the symbol is not a hyphen, and the
>> symbol appears as the LAST character of the sequence!
>>
>> One of the Great Debates of the Algol development era was what notation to use for
>> reserved words and numbers, and the committee (all of whom spoke fluent English) agreed
>> that the identifiers such as BEGIN, END, FOR, STEP, WHILE, UNTIL, PROCEDURE, INTEGER, and
>> so on would be in English. Many other international programmers viewed this as the
>> English-speaking world imposing their monoculture on everyone. In fact, one of the
>> reasons that China fell behind in computing was that as part of the Cultural Revolution,
>> technologies that required "Western" notations (such as FORTRAN, COBOL, ALGOL, and
>> probably C--there was no C++ back then, and languages derived from these) were forbidden,
>> which meant only assemblers and absolute octal/hex coding were permitted. Foreign numeric
>> notations, like Arabic numbers, were also forbidden, because these compromised the
>> "purity" of the authentic People's Culture. Of course, they also had to reject a lot that
>> came down from the previous emperors, including a lot of technological work, so they were
>> left with a lot of problems writing code. Some oganizations (notably, those concerned
>> with military computers) were exempted, but civilian (including academic) computing hit a
>> real low at that time, and took decades to recover. [This is based on stories told to me
>> by visiting Chinese computer specialiists],
>>
>> Many European compilers had workarounds that allowed the English reserved words to be
>> overridden; for example, a set of directive cards (remember, this is the punched-card era)
>> of the form
>> $BEGIN=localizedword
>> $FOR=localizedword
>> which allowed a programmer who didn't know English to write a program he or she found
>> intelligible. There was a desire to use [] to indicate subscripting, but those character
>> codes had been taken over by Germans who used them to represent ��, so the German
>> contigent insisted that () be used for subscripting. Hence, it was ambiguous if you saw
>> an expression
>> n = a(i);
>> because you did not know, unless you found the definition of a in scope, whether a was a
>> function or an array, And because Algol allows functions to be defined inside functions,
>> and we lived in a punched-card universe (no
>> hover-the-mouse-over-the-name-and-get-its-type) it could be a quite complex task for
>> someone reading the code to figure out where a was defined and what it meant (the compiler
>> had no problem, but it didn't always do what you expected, only what was right). In fact,
>> a number of notable "hacks" could be done by declaring a variable name that superseded an
>> outer-scope array name, or vice-versa, and you could drive people nuts looking for bugs by
>> playing this game.
>>
>> The localization of a programming language is a VERY complex problem, but to go to the
>> extreme of forbidding the use of localized numbers and puctuation by interpreting them as
>> "letters" is, well, over the top.
>>
>> My objection had nothing to do with the use of the ASCII-7 comma in the C++ spec, and
>> never did. And your response which raised that issue as the objection was ill-founded.
>> joe
>> ****
>>>
>>>> ****
>>>>>
>>>>> Ignorance squared means that one is even ignorance of their own
>>>>> ignorance. (or at least the degree of this ignorance).
>>>> ***
>>>> So what do you call an insistence on remaining ignorant, even when others are supplying
>>>> knowledge you didn't have?
>>>> joe
>>>
>>> I explained this term here as applied to myself to help you understand
>>> why it took me so long to understand what you were saying. What you were
>>> saying made no sense at all within the context of my fundamental (and
>>> incorrect) base assumptions.
>>>
>>> Correcting these fundamental base assumptions was the necessarily
>>> prerequisite for my understanding what you were saying.
>>>
>>>> ****
>>>> Joseph M. Newcomer [MVP]
>>>> email: newcomer(a)flounder.com
>>>> Web: http://www.flounder.com
>>>> MVP Tips: http://www.flounder.com/mvp_tips.htm
>> Joseph M. Newcomer [MVP]
>> email: newcomer(a)flounder.com
>> Web: http://www.flounder.com
>> MVP Tips: http://www.flounder.com/mvp_tips.htm
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 21 May 2010 08:28

See below...
On Fri, 21 May 2010 03:38:43 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/21/2010 1:32 AM, Mihai N. wrote:
>>
>>> Ah so my idea to allow UTF-8 encoded identifiers is really not all that
>>> bad.
>>
>> It is not the idea that is bad, it is the intended implementation.
>> (anything above 0x80 (or was it 0xff?) is a letter).
>>
>> I have pointed you to a relevant document
>> "Identifier and Pattern Syntax" (Unicode Technical Report #31,
>> http://unicode.org/reports/tr31/tr31-1.html)
>>
>> You ignored it, but you keep saying "I am no expert"
>> Well, maybe reading documents written by experts might be a good start.
>>
>>
>I was already familiar with the restriction that Java makes, and this
>seems to be similar. Initially I saw no reason to restrict an identifier
>to a letter because I had assumed that C++ was always inherently
>restricted at the lexical level to ASCII. Since my language will be a
>subset of C++, I would follow whatever C++ does.
>
>Since it is possible to map local punctuation (why is there a need for
>more than one comma in the world?)
****
That is a remarkably STUPID statement! Why is there a need for more than one alphabet in
the world? What's wrong with the Roman alphabet? Why do those annoying Greeks and
Russians have their own alphabet? Who gave permission for those Arabs to use "Arabian"
digits that don't look like the ones we write in the U.S. and England? They must be
culturally deprived to insist that they have their own language? Next, you'll be
insisting that the whole world should speak only English!

Talk about foaming-at-the-mouth ethnocentric bias!

Next, you'll be complaining about them furriners talkin' funny, using words we Good
Americans don't unnerstand! (Perhaps it is the fact that my grandfather was an immigrant,
and spoke 18 languages or dialects of languages, including Italian, German, French,
Russian, Hungarian, Polish and Greek, as well as his native Slovak, which are the only
languages I ever heard him speak; but my mother said he knew a lot more, and heard hiim
speak them when she was growing up in an ethnically diverse neighborhood) that makes me
aware of such issues. But you are making it sound like you what to know who is some
furriner to insist they have a comma that doesn't look like our good, native, home-grown
symbol! The audacity! Maybe it is White Man's Burden to convey the Rightness Of The One
True Comma to all those unwashed heathens!

Sheesh!

Are you that seriously damaged that you cannot recognize the validity of other human
beings in their own cultures? And their desire to use their own punctuation marks in
writing their programs?
****
> and local digits to this ASCII set,
>then restricting identifiers from using local punctuation would be
>required.
>
>So from all of this what seems to make the most sense is to restrict the
>set of code points used by identifiers only to the extent that the lexer
>is (or will be) mapping NonASCII code points to ASCII. This would seem
>to be the choice that minimizes complexity.
***
Isn't that what I told you in my first message, complete with all the necessary details?
You could have spent the necessary five minutes to do the research (I obviously had, and
it took less than 5 minutes) to verify that I was being very precise.
joe

Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

First | Prev | Next | Last
Pages: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Prev: Love Potion for Miss Blandish
Next: Newcomer's CAsyncSocket example: trouble connecting with other clients