Is this UTF-8 regular expression semantically correct? [MFC]

Prev: Love Potion for Miss Blandish
Next: Newcomer's CAsyncSocket example: trouble connecting with other clients

From: Peter Olcott on 21 May 2010 10:59

On 5/21/2010 7:28 AM, Joseph M. Newcomer wrote:
> See below...
> On Fri, 21 May 2010 03:38:43 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>
>> On 5/21/2010 1:32 AM, Mihai N. wrote:

>> I was already familiar with the restriction that Java makes, and this
>> seems to be similar. Initially I saw no reason to restrict an identifier
>> to a letter because I had assumed that C++ was always inherently
>> restricted at the lexical level to ASCII. Since my language will be a
>> subset of C++, I would follow whatever C++ does.
>>
>> Since it is possible to map local punctuation (why is there a need for
>> more than one comma in the world?)
> ****
> That is a remarkably STUPID statement! Why is there a need for more than one alphabet in
> the world?

According to the bible because God got angry and screwed up the
pre-existing universal language.

> What's wrong with the Roman alphabet? Why do those annoying Greeks and
> Russians have their own alphabet? Who gave permission for those Arabs to use "Arabian"
> digits that don't look like the ones we write in the U.S. and England? They must be
> culturally deprived to insist that they have their own language? Next, you'll be
> insisting that the whole world should speak only English!
>
> Talk about foaming-at-the-mouth ethnocentric bias!
>
> Next, you'll be complaining about them furriners talkin' funny, using words we Good
> Americans don't unnerstand! (Perhaps it is the fact that my grandfather was an immigrant,
> and spoke 18 languages or dialects of languages, including Italian, German, French,
> Russian, Hungarian, Polish and Greek, as well as his native Slovak, which are the only
> languages I ever heard him speak; but my mother said he knew a lot more, and heard hiim
> speak them when she was growing up in an ethnically diverse neighborhood) that makes me
> aware of such issues. But you are making it sound like you what to know who is some
> furriner to insist they have a comma that doesn't look like our good, native, home-grown
> symbol! The audacity! Maybe it is White Man's Burden to convey the Rightness Of The One
> True Comma to all those unwashed heathens!
>
> Sheesh!
>
> Are you that seriously damaged that you cannot recognize the validity of other human
> beings in their own cultures? And their desire to use their own punctuation marks in
> writing their programs?

In the case where there is a semantic difference, yes. In the case where
nothing is added besides unnecessary complexity no.

> ****
>> and local digits to this ASCII set,
>> then restricting identifiers from using local punctuation would be
>> required.
>>
>> So from all of this what seems to make the most sense is to restrict the
>> set of code points used by identifiers only to the extent that the lexer
>> is (or will be) mapping NonASCII code points to ASCII. This would seem
>> to be the choice that minimizes complexity.
> ***
> Isn't that what I told you in my first message, complete with all the necessary details?
> You could have spent the necessary five minutes to do the research (I obviously had, and
> it took less than 5 minutes) to verify that I was being very precise.
> joe

Previously I thought that C++ explicitly required all text to be ASCII,
and thus would not allow any mapping. This being the case my original
idea would have been correct and all of your criticism would have been
incorrect. Until I could see that C++ allowed this mapping what you were
saying merely seemed to be useless argumentativeness.

C++ is apparently more restrictive than you thought because it requires
every input character to be mapped to the ASCII set. This would seem to
explicitly prohibit the flexibility that I provided of allowing UTF-8
identifiers.

So within the context of all this my original design has proven to be
reasonable within my design goals. User of my language will initially
have the extra benefit of encoding identifiers in their native language
which is more than is possible in C++. Possibly I could screen out
NonASCII punctuation from the lexical definition of Letter to minimize
the fixes required to user specified code if I ever decide to add local
punctuation and local digits.

>
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 21 May 2010 15:30

See below....
On Fri, 21 May 2010 09:59:50 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/21/2010 7:28 AM, Joseph M. Newcomer wrote:
>> See below...
>> On Fri, 21 May 2010 03:38:43 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>>
>>> On 5/21/2010 1:32 AM, Mihai N. wrote:
>
>>> I was already familiar with the restriction that Java makes, and this
>>> seems to be similar. Initially I saw no reason to restrict an identifier
>>> to a letter because I had assumed that C++ was always inherently
>>> restricted at the lexical level to ASCII. Since my language will be a
>>> subset of C++, I would follow whatever C++ does.
>>>
>>> Since it is possible to map local punctuation (why is there a need for
>>> more than one comma in the world?)
>> ****
>> That is a remarkably STUPID statement! Why is there a need for more than one alphabet in
>> the world?
>
>According to the bible because God got angry and screwed up the
>pre-existing universal language.
****
So we're making design decisions based on biblical tales now?

It was Jean Sammet who took one of the classic images of the Tower of Babel and wrote the
names of programming languages on the bricks. She maintained and updated this image every
couple years as languages came and went; at one point, a language I helped design and
implement was on it, but only for about two years.
****
>
> > What's wrong with the Roman alphabet? Why do those annoying Greeks and
>> Russians have their own alphabet? Who gave permission for those Arabs to use "Arabian"
>> digits that don't look like the ones we write in the U.S. and England? They must be
>> culturally deprived to insist that they have their own language? Next, you'll be
>> insisting that the whole world should speak only English!
>>
>> Talk about foaming-at-the-mouth ethnocentric bias!
>>
>> Next, you'll be complaining about them furriners talkin' funny, using words we Good
>> Americans don't unnerstand! (Perhaps it is the fact that my grandfather was an immigrant,
>> and spoke 18 languages or dialects of languages, including Italian, German, French,
>> Russian, Hungarian, Polish and Greek, as well as his native Slovak, which are the only
>> languages I ever heard him speak; but my mother said he knew a lot more, and heard hiim
>> speak them when she was growing up in an ethnically diverse neighborhood) that makes me
>> aware of such issues. But you are making it sound like you what to know who is some
>> furriner to insist they have a comma that doesn't look like our good, native, home-grown
>> symbol! The audacity! Maybe it is White Man's Burden to convey the Rightness Of The One
>> True Comma to all those unwashed heathens!
>>
>> Sheesh!
>>
>> Are you that seriously damaged that you cannot recognize the validity of other human
>> beings in their own cultures? And their desire to use their own punctuation marks in
>> writing their programs?
>
>In the case where there is a semantic difference, yes. In the case where
>nothing is added besides unnecessary complexity no.
****
But if you declare their comma to be a letter, the parameter pair A,B becomes a single
identifier! So in fact it has a significant impact on the semanicts!

And learning an alphabet other than the classic Roman alphabet (ASCII-7) is clearly
"unnecessary complexity" (unless, of course, you want to get around in Moscow, Athens,
Tokyo, Jerusalem, Cairo, Beijing, Seoul, Saigon [Ho Chi Minh City], Bangkok, ....). I
don't buy the "complexity" argument, and if I was wanting to build an international
language that resembled C++, I'd rope in REAL language experts to help me (and I'd pay
them!). If I wanted a naive approximation, I would only include letters. I;ve actually
seen grammars that have the declaration

end-of-statement = ';'

which allows implementors to define end-of-statement to be any collection of localized
semicolons. All productions that would reference semicolon use the nonterminal
end-of-statement, that is, in C this would mean you could write

assignment-statement: lvalue = rvalue end-of-statement

C grammer doesn't work this way, but others do. And if you are making an *extension* to
the language to allow *native character support*, then you *must* do this!

You cannot misrepresent what you are doing and you should, out of consideration to the
cultures, at least make a pretense of supporting their localized symbols.
joe
****
>
>> ****
>>> and local digits to this ASCII set,
>>> then restricting identifiers from using local punctuation would be
>>> required.
>>>
>>> So from all of this what seems to make the most sense is to restrict the
>>> set of code points used by identifiers only to the extent that the lexer
>>> is (or will be) mapping NonASCII code points to ASCII. This would seem
>>> to be the choice that minimizes complexity.
>> ***
>> Isn't that what I told you in my first message, complete with all the necessary details?
>> You could have spent the necessary five minutes to do the research (I obviously had, and
>> it took less than 5 minutes) to verify that I was being very precise.
>> joe
>
>Previously I thought that C++ explicitly required all text to be ASCII,
>and thus would not allow any mapping. This being the case my original
>idea would have been correct and all of your criticism would have been
>incorrect. Until I could see that C++ allowed this mapping what you were
>saying merely seemed to be useless argumentativeness.
****
But the point is, it didn't take those of us who had not read the new C++ draft all that
long to discover the truth. You could have done it, too. A few of us were basing what we
were say NOT on the C++ Draf Standard, but your OWN assertion that you were extending the
syntax to support localization! So I didn't even HAVE to look at the C++ Draft Standard
to know that what you were saying in your requirements was completely inconsistent with
your specified implementation! And, as I said, the requirements of C++ were IRRELEVANT to
this discussion!
****
>
>C++ is apparently more restrictive than you thought because it requires
>every input character to be mapped to the ASCII set. This would seem to
>explicitly prohibit the flexibility that I provided of allowing UTF-8
>identifiers.
****
OK, are you implementing C++ or are you implement an EXTENSION of C++ to allow native
characters? If you are implementing an EXTENSION, then you get to decide what identifiers
look like, but they should NOT look like sequences of arbitrary characters including
punctuation marks! That is not sensible, and it is inconsistent with the stated goal! You
don't need to read the C++ standard to know this!
****
>
>So within the context of all this my original design has proven to be
>reasonable within my design goals. User of my language will initially
>have the extra benefit of encoding identifiers in their native language
>which is more than is possible in C++. Possibly I could screen out
>NonASCII punctuation from the lexical definition of Letter to minimize
>the fixes required to user specified code if I ever decide to add local
>punctuation and local digits.
****
Isn't that what I suggested in my first reply?
joe
****
>
>>
>> Joseph M. Newcomer [MVP]
>> email: newcomer(a)flounder.com
>> Web: http://www.flounder.com
>> MVP Tips: http://www.flounder.com/mvp_tips.htm
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Peter Olcott on 21 May 2010 15:43

On 5/21/2010 2:33 PM, Joseph M. Newcomer wrote:
> :-)!!!! And I can decode that even without looking up the actual codepoints! Yes, I've
> been seriously tempted, but as I said in the last tedious thread, I think I must suffer
> from OCD because I keep trying to educate him, in spite of his resistance to it!
> joe

I did acknowledge that you did make your point as soon as you provided
me with enough reasoning to make your point.

From: Peter Olcott on 21 May 2010 15:55

On 5/21/2010 2:30 PM, Joseph M. Newcomer wrote:
> See below....
> On Fri, 21 May 2010 09:59:50 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>
>> C++ is apparently more restrictive than you thought because it requires
>> every input character to be mapped to the ASCII set. This would seem to
>> explicitly prohibit the flexibility that I provided of allowing UTF-8
>> identifiers.
> ****
> OK, are you implementing C++ or are you implement an EXTENSION of C++ to allow native
> characters? If you are implementing an EXTENSION, then you get to decide what identifiers
> look like, but they should NOT look like sequences of arbitrary characters including
> punctuation marks! That is not sensible, and it is inconsistent with the stated goal! You
> don't need to read the C++ standard to know this!

How would you go about making a language as international as you can
within a 40 hour budget? Assume that you only have novice levels of
understanding of Unicode and any learning must also be included in this
40 hour budget.

Since my language would not treat any code point above ASCII as
lexically or syntactically significant, I still think that my approach
within my budget is optimal.

What I learned from you is that if and when I do decide to map local
punctuation and digits to their corresponding ASCII equivalents, then I
would need to restrict the use of these remapped code points from being
used within identifiers. Until then it makes little difference.

I also learned from you that this next step of localization provides
much more functionality for relatively little cost.

From: Peter Olcott on 21 May 2010 16:23

On 5/21/2010 2:55 PM, Peter Olcott wrote:
> On 5/21/2010 2:30 PM, Joseph M. Newcomer wrote:
>> See below....
>> On Fri, 21 May 2010 09:59:50 -0500, Peter
>> Olcott<NoSpam(a)OCR4Screen.com> wrote:
>>
>>> C++ is apparently more restrictive than you thought because it requires
>>> every input character to be mapped to the ASCII set. This would seem to
>>> explicitly prohibit the flexibility that I provided of allowing UTF-8
>>> identifiers.
>> ****
>> OK, are you implementing C++ or are you implement an EXTENSION of C++
>> to allow native
>> characters? If you are implementing an EXTENSION, then you get to
>> decide what identifiers
>> look like, but they should NOT look like sequences of arbitrary
>> characters including
>> punctuation marks! That is not sensible, and it is inconsistent with
>> the stated goal! You
>> don't need to read the C++ standard to know this!
>
> How would you go about making a language as international as you can
> within a 40 hour budget?

It would probably take me much longer than 40 hours just to find the
exhaustive list of every local code point that must be mapped to an
ASCII code point. The whole rest of this adaptation would be nearly
trivial.

> Assume that you only have novice levels of
> understanding of Unicode and any learning must also be included in this
> 40 hour budget.
>
> Since my language would not treat any code point above ASCII as
> lexically or syntactically significant, I still think that my approach
> within my budget is optimal.
>
> What I learned from you is that if and when I do decide to map local
> punctuation and digits to their corresponding ASCII equivalents, then I
> would need to restrict the use of these remapped code points from being
> used within identifiers. Until then it makes little difference.
>
> I also learned from you that this next step of localization provides
> much more functionality for relatively little cost.

First | Prev | Next | Last
Pages: 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Prev: Love Potion for Miss Blandish
Next: Newcomer's CAsyncSocket example: trouble connecting with other clients