Is this UTF-8 regular expression semantically correct? [MFC]

Prev: Love Potion for Miss Blandish
Next: Newcomer's CAsyncSocket example: trouble connecting with other clients

From: Peter Olcott on 20 May 2010 14:23

On 5/20/2010 1:07 PM, Joseph M. Newcomer wrote:
> See below...
> On Thu, 20 May 2010 12:01:12 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>
>> I coined a term long ago [ignorance squared]. What this means is that
>> there is no possible way for any person lacking knowledge to accurately
>> quantify the specific degree of this lack of knowledge because this
>> requires having the knowledge to measure the lack against.
> ****
> Hmm. But I had attempted to correct your ignorance, in particular, I remember
> specifically giving the Unicode code point for the Armenian Comma and several other
> localized punctuation marks, and you made the assumption I was "yanking your chain". Now
> THAT's a manifestation of ignorance squared! When someone corrects you by stating a fact,
> and you find the fact "inconvenient", it does not make you smarter; it only proves that
> you like remaining ignorant.

No. That proves that I lacked sufficient understanding of the underlying
infrastructure to make sense of what you were saying.

> ****
>>
>> To a person whom lacks knowledge this lack can only appear to be
>> disagreement. Only the person whom has the knowledge can accurately
>> quantify the degree of the lack.
> ****
> I had done that, by pointing out ranges of localized digits, and localized punctuation
> marks, and you chose to both ignore me and argue that such things didn't matter, which was
> inconsistent with your stated design goal (program in the localized language). I even
> pointed out that I had used my Locale Explorer to find these, and it is a free download
> (and the table I use in it is directly from the Unicode Web site, and is the official,
> sanctioned, data, at least as of the time I downloaded it; it is potentially obsolete, but
> it already contained enough information to show you were wrong)

I was under the false assumption that C++ required ASCII punctuation at
the lexical level. This is the key false assumption that prevented me
from understanding what you were saying.

> ****
>>
>> Ignorance squared means that one is even ignorance of their own
>> ignorance. (or at least the degree of this ignorance).
> ***
> So what do you call an insistence on remaining ignorant, even when others are supplying
> knowledge you didn't have?
> joe

I explained this term here as applied to myself to help you understand
why it took me so long to understand what you were saying. What you were
saying made no sense at all within the context of my fundamental (and
incorrect) base assumptions.

Correcting these fundamental base assumptions was the necessarily
prerequisite for my understanding what you were saying.

> ****
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Pete Delgado on 20 May 2010 17:33

"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in message
news:mln9v5d6ravla3g4ivr0nn6vil2qlk8ipt(a)4ax.com...
>
> Note that you should also cite section 2.3 and the footnotes on page 16.
> joe

Joe,
I simply wanted to provide a starting point for Peter Olcott should he
actually decide to read the document. I apologize if I did not provide
enough information. I guess I'd fail one of your courses! ;-)

-Pete

From: Joseph M. Newcomer on 20 May 2010 21:05

See below...
On Thu, 20 May 2010 13:23:23 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/20/2010 1:07 PM, Joseph M. Newcomer wrote:
>> See below...
>> On Thu, 20 May 2010 12:01:12 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>>
>>> I coined a term long ago [ignorance squared]. What this means is that
>>> there is no possible way for any person lacking knowledge to accurately
>>> quantify the specific degree of this lack of knowledge because this
>>> requires having the knowledge to measure the lack against.
>> ****
>> Hmm. But I had attempted to correct your ignorance, in particular, I remember
>> specifically giving the Unicode code point for the Armenian Comma and several other
>> localized punctuation marks, and you made the assumption I was "yanking your chain". Now
>> THAT's a manifestation of ignorance squared! When someone corrects you by stating a fact,
>> and you find the fact "inconvenient", it does not make you smarter; it only proves that
>> you like remaining ignorant.
>
>No. That proves that I lacked sufficient understanding of the underlying
>infrastructure to make sense of what you were saying.
>
>> ****
>>>
>>> To a person whom lacks knowledge this lack can only appear to be
>>> disagreement. Only the person whom has the knowledge can accurately
>>> quantify the degree of the lack.
>> ****
>> I had done that, by pointing out ranges of localized digits, and localized punctuation
>> marks, and you chose to both ignore me and argue that such things didn't matter, which was
>> inconsistent with your stated design goal (program in the localized language). I even
>> pointed out that I had used my Locale Explorer to find these, and it is a free download
>> (and the table I use in it is directly from the Unicode Web site, and is the official,
>> sanctioned, data, at least as of the time I downloaded it; it is potentially obsolete, but
>> it already contained enough information to show you were wrong)
>
>I was under the false assumption that C++ required ASCII punctuation at
>the lexical level. This is the key false assumption that prevented me
>from understanding what you were saying.
*****
Actually, this is irrelevant to my point. In fact, it completely irrelevant to the
discussion. You had said "Allow a programmer to write code in his/her native language" or
something close to that statement, by which I would interpret this to mean "this is an
extension which permits localized variable names using the native language characters,
localized numbers using the native language digits, and localized punctuation, using the
native language punctuation marks", which seems to be a reasonable interpretation. Then
you made the statement that all characters > U007F would be "letters", which even the most
superficial reading of the Unicode standard, or even looking at Character Map,
demonstrates is nonsensical. It has NOTHING to do with what ANSI/ISO C requires, and
everything to do with you creating a specification inconsistent with your claim. No deep
understaning of C++ is required to see that allowing an Armenian to write an identifier
which has an Armenian comma embedded is going to be confusing to an Armenian reading the
source code, and that an Armenian who writes his/her own native comma between parameters
is writing something that makes PERFECT sense to an Armenian, but by your interpretation,
if A and B can stand for Armenian letters, and , for an Armenian comma, a function call of
two parameters f(A,B) written in Armenian script is something you interpret as a function
call of one argument whose name is "A,B", and I fail to see how this could make sense.

And at no point does the C++ language rule need to enter this discussion to demonstrate
that your specification ("writing in the native language") and your implementaiton ("Not
accepting native language digits or punctuation") are inconsistent.

Note that ANSI/ISO C++ does not accept letters outside the set [A-Za-z_] but you did not
say "I'm not going to let an Armenian write identifiers using Armenian letters because C++
does not accept that"; instead, you fastened on one particular point, the specifcation of
comma, and insisted that an Armenian is not entitled to use an Armenian comma, that is,
write in their native language. Note that many languages have special symbols for colon,
semicolon, comma, and period, and for that matter, most Europeans object to having to
write floating point numbers as 1.23 when any child there knows the proper form is 1,23. C
and C++ do not cater to this, either, so if you REALLY want to make the syntax localized,
you have to allow for the correct lexical notation for primitive lexemes as recognized by
native language users. And note that not all languages specify negative number by putting
a hyphen as the first character; in some languages, the symbol is not a hyphen, and the
symbol appears as the LAST character of the sequence!

One of the Great Debates of the Algol development era was what notation to use for
reserved words and numbers, and the committee (all of whom spoke fluent English) agreed
that the identifiers such as BEGIN, END, FOR, STEP, WHILE, UNTIL, PROCEDURE, INTEGER, and
so on would be in English. Many other international programmers viewed this as the
English-speaking world imposing their monoculture on everyone. In fact, one of the
reasons that China fell behind in computing was that as part of the Cultural Revolution,
technologies that required "Western" notations (such as FORTRAN, COBOL, ALGOL, and
probably C--there was no C++ back then, and languages derived from these) were forbidden,
which meant only assemblers and absolute octal/hex coding were permitted. Foreign numeric
notations, like Arabic numbers, were also forbidden, because these compromised the
"purity" of the authentic People's Culture. Of course, they also had to reject a lot that
came down from the previous emperors, including a lot of technological work, so they were
left with a lot of problems writing code. Some oganizations (notably, those concerned
with military computers) were exempted, but civilian (including academic) computing hit a
real low at that time, and took decades to recover. [This is based on stories told to me
by visiting Chinese computer specialiists],

Many European compilers had workarounds that allowed the English reserved words to be
overridden; for example, a set of directive cards (remember, this is the punched-card era)
of the form
$BEGIN=localizedword
$FOR=localizedword
which allowed a programmer who didn't know English to write a program he or she found
intelligible. There was a desire to use [] to indicate subscripting, but those character
codes had been taken over by Germans who used them to represent ��, so the German
contigent insisted that () be used for subscripting. Hence, it was ambiguous if you saw
an expression
n = a(i);
because you did not know, unless you found the definition of a in scope, whether a was a
function or an array, And because Algol allows functions to be defined inside functions,
and we lived in a punched-card universe (no
hover-the-mouse-over-the-name-and-get-its-type) it could be a quite complex task for
someone reading the code to figure out where a was defined and what it meant (the compiler
had no problem, but it didn't always do what you expected, only what was right). In fact,
a number of notable "hacks" could be done by declaring a variable name that superseded an
outer-scope array name, or vice-versa, and you could drive people nuts looking for bugs by
playing this game.

The localization of a programming language is a VERY complex problem, but to go to the
extreme of forbidding the use of localized numbers and puctuation by interpreting them as
"letters" is, well, over the top.

My objection had nothing to do with the use of the ASCII-7 comma in the C++ spec, and
never did. And your response which raised that issue as the objection was ill-founded.
joe
****
>
>> ****
>>>
>>> Ignorance squared means that one is even ignorance of their own
>>> ignorance. (or at least the degree of this ignorance).
>> ***
>> So what do you call an insistence on remaining ignorant, even when others are supplying
>> knowledge you didn't have?
>> joe
>
>I explained this term here as applied to myself to help you understand
>why it took me so long to understand what you were saying. What you were
>saying made no sense at all within the context of my fundamental (and
>incorrect) base assumptions.
>
>Correcting these fundamental base assumptions was the necessarily
>prerequisite for my understanding what you were saying.
>
>> ****
>> Joseph M. Newcomer [MVP]
>> email: newcomer(a)flounder.com
>> Web: http://www.flounder.com
>> MVP Tips: http://www.flounder.com/mvp_tips.htm
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 20 May 2010 21:06

No, I was just adding to the issue by pointing out that the manual really DOES say
something quite different about the character set.
joe

On Thu, 20 May 2010 17:33:29 -0400, "Pete Delgado" <Peter.Delgado(a)NoSpam.com> wrote:

>
>"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in message
>news:mln9v5d6ravla3g4ivr0nn6vil2qlk8ipt(a)4ax.com...
>>
>> Note that you should also cite section 2.3 and the footnotes on page 16.
>> joe
>
>Joe,
>I simply wanted to provide a starting point for Peter Olcott should he
>actually decide to read the document. I apologize if I did not provide
>enough information. I guess I'd fail one of your courses! ;-)
>
>-Pete
>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 20 May 2010 21:07

One of the great joys of teaching commercial courses is that I don't have to give grades!
joe

On Thu, 20 May 2010 17:33:29 -0400, "Pete Delgado" <Peter.Delgado(a)NoSpam.com> wrote:

>
>"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in message
>news:mln9v5d6ravla3g4ivr0nn6vil2qlk8ipt(a)4ax.com...
>>
>> Note that you should also cite section 2.3 and the footnotes on page 16.
>> joe
>
>Joe,
>I simply wanted to provide a starting point for Peter Olcott should he
>actually decide to read the document. I apologize if I did not provide
>enough information. I guess I'd fail one of your courses! ;-)
>
>-Pete
>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Prev: Love Potion for Miss Blandish
Next: Newcomer's CAsyncSocket example: trouble connecting with other clients