Is this UTF-8 regular expression semantically correct? [MFC]

Prev: Love Potion for Miss Blandish
Next: Newcomer's CAsyncSocket example: trouble connecting with other clients

From: Joseph M. Newcomer on 22 May 2010 06:03

See below...
On Fri, 21 May 2010 14:43:07 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/21/2010 2:33 PM, Joseph M. Newcomer wrote:
>> :-)!!!! And I can decode that even without looking up the actual codepoints! Yes, I've
>> been seriously tempted, but as I said in the last tedious thread, I think I must suffer
>> from OCD because I keep trying to educate him, in spite of his resistance to it!
>> joe
>
>I did acknowledge that you did make your point as soon as you provided
>me with enough reasoning to make your point.
****
Sadly, all of this was so evident that I didn't see a need to keep drilling down when the
correct issues were screamingly obvious. You should have been able to determine all of
this on your own from my first responses.
joe
****
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 22 May 2010 06:08

See below...
On Fri, 21 May 2010 14:55:27 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/21/2010 2:30 PM, Joseph M. Newcomer wrote:
>> See below....
>> On Fri, 21 May 2010 09:59:50 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>>
>>> C++ is apparently more restrictive than you thought because it requires
>>> every input character to be mapped to the ASCII set. This would seem to
>>> explicitly prohibit the flexibility that I provided of allowing UTF-8
>>> identifiers.
>> ****
>> OK, are you implementing C++ or are you implement an EXTENSION of C++ to allow native
>> characters? If you are implementing an EXTENSION, then you get to decide what identifiers
>> look like, but they should NOT look like sequences of arbitrary characters including
>> punctuation marks! That is not sensible, and it is inconsistent with the stated goal! You
>> don't need to read the C++ standard to know this!
>
>How would you go about making a language as international as you can
>within a 40 hour budget? Assume that you only have novice levels of
>understanding of Unicode and any learning must also be included in this
>40 hour budget.
****
What part of the budget is 40 hours? I could add the lexer rules in a few hours with
careful reading of the Unicode standard. This would probably leave me 35 hours to deal
with the finer points, the kind Mihai, Tom, and others who have worked deeply in other
languages, might be able to point out. It ain't Rocket Science!
****
>
>Since my language would not treat any code point above ASCII as
>lexically or syntactically significant, I still think that my approach
>within my budget is optimal.
****
Fine, but you mispresented what you were doing. So either your implementation doesn't
meet your stated specification, or the stated specification was naively optimistic. But
the implmentation clearly did not match it.
****
>
>What I learned from you is that if and when I do decide to map local
>punctuation and digits to their corresponding ASCII equivalents, then I
>would need to restrict the use of these remapped code points from being
>used within identifiers. Until then it makes little difference.
*****
Yes. But it makes a SIGNIFICANT difference if you tell me that I can use my native
character set, and then you don't do that.
****
>
>I also learned from you that this next step of localization provides
>much more functionality for relatively little cost.
*****
Well, it means your implementation and your spefication are closer to each other...
joe
****
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 22 May 2010 06:16

See below...
On Fri, 21 May 2010 15:23:25 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/21/2010 2:55 PM, Peter Olcott wrote:
>> On 5/21/2010 2:30 PM, Joseph M. Newcomer wrote:
>>> See below....
>>> On Fri, 21 May 2010 09:59:50 -0500, Peter
>>> Olcott<NoSpam(a)OCR4Screen.com> wrote:
>>>
>>>> C++ is apparently more restrictive than you thought because it requires
>>>> every input character to be mapped to the ASCII set. This would seem to
>>>> explicitly prohibit the flexibility that I provided of allowing UTF-8
>>>> identifiers.
>>> ****
>>> OK, are you implementing C++ or are you implement an EXTENSION of C++
>>> to allow native
>>> characters? If you are implementing an EXTENSION, then you get to
>>> decide what identifiers
>>> look like, but they should NOT look like sequences of arbitrary
>>> characters including
>>> punctuation marks! That is not sensible, and it is inconsistent with
>>> the stated goal! You
>>> don't need to read the C++ standard to know this!
>>
>> How would you go about making a language as international as you can
>> within a 40 hour budget?
>
>It would probably take me much longer than 40 hours just to find the
>exhaustive list of every local code point that must be mapped to an
>ASCII code point. The whole rest of this adaptation would be nearly
>trivial.
****
Why do you care about ASCII code points? You explicitly said you are implementing an
EXTENSION to C++ syntax, for a language which is NOT C++ but your private scripting
language! So what in the world does the C++ specification have to do with your EXTENSION
to the syntax????

If you say "I wish to ignore the limitations of the C++ language" and then you say "I am
forced to do a bad implementation because I have to adhere to the limitations of the C++
language", how can we resolve these two positions?
****
>
> > Assume that you only have novice levels of
>> understanding of Unicode and any learning must also be included in this
>> 40 hour budget.
*****
It does not take much experience to read the Unicode tables and see what are letters and
what are digits and what are puctuation marks! And it does not take hours of study to do
this!
****
>>
>> Since my language would not treat any code point above ASCII as
>> lexically or syntactically significant, I still think that my approach
>> within my budget is optimal.
*****
Oh, what happened to that stated specification of allowing people to program in their
native character set? Oh, that was just a Magic Morphing Requirement which is no longer
true. Never mind.
****
>>
>> What I learned from you is that if and when I do decide to map local
>> punctuation and digits to their corresponding ASCII equivalents, then I
>> would need to restrict the use of these remapped code points from being
>> used within identifiers. Until then it makes little difference.
*****
But it is so trivial to do the job right in the first place! You treat anything
recognizably called a "letter" as a letter, anything recognizably called a "digit" as a
digit, write lexical rules for a number which has productions of the form

thai_number = [0-9] (where 0-9 represent the code points for a thai number)
chinese_number = [0-9] (where 0-9 represent the code poitns for a chinese number)
english_Number = [0-9] (where 0-9 represent the code points \u0030 to \u0039)

number = thai_number | chinese_number | english_number | ...lots of others...

Note that converting a Chinese number to a binary representation is a bit trickier,
because Chinese has a symbol for "ten", so you need to know the syntax for doing the
conversion, but that's a trivial detail. That's what you worry about in the other 35
hours.
joe
****
>>
>> I also learned from you that this next step of localization provides
>> much more functionality for relatively little cost.
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Oliver Regenfelder on 22 May 2010 06:47

Hello,

Peter Olcott wrote:
> I did acknowledge that you did make your point as soon as you provided
> me with enough reasoning to make your point.

This may be so from your point of view. But I would say for most other
people Joe made his point quite clear from the beginning!

Just because it takes you long to understand doesn't mean that
he didn't provide enough reason before.

Best regads,

Oliver

From: Peter Olcott on 22 May 2010 09:59

On 5/22/2010 5:03 AM, Joseph M. Newcomer wrote:
> See below...
> On Fri, 21 May 2010 14:43:07 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>
>> On 5/21/2010 2:33 PM, Joseph M. Newcomer wrote:
>>> :-)!!!! And I can decode that even without looking up the actual codepoints! Yes, I've
>>> been seriously tempted, but as I said in the last tedious thread, I think I must suffer
>>> from OCD because I keep trying to educate him, in spite of his resistance to it!
>>> joe
>>
>> I did acknowledge that you did make your point as soon as you provided
>> me with enough reasoning to make your point.
> ****
> Sadly, all of this was so evident that I didn't see a need to keep drilling down when the
> correct issues were screamingly obvious. You should have been able to determine all of
> this on your own from my first responses.
> joe

Within the context of the basic assumption (and I have already said this
several times but you still don't get it) that C++ requires ASCII at the
lexical level, everything that you said about how I was treating
identifiers was utter nonsense gibberish.

ONLY after this incorrect assumption was corrected could anything that
you said about how I was treating identifiers make any sense at all.

The ONLY reason that C++ does not allow any character in an identifier
is that it would screw up the parser. If is would not screw up the
parser then any character at all could be used in an identifier. It took
you an enormous amount of time to explain why it would screw up the
parser. You kept insisting upon arbitrary historical convention as your
criterion for correct identifiers without pointing out how the parser
would be screwed up.

> ****
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

First | Prev | Next | Last
Pages: 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Prev: Love Potion for Miss Blandish
Next: Newcomer's CAsyncSocket example: trouble connecting with other clients