Is this UTF-8 regular expression semantically correct? [MFC]

Prev: Love Potion for Miss Blandish
Next: Newcomer's CAsyncSocket example: trouble connecting with other clients

From: Peter Olcott on 22 May 2010 10:28

On 5/22/2010 5:16 AM, Joseph M. Newcomer wrote:
> See below...
> On Fri, 21 May 2010 15:23:25 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>
>> It would probably take me much longer than 40 hours just to find the
>> exhaustive list of every local code point that must be mapped to an
>> ASCII code point. The whole rest of this adaptation would be nearly
>> trivial.
> ****
> Why do you care about ASCII code points? You explicitly said you are implementing an
> EXTENSION to C++ syntax, for a language which is NOT C++ but your private scripting
> language! So what in the world does the C++ specification have to do with your EXTENSION
> to the syntax????

C++ requires that every non ASCII character be mapped to the ASCII set.
I am extending this to be that I only require semantically significant
non ASCII characters to be mapped to the ASCII set. This approach (as
you already know if you have the compiler design experience that you
claim) is simple because it can reuse the same parser and only require
changes to the lexer.

This design requires obtaining the exhaustively complete set of every
character in every language that must be mapped to those characters
within C++ that have semantic significance. This includes all C++
punctuation marks as well as the local set of numeric digits. Fining the
local set of numeric digits would be easy enough. Finding out whether or
not a Chinese semicolon (if one even exists) is reasonable to map to the
ASCII semicolon for every punctuation mark for every language would take
more time that I have unless someone else has already done this.

>
> If you say "I wish to ignore the limitations of the C++ language" and then you say "I am
> forced to do a bad implementation because I have to adhere to the limitations of the C++
> language", how can we resolve these two positions?
> ****
>>
>>> Assume that you only have novice levels of
>>> understanding of Unicode and any learning must also be included in this
>>> 40 hour budget.
> *****
> It does not take much experience to read the Unicode tables and see what are letters and
> what are digits and what are puctuation marks! And it does not take hours of study to do
> this!
> ****

Which local punctuation mark can be mapped to which ASCII punctuation
mark specifically taking into account all of the subtle nuances of
semantic distinctions will take longer than I have. A concrete example
is that the comma is used as a decimal point in some countries.

>>>
>>> Since my language would not treat any code point above ASCII as
>>> lexically or syntactically significant, I still think that my approach
>>> within my budget is optimal.
> *****
> Oh, what happened to that stated specification of allowing people to program in their
> native character set? Oh, that was just a Magic Morphing Requirement which is no longer
> true. Never mind.
> ****
>>>
>>> What I learned from you is that if and when I do decide to map local
>>> punctuation and digits to their corresponding ASCII equivalents, then I
>>> would need to restrict the use of these remapped code points from being
>>> used within identifiers. Until then it makes little difference.
> *****
> But it is so trivial to do the job right in the first place! You treat anything
> recognizably called a "letter" as a letter, anything recognizably called a "digit" as a
> digit, write lexical rules for a number which has productions of the form

That would be wrong. Rejecting a combining mark as not a Letter and thus
not valid in an identifier would be incorrect. That is why I take the
opposite approach. Anything that is used in ways that a Letter is not
used (C++ significant punctuation and numeric digits) is not a Letter.
Everything else is a Letter in terms a its use in any identifier.

The hard part is deriving the table mapping local punctuation marks to
their ASCII equivalents while specifically taking into account possibly
very great depths of subtle nuances in semantic meaning. Just last night
I looked in the Unicode table and found many code points that had a
letter with an implied comma embedded within its meaning. The comma was
being used as a diacritical mark.

>
> thai_number = [0-9] (where 0-9 represent the code points for a thai number)
> chinese_number = [0-9] (where 0-9 represent the code poitns for a chinese number)
> english_Number = [0-9] (where 0-9 represent the code points \u0030 to \u0039)
>
> number = thai_number | chinese_number | english_number | ...lots of others...
>
> Note that converting a Chinese number to a binary representation is a bit trickier,
> because Chinese has a symbol for "ten", so you need to know the syntax for doing the
> conversion, but that's a trivial detail. That's what you worry about in the other 35
> hours.
> joe
> ****
>>>
>>> I also learned from you that this next step of localization provides
>>> much more functionality for relatively little cost.
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Hector Santos on 22 May 2010 10:41

Peter Olcott wrote:

> On 5/22/2010 5:03 AM, Joseph M. Newcomer wrote:
>> See below...
>> On Fri, 21 May 2010 14:43:07 -0500, Peter
>> Olcott<NoSpam(a)OCR4Screen.com> wrote:
>>
>>> On 5/21/2010 2:33 PM, Joseph M. Newcomer wrote:
>>>> :-)!!!! And I can decode that even without looking up the actual
>>>> codepoints! Yes, I've
>>>> been seriously tempted, but as I said in the last tedious thread, I
>>>> think I must suffer
>>>> from OCD because I keep trying to educate him, in spite of his
>>>> resistance to it!
>>>> joe
>>>
>>> I did acknowledge that you did make your point as soon as you provided
>>> me with enough reasoning to make your point.
>> ****
>> Sadly, all of this was so evident that I didn't see a need to keep
>> drilling down when the
>> correct issues were screamingly obvious. You should have been able to
>> determine all of
>> this on your own from my first responses.
>> joe
>
> Within the context of the basic assumption (and I have already said this
> several times but you still don't get it) that C++ requires ASCII at the
> lexical level, everything that you said about how I was treating
> identifiers was utter nonsense gibberish.
>
> ONLY after this incorrect assumption was corrected could anything that
> you said about how I was treating identifiers make any sense at all.
>
> The ONLY reason that C++ does not allow any character in an identifier
> is that it would screw up the parser. If is would not screw up the
> parser then any character at all could be used in an identifier. It took
> you an enormous amount of time to explain why it would screw up the
> parser. You kept insisting upon arbitrary historical convention as your
> criterion for correct identifiers without pointing out how the parser
> would be screwed up.

It reminds of the classic

"Press Any Key to continue."

and someone you would respond:

"Where is the ANY key?"

Pedro, when people say "Eat Food" the term EAT implies many basic
ideas about the process of obtaining a food item, moving it towards
your mouth, putting it into your mouth and begin chewing and
swallowing process.

Do you need this level of attention?

What you are not grasping is that when you begin to talk about
compiler (or translators) design, there is a natural presumption that
you have some basic level of understanding for the basic requirements
about the concept.

Besides why are you in any of the COMP.* groups discussing compiler
design concepts?

Why the MFC group? Do you have that much disdain for everyone? Or do
you need to prove something about yourself we don't already know?

--
HLS

From: Mihai N. on 23 May 2010 04:22

> C++ requires that every non ASCII character be mapped to the ASCII set.

Where did you get this from?

> I looked in the Unicode table and found many code points that had a
> letter with an implied comma embedded within its meaning. The comma was
> being used as a diacritical mark.

I am not sure what are you refering to.
I don't know of any comma used as diacritical mark.
If you are talking about things like U+0219 (LATIN SMALL LETTER S WITH
COMMA BELOW), that is a stand-alone letter, not a letter with a
diacritical mark.
Like I would say "O with a small squigly" when talking about Q.

The Unicode names describe the character using plain ASCII, but
does not imply anything about the meaning of the thing.

--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

From: Peter Olcott on 23 May 2010 09:48

On 5/23/2010 3:22 AM, Mihai N. wrote:
>
>
>> C++ requires that every non ASCII character be mapped to the ASCII set.
>
> Where did you get this from?
>
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3035.pdf

2.2
Physical source file characters are mapped, in an implementation-defined
manner, to the basic source character set. The set of physical
source file characters accepted is implementation-defined.

2.3
The basic source character set consists of 96 characters: the space
character, the control characters representing horizontal tab, vertical
tab, form feed, and new-line, plus the following 91 graphical characters:

>
>> I looked in the Unicode table and found many code points that had a
>> letter with an implied comma embedded within its meaning. The comma was
>> being used as a diacritical mark.
>
> I am not sure what are you refering to.
> I don't know of any comma used as diacritical mark.
> If you are talking about things like U+0219 (LATIN SMALL LETTER S WITH
> COMMA BELOW), that is a stand-alone letter, not a letter with a
> diacritical mark.
> Like I would say "O with a small squigly" when talking about Q.
>
> The Unicode names describe the character using plain ASCII, but
> does not imply anything about the meaning of the thing.
>
>
>

Since the above examples had the term "Comma" embedded within their name
it was possible for them to contain a nuance of the semantic meaning of
the punctuation mark.

In any case it seems that Joe may have been wrong about this. If one
takes the Java language as an example of how computer languages are
internationalized.
http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.1

Instead of using the local symbol for Comma and translating it into the
ASCII comma, Java takes a different approach. The ASCII character "," is
used. It seems that native speakers of the Java language think that this
is perfectly reasonable. Java also requires [0-9] digits.

It seems that Java takes essentially the same approach that I am taking
and only allows non ASCII characters within identifiers.

Here is another useful link:
http://en.wikipedia.org:80/wiki/Non-English-based_programming_languages

From: Mihai N. on 23 May 2010 14:52

> 2.2
> Physical source file characters are mapped, in an implementation-defined
> manner, to the basic source character set. The set of physical
> source file characters accepted is implementation-defined.

Editing the quote to make a point is cheating. The quote is:
"Physical source file characters are mapped, in an implementation-defined
manner, to the basic source character set (introducing new-line characters
for end-of-line indicators) if necessary."

Note the "if necessary"?

This might mean that it can be an implementation-defined way to
map other commas (like Arabic, or Mongolian) to the ASCII comma.

> Since the above examples had the term "Comma" embedded within their name
> it was possible for them to contain a nuance of the semantic meaning of
> the punctuation mark.

That is not the case, believe me.
This might be true for other things (like accent grave, or acute), but
even then it would be locale dependent. For some countries a with
acute (U+00C1) is a letter, for some it is an accent, for some it is
a tone mark.
But that's not true for comma. And there is no way to tell how
something is used, unless you know about it (it is not captured
in the Unicode tables).

> It seems that native speakers of the Java language think that this
> is perfectly reasonable.

Did you talk to them?
And does anyone claim that Java allows people to write code in their own
language?

--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

First | Prev | Next | Last
Pages: 5 6 7 8 9 10 11 12 13 14 15 16 17
Prev: Love Potion for Miss Blandish
Next: Newcomer's CAsyncSocket example: trouble connecting with other clients