Is this UTF-8 regular expression semantically correct? [MFC]

Prev: Love Potion for Miss Blandish
Next: Newcomer's CAsyncSocket example: trouble connecting with other clients

From: Hector Santos on 23 May 2010 20:43

Mihai N. wrote:

>> Peter Olcott Wrote:
>>
>> 2.2
>> Physical source file characters are mapped, in an implementation-defined
>> manner, to the basic source character set. The set of physical
>> source file characters accepted is implementation-defined.
>
> Editing the quote to make a point is cheating.

Now you see what I have been saying about this unethical pathetic
individual who has a great disdain for everyone in this forum who
would lie, twist and turn and rationalized whatever it is to suit his
purpose, creating chaos and disturbance in the newsgroup.

I have filed an abuse complaint to his giganews usenet provider. His
behavior is worst than spam.

If anyone wishes to endorse this abuse complaint please email me at
sant9442(a)gmail.com, the giganews people will pay attention more if
its coming from more than one person.

--
HLS

From: Peter Olcott on 23 May 2010 22:39

On 5/23/2010 1:52 PM, Mihai N. wrote:
>
>
>> 2.2
>> Physical source file characters are mapped, in an implementation-defined
>> manner, to the basic source character set. The set of physical
>> source file characters accepted is implementation-defined.
>
> Editing the quote to make a point is cheating. The quote is:
> "Physical source file characters are mapped, in an implementation-defined
> manner, to the basic source character set (introducing new-line characters
> for end-of-line indicators) if necessary."
>
> Note the "if necessary"?
>

I took that to apply to the extraneous detail of end-of-line indicators.
I edited out because it was an extraneous detail.

> This might mean that it can be an implementation-defined way to
> map other commas (like Arabic, or Mongolian) to the ASCII comma.

It allows this much leeway without the "if necessary". I don't think
that this is the way that it is done though. See my notes below.

>
>
>
>> Since the above examples had the term "Comma" embedded within their name
>> it was possible for them to contain a nuance of the semantic meaning of
>> the punctuation mark.
>
> That is not the case, believe me.
> This might be true for other things (like accent grave, or acute), but
> even then it would be locale dependent. For some countries a with
> acute (U+00C1) is a letter, for some it is an accent, for some it is
> a tone mark.
> But that's not true for comma. And there is no way to tell how
> something is used, unless you know about it (it is not captured
> in the Unicode tables).
>
>
>> It seems that native speakers of the Java language think that this
>> is perfectly reasonable.
>
> Did you talk to them?
> And does anyone claim that Java allows people to write code in their own
> language?
>
>

We are talking about native speakers of native languages. Java and C++
could themselves form the native language of Java and C++ programmers.
Because of this possibility the ASCII comma might not be considered to
be an English Comma, but, instead it might be considered to be the
native language (of C++ and Java) way to specify a parameter delimiter.

This seems to be the way that Java does it, and it also makes good sense.

From: Hector Santos on 24 May 2010 02:24

I'm a compiler designer

http://www.santronics.com/products/winserver/Code.php

And I am still trying to figure out what you are talking about.

Any (non-english) foreign language developer can write code in his
native tongue if their editor and the compiler allows for it. Its
matter of having the right dictionary table for it.

But if you wish to allow pure native speaking language, it becomes
more of a NLP (Natural Language Parsing) design issue rather than
tokenization design issue. For example, what is one word command in
english can be two words in another. A meaning in english has a
different meaning in others. The only other way to avoid the NLP is
to build a lexical vocabulary for the specific language.

One of the beauties of the APL language was that it was 100% symbolic,
not one based on lexical translation. However, you did need a proper
keyboard. APL gurs were able to use control codes and learn to
recognize the non-APL code page display. It wasn't APL symbolically
pretty though.

The same will be true with a C/C++ Chinese compiler, well, you know we
doubt you are writing acompiler, but rather a pre-processor, more like
a lint. Nonetheless, even if did borrow some compiler code, you will
probably need to create a special dictionary table anyway. See
Chinese Basic history with its Cangjie input method:

http://en.wikipedia.org/wiki/Chinese_BASIC

I think that is the easy part. The hard part is the Chinese programmer
learning another language that will probably not be as nature as you
think. For that you really do need more of a NLP.

--
HLS

Peter Olcott wrote:

> On 5/23/2010 3:22 AM, Mihai N. wrote:
>>
>>
>>> C++ requires that every non ASCII character be mapped to the ASCII set.
>>
>> Where did you get this from?
>>
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3035.pdf
>
> 2.2
> Physical source file characters are mapped, in an implementation-defined
> manner, to the basic source character set. The set of physical
> source file characters accepted is implementation-defined.
>
> 2.3
> The basic source character set consists of 96 characters: the space
> character, the control characters representing horizontal tab, vertical
> tab, form feed, and new-line, plus the following 91 graphical characters:
>
>>
>>> I looked in the Unicode table and found many code points that had a
>>> letter with an implied comma embedded within its meaning. The comma was
>>> being used as a diacritical mark.
>>
>> I am not sure what are you refering to.
>> I don't know of any comma used as diacritical mark.
>> If you are talking about things like U+0219 (LATIN SMALL LETTER S WITH
>> COMMA BELOW), that is a stand-alone letter, not a letter with a
>> diacritical mark.
>> Like I would say "O with a small squigly" when talking about Q.
>>
>> The Unicode names describe the character using plain ASCII, but
>> does not imply anything about the meaning of the thing.
>>
>>
>>
>
> Since the above examples had the term "Comma" embedded within their name
> it was possible for them to contain a nuance of the semantic meaning of
> the punctuation mark.
>
> In any case it seems that Joe may have been wrong about this. If one
> takes the Java language as an example of how computer languages are
> internationalized.
> http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.1
>
> Instead of using the local symbol for Comma and translating it into the
> ASCII comma, Java takes a different approach. The ASCII character "," is
> used. It seems that native speakers of the Java language think that this
> is perfectly reasonable. Java also requires [0-9] digits.
>
> It seems that Java takes essentially the same approach that I am taking
> and only allows non ASCII characters within identifiers.
>
> Here is another useful link:
> http://en.wikipedia.org:80/wiki/Non-English-based_programming_languages
>

--
HLS

From: Peter Olcott on 24 May 2010 11:57

On 5/24/2010 1:24 AM, Hector Santos wrote:
> I'm a compiler designer
>
> http://www.santronics.com/products/winserver/Code.php
>
> And I am still trying to figure out what you are talking about.
>

I am trying to find out exactly how languages such as C++ and Java are
adapted for the international market. Although Joe's idea of permitting
a local comma to be mapped to the ASCII comma seemed plausible, it does
not look like this is the way that it is actually done.

I am trying to find out the way that it is actually done. The way that
it is actually done in Java is that although there is leeway in the
specification of identifiers, everything else must be pure ASCII.

> Any (non-english) foreign language developer can write code in his
> native tongue if their editor and the compiler allows for it. Its matter
> of having the right dictionary table for it.
>
> But if you wish to allow pure native speaking language, it becomes more
> of a NLP (Natural Language Parsing) design issue rather than
> tokenization design issue. For example, what is one word command in
> english can be two words in another. A meaning in english has a
> different meaning in others. The only other way to avoid the NLP is to
> build a lexical vocabulary for the specific language.
>
> One of the beauties of the APL language was that it was 100% symbolic,
> not one based on lexical translation. However, you did need a proper
> keyboard. APL gurs were able to use control codes and learn to recognize
> the non-APL code page display. It wasn't APL symbolically pretty though.
>
> The same will be true with a C/C++ Chinese compiler, well, you know we
> doubt you are writing acompiler, but rather a pre-processor, more like a
> lint. Nonetheless, even if did borrow some compiler code, you will
> probably need to create a special dictionary table anyway. See Chinese
> Basic history with its Cangjie input method:
>
> http://en.wikipedia.org/wiki/Chinese_BASIC
>
> I think that is the easy part. The hard part is the Chinese programmer
> learning another language that will probably not be as nature as you
> think. For that you really do need more of a NLP.
>
> --
> HLS
>
> Peter Olcott wrote:
>
>> On 5/23/2010 3:22 AM, Mihai N. wrote:
>>>
>>>
>>>> C++ requires that every non ASCII character be mapped to the ASCII set.
>>>
>>> Where did you get this from?
>>>
>> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3035.pdf
>>
>> 2.2
>> Physical source file characters are mapped, in an
>> implementation-defined manner, to the basic source character set. The
>> set of physical
>> source file characters accepted is implementation-defined.
>>
>> 2.3
>> The basic source character set consists of 96 characters: the space
>> character, the control characters representing horizontal tab,
>> vertical tab, form feed, and new-line, plus the following 91 graphical
>> characters:
>>
>>>
>>>> I looked in the Unicode table and found many code points that had a
>>>> letter with an implied comma embedded within its meaning. The comma was
>>>> being used as a diacritical mark.
>>>
>>> I am not sure what are you refering to.
>>> I don't know of any comma used as diacritical mark.
>>> If you are talking about things like U+0219 (LATIN SMALL LETTER S WITH
>>> COMMA BELOW), that is a stand-alone letter, not a letter with a
>>> diacritical mark.
>>> Like I would say "O with a small squigly" when talking about Q.
>>>
>>> The Unicode names describe the character using plain ASCII, but
>>> does not imply anything about the meaning of the thing.
>>>
>>>
>>>
>>
>> Since the above examples had the term "Comma" embedded within their
>> name it was possible for them to contain a nuance of the semantic
>> meaning of the punctuation mark.
>>
>> In any case it seems that Joe may have been wrong about this. If one
>> takes the Java language as an example of how computer languages are
>> internationalized.
>> http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.1
>>
>> Instead of using the local symbol for Comma and translating it into
>> the ASCII comma, Java takes a different approach. The ASCII character
>> "," is used. It seems that native speakers of the Java language think
>> that this is perfectly reasonable. Java also requires [0-9] digits.
>>
>> It seems that Java takes essentially the same approach that I am
>> taking and only allows non ASCII characters within identifiers.
>>
>> Here is another useful link:
>> http://en.wikipedia.org:80/wiki/Non-English-based_programming_languages
>>
>
>
>

From: Hector Santos on 24 May 2010 14:04

Peter Olcott wrote:

> On 5/24/2010 1:24 AM, Hector Santos wrote:
>> I'm a compiler designer
>>
>> http://www.santronics.com/products/winserver/Code.php
>>
>> And I am still trying to figure out what you are talking about.
>>
>
> I am trying to find out exactly how languages such as C++ and Java are
> adapted for the international market. Although Joe's idea of permitting
> a local comma to be mapped to the ASCII comma seemed plausible, it does
> not look like this is the way that it is actually done.
>
> I am trying to find out the way that it is actually done. The way that
> it is actually done in Java is that although there is leeway in the
> specification of identifiers, everything else must be pure ASCII.

Well, its hard to imagine what "deep thoughts" you guys got into but
it does seem this is yet another case of over thinking (or under
thinking) or just misreading again what is a non-issue and/or was
already been addressed. You do have a bad habit of zooming in on a
nit or some subtle point that may not have been a main point but you
rationalize to believe it is big proof for whatever concept you are
trying to express. Was this comma thing one of them?

Also, what do you call "Pure ASCII?" Is it 7 bit or 8 bit? UTF-7?

In any case, there are so many factors in all these, I'm sure that
once again, you are just looking at a pebble at pond of developer
compiler requirements and more than lightly the refraction is
providing a different view of the pebble.

You have the editor and what it accepts and interprets from the
keyboard - and HOW it is ENTERED. You have the editor and the display
and how the BYTE(S) or codepoints are rendered for readability. You
have the preprocessor and translator and P-CODE and OP-CODE generator.
These languages do nothing ODD than what is already understood. The
key difference is how it relates to the HUMAN. I can write code in
C/C++ that is totally "SYMBOLIC" to me and can display and print out
in these SYMBOLIC to me using bit 8 bytes.

The question is:

To the HUMAN, the DEVELOPER, does it make sense to him?

--
HLS

First | Prev | Next | Last
Pages: 6 7 8 9 10 11 12 13 14 15 16 17
Prev: Love Potion for Miss Blandish
Next: Newcomer's CAsyncSocket example: trouble connecting with other clients