Is this UTF-8 regular expression semantically correct? [MFC]

Prev: Love Potion for Miss Blandish
Next: Newcomer's CAsyncSocket example: trouble connecting with other clients

From: Pete Delgado on 20 May 2010 00:59

"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
news:gomdnfY9-INibm7WnZ2dnUVZ_sEAAAAA(a)giganews.com...
> On 5/19/2010 12:55 AM, Mihai N. wrote:
>>
>>> So C++ can take UTF-8 Identifiers?
>>
>> No, it can take Unicode identifiers.
>> The exact transformation format is not relevant.
>>
>>
> I though that it choked on anything besides ASCII. So are you implying
> that it can take Unicode within any encoding?

*Read* the C++ standards documents. It explains *everything*. For
information about identifiers, see section 2.11.There are draft copies of
the current proposed standard available for free :

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3092.pdf

There is no need to "imply" anything. As usual, Mihai is correct in matters
such as this and your "thought" was wrong.

-Pete

From: Mihai N. on 20 May 2010 03:17

> I though that it choked on anything besides ASCII. So are you implying
> that it can take Unicode within any encoding?

Can take Unicode in some Unicode form.
It can take the Unicode form accepted by the compiler.
Some compilers understand UTF-16, some understand UTF-8,
some understand none.
But even for the the ones that don't understand anything than ASCII,
they should still accept escaped form (\uXXXX)

int x\u6565rap = 123;
is a perfectly valid name.

(and if the compiler accepts utf-8 or utf-16, you can use some
human readable form)

--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

From: Joseph M. Newcomer on 20 May 2010 03:39

Note that an identifier is defined as incorporating "other implementation-defined
characters". If someone is claiming to extend C syntax to include localized letters, then
it should be philosophically consistent with the localized environment and define letters
to be consistent with that environment, or alternatively, be inclusive and include all
letters in all localized environments. Letters in a localized environment would not
include digits in a localized environment, punctuation marks of a localized environment,
etc.

Peter makes one of the common mistakes he is so fond of: he fastens on ONE implementation
by ONE vendor and makes a claim that it is DEFINITIVE. You can't even argue that Intel's
C++ compiler or gcc "prove" that this is true for ALL compilers, since they are intended
to be clones of each other and historically they all date back to the PDP-11 C++ compiler
which only used ASCII-7, so they are clones of that, except for extending the syntax to
more modern constructs. So he comes along and says "I'm going to extend this" and as soon
as I point out that the extensions have serious problems, he says "but the regular C++
language does not work that way!" which seems to beg the question of what is meant by
creating an extension that meets the requirements of allowing "native language". There
are interesting questions about accent marks, vowel marks, combining characters, localized
punctuation, localized digits, etc., but when I raised these, I was informed that the
extensions to support "native language coding" did NOT mean "support native language
coding" but meant "support something that allows native language programmers to write
identifiers in their native language that don't even make sense lexically in the native
language", and while making claims about how fast the recognizer is, refuse to limit the
productions because the copy-and-paste lex rules would actually require WORK to make them
correct, so he argues that it is not "convenient" to do it right.

I guess I don't respect doing a job wrong, and rationalizations that say "wrong is OK,
because whatever it is that I have defined is necessarily right, whether it is right or
not". There are some VERY interesting questions about combining accent marks and
combining characters, but if we ignore those, there is ZERO excuse for not writing
productions based on localized letters or digits (other than the copy-and-paste solution
no longer works!) because it cannot POSSIBLY affect the performance of the lexer! He even
says it can't, so the only remaining reason is the need to actually THINK about the
problem, instead of accepting an unsanctioned and unsupported regexp rule set.

Note that the lexical rules require that localized characters be mapped to the base
character set, so a thai digit character should map to the corresponding 0..9 value, and a
conforming compiler that allowed Thai input would do so because the C++ standard requires
that it do so. So his argument about why his extended C++ does not have to treat a
localized comma as a comma or a localized semicolon as a semicolon does not make sense;
the standard says that the input character set is implementation-specific but must map to
the base character set, so the argument that if I treat the following sequence in some
language "A,B" that if I use a localized comma with localized letters this is, by his
rules, necessarily an identifier. It means that under the mapping requirements it does
not translate and therefore his assertion is (no big surprise here) gibberish.

But why are we arguing over this? We KNOW his design is wrong; only HE can defend his bad
decisions by rationalizing them to himself. The first time a customer programmer
complains "But you SAID your extensions supported UTF-8 input, and I wrote this code in my
native language and it is correct" he can explain to a PAYING CUSTOMER why his
implementation makes no sense.

Note that you should also cite section 2.3 and the footnotes on page 16.
joe

On Thu, 20 May 2010 00:59:44 -0400, "Pete Delgado" <Peter.Delgado(a)NoSpam.com> wrote:

>
>"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
>news:gomdnfY9-INibm7WnZ2dnUVZ_sEAAAAA(a)giganews.com...
>> On 5/19/2010 12:55 AM, Mihai N. wrote:
>>>
>>>> So C++ can take UTF-8 Identifiers?
>>>
>>> No, it can take Unicode identifiers.
>>> The exact transformation format is not relevant.
>>>
>>>
>> I though that it choked on anything besides ASCII. So are you implying
>> that it can take Unicode within any encoding?
>
>*Read* the C++ standards documents. It explains *everything*. For
>information about identifiers, see section 2.11.There are draft copies of
>the current proposed standard available for free :
>
>http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3092.pdf
>
>There is no need to "imply" anything. As usual, Mihai is correct in matters
>such as this and your "thought" was wrong.
>
>-Pete
>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 20 May 2010 03:46

See below...
On Wed, 19 May 2010 14:54:03 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/19/2010 2:31 PM, Joseph M. Newcomer wrote:
>> See below...
>> On Wed, 19 May 2010 10:01:36 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>>
>>> On 5/19/2010 1:32 AM, Joseph M. Newcomer wrote:
>>>> See below...
>>>> On Tue, 18 May 2010 20:48:26 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>>>>
>>>>>
>>>>> A church that I used to go to had an expression, "What would Jesus do?"
>>>>> as their measure of correct behavior. In my case I have an analogous
>>>>> measure, "What would C++ do?"
>>>>>
>>>>> Would C++ permit digits other than ASCII [0-9] ???
>>>> ***
>>>> How about
>>>>
>>>> "Would a person who makes a claim that his language allows programmers to program in their
>>>> native language create a compiler in which their native digits are considered letters be
>>>> lying in his teeth about his claim?"
>>>> ****
>>>
>>> The claim was merely imprecisely (thus incorrectly) stated. What I meant
>>> by this was that Identifiers can be written in the native language, and
>>> C++ language constraints must be otherwise maintained.
>> ****
>> This is a typical pattern:
>>
>> Peter: "X is true"
>> World: "X is false"
>> Peter: "I meant to say, X is true under the following conditions"
>> World: "X is false in two of those three conditions"
>> Peter: "No, I REALLY meant that X is true only under conditions when it is true,
>> and I'm going to ignore all the conditions where it is false, and
>> define them out of existence by stating they were not part of
>> my design"
>>>
>>> Will C++ allow anything other than an ASCII comma between parameters?
>>> What are the limits on exactly how much C++ is Unicode aware? I already
>>> know that std::wstring it totally clueless. I assumed based on this that
>>> all of C++ was generally clueless about Unicode.
>> ****
>> But you said you were EXTENDING the language to be a C-like language that supported
>> localization! Did you mean something different? (See Magic Morphing Requirements)
>
>I was not aware of any other issues pertaining to the localization of a
>language based on C++ than providing a way to write identifiers in the
>native language. I had thought that C++ required all users to use ASCII
>numeric digits.
***
What do you mean "I thought"? Does it mean "I once heard a rumor about this" or "I found
something on someone's Web page", or what? It certainly cannot mean "I read the C++
Standard" because that is very explicit about stating that it is the responsibility of the
input mechanism to map the input character set to the base character set, which clearly
suggests that localized digits are permissible for numbers!
****
>
>To exactly what extent does the current C++ provide for localization?
>Does the current C++ allow you to use anything other than an ASCII comma
>to separate parameters?
****
It explicitly states that the input mechanism is responsible for mapping from the input
character set to the base character set. And you have seriously been confused by saying
"I am extending the lexical rules to allow UTF-8" and then saying "But if you use a
character which is not in conformance with the ASCII-7 subset, I am not obligated to honor
it". Duh!

Since you have been given a citation to the C++ Standard, I suggest reading it. Why
should I copy-and-paste from the document to save you the effort? I just read it, and it
is pretty explicit about what constitutes correct behavior. And your implementation is
not even CLOSE to supporting correct behavior!

And note that if you say "I am extending the lexical rules" then you must actually do
this, and not say "What I meant by that was I am extending the lexical rules, except when
it would require that I do a little work beyond copy and paste to make them correct".
joe
****
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Hector Santos on 20 May 2010 09:17

Mihai N. wrote:

>> I though that it choked on anything besides ASCII. So are you implying
>> that it can take Unicode within any encoding?
>
> Can take Unicode in some Unicode form.
> It can take the Unicode form accepted by the compiler.
> Some compilers understand UTF-16, some understand UTF-8,
> some understand none.
> But even for the the ones that don't understand anything than ASCII,
> they should still accept escaped form (\uXXXX)
>
> int x\u6565rap = 123;
> is a perfectly valid name.
>
> (and if the compiler accepts utf-8 or utf-16, you can use some
> human readable form)

Half the problem with all is that there is no context in the
applicability.

Overall, you have the fundamental ergonomics or interfaces:

- text editor (creation)
- text compiling (translation)
- data transfer (heterogeneous networking)
- display rendering (old and new user devices)

What else?

He was clearly wrong about C/C++ only supporting ASCII and thats only
because he is not a programmer to know it isn't true. But even if it
as true, so what? Not everyone is using C/C++ only. Unicode Editing
was neccesary for a developer, they will find the tools. There are
other languages and the Creation/Translation/Rednering is pretty much
well defined (complex but well defined).

OTOH, the data transfer is generally the problem and for us, that was
our main focus the past year or so because the new IETF requirement
for mail transport protocols. UTF-8 encoding has made this easy.

Next it the other parts for us. I will have questions about this soon. :)

--
HLS

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Prev: Love Potion for Miss Blandish
Next: Newcomer's CAsyncSocket example: trouble connecting with other clients