Is this UTF-8 regular expression semantically correct? [MFC]

Prev: Love Potion for Miss Blandish
Next: Newcomer's CAsyncSocket example: trouble connecting with other clients

From: Joseph M. Newcomer on 17 May 2010 13:04

See below..
On Mon, 17 May 2010 09:17:41 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/17/2010 12:40 AM, Joseph M. Newcomer wrote:
>> Because without the context it is not a valid question.
>>
>> For example, since this is a C++/MFC group, the question might have been in terms of a
>> regexp library, which suggests you are using UTF-8 internally, which would be wrong.
>>
>> But as stated, the question is wrong, because you are presuming an over-simplified concept
>> of "letter", for which I have already pointed out there are failures (numbers in other
>> languages). You would have to deal with all accent marks, and while some languages have
>> e-umlaut, i-umlaut and y-umlaut, these are not letters in German, and so you have to take
>> into account the localization context to determine if they really are "letters". And in
>> Chinese, a single glyph may be a "word" and thus two of these in sequence would be
>> syntactically illegal. So how do you define "letter"? And in some cases, the accent mark
>> is a separate codepoint, so a separate UTF-8 encoding, but you can't combine that accent
>> mark with any but a few letters, so the regexp does not account for these at all!
>>
>> What about RTL encodings. In Hebrew, which I will simplifiy for NG syntax, if I wanted to
>> write ABC it would appear as CBA because of the left-to-right nature of that language. But
>> if I wanted to write ABC 123 DEF it would appear as FED*123$CBA where the $ represents the
>> token that says "change to RTL" and the * represents the token that says "change to LTR".
>> Read the Unicode documentation! (RTFM!) So if you are parsing this into tokens, is it
>> "FED" "123" "CBA" or "ABC" "321" "DEF" or "ABC" "123" "DEF"? If you can't answer this
>> question, then you can't ask the one about the regexp being correct. What if I have a
>> lexically illegal sequence of accent marks and characters? What if I have the sequence
>> '`a? If 'a means � and `a means � (I'm not talking about the ANSI characters, here'
>> means U0300 and ` means U0301), what does '`a or `'a mean? Whoops, lexical error. There
>> is no rule in your regexp that detects this, therefore, it is wrong. (UTF-32 these would
>> be U00000300 U00000061 and U00000301 U00000061and in UTF-8 these would be "cc 80 61 cc 81
>> 61"
>>
>> So the simple answer is "It is completely and utterly insufficient, and its correctness is
>> problematic, and it does not define even what a letter is", and even if you convert to
>> UTF-32 you have not solved this problem.
>> joe
>>
>> So the simplest answer is "No", under no imaginable conditions is this collection of
>> regexps even CLOSE to being usable, and even if expressed in UTF-32, there is no possible
>> way something this overly-simplistic could be construed to make sense, and the real
>> problem is vastly more complicated than you have imagined!
>> joe
>
>You are taking the incorrect approach in that if a solution does not
>provide support for every possible issue then the this solution does not
>solve the problem. The failure in this approach is that for many
>problems most of these issues are entirely moot.
>
>For the purpose of creating an interpreted GUI scripting language that
>permits people to write GUI scripts in their native language I only need
>to be able to handle UTF-8 input and make sure that it it valid UTF-8.
>There is no need for me to validate this any further.
****
So you can accept a line of the form
A + B = C D - )) 0123

because it is lexically correct? If it is valid UTF-8, then it must necessarily be a
correct script?

Have I missed some important point here?

Why do you care if it is valid UTF-8? If it isn't, the UTF-8 to UTF-32 conversion will
fail and you can report an improper input file at that point! At no point would lex need
to know about UTF-8 text, because it should not be working with anything less than UTF-16.
Better still, UTF-32.

Curious yellow dreams sleep furiously.

(A syntactically correct sentence, if you are wondering, and one of the classics used in
language comprehension by computers)

Time flies like an arrow
Fruit flies like a banana

If you are doing a "scripting language" you need to handle syntax as well.\
joe
>
>>
>> On Sat, 15 May 2010 09:20:42 -0500, "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote:
>>
>>>
>>> "Mihai N."<nmihai_year_2000(a)yahoo.com> wrote in message
>>> news:Xns9D7922762D9EBMihaiN(a)207.46.248.16...
>>>>
>>>>
>>>>> Possibly, but, I was really only looking for a yes or no
>>>>> answer.
>>>>
>>>> If you wanted a yes/no answer you should give complete
>>>> info
>>>> (like the fact that you are talking lex context)
>>>> Othewise you wil very likely get a wrong answer.
>>>
>>> I don't see why this would be the case for a yes or no
>>> question.
>>>
>>>>
>>>>
>>>> --
>>>> Mihai Nita [Microsoft MVP, Visual C++]
>>>> http://www.mihai-nita.net
>>>> ------------------------------------------
>>>> Replace _year_ with _ to get the real email
>>>>
>>>
>> Joseph M. Newcomer [MVP]
>> email: newcomer(a)flounder.com
>> Web: http://www.flounder.com
>> MVP Tips: http://www.flounder.com/mvp_tips.htm
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Peter Olcott on 17 May 2010 15:38

On 5/17/2010 12:04 PM, Joseph M. Newcomer wrote:
> See below..
> On Mon, 17 May 2010 09:17:41 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>
>> You are taking the incorrect approach in that if a solution does not
>> provide support for every possible issue then the this solution does not
>> solve the problem. The failure in this approach is that for many
>> problems most of these issues are entirely moot.
>>
>> For the purpose of creating an interpreted GUI scripting language that
>> permits people to write GUI scripts in their native language I only need
>> to be able to handle UTF-8 input and make sure that it it valid UTF-8.
>> There is no need for me to validate this any further.
> ****
> So you can accept a line of the form
> A + B = C D - )) 0123
>
> because it is lexically correct? If it is valid UTF-8, then it must necessarily be a
> correct script?

Any sequence of code points above the ASCII range will form acceptable
Identifiers.

>
> Have I missed some important point here?
>

Yes you have missed project scope that it defined within the project's
intended purpose.

> Why do you care if it is valid UTF-8? If it isn't, the UTF-8 to UTF-32 conversion will
> fail and you can report an improper input file at that point! At no point would lex need
> to know about UTF-8 text, because it should not be working with anything less than UTF-16.
> Better still, UTF-32.

I was not going to convert the UTF-8, I was going to store it in the
SymbolTable as UTF-8.

>
> Curious yellow dreams sleep furiously.
>
> (A syntactically correct sentence, if you are wondering, and one of the classics used in
> language comprehension by computers)
>
> Time flies like an arrow
> Fruit flies like a banana
>
> If you are doing a "scripting language" you need to handle syntax as well.\
> joe

Only the syntax of my GUI scripting language, not the syntax of any
natural language.

From: Mihai N. on 18 May 2010 04:10

Why not go to the root of the problem?

This is what you need:
> For the purpose of creating an interpreted GUI scripting language that
> permits people to write GUI scripts in their native language

Then expose the whole thing using a COM model, and it would allow
anyone to automate using any .NET language, Perl, JScript, you name it.
Solid languages, some of them supporting Unicode out of the box, way
more popular. You stop wasting your time developing a compiler,
and people will not be forces to waste time learning another
programming language (C-like but not quite C).

--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

From: Joseph M. Newcomer on 18 May 2010 12:26

See below...
On Tue, 18 May 2010 01:10:07 -0700, "Mihai N." <nmihai_year_2000(a)yahoo.com> wrote:

>
>
>Why not go to the root of the problem?
>
>This is what you need:
> > For the purpose of creating an interpreted GUI scripting language that
> > permits people to write GUI scripts in their native language
>
>Then expose the whole thing using a COM model, and it would allow
>anyone to automate using any .NET language, Perl, JScript, you name it.
>Solid languages, some of them supporting Unicode out of the box, way
>more popular. You stop wasting your time developing a compiler,
>and people will not be forces to waste time learning another
>programming language (C-like but not quite C).
****
But that sounds *reasonable*.

Note that "permits people to write GUI scripts in their native language" but "all
characters above the ASCII range" [which I presume means U007F] "are letters". Apparently,
these languages do not have localized punctuation marks or digits, which is true only if
you live deep in a Reality Distortion Field.

In what language, exactly, is my use of the localized punctuation marks or digits
considered part of the set of "letters". Presumably, if this were cast into the context
of the 7-bit set, it would mean that I could have identifies "A,B", "A.B", "A;B" "01ABC",
"3CAT" and so on. If my native language has a native comma, period, or semicolon, why is
this considered a "letter"? Why is it I can start an identifier with a digit? Why is my
native rendering of 12.34 considered an "identifier" and not a "number"? And localized
digits? If I were doing this, I'd have productions that defined numeric sequences, e.g.,
bengali_number, thai_number, etc. and then have a production that a number is an
"ascii_number", "bengali_number", "thai_number", etc., but unfortunately that would merely
make my implementation *correct*, rather than "small and fast" (this is the Unix mindset:
it doesn't matter if it is right as long as it is small and fast).

Of course, if you want to simplify the problem to make its implementation easy, and
violate your own specification of using native languages, then making such nonsensical
statements such as "all characters above the ASCII range are letters" is acceptable.
Nonsesical, of course, but if you define nonsense away by saying "my implemention defines
correctness, not my specification", then it is presumably OK.
joe
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Peter Olcott on 18 May 2010 15:03

On 5/18/2010 3:10 AM, Mihai N. wrote:
>
>
> Why not go to the root of the problem?
>
> This is what you need:
> > For the purpose of creating an interpreted GUI scripting language that
> > permits people to write GUI scripts in their native language
>
> Then expose the whole thing using a COM model, and it would allow
> anyone to automate using any .NET language, Perl, JScript, you name it.
> Solid languages, some of them supporting Unicode out of the box, way
> more popular. You stop wasting your time developing a compiler,
> and people will not be forces to waste time learning another
> programming language (C-like but not quite C).
>
>
>
>
I considered that , but rejected it for two reasons:
(1) Not sufficiently platform independent.
(2) Makes my success too dependent upon Microsoft.

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Prev: Love Potion for Miss Blandish
Next: Newcomer's CAsyncSocket example: trouble connecting with other clients