Is this Regular Expression for UTF-8 Correct?? [Unix Programming]

Prev: Raw socket link indication
Next: diffent results of make implicit rules

From: Peter Olcott on 13 May 2010 16:06

Is this Regular Expression for UTF-8 Correct??

The solution is based on the GREEN portions of the first
chart shown
on this link:
http://www.w3.org/2005/03/23-lex-U

A semantically identical regular expression is also found on
the above link underValidating lex Template

1 ['\u0000'-'\u007F']
2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
3 | ( '\u00E0' ['\u00A0'-'\u00BF']
['\u0080'-'\u00BF'])
4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF']
['\u0080'-'\u00BF'])
5 | ( '\u00ED' ['\u0080'-'\u009F']
['\u0080'-'\u00BF'])
6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF']
['\u0080'-'\u00BF'])
7 | ( '\u00F0' ['\u0090'-'\u00BF']
['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF']
['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
9 | ( '\u00F4' ['\u0080'-'\u008F']
['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])

Here is my version, the syntax is different, but the UTF8
portion should be semantically identical.

UTF8_BYTE_ORDER_MARK [\xEF][\xBB][\xBF]

ASCII [\x0-\x7F]

U1 [a-zA-Z_]
U2 [\xC2-\xDF][\x80-\xBF]
U3 [\xE0][\xA0-\xBF][\x80-\xBF]
U4 [\xE1-\xEC][\x80-\xBF][\x80-\xBF]
U5 [\xED][\x80-\x9F][\x80-\xBF]
U6 [\xEE-\xEF][\x80-\xBF][\x80-\xBF]
U7 [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
U8 [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
U9 [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]

UTF8 {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}

// This identifies the "Letter" portion of an Identifier.
L {U1}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}

I guess that most of the analysis may simply boil down to
whether or not the original source from the link is
considered reliable. I had forgotten this original source
when I first asked this question, that is why I am reposting
this same question again.

From: Leigh Johnston on 13 May 2010 16:27

"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
news:3sudnRDN849QxnHWnZ2dnUVZ_qKdnZ2d(a)giganews.com...
> Is this Regular Expression for UTF-8 Correct??
>
> The solution is based on the GREEN portions of the first chart shown
> on this link:
> http://www.w3.org/2005/03/23-lex-U
>
> A semantically identical regular expression is also found on the above
> link underValidating lex Template
>
> 1 ['\u0000'-'\u007F']
> 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
> 3 | ( '\u00E0' ['\u00A0'-'\u00BF'] ['\u0080'-'\u00BF'])
> 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
> 5 | ( '\u00ED' ['\u0080'-'\u009F'] ['\u0080'-'\u00BF'])
> 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
> 7 | ( '\u00F0' ['\u0090'-'\u00BF'] ['\u0080'-'\u00BF']
> ['\u0080'-'\u00BF'])
> 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']
> ['\u0080'-'\u00BF'])
> 9 | ( '\u00F4' ['\u0080'-'\u008F'] ['\u0080'-'\u00BF']
> ['\u0080'-'\u00BF'])
>
> Here is my version, the syntax is different, but the UTF8 portion should
> be semantically identical.
>
> UTF8_BYTE_ORDER_MARK [\xEF][\xBB][\xBF]
>
> ASCII [\x0-\x7F]
>
> U1 [a-zA-Z_]
> U2 [\xC2-\xDF][\x80-\xBF]
> U3 [\xE0][\xA0-\xBF][\x80-\xBF]
> U4 [\xE1-\xEC][\x80-\xBF][\x80-\xBF]
> U5 [\xED][\x80-\x9F][\x80-\xBF]
> U6 [\xEE-\xEF][\x80-\xBF][\x80-\xBF]
> U7 [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
> U8 [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
> U9 [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]
>
> UTF8 {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>
> // This identifies the "Letter" portion of an Identifier.
> L {U1}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>
> I guess that most of the analysis may simply boil down to whether or not
> the original source from the link is considered reliable. I had forgotten
> this original source when I first asked this question, that is why I am
> reposting this same question again.

What has this got to do with C++? What is your C++ language question?

/Leigh

From: Peter Olcott on 13 May 2010 16:36

"Leigh Johnston" <leigh(a)i42.co.uk> wrote in message
news:GsGdnbYz-OIj_XHWnZ2dnUVZ8uCdnZ2d(a)giganews.com...
> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
> news:3sudnRDN849QxnHWnZ2dnUVZ_qKdnZ2d(a)giganews.com...
>> Is this Regular Expression for UTF-8 Correct??
>>
>> The solution is based on the GREEN portions of the first
>> chart shown
>> on this link:
>> http://www.w3.org/2005/03/23-lex-U
>>
>> A semantically identical regular expression is also found
>> on the above link underValidating lex Template
>>
>> 1 ['\u0000'-'\u007F']
>> 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
>> 3 | ( '\u00E0' ['\u00A0'-'\u00BF']
>> ['\u0080'-'\u00BF'])
>> 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF']
>> ['\u0080'-'\u00BF'])
>> 5 | ( '\u00ED' ['\u0080'-'\u009F']
>> ['\u0080'-'\u00BF'])
>> 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF']
>> ['\u0080'-'\u00BF'])
>> 7 | ( '\u00F0' ['\u0090'-'\u00BF']
>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>> 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF']
>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>> 9 | ( '\u00F4' ['\u0080'-'\u008F']
>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>
>> Here is my version, the syntax is different, but the UTF8
>> portion should be semantically identical.
>>
>> UTF8_BYTE_ORDER_MARK [\xEF][\xBB][\xBF]
>>
>> ASCII [\x0-\x7F]
>>
>> U1 [a-zA-Z_]
>> U2 [\xC2-\xDF][\x80-\xBF]
>> U3 [\xE0][\xA0-\xBF][\x80-\xBF]
>> U4 [\xE1-\xEC][\x80-\xBF][\x80-\xBF]
>> U5 [\xED][\x80-\x9F][\x80-\xBF]
>> U6 [\xEE-\xEF][\x80-\xBF][\x80-\xBF]
>> U7 [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
>> U8 [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
>> U9 [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]
>>
>> UTF8 {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>
>> // This identifies the "Letter" portion of an Identifier.
>> L {U1}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>
>> I guess that most of the analysis may simply boil down to
>> whether or not the original source from the link is
>> considered reliable. I had forgotten this original source
>> when I first asked this question, that is why I am
>> reposting this same question again.
>
> What has this got to do with C++? What is your C++
> language question?
>
> /Leigh

I will be implementing a utf8string to supplement
std::string and will be using a regular expression to
quickly divide up UTF-8 bytes into Unicode CodePoints.

Since there are no UTF-8 groups, or even Unicode groups I
must post these questions to groups that are at most
indirectly related to this subject matter.

From: Leigh Johnston on 13 May 2010 16:41

"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
news:xMOdnahxZJNX_3HWnZ2dnUVZ_hCdnZ2d(a)giganews.com...
>>
>> What has this got to do with C++? What is your C++ language question?
>>
>> /Leigh
>
> I will be implementing a utf8string to supplement std::string and will be
> using a regular expression to quickly divide up UTF-8 bytes into Unicode
> CodePoints.
>
> Since there are no UTF-8 groups, or even Unicode groups I must post these
> questions to groups that are at most indirectly related to this subject
> matter.

Wrong: off-topic is off-topic. If I chose to write a Tetris game in C++ it
would be inappropriate to ask about the rules of Tetris in this newsgroup
even if there was not a more appropriate newsgroup.

/Leigh

From: Peter Olcott on 13 May 2010 16:54

"Leigh Johnston" <leigh(a)i42.co.uk> wrote in message
news:v7CdnY8dPrNy_nHWnZ2dnUVZ8t-dnZ2d(a)giganews.com...
>
>
> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
> news:xMOdnahxZJNX_3HWnZ2dnUVZ_hCdnZ2d(a)giganews.com...
>>>
>>> What has this got to do with C++? What is your C++
>>> language question?
>>>
>>> /Leigh
>>
>> I will be implementing a utf8string to supplement
>> std::string and will be using a regular expression to
>> quickly divide up UTF-8 bytes into Unicode CodePoints.
>>
>> Since there are no UTF-8 groups, or even Unicode groups I
>> must post these questions to groups that are at most
>> indirectly related to this subject matter.
>
> Wrong: off-topic is off-topic. If I chose to write a
> Tetris game in C++ it would be inappropriate to ask about
> the rules of Tetris in this newsgroup even if there was
> not a more appropriate newsgroup.
>
> /Leigh

I think that posting to the next most relevant group(s)
where a directly relevant group does not exist is right, and
thus you are simply wrong.

| Next | Last
Pages: 1 2 3 4
Prev: Raw socket link indication
Next: diffent results of make implicit rules