From: Sam on
Peter Olcott writes:

> On 5/13/2010 6:40 PM, Sam wrote:
>>> This time I found the original source of a semantically identical
>>> regular expression that you berated so rudely.
>>> http://www.w3.org/2005/03/23-lex-U
>>>
>>> Who knows, maybe www.w3.org is wrong and you are right?
>>
>> And as I wrote in the first thread, I suspected that the regular
>> expression mish-mash's actual purpose was to validate some defined a
>> subset of the entire Unicode range, as encoded in UTF-8.
>
> And this view is clearly incorrect. It validates the the entire set of
> UTF-8 encodings. Here is a quote:
>
> "This pattern does not restrict to the set of
> defined UCS characters, instead to the set that
> is permitted by UTF-8 encoding."
>
> The difference is the missing D800-DFFF High and Low surrogates that are
> not legal in UTF-8. All of the other CodePoints from 0-10FFFF are
> represented.

Since you claim to know so much about UTF-8 encoding and decoding -- even
more than RFC 2279 -- it's a wonder you had to ask your question at all. It
seems that you already knew the answer to the question.

Good luck UTF-8 encoding and decoding.

From: Peter Olcott on
On 5/13/2010 8:01 PM, Sam wrote:
> Peter Olcott writes:
>
>> On 5/13/2010 6:40 PM, Sam wrote:
>>>> This time I found the original source of a semantically identical
>>>> regular expression that you berated so rudely.
>>>> http://www.w3.org/2005/03/23-lex-U
>>>>
>>>> Who knows, maybe www.w3.org is wrong and you are right?
>>>
>>> And as I wrote in the first thread, I suspected that the regular
>>> expression mish-mash's actual purpose was to validate some defined a
>>> subset of the entire Unicode range, as encoded in UTF-8.
>>
>> And this view is clearly incorrect. It validates the the entire set of
>> UTF-8 encodings. Here is a quote:
>>
>> "This pattern does not restrict to the set of
>> defined UCS characters, instead to the set that
>> is permitted by UTF-8 encoding."
>>
>> The difference is the missing D800-DFFF High and Low surrogates that
>> are not legal in UTF-8. All of the other CodePoints from 0-10FFFF are
>> represented.
>
> Since you claim to know so much about UTF-8 encoding and decoding --
> even more than RFC 2279 -- it's a wonder you had to ask your question at
> all. It seems that you already knew the answer to the question.

http://tools.ietf.org/html/rfc3629
This memo obsoletes and replaces RFC 2279.

>
> Good luck UTF-8 encoding and decoding.
>

Thanks.

From: Jasen Betts on
On 2010-05-13, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:
>
> "Ian Collins" <ian-news(a)hotmail.com> wrote in message
> news:8539h9F7f1U1(a)mid.individual.net...
>> On 05/14/10 08:06 AM, Peter Olcott wrote:
>>> Is this Regular Expression for UTF-8 Correct??
>>
>> It's a fair bet you are off-topic in all the groups you
>> have cross posted to. Why don't you pick a group for a
>> language with built in UTF8 and regexp support (PHP?) and
>> badger them?
>>
>> --
>> Ian Collins
>
> What does this question have to do with the C++ language?
>
> At least my question is indirectly related to C++ by making
> a utf8string for the C++ language from the regular
> expression.

Just use iconv.

and don't cross post off-topic.


--- news://freenews.netfront.net/ - complaints: news(a)netfront.net ---
From: David Schwartz on
On May 13, 3:04 pm, "Peter Olcott" <NoS...(a)OCR4Screen.com> wrote:

> What does this question have to do with the C++ language?
>
> At least my question is indirectly related to C++ by making
> a utf8string for the C++ language from the regular
> expression.
>
> Your question is not even indirectly related to the C++
> language.

Unfortunately, no better way is known to keep conversations on topic.
If you know a better way, we'd all love to hear it. If you don't
respond immediately in the forum and point out that something is off
topic, other people browsing the forum will think the question was on
topic. Other ways have been tried in the past (such as private mails
where possible, monthly posts about topicality rather than replying to
each off-topic post, and so on). None have been shown to be effective.

Painful experience has shown that the most effective technique is to
verbally berate and ridicule people who post off topic. Thus others
will see the negative response by the group and now want their posts
to be met with a similar response.

Again, this wasn't anyone's first choice, and if you know a better
way, please tell us. (In the appropriate forum, of course!)

DS