Is this Regular Expression for UTF-8 Correct?? [Unix Programming]

Prev: Raw socket link indication
Next: diffent results of make implicit rules

From: Victor Bazarov on 13 May 2010 18:41

On 5/13/2010 6:04 PM, Peter Olcott wrote:
> "Ian Collins"<ian-news(a)hotmail.com> wrote in message
> news:8539h9F7f1U1(a)mid.individual.net...
>> On 05/14/10 08:06 AM, Peter Olcott wrote:
>>> Is this Regular Expression for UTF-8 Correct??
>>
>> It's a fair bet you are off-topic in all the groups you
>> have cross posted to. Why don't you pick a group for a
>> language with built in UTF8 and regexp support (PHP?) and
>> badger them?
>>
>> --
>> Ian Collins
>
> What does this question have to do with the C++ language?

It does not have to have anything to do with C++. A post on the
topicality of another post is *always on topic*.

> At least my question is indirectly related to C++ by making
> a utf8string for the C++ language from the regular
> expression.

<sarcasm>
I am about to hold a party where I expect my colleagues to show up.
They are all C++ programmers. Would the question on what to feed them,
or whether 1970s pop music is going to be appropriate, be on topic in
comp.lang.c++? It's *indirectly related* to C++, isn't it?
</sarcasm>

> Your question is not even indirectly related to the C++
> language.

See above.

V
--
Please remove capital 'A's when replying by e-mail
I do not respond to top-posted replies, please don't ask

From: Sam on 13 May 2010 19:12

Victor Bazarov writes:

> On 5/13/2010 6:04 PM, Peter Olcott wrote:
>> "Ian Collins"<ian-news(a)hotmail.com> wrote in message
>> news:8539h9F7f1U1(a)mid.individual.net...
>>> On 05/14/10 08:06 AM, Peter Olcott wrote:
>>>> Is this Regular Expression for UTF-8 Correct??
>>>
>>> It's a fair bet you are off-topic in all the groups you
>>> have cross posted to. Why don't you pick a group for a
>>> language with built in UTF8 and regexp support (PHP?) and
>>> badger them?
>>>
>>> --
>>> Ian Collins
>>
>> What does this question have to do with the C++ language?
>
> It does not have to have anything to do with C++. A post on the
> topicality of another post is *always on topic*.
>
>> At least my question is indirectly related to C++ by making
>> a utf8string for the C++ language from the regular
>> expression.
>
> <sarcasm>
> I am about to hold a party where I expect my colleagues to show up.
> They are all C++ programmers. Would the question on what to feed them,
> or whether 1970s pop music is going to be appropriate, be on topic in
> comp.lang.c++? It's *indirectly related* to C++, isn't it?
> </sarcasm>
>
>> Your question is not even indirectly related to the C++
>> language.
>
> See above.

This guy is a tool. He re-posted this question a second time because when he
first posted that snippet nobody cared either. But after watching the
struggle in the original thread, the ugly carnage appealed to the
infinitesimally small humanitarian aspect of my psyche sufficiently enough
to motivate myself into actually looking at the regexp monstrosity. But
after I explained why that spaghetti of a regexp does not jive with RFC
2279, he got all huffy about it. He was confident that I was wrong, and that
the regular expression was right. But I was able to explain my reasoning, by
referencing directly to the contents of RFC 2279, and he was unable to
explain why he thought I was wrong, instead sprinkling more URLs to some
apparently orphaned web pages that said something else.

Which raised an obvious question: if he was so sure that his regular
expressions were correct, why was he asking? What exactly is the part of RFC
2279 that he didn't understand?

It seems to be his personality trait: when he asks a question, he thinks he
knows what the answer is, and every other answer is wrong. I can't figure
out what the real reason for asking the question must be, but I think I
really don't want to know the answer.

It remains to be seen how long it will take him to figure out that the
difficulty he has in getting someone answer this might be, just might be,
due to the simple fact that this is one of these things that can be answered
simply by RTFMing. Really, UTF-8 is not some patented trade secret. Its
specifications are openly available, to anyone who wants to read them. And
anyone who reads them should be able to figure out the correct regexp for
themselves. It's not rocket science.

Amusingly, he's been trying to find the answer to this question longer than
it took myself, originally, to read RFC 2279, and implement encoding and
decoding of Unicode using UTF-8. In C++. Well, in C actually, but it's still
technically valid C++. Which, I guess, makes this on-topic, under the new
rules that just came down, by fiat.

From: Peter Olcott on 13 May 2010 19:22

On 5/13/2010 6:12 PM, Sam wrote:
> Victor Bazarov writes:
>
>> On 5/13/2010 6:04 PM, Peter Olcott wrote:
>>> "Ian Collins"<ian-news(a)hotmail.com> wrote in message
>>> news:8539h9F7f1U1(a)mid.individual.net...
>>>> On 05/14/10 08:06 AM, Peter Olcott wrote:
>>>>> Is this Regular Expression for UTF-8 Correct??
>>>>
>>>> It's a fair bet you are off-topic in all the groups you
>>>> have cross posted to. Why don't you pick a group for a
>>>> language with built in UTF8 and regexp support (PHP?) and
>>>> badger them?
>>>>
>>>> --
>>>> Ian Collins
>>>
>>> What does this question have to do with the C++ language?
>>
>> It does not have to have anything to do with C++. A post on the
>> topicality of another post is *always on topic*.
>>
>>> At least my question is indirectly related to C++ by making
>>> a utf8string for the C++ language from the regular
>>> expression.
>>
>> <sarcasm>
>> I am about to hold a party where I expect my colleagues to show up.
>> They are all C++ programmers. Would the question on what to feed them,
>> or whether 1970s pop music is going to be appropriate, be on topic in
>> comp.lang.c++? It's *indirectly related* to C++, isn't it?
>> </sarcasm>
>>
>>> Your question is not even indirectly related to the C++
>>> language.
>>
>> See above.
>
> This guy is a tool. He re-posted this question a second time because
> when he first posted that snippet nobody cared either. But after
> watching the struggle in the original thread, the ugly carnage appealed
> to the infinitesimally small humanitarian aspect of my psyche
> sufficiently enough to motivate myself into actually looking at the
> regexp monstrosity. But after I explained why that spaghetti of a regexp
> does not jive with RFC 2279, he got all huffy about it. He was confident
> that I was wrong, and that the regular expression was right. But I was
> able to explain my reasoning, by referencing directly to the contents of
> RFC 2279, and he was unable to explain why he thought I was wrong,
> instead sprinkling more URLs to some apparently orphaned web pages that
> said something else.
>
> Which raised an obvious question: if he was so sure that his regular
> expressions were correct, why was he asking? What exactly is the part of
> RFC 2279 that he didn't understand?
>
> It seems to be his personality trait: when he asks a question, he thinks
> he knows what the answer is, and every other answer is wrong. I can't
> figure out what the real reason for asking the question must be, but I
> think I really don't want to know the answer.
>
> It remains to be seen how long it will take him to figure out that the
> difficulty he has in getting someone answer this might be, just might
> be, due to the simple fact that this is one of these things that can be
> answered simply by RTFMing. Really, UTF-8 is not some patented trade
> secret. Its specifications are openly available, to anyone who wants to
> read them. And anyone who reads them should be able to figure out the
> correct regexp for themselves. It's not rocket science.
>
> Amusingly, he's been trying to find the answer to this question longer
> than it took myself, originally, to read RFC 2279, and implement
> encoding and decoding of Unicode using UTF-8. In C++. Well, in C
> actually, but it's still technically valid C++. Which, I guess, makes
> this on-topic, under the new rules that just came down, by fiat.
>
>

This time I found the original source of a semantically identical
regular expression that you berated so rudely.
http://www.w3.org/2005/03/23-lex-U

Who knows, maybe www.w3.org is wrong and you are right?

From: Sam on 13 May 2010 19:40

Peter Olcott writes:

> On 5/13/2010 6:12 PM, Sam wrote:
>> Victor Bazarov writes:
>>
>>> On 5/13/2010 6:04 PM, Peter Olcott wrote:
>>>> "Ian Collins"<ian-news(a)hotmail.com> wrote in message
>>>> news:8539h9F7f1U1(a)mid.individual.net...
>>>>> On 05/14/10 08:06 AM, Peter Olcott wrote:
>>>>>> Is this Regular Expression for UTF-8 Correct??
>>>>>
>>>>> It's a fair bet you are off-topic in all the groups you
>>>>> have cross posted to. Why don't you pick a group for a
>>>>> language with built in UTF8 and regexp support (PHP?) and
>>>>> badger them?
>>>>>
>>>>> --
>>>>> Ian Collins
>>>>
>>>> What does this question have to do with the C++ language?
>>>
>>> It does not have to have anything to do with C++. A post on the
>>> topicality of another post is *always on topic*.
>>>
>>>> At least my question is indirectly related to C++ by making
>>>> a utf8string for the C++ language from the regular
>>>> expression.
>>>
>>> <sarcasm>
>>> I am about to hold a party where I expect my colleagues to show up.
>>> They are all C++ programmers. Would the question on what to feed them,
>>> or whether 1970s pop music is going to be appropriate, be on topic in
>>> comp.lang.c++? It's *indirectly related* to C++, isn't it?
>>> </sarcasm>
>>>
>>>> Your question is not even indirectly related to the C++
>>>> language.
>>>
>>> See above.
>>
>> This guy is a tool. He re-posted this question a second time because
>> when he first posted that snippet nobody cared either. But after
>> watching the struggle in the original thread, the ugly carnage appealed
>> to the infinitesimally small humanitarian aspect of my psyche
>> sufficiently enough to motivate myself into actually looking at the
>> regexp monstrosity. But after I explained why that spaghetti of a regexp
>> does not jive with RFC 2279, he got all huffy about it. He was confident
>> that I was wrong, and that the regular expression was right. But I was
>> able to explain my reasoning, by referencing directly to the contents of
>> RFC 2279, and he was unable to explain why he thought I was wrong,
>> instead sprinkling more URLs to some apparently orphaned web pages that
>> said something else.
>>
>> Which raised an obvious question: if he was so sure that his regular
>> expressions were correct, why was he asking? What exactly is the part of
>> RFC 2279 that he didn't understand?
>>
>> It seems to be his personality trait: when he asks a question, he thinks
>> he knows what the answer is, and every other answer is wrong. I can't
>> figure out what the real reason for asking the question must be, but I
>> think I really don't want to know the answer.
>>
>> It remains to be seen how long it will take him to figure out that the
>> difficulty he has in getting someone answer this might be, just might
>> be, due to the simple fact that this is one of these things that can be
>> answered simply by RTFMing. Really, UTF-8 is not some patented trade
>> secret. Its specifications are openly available, to anyone who wants to
>> read them. And anyone who reads them should be able to figure out the
>> correct regexp for themselves. It's not rocket science.
>>
>> Amusingly, he's been trying to find the answer to this question longer
>> than it took myself, originally, to read RFC 2279, and implement
>> encoding and decoding of Unicode using UTF-8. In C++. Well, in C
>> actually, but it's still technically valid C++. Which, I guess, makes
>> this on-topic, under the new rules that just came down, by fiat.
>>
>>
>
> This time I found the original source of a semantically identical
> regular expression that you berated so rudely.
> http://www.w3.org/2005/03/23-lex-U
>
> Who knows, maybe www.w3.org is wrong and you are right?

And as I wrote in the first thread, I suspected that the regular expression
mish-mash's actual purpose was to validate some defined a subset of the
entire Unicode range, as encoded in UTF-8.

See message <cone.1273539693.340713.2085.500(a)commodore.email-scan.com>,
where I wrote:

> I think what that regexp really does is match a subset of all valid
> UTF-8 sequences that corresponds with a subset of Unicodes that the
> author was interested in. It doesn't match all valid UTF-8 sequences,
> which the non-regexp version does.

And reading the "www.w3.org" link, it's clear that's exactly what it does,
and what the criteria is. Still, you replied as follows, in
<cvydnfcDPJR5MHXWnZ2dnUVZ_vOdnZ2d(a)giganews.com>:

> I think that your understanding might be less than complete. If you read
> the commentary you will see that your view is not supported.

Obviously, it's your thoughts turned out to be "less than complete". That
regular expression does not validate whether an arbitrary octet stream is a
UTF-8-encoded unicode value sequence. That regular expression checks whether
whether an arbitrary octet stream is a UTF-8-encoded unicode value sequence
and all unicode values belong to a specific, defined subset of the entire
unicode value range.

From: Peter Olcott on 13 May 2010 19:58

On 5/13/2010 6:40 PM, Sam wrote:
>> This time I found the original source of a semantically identical
>> regular expression that you berated so rudely.
>> http://www.w3.org/2005/03/23-lex-U
>>
>> Who knows, maybe www.w3.org is wrong and you are right?
>
> And as I wrote in the first thread, I suspected that the regular
> expression mish-mash's actual purpose was to validate some defined a
> subset of the entire Unicode range, as encoded in UTF-8.

And this view is clearly incorrect. It validates the the entire set of
UTF-8 encodings. Here is a quote:

"This pattern does not restrict to the set of
defined UCS characters, instead to the set that
is permitted by UTF-8 encoding."

The difference is the missing D800-DFFF High and Low surrogates that are
not legal in UTF-8. All of the other CodePoints from 0-10FFFF are
represented.

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: Raw socket link indication
Next: diffent results of make implicit rules