Is this Regular Expression for UTF-8 Correct?? [MFC]

Prev: Where to handle CSliderCtl messages in
Next: Blocking mouse clicks

From: Peter Olcott on 29 May 2010 10:01

On 5/28/2010 1:22 PM, Liviu wrote:
> "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote...
>> On 5/28/2010 12:37 PM, Liviu wrote:
>>> "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote...
>>>> On 5/28/2010 11:52 AM, Liviu wrote:
>>>>> "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote...
>>>>>>
>>>>>> You are referring to the fact that I don't bother to invoke it in
>>>>>> main()? That was not an error.
>>>>>
>>>>> No, not that. Why do you have to _guess_ anyway? Just lower
>>>>> yourself to actually try and test it with any non-ASCII input.
>>>>
>>>> I have other priorities right now. I will exhaustively test it once
>>>> I derive the UTF32toUTF8 function. I need this function to generate
>>>> my test data.
>>>
>>> You really mean to generate test data using another (untested)
>>> function of yours? Brilliant.
>>
>> If I generate every possible valid CodePoint and translate to and from
>> UTF-8 and get the same value that I send in back out this will prove
>> with very high reliability that both functions are correct.
>
> ...and the following code demonstrates my novel implementation of the
> increment/decrement arithmetic, provably faster than all prior art, and
> which I deem to be correct "with very high reliability" ;-)
>
> inline int inc(int n) { return n; }
> inline int dec(int n) { return n; }
>
> int main(void)
> {
> for(int n = 0; ++n; )
> if(n != inc(dec(n)) || n != dec(inc(n)))
> return -1; // failed
> return 0; // verified ok
> }
>
> Liviu
>
>
>
Another way to test my function would be to use a large sample of
Chinese UTF-8 and compare this against another UTF-8 decoder. Finding a
large sample of Chinese UTF-8 would take me longer than I want to spend.
Also this way is not exhaustive because it would not test every
CodePoint, whereas my proposal does test every CodePoint.

From: Peter Olcott on 29 May 2010 23:02

On 5/28/2010 11:52 AM, Liviu wrote:
> "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote...
>> On 5/28/2010 11:22 AM, Liviu wrote:
>>> "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote...
>>>>
>>>> http://www.ocr4screen.com/UTF8.cpp
>>>
>>> Now maybe if you tried to actually test it, you'd find the next
>>> obvious error, painfully obvious to anyone even remotely fluent in
>>> C/C++. Which is even more odd since I thought you were writing
>>> code so perfectly designed that it needed virtually no debugging.
>>
>> You are referring to the fact that I don't bother to invoke it in
>> main()? That was not an error. The only reason that included main()
>> was so that the compiler would not complain. It is intended to be
>> used as a header file.
>
> No, not that. Why do you have to _guess_ anyway? Just lower yourself
> to actually try and test it with any non-ASCII input.
>
> Liviu
>
>

Here is the original:
http://www.ocr4screen.com/UTF8_ORIG.cpp

Here is the logically correct one, the only errors were:
(1) Make member functions public
(2) Change row to col on the second loop
(3) Change && to &
http://www.ocr4screen.com/UTF8.cpp

Aside from these trivial and typographical errors the class worked
correctly the first time without any debugging. There are two
enhancements that need to be made. Can you guess what they are?

From: Liviu on 30 May 2010 04:08

"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote...
> On 5/28/2010 11:52 AM, Liviu wrote:
>> "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote...
>>> On 5/28/2010 11:22 AM, Liviu wrote:
>>>> "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote...
>>>>>
>>>>> http://www.ocr4screen.com/UTF8.cpp
>>>>
>>>> Now maybe if you tried to actually test it, you'd find the next
>>>> obvious error, painfully obvious to anyone even remotely fluent in
>>>> C/C++. Which is even more odd since I thought you were writing
>>>> code so perfectly designed that it needed virtually no debugging.
>>>
>>> You are referring to the fact that I don't bother to invoke it in
>>> main()? That was not an error. The only reason that included main()
>>> was so that the compiler would not complain. It is intended to be
>>> used as a header file.
>>
>> No, not that. Why do you have to _guess_ anyway? Just lower yourself
>> to actually try and test it with any non-ASCII input.
>
> Here is the original:
> http://www.ocr4screen.com/UTF8_ORIG.cpp

Not exactly. The original, before you rushed out what you _now_
present as the original (after Pete Delgado deservedly mocked you)
(a) had no copyright notice (funny that was your first worry), and
(b) had a handful more infinite loops. One out of several such,
lines #151-152 in your original utf8.cpp were...

|| for (col = 1; row < 0x7F; col++)
|| States[0][col] = FirstByteOfOneByte;

> Here is the logically correct one, the only errors were:
> (1) Make member functions public
> (2) Change row to col on the second loop

I missed that particular loop on the last read. But there were other
10 infinite loops in your original code as posted. And, as word goes,
it doesn't take more than one to ruin the best performance ;-)

> (3) Change && to &

Anyone with a modicum of C/C++ fluency would recognize a construct
of "x |= y && z" as highly suspect and 99% wrong. The remaining 1%
allowance would be for cases where "x" is a "bool" and one tried to
outsmart the language around the missing "||=" operator. None of that
applies here, so you were just confusing logical vs. bitwise operators.

> Aside from these trivial and typographical errors

Typographical? Maybe you should turn off MS Word's auto-correct while
you are writing C++ code in it ;-)

> the class worked correctly the first time without any debugging.

....and lest I needed air and my bicycle could fly, I'd be on the moon
now. You must have a very peculiar notion of "worked correctly the
first time". Good luck selling that.

> There are two enhancements that need to be made.
> Can you guess what they are?

No, and I have no interest in playing second-guess. Plus, I am no real
expert in UTF-8 if that's what you are after.

As far as C++ in general, your class is stateless and not abstract.
You could as well use global functions, or a namespace just for scoping.

As far as style, goto'ing out of the switch instead of a proper loop is
in bad C taste, as has been noted before.

As far as efficiency, using a dynamic Array2D<uint8_t> for what is
essentially a static constant 2D array is an overkill. And the
constructor could be written better. Though, assuming it's a
singleton, it may not matter much in the grand scheme of things.

As far as functionality, the code seems to be very cavalier towards
malformed input.

Liviu

From: Peter Olcott on 30 May 2010 10:16

On 5/30/2010 3:08 AM, Liviu wrote:
> "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote...
>> On 5/28/2010 11:52 AM, Liviu wrote:
>>> "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote...
>>>> On 5/28/2010 11:22 AM, Liviu wrote:
>>>>> "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote...
>>>>>>
>>>>>> http://www.ocr4screen.com/UTF8.cpp
>>>>>
>>>>> Now maybe if you tried to actually test it, you'd find the next
>>>>> obvious error, painfully obvious to anyone even remotely fluent in
>>>>> C/C++. Which is even more odd since I thought you were writing
>>>>> code so perfectly designed that it needed virtually no debugging.
>>>>
>>>> You are referring to the fact that I don't bother to invoke it in
>>>> main()? That was not an error. The only reason that included main()
>>>> was so that the compiler would not complain. It is intended to be
>>>> used as a header file.
>>>
>>> No, not that. Why do you have to _guess_ anyway? Just lower yourself
>>> to actually try and test it with any non-ASCII input.
>>
>> Here is the original:
>> http://www.ocr4screen.com/UTF8_ORIG.cpp
>
> Not exactly. The original, before you rushed out what you _now_
> present as the original (after Pete Delgado deservedly mocked you)
> (a) had no copyright notice (funny that was your first worry), and
> (b) had a handful more infinite loops. One out of several such,
> lines #151-152 in your original utf8.cpp were...
>
> || for (col = 1; row< 0x7F; col++)
> || States[0][col] = FirstByteOfOneByte;

By original I am referring to the last posting on the 27th of May. I
told Hector that even though I had already won his bet before he stated
his bet that I will give him a break and not count it as welshing on the
bet until I provided a correct program on the 27th.

Referring to this code as the "original" was a short-hand way of saying
all of the above.

>
>> Here is the logically correct one, the only errors were:
>> (1) Make member functions public
>> (2) Change row to col on the second loop
>
> I missed that particular loop on the last read. But there were other
> 10 infinite loops in your original code as posted. And, as word goes,
> it doesn't take more than one to ruin the best performance ;-)

Yes and I got all but one of them with zero testing.

>
>> (3) Change&& to&
>
> Anyone with a modicum of C/C++ fluency would recognize a construct
> of "x |= y&& z" as highly suspect and 99% wrong. The remaining 1%
> allowance would be for cases where "x" is a "bool" and one tried to
> outsmart the language around the missing "||=" operator. None of that
> applies here, so you were just confusing logical vs. bitwise operators.
>
>> Aside from these trivial and typographical errors
>
> Typographical? Maybe you should turn off MS Word's auto-correct while
> you are writing C++ code in it ;-)

Since I almost always use the && operator (since K&R was the de facto
standard "C") I merely typed && when I meant &.

>
>> the class worked correctly the first time without any debugging.
>
> ...and lest I needed air and my bicycle could fly, I'd be on the moon
> now. You must have a very peculiar notion of "worked correctly the
> first time". Good luck selling that.

If you just take the exactingly precise literal meaning of my words it
will be completely obvious that the statement is entirely 100% true.

Testing uncovered two typographical errors (the reason for the infinite
loop was that I cut and pasted a loop with one variable, and then
changed two of the three instances of the one variable to the other) and
one compile time error, (it was not apparent that I needed to declare
the member functions public until I tried to invoke these member functions.

As I already said aside from these three trivial errors the code did
indeed work correctly the very first time. This is the way all of my
code is. That is why I only need to spent a total of 5% of my time on
testing and debugging combined.

>
>> There are two enhancements that need to be made.
>> Can you guess what they are?
>
> No, and I have no interest in playing second-guess. Plus, I am no real
> expert in UTF-8 if that's what you are after.
>
> As far as C++ in general, your class is stateless and not abstract.
> You could as well use global functions, or a namespace just for scoping.

The class is not stateless. It must dynamically create its state
transition matrix in its constructor. I could have used a global table,
(sloppy, I always keep my class data encapsulated). I could have used a
static local table (takes up too much stack memory).

I always encode my classes so that they will fit on the stack. By doing
this I obtain another level of encapsulation (only the functions needing
the objects have access to them). Also I eliminate the need for dynamic
memory allocation. By eliminating the need for dynamic memory
allocation, I also eliminate the possibility of dynamic memory
allocation errors.

>
> As far as style, goto'ing out of the switch instead of a proper loop is
> in bad C taste, as has been noted before.

When I was in school, gotos were a swear-word that were never used in
programs: http://en.wikipedia.org/wiki/Edsger_W._Dijkstra

This almost always made perfect sense. The one exception that I found in
my whole life was to replace the break statement within a switch
statement (which is essentially a goto that jumps to the bottom) with a
goto that jumps to the top. I only do this in time critical code.

There is no increase in complexity, (a consistent jump to the bottom is
exactly as complex as a consistent jump to the top) and it eliminates a
redundant double jump. (jump to the bottom to a jump to the top)
Compiler optimizers might now be smart enough to eliminate this
redundant double jump. They were not smart enough when I began this
practice.

>
> As far as efficiency, using a dynamic Array2D<uint8_t> for what is
> essentially a static constant 2D array is an overkill. And the
> constructor could be written better. Though, assuming it's a
> singleton, it may not matter much in the grand scheme of things.

It is not a singleton. I use two dimensional std::vectors all the time.

>
> As far as functionality, the code seems to be very cavalier towards
> malformed input.

There are several different opposing views on this. Since the purpose of
this class was very fast processing of UTF-8 input for benchmarking
purposes, robust error handling was explicitly out of scope.

>
> Liviu
>
>

From: Liviu on 30 May 2010 23:11

"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote...
> On 5/30/2010 3:08 AM, Liviu wrote:
>> "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote...
>>>
>>> Here is the original:
>>> http://www.ocr4screen.com/UTF8_ORIG.cpp
>>
>> Not exactly. The original, before you rushed out what you _now_
>> present as the original (after Pete Delgado deservedly mocked you)
>
> By original I am referring to the last posting on the 27th of May.

At the time you posted on the 27th the link went to a different .cpp
file, which was _not_ the same as this utf8_orig.cpp you are claiming
now as the "original". Of course, the file itself was hosted on your
server, still is, and you can change its contents as often as you wish.
But you can't undo what you posted and others may have already
downloaded, so better be honest about it.

> Since I almost always use the && operator (since K&R was
> the de facto standard "C") I merely typed && when I meant &.

Are you _still_ confused? The distinction between logical && and
bitwise & operators hasn't changed since the beginning of C.

> As I already said aside from these three trivial errors the code did
> indeed work correctly the very first time.

The code didn't compile, then ran into infinite loops, then failed to
convert anything other than pure ASCII, then at long last may be
doing something remotely meaningful, however inefficiently, but still
lacks the "validate" part which you originally stated as a goal.

|| My method can completely validate any UTF-8 sequence of
|| bytes and decode it into its corresponding code point values in
|| fewer machine clock cycles than any possible alternative

Yet, you call that "work correctly the very first time". Oh well,
good luck with that notion of "correctly" in your future endeavors.

> The class is not stateless. It must dynamically create its state
> transition matrix in its constructor.

Assuming you created multiple instances of that class, all objects
would hold the exact same transition matrix and would be identical
to each other for all functional purposes. In that sense, the class is
stateless. I just didn't have enough imagination to fathom that you'd
contemplate instantiating more than one static object of that class.

> I always encode my classes so that they will fit on the stack.
> By doing this [...] Also I eliminate the need for dynamic memory
> allocation.

You never know how large the (remaining) stack is, so shouldn't
code for that. Also, when your code calls "States.resize(7, 256);"
for example, then that's a dynamic memory allocation right there.

> It is not a singleton. I use two dimensional std::vectors

I meant singleton in the sense of a class designed to only have one
object of its type ever instantiated.

Liviu

First | Prev | Next | Last
Pages: 17 18 19 20 21 22 23 24 25 26 27 28 29
Prev: Where to handle CSliderCtl messages in
Next: Blocking mouse clicks