From: Peter Olcott on
On 5/18/2010 9:26 AM, Oliver Regenfelder wrote:
> Hello,
>
> Peter Olcott wrote:
>> I completed the detailed design on the DFA that would validate and
>> translate any valid UTF-8 byte sequence into UTF-32. It can not be
>> done faster or simpler. The state transition matrix only takes exactly
>> 2 KB.
>
> Who cares about DFAs and state transition matrix sizes when all you want
> to do is convert UTF-8 to UTF-32. That are some if/else and switch
> statements in your programming language of choice + error handling.
>
> Best regards,
>
> Oliver


I most often (and sometimes too often) try to make the best possible
code. The DFA based finite state machine will perform at least twice as
fast and as much as ten-fold faster than alternatives. Since its basic
structure is of minimum complexity it can be coded with 100% reliability
almost mindlessly.
From: Peter Olcott on
On 5/18/2010 9:29 AM, Oliver Regenfelder wrote:
> Hello,
>
> Peter Olcott wrote:
>> Maybe it is much simpler for me than it would be for others because of
>> my strong bias towards DFA recognizers.
>
> I would say it is exactly the oposite. Your strong bias towards DFA
> recognizers lets you complete forget about the current abstraction
> level you are dealing with.
>
>> I bet my DFA recognizer is at least twice as fast as any other method
>> for validating UTF-8 and converting it to code points.
>> I am estimating about 20 machine clocks per code point.
>
> You might want to reread some of the postings regarding optimization
> from the earlier threads.
>
> Have you been a hardware engineer before by any chance?
>
> Best regards,
>
> Oliver

No, but, I always try to keep the hardware in mind when engineering the
software.
From: Peter Olcott on
On 5/18/2010 9:34 AM, James Kanze wrote:
> On 17 May, 14:08, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
>> On 5/17/2010 1:35 AM, Mihai N. wrote:
>> Do you know of any faster way to validate and divide a UTF-8 sequence
>> into its constituent code point parts than a regular expression
>> implemented as a finite state machine? (please don't cite a software
>> package, I am only interested in the underlying methodology).
>
> If the goal is only to get to the start of the next character,
> something like:
> p += byteCount(*p);
> is a lot faster; you don't even have to look at the intermediate
> characters. If you don't know the validity of the UTF-8, and
> want to check that at the same time, then I'd still start with
> something like that. For checking for non-canonical
> representations and such, regular expressions might be useful,
> but they're likely to be a bit more complicated than what you
> present.

No it is apparently very simple. I have the DFA already encoded.

>
>> To the very best of my knowledge (and I have a patent on
>> a finite state recognizer)
>
> You have a patent on finite state machines? I suspect that
> you'll find that prior experience will invalidate it.

<sarcasm>
Yes right, when I say that "I have a patent on a finite state
recognizer" of course I really mean that I have a patent on "every"
finite state recognizer.
</sarcasm>

Patent number 7,046,848

>
>> a regular expression implemented as a finite state machine is
>> the fastest and simplest possible way of every way that can
>> possibly exist to validate a UTF-8 sequence and divide it into
>> its constituent parts.
>
> It all depends on the formal specification; one of the
> characteristics of UTF-8 is that you don't have to look at every
> character to find the length of a sequence. And a regular
> expression generally will have to look at every character.

Validation and translation to UTF-32 concurrently can not be done faster
than a DFA recognizer, in fact it must always be slower.

>
> --
> James Kanze

From: Joseph M. Newcomer on
See below...
On Tue, 18 May 2010 13:48:22 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/18/2010 2:36 AM, Mihai N. wrote:
>>
>>> It will be faster than any other possible method.
>>
>> Not only "the fastest method today" but faster than any possible method.
>> Really?
>> Never make such absolute statements.
>> Looks like a sure sign of delusion.
>> No matter how fast the method, someone, at some point, will have something
>> faster. It's just how things work.
>
>In this case it is a sure sign of very extensive knowledge of DFA
>recognizers. I have a patent on a DFA recognizer: 7,046,848.
>
>>
>>
>>> I will post the source code when it is completed.
>> You do that.
>> That way it is easyer to measure and compare with other methods.
>> Estimates mean nothing.
>>
>>
>If one achieves the absolute minimum number of clock cycles, then one
>achieves the fastest possible code. The code that I will publish will be
>in the ball park of the minimum number of operations that can be encoded
>in C/C++. Hand tweaking in assembly language might provide further
>improvements.
****
This is true, but it ignores various parts of reality. For example, you cannot correlate
"absolute minimum number of clock cycles" with "short instruction path", because a small,
tight piece of code may be suboptimal for cache line hits or page faults, meaning that
even though, under perfect conditions, it would execute in a minimum number of clock
cycles, a cache line miss or a page fault will cost you 20-100 clock cycles or 30,000,000
clock cycles (respectively). So it is dangerous to assert that shortest-instruction-path
is going to be shortest-execution-time. As someone who once worried about these issues
for code generation, I learned long ago that fewest-instruction-cycles did not correleate
to fewest-data-cycles, and you do not seem to have provided actual information on this,
apparently asserting it without evidence. We also learned that fewest-lines-of-code
rarely resulted in fewest-instructions, but that was because we were actually doing real
measurements on real compilers, not hypothesizing in the absence of data.

You make these assertions but you never back them up with evidence. You may be right, but
my experience suggests that you are just guessing based on common myths.
joe
****
>
>It will not produce smaller code than alternatives. The code will
>probably be about 2048 bytes larger because the state transition matrix
>needs this much.
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Joseph M. Newcomer on
See below...
On Tue, 18 May 2010 14:17:02 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/18/2010 9:34 AM, James Kanze wrote:
>> On 17 May, 14:08, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
>>> On 5/17/2010 1:35 AM, Mihai N. wrote:
>>> Do you know of any faster way to validate and divide a UTF-8 sequence
>>> into its constituent code point parts than a regular expression
>>> implemented as a finite state machine? (please don't cite a software
>>> package, I am only interested in the underlying methodology).
>>
>> If the goal is only to get to the start of the next character,
>> something like:
>> p += byteCount(*p);
>> is a lot faster; you don't even have to look at the intermediate
>> characters. If you don't know the validity of the UTF-8, and
>> want to check that at the same time, then I'd still start with
>> something like that. For checking for non-canonical
>> representations and such, regular expressions might be useful,
>> but they're likely to be a bit more complicated than what you
>> present.
>
>No it is apparently very simple. I have the DFA already encoded.
>
>>
>>> To the very best of my knowledge (and I have a patent on
>>> a finite state recognizer)
>>
>> You have a patent on finite state machines? I suspect that
>> you'll find that prior experience will invalidate it.
>
><sarcasm>
>Yes right, when I say that "I have a patent on a finite state
>recognizer" of course I really mean that I have a patent on "every"
>finite state recognizer.
></sarcasm>
>
>Patent number 7,046,848
>
>>
>>> a regular expression implemented as a finite state machine is
>>> the fastest and simplest possible way of every way that can
>>> possibly exist to validate a UTF-8 sequence and divide it into
>>> its constituent parts.
>>
>> It all depends on the formal specification; one of the
>> characteristics of UTF-8 is that you don't have to look at every
>> character to find the length of a sequence. And a regular
>> expression generally will have to look at every character.
****
I will believe this when I see numerical proof based on actual timings. Note that if
anyone else can implement the same algorithm and get faster performance, you will be
proven wrong.

But we need to see what the performance is!
joe
****
>
>Validation and translation to UTF-32 concurrently can not be done faster
>than a DFA recognizer, in fact it must always be slower.
>
>>
>> --
>> James Kanze
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm