From: Peter Olcott on
On 5/19/2010 12:16 PM, James Kanze wrote:
> On May 19, 3:32 pm, "Leigh Johnston"<le...(a)i42.co.uk> wrote:
>> "Peter Olcott"<NoS...(a)OCR4Screen.com> wrote in message
>> news:o86dnTxKTd3Xb27WnZ2dnUVZ_g6dnZ2d(a)giganews.com...
>>> On 5/19/2010 12:39 AM, Pete Delgado wrote:
>>>> "Peter Olcott"<NoS...(a)OCR4Screen.com> wrote in message
>>>> news:hcednaq6ks5ml2_WnZ2dnUVZ_tqdnZ2d(a)giganews.com...
>
>>>>> The finite state machine's detailed design is now
>>>>> completed. Its state transition matrix only takes 2048
>>>>> bytes. It will be faster than any other possible method.
>
>>>> So once again you find yourself with a *design* that is
>>>> complete but you have not done any *coding*? Yet you claim
>>>> that it will be faster than any other possible method?
>
>>>> Is anyone else noticing a pattern here??? (no pun intended...)
>
>>> The code will be complete within a week. Also most of the
>>> coding is complete for my major components.
>
>>> I would think that it would be self-evident to all who
>>> really understand deterministic finite automatons that
>>> nothing can beat the speed of a state transition matrix.
>
>> What if such a matrix does not fit into L1 cache?
>
> What if any number of things. The only thing that is
> self-evident to those who understand modern computers is that
> your really can't tell until you've measured (within certain
> limits, of course). There are too many variables involved. In
> general, however, avoiding memory accesses tends to help (as
> does avoiding conditions on some machines).
>
> --
> James Kanze

Take a look at the detailed design and see if you still think this.
From: Peter Olcott on
On 5/19/2010 12:23 PM, James Kanze wrote:
> On May 19, 4:14 pm, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
>> On 5/19/2010 5:39 AM, James Kanze wrote:
>>> On May 18, 8:17 pm, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
>>>> On 5/18/2010 9:34 AM, James Kanze wrote:
>>>>> On 17 May, 14:08, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
>>>>>> On 5/17/2010 1:35 AM, Mihai N. wrote:
>
>
> Design never proves anything with regards to speed, at least as
> long as the big-O is the same.

Maybe not complete proof, but substantial evidence, sometimes sufficient
evidence.

>
>> I will also add that I only have twelve ActionCodes including
>> InvalidByteError and the OutOfdata sentinel.
>
> And how do you act on an ActionCode? Switch statements and
> indirect jumps are very, very slow on some machines (but not on
> others).

I could not imagine why they would ever be very slow. I know that they
are much slower than an ordinary jump because of the infrastructure
overhead. This is only about one order of magnitude or less.

>
> --
> James Kanze

See my newest thread for more details:
Designing a Finite State Machine DFA Recognizer for UTF-8
From: Hector Santos on
On May 19, 1:16 pm, James Kanze <james.ka...(a)gmail.com> wrote:

>> What if such a matrix does not fit into L1 cache?
>
> What if any number of things. The only thing that is
> self-evident to those who understand modern computers is that
> your really can't tell until you've measured (within certain
> limits, of course). There are too many variables involved.

That is obvious to all, but to the Peter's it is irrevalent.

> In general, however, avoiding memory accesses tends to help (as
> does avoiding conditions on some machines).


Hence the Peter Troll "Generalization Rule" - He can't be bothered
with practical real world bottlenecks, experimentation, measurements
and/or any kind of implementation testing, etc. He needs you
(speaking in general) for that.

As a theory, his patent filing mindset has to be general applicability
across the board. Even the troll will tell you that. So even if there
are other real world variables involved, factors that must be
considered, in his mind, they are factored out and doesn't apply to
the overall "Idea" which he believes means more than any actual
implementation or deviation from the idea.

Note he has proven he is not capable of doing the R&D on his own.
Hence his goal here, and note how he cleverly broke it up into four
different threads across multiple groups where there are decent people
that will respond to questions, is to make certain IP claims and to
get people to squeeze out all the issues related to it.

Keep in mind he has no real demonstration capability, he has no
product to based anything of his claims on. His 10 year DFA project
and patent is still a theory and by his own admission, no coding is
done until the software engineering is complete. Its all on paper and
only the frailings of the USPTO that allowed for such frivilous
filings to exist. The fact is he has never been challenged and
thats only because the worth of his patent is zero and he can't use it
to stop anyone else with preexisting methods already in place since
the 80s.

--
HLS
From: Peter Olcott on
On 5/19/2010 12:46 PM, Hector Santos wrote:
> On May 19, 1:16 pm, James Kanze<james.ka...(a)gmail.com> wrote:
>
>>> What if such a matrix does not fit into L1 cache?
>>
>> What if any number of things. The only thing that is
>> self-evident to those who understand modern computers is that
>> your really can't tell until you've measured (within certain
>> limits, of course). There are too many variables involved.
>
> That is obvious to all, but to the Peter's it is irrevalent.
>
>> In general, however, avoiding memory accesses tends to help (as
>> does avoiding conditions on some machines).
>
>
> Hence the Peter Troll "Generalization Rule" - He can't be bothered
> with practical real world bottlenecks, experimentation, measurements
> and/or any kind of implementation testing, etc. He needs you
> (speaking in general) for that.
>
> As a theory, his patent filing mindset has to be general applicability
> across the board. Even the troll will tell you that. So even if there
> are other real world variables involved, factors that must be
> considered, in his mind, they are factored out and doesn't apply to
> the overall "Idea" which he believes means more than any actual
> implementation or deviation from the idea.
>
> Note he has proven he is not capable of doing the R&D on his own.
> Hence his goal here, and note how he cleverly broke it up into four
> different threads across multiple groups where there are decent people
> that will respond to questions, is to make certain IP claims and to
> get people to squeeze out all the issues related to it.
>
> Keep in mind he has no real demonstration capability, he has no
> product to based anything of his claims on. His 10 year DFA project
> and patent is still a theory and by his own admission, no coding is
> done until the software engineering is complete. Its all on paper and

Here are the actual results from the working prototype of my original
DFA based glyph recognition engine.
http://www.ocr4screen.com/Unique.html
The new algorithm is much better than this.

> only the frailings of the USPTO that allowed for such frivilous
> filings to exist. The fact is he has never been challenged and
> thats only because the worth of his patent is zero and he can't use it
> to stop anyone else with preexisting methods already in place since
> the 80s.
>
> --
> HLS

From: Joseph M. Newcomer on
See below...
On Wed, 19 May 2010 11:37:18 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/19/2010 10:12 AM, Joseph M. Newcomer wrote:
>> A classic error in predicting performance deals with the myth that instruction count is an
>> important parameter in determining timing.
>>
>> In a modern pipelined superscalar architecture, with asynchronous, opportunistic
>> sequencing of instructions, speculative execution, multilevel caches, etc., there is NO
>> WAY to look at code and predict its behavior unless you are mentally simulating the
>> asynchronous nature of the entire architecture, from memory access through write pipes,
>> caches, and the entire asynchronous execution engine. Therefore, "counting" memory
>> accesses does not work (speculative prefetch may have made the data available at the point
>> where it is needed, with no "memory delay" a naive simulation may suggest). Counting
>> instructions is pointless, because such models suggest the instructions are executed
>> sequentially (this has not been true since the Pentium II).
>
>The above statement has at least one obvious exception. When the
>algorithm is such that all of the above re-sequencing of instructions
>can't possibly help because there is only a single sequence that can
>produce the correct result.
****
Sadlly, this only proves you have NO CLUE as to what a pipelined superscalar architecture
with asynchronous opportunistic instruction execution does. You are still thinking like a
PDP-11 in an era in which that architecture is dead, dead, dead.

You are clearly ignoring the issues of opportunistic execution mediated by speculative
prefetch. In fact, your notion there is only one execution path demonstrates that you
have not looked at computer architectures since the mid-1980s (excepting the IBM Stretch,
which was designed in 1955, and incorporated a lot of these ideas of non-sequential
execution). Supercomputers of the 1960s and 1970s had similar issues, and it took until
the late 1980s until these ideas could be incorporated economically into microcomputer
chips. A friend of mine was the Pentium II architect, and they had speculative execution,
data prefetching, RISC pipelines, dynamic register renaming, and other things that had
previously been limited to multi-megabuck supercomputer systems.

Why is it you think the COMPUTER executes a unique path through the code? It doesn't. At
every branch point, the branch predictor circuitry routes the exeuction path in the way it
thinks best, and in some cases, there are concurrent multiple execution paths doing
speculative prefetch. Consider the following
L: mov eax, addr
inc addr
cmp eax, immediate_value
je L
mov eax, other
add eax, 17

While this is executing, the value in the memory location addr may not be available
because it involves an L2 cache miss, but while waiting for it, execution proceeds to the
cmp instruction, which is stalled waiting for the data to arrive in eax (the inc is also
stalled, but execution proceeds to the cmp). Meanwhile, the speculative execution is
fetching the contents of 'other' from memory, which, if it is in the L1 cache, will allow
the add instruction to be executed before the data arrives for the cmp. Only when the
data arrives is the control transfer executed; if it loops back, the add of the value in
eax is discarded. The internal register used in mov eax,addr is NOT the same as the
internal register used for mov eax,other, so there is no conflict (this is dynamic
register renaming coming into play). But if the branch is not taken, the two instructions
that followed it have ALREADY been executed! eax is a temporary alias for an internal
register, and this aliasing can change at any time a register is a target of a mov
operation or other operation that can change its contents.

This is the nature of modern computers. So while your thought is that there is a unique
path of execution through the code, in fact this doesn't actually happen in terms of the
computations done inside the chip. Prefetch and speculative execution means that some
code is executed before the "logical" flow gets to it, in which case it then takes zero
time.
****
>
>>
>> In principle, the only way to know how long an algorithm takes is to use a high-resolution
>> (e.g., RDTSC) timer. But the way Windows is implemented, this also has problem.
>>
>> The chip itself does not guarantee the RDTSC instruction is executed when you think it is;
>> in fact, it is *documented* that it might be executed BEFORE instructions that precede it
>> and when executed, an arbitrary number of instructions that FOLLOW it might have *already*
>> been executed (and this can happen even if the preceding instructions have NOT YET been
>> executed), so to use the RDTSC instruction directly, it must always be preceded by a
>> "serializing instruction". The only valid user-mode serializing instruction is CPUID.
>> Note that QueryPerformanceCounter will do the right thing.
>>
>> However, in a multicore/multiprocessor environment, a thread may read the high-resolution
>> timer on one CPU, then be suspended, and read the timer again on another CPU. Note there
>> can be a "clock skew" between the two CPUs, so it *is* possible for the timer to give
>> NEGATIVE time for small intervals. There are many other problems with getting good
>> performance data on the x86 chipset under Windows, so at best all timings are
>> approximations, and you have to do a sufficient number of experiments that your collected
>> data has statistical significance (and if you make a claim about performance and cannot
>> give the mean and standard deviation of your measurements, you don't HAVE measurements,
>> you have random numbers which might approximate measurements but cannot be trusted).
>>
>> So trying to compare models by counting the number of potential memory accesses or the
>> potential instruction trace will NOT give correct information. You can no longer do
>> static analysis of code to understand what its performance is going to be. Well, you can,
>> but it is not practical, and in any case, if you can work it out for a particular model of
>> CPU with a particular motherboard chipset, it is ONLY valid for that instance of a
>> "computer", and will not necessarily hold for other instances of "computer" with different
>> CPUs or different chip sets on the motherboard. So performance metrics that say "I looked
>> at the instruction set and X is true" have no meaning.
>> joe
>
>Certainly head to head direct benchmarking will provide much more
>accurate and reliable information. This is not to say that instruction
>counts are always useless. When designing a fast algorithm it is too
>expensive to encode every alternative, thus some reasonable heuristic
>must be used to narrow down the number of alternatives.
****
But the point is, short instruction sequences DO NOT correlate with high performance (take
a single page fault in a tight loop, for example, and watch what happens to performance!)

They are a heuristic, but they do not accurately predict.
joe
****
>
>> ***
>> On Wed, 19 May 2010 03:39:00 -0700 (PDT), James Kanze<james.kanze(a)gmail.com> wrote:
>>
>>> On May 18, 8:17 pm, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
>>>> On 5/18/2010 9:34 AM, James Kanze wrote:
>>>>> On 17 May, 14:08, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
>>>>>> On 5/17/2010 1:35 AM, Mihai N. wrote:
>>>
>>>>>> a regular expression implemented as a finite state machine
>>>>>> is the fastest and simplest possible way of every way that
>>>>>> can possibly exist to validate a UTF-8 sequence and divide
>>>>>> it into its constituent parts.
>>>
>>>>> It all depends on the formal specification; one of the
>>>>> characteristics of UTF-8 is that you don't have to look at
>>>>> every character to find the length of a sequence. And
>>>>> a regular expression generally will have to look at every
>>>>> character.
>>>
>>>> Validation and translation to UTF-32 concurrently can not be
>>>> done faster than a DFA recognizer, in fact it must always be
>>>> slower.
>>>
>>> UTF-8 was designed intentionally in a way that it doesn't
>>> require a complete DFA to handle, but can be handled faster.
>>> Complete DFA's are usually slower than caluculations on modern
>>> processors, since they require memory accesses, and memory is
>>> often the limiting factor.
>>>
>>> In fact, there is no "must always be slower". There are too
>>> many variables involved to be able to make such statements.
>> Joseph M. Newcomer [MVP]
>> email: newcomer(a)flounder.com
>> Web: http://www.flounder.com
>> MVP Tips: http://www.flounder.com/mvp_tips.htm
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm