From: Joseph M. Newcomer on
See below...
On Sat, 20 Mar 2010 13:02:02 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote:

>
>"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
>news:OshXHLFyKHA.5576(a)TK2MSFTNGP05.phx.gbl...
>> Hector Santos wrote:
>>
>>>> <not yelling, emphasizing>
>>>> MEMORY BANDWIDTH SPEED IS THE BOTTLENECK
>>>> </not yelling, emphasizing>
>>
>> >
>>
>>> <BIG>
>>> YOU DON'T KNOW WHAT YOU ARE DOING!
>>> </BIG>
>>>
>>> You don't have a freaking CRAY application need! Plus if
>>> you said your process time is 100ms is less, then YOU
>>> DON'T KNOW what you are talking about if you say you
>>> can't handle more than one thread.
>>>
>>> It means YOU PROGRAMMED YOUR SOFTWARE WRONG!
>>
>> Look, you can't take a single thread process that demands
>> 4GB of meta processing and believe that this is optimized
>> for a WINTEL QUAD machine to run as single thread process
>> instances, and then use as a BASELINE for any other
>> WEB-SERVICE design. Its foolish.
>
>Do you want me to paypal you fifty dollars? All that I need
>is some way to get your paypal email address. You can email
>me at PeteOlcott(a)gmail.com Only send me your paypal address
>because I never check this mailbox. If you do send me your
>paypal address, please tell me so I can check this email box
>that I never otherwise check.
>
>>
>> You have to redesign your OCR software to make it
>> thread-ready and use sharable data so that its only LOADED
>> once and USED many times.
>>
>> If you have thousands of font glyph files, then you can
>> use a memory mapped class array shared data. I will
>> guarantee you that will allow you to run multiple threads.
>
>I am still convinced that multiple threads for my OCR
>process is a bad idea. I think that the only reason that you
>are not seeing this is that you don't understand my
>technology well enough. I also don't think that there exists
>any possible redesign that would not reduce performance. The
>design is fundamentally based on leveraging large amounts of
>RAM to increase speed. Because I am specifically leveraging
>RAM to increase speed, the architecture is necessarily
>memory bandwidth intensive.***
****
Why? What evidence do you have to suggest this would be a "bad idea". It would allow you
to have more than one reconigtion going on concurrently, in the same image, and if you
believe the whole image is going to remain resident, then the second thread would cause no
page faults and therefore effectively be "free". If you ar running multicore, then you
should be able to get throughput equal to the number of cores, which means concurrent
requiestss would fall within the magical 500ms limit, which you thought was so critiical
last week, so critical it was non-negotiable. I guess it wasn't, since you clearly don't
care about performance this week. Notice that multithreading doesn't require additional
memory bandwidth, because you most likely are going to be running on multiple cores, with
multiple caches, and if you aren't, it isn't going to require any more memory bandwidth on
a single core because the cache is probably going to smooth this out.
joe
****
>
>>
>> But if you insist it can only be a FIFO single thread
>> processor, well, you are really wasting people time here
>> because everything else you want to do contradicts your
>> limitations. You want to put a web server INTO your OCR,
>> when in reality, you need to put your OCR into your WEB
>> SERVER.
>>
>> --
>> HLS
>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Joseph M. Newcomer on
See below...
On Sat, 20 Mar 2010 15:12:43 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote:
>>>(5) OCR processing time is less than 100 milliseconds.
>> ****
>> As you point out, if there are 10 requests pending, this
>> mens 1sec, which violates your
>> 500ms goal. But if you had a concurrent server, it would
>> be (nrequests * 100)/ncores
>> milliseconds, so for a quad-core server with 10 pending
>> requests and a 4-thread pool you
>> would have only 250ms, within your goals. If response
>> time is so critical, why are you
>> not running multithreaded already?
>> ****
>
>If I get more than an average of one request per second, I
>will get another server. My process does not improve with
>my multiple cores, it does get faster with faster memory.
>800% faster memory at the same core speed provided an 800%
>increase in speed. Two processes on a quad core resulted in
>half the speed for each.
****
Well, if you have not designed your code to run multithreaded, how do you KNOW it won't
run faster if you add more cores? If it is single threaded, it will run at EXACTLY the
same speed on an 8-core system as it does on a uniprocessor, because you have no
concurrency, but if you want to process 8 requests, you currently require ~800ms, which
violates you apparently nonnegotiable 500ms limit, but if you run multithreaded, an 8-core
system could handle all 8 of the them concurrently, meaning your total processing time on
8 concurrent requests is ~100ms. Or have I missed something here, and was the 500ms limit
avandoned?

Seriously, how hard can it be to convert code that requires no locking to multithreaded?
joe
****
>
>>>
>>>
>> Joseph M. Newcomer [MVP]
>> email: newcomer(a)flounder.com
>> Web: http://www.flounder.com
>> MVP Tips: http://www.flounder.com/mvp_tips.htm
>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Peter Olcott on

"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
message news:felaq598kd3crokhet06bdmvl27bot4al6(a)4ax.com...
> See below...
> On Sat, 20 Mar 2010 13:02:02 -0500, "Peter Olcott"
> <NoSpam(a)OCR4Screen.com> wrote:
>
>>
>>"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in
>>message
>>news:OshXHLFyKHA.5576(a)TK2MSFTNGP05.phx.gbl...
>>> Hector Santos wrote:
>>>
>>>>> <not yelling, emphasizing>
>>>>> MEMORY BANDWIDTH SPEED IS THE BOTTLENECK
>>>>> </not yelling, emphasizing>
>>>
>>> >
>>>
>>>> <BIG>
>>>> YOU DON'T KNOW WHAT YOU ARE DOING!
>>>> </BIG>
>>>>
>>>> You don't have a freaking CRAY application need! Plus
>>>> if
>>>> you said your process time is 100ms is less, then YOU
>>>> DON'T KNOW what you are talking about if you say you
>>>> can't handle more than one thread.
>>>>
>>>> It means YOU PROGRAMMED YOUR SOFTWARE WRONG!
>>>
>>> Look, you can't take a single thread process that
>>> demands
>>> 4GB of meta processing and believe that this is
>>> optimized
>>> for a WINTEL QUAD machine to run as single thread
>>> process
>>> instances, and then use as a BASELINE for any other
>>> WEB-SERVICE design. Its foolish.
>>
>>Do you want me to paypal you fifty dollars? All that I
>>need
>>is some way to get your paypal email address. You can
>>email
>>me at PeteOlcott(a)gmail.com Only send me your paypal
>>address
>>because I never check this mailbox. If you do send me
>>your
>>paypal address, please tell me so I can check this email
>>box
>>that I never otherwise check.
>>
>>>
>>> You have to redesign your OCR software to make it
>>> thread-ready and use sharable data so that its only
>>> LOADED
>>> once and USED many times.
>>>
>>> If you have thousands of font glyph files, then you can
>>> use a memory mapped class array shared data. I will
>>> guarantee you that will allow you to run multiple
>>> threads.
>>
>>I am still convinced that multiple threads for my OCR
>>process is a bad idea. I think that the only reason that
>>you
>>are not seeing this is that you don't understand my
>>technology well enough. I also don't think that there
>>exists
>>any possible redesign that would not reduce performance.
>>The
>>design is fundamentally based on leveraging large amounts
>>of
>>RAM to increase speed. Because I am specifically
>>leveraging
>>RAM to increase speed, the architecture is necessarily
>>memory bandwidth intensive.***
> ****
> Why? What evidence do you have to suggest this would be a
> "bad idea". It would allow you
> to have more than one reconigtion going on concurrently,
> in the same image, and if you

(I have already said these thing several times before)

Empirical:
(1) I tried it and it doesn't work, it cuts the performance
of each process by at least half.
(2) The fact that I achieved an 800% performance improvement
between one machine and another and the primary difference
was 800% faster RAM shows that my process must be taking
essentially all of the memory bandwidth.

Analytical
If my process is already taking ALL of the memory bandwidth,
then adding another thread of execution can not possibly
help, because the process is memory bandwidth bound, not CPU
bound.


> believe the whole image is going to remain resident, then
> the second thread would cause no
> page faults and therefore effectively be "free". If you
> ar running multicore, then you

But each process would still have to take turns accessing
the memory bus. The memory bus has a finite maximum access
speed. If one process is already using ALL of this up, then
another process or thread can not possibly help.

> should be able to get throughput equal to the number of
> cores, which means concurrent

Only for CPU bound processes, not for memory access bound
processes.

> requiestss would fall within the magical 500ms limit,
> which you thought was so critiical
> last week, so critical it was non-negotiable. I guess it
> wasn't, since you clearly don't
> care about performance this week. Notice that
> multithreading doesn't require additional
> memory bandwidth, because you most likely are going to be
> running on multiple cores, with
> multiple caches, and if you aren't, it isn't going to
> require any more memory bandwidth on
> a single core because the cache is probably going to
> smooth this out.
> joe
> ****

Nope not in my case. In my case I must have access to a much
larger DFA than will possibly fit into cache. With the
redesign there are often times that a DFA will fit into
cache. With this new design I may have 1,000 (or more) DFAs
all loaded at once, thus still requiring fast RAM access.
Some of these DFAs will not fit into cache. I have not
tested the new design yet. In the case of the new design it
might be possible to gain from multiple cores.

Much more interesting to me than this, is testing your
theory about cache hit ratio. If you are right, then my
basic design will gain a huge amount of performance. The
cache hit ratio could improve from 5% to 95%.

>>
>>>
>>> But if you insist it can only be a FIFO single thread
>>> processor, well, you are really wasting people time here
>>> because everything else you want to do contradicts your
>>> limitations. You want to put a web server INTO your
>>> OCR,
>>> when in reality, you need to put your OCR into your WEB
>>> SERVER.
>>>
>>> --
>>> HLS
>>
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm


From: Peter Olcott on

"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
message news:52maq5h8eg3490hrv06l066r76fp02fo0u(a)4ax.com...
> See below...
> On Sat, 20 Mar 2010 15:12:43 -0500, "Peter Olcott"
> <NoSpam(a)OCR4Screen.com> wrote:
>>>>(5) OCR processing time is less than 100 milliseconds.
>>> ****
>>> As you point out, if there are 10 requests pending, this
>>> mens 1sec, which violates your
>>> 500ms goal. But if you had a concurrent server, it
>>> would
>>> be (nrequests * 100)/ncores
>>> milliseconds, so for a quad-core server with 10 pending
>>> requests and a 4-thread pool you
>>> would have only 250ms, within your goals. If response
>>> time is so critical, why are you
>>> not running multithreaded already?
>>> ****
>>
>>If I get more than an average of one request per second, I
>>will get another server. My process does not improve with
>>my multiple cores, it does get faster with faster memory.
>>800% faster memory at the same core speed provided an 800%
>>increase in speed. Two processes on a quad core resulted
>>in
>>half the speed for each.
> ****
> Well, if you have not designed your code to run
> multithreaded, how do you KNOW it won't
> run faster if you add more cores? If it is single
> threaded, it will run at EXACTLY the
> same speed on an 8-core system as it does on a
> uniprocessor, because you have no
> concurrency, but if you want to process 8 requests, you
> currently require ~800ms, which
> violates you apparently nonnegotiable 500ms limit, but if
> you run multithreaded, an 8-core
> system could handle all 8 of the them concurrently,
> meaning your total processing time on
> 8 concurrent requests is ~100ms. Or have I missed
> something here, and was the 500ms limit
> avandoned?
>
> Seriously, how hard can it be to convert code that
> requires no locking to multithreaded?
> joe


How can I empirically test exactly how much of the total
memory bandwidth that my process is taking up?

Would you agree that if the empirical test shows that my
single process is taking up 100% of the memory bandwidth,
that multiple cores or multiple threads could not help
increase speed?


> ****
>>
>>>>
>>>>
>>> Joseph M. Newcomer [MVP]
>>> email: newcomer(a)flounder.com
>>> Web: http://www.flounder.com
>>> MVP Tips: http://www.flounder.com/mvp_tips.htm
>>
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm


From: Hector Santos on
Peter Olcott wrote:

> Would you agree that if the empirical test shows that my
> single process is taking up 100% of the memory bandwidth,
> that multiple cores or multiple threads could not help
> increase speed?


You are thinking about this all wrong.

You have a quantum based context switching you can't stop, even for a
single thread process. In other words, you will never have 100% full
exclusive control of MEMORY ACCESS - never. If you did, nothing else
will run.

What I am saying is this, suppose you have 10 lines of DFA C code, the
compiler creates OP CODES for these 10 lines. Each OP CODE has a fixed
frequency cycle. When the accumulated frequency reaches a QUANTUM
(~15ms), you will get a context switch - in other words, your code is
preempted (stop), swapped out, and Windows will give all other threads
a change to run.

That gives other threads in your process, if it was multi-thread to do
the same type of memory access work. Since it is READ ONLY, there is
no contention. If your preempted thread BLOCKED it, then your have
contention or even a dead lock - but you are not doing that. You are
reading only READ ONLY memory - which will have a maximum access.

Now comes a MULTI-CORE, and you have two or more threads, the SPEED is
that there is NO CONTEXT SWITCHING - you still may have the same
memory access, but that would be no slower if it was single cpu. Your
speed comes in less context switching. Understand?

In short:

single cpu: speed lost due to context switching
multi cpu/core: less context switching, more resident time.

You can' not think of term of a single thread process because there is
no advantage for it on a multi-core/cpu machine.

The INTEL Multi-Core chips has advanced technology to help
multi-threaded applications. Single thread processes can not benefit
on multi-core machine. They must be designed for threads to see any
benefits. If you want to read up on it, check out the Intel technical
documents, like this one:

http://download.intel.com/technology/architecture/sma.pdf

Specifically read SMA "Smart Memory Access"

The bottom line is really simple:

You have a single process with a huge memory load. Each instance
redundantly create additional huge memory loads and that alone will
cause a SYSTEM WIDE performance serious degradation with huge page
faulting and context switching delays.

You will never get any improvements until you change your memory usage
for intelligent sharable and use threads. When done correctly, you
will gain benefits provided by the OS and machine.

You really need to look at this as a whole:

1 process - with X number of threads

vs

X single thread Processes.

You need to trust us this is NOT the same when the DATA is HUGE!. In
the threaded model, it is shared. In the non-threaded model, is
redundant for each instance - that will murder you!

If you had NO HUGE data requirement, then they become more equal
because now its just CODE.

Now, it is conceivable that for specific your application, you might
realize that X may be 5-10 threads before you see a performance issue
that isn't towards your liken.

Show me how you are using std::vector with your files, and I will
create a simulator for you to PROVE to you how your thinking is all
wrong. This simulator will allow you to fine tune it to determine
what is your boundary conditions for performance.

While you have 20,000 hrs into this WITHOUT even exploring high end
thread designs, I have 6 years in Intel RMX

http://en.wikipedia.org/wiki/RMX_(operating_system)

which was considered one of the early Intel "multi-thread" frameworks
we have today and gave me an early nature understanding when NT 3.1
(17 years?) where I have done exclusively high-end multi-threaded
commercial server products since then. Count the hours! I can assure
you, your single process thinking is wrong.

--
HLS