From: Joseph M. Newcomer on
See below...

On Sat, 20 Mar 2010 22:27:40 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote:

>
>"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
>message news:felaq598kd3crokhet06bdmvl27bot4al6(a)4ax.com...
>> See below...
>> On Sat, 20 Mar 2010 13:02:02 -0500, "Peter Olcott"
>> <NoSpam(a)OCR4Screen.com> wrote:
>>
>>>
>>>"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in
>>>message
>>>news:OshXHLFyKHA.5576(a)TK2MSFTNGP05.phx.gbl...
>>>> Hector Santos wrote:
>>>>
>>>>>> <not yelling, emphasizing>
>>>>>> MEMORY BANDWIDTH SPEED IS THE BOTTLENECK
>>>>>> </not yelling, emphasizing>
>>>>
>>>> >
>>>>
>>>>> <BIG>
>>>>> YOU DON'T KNOW WHAT YOU ARE DOING!
>>>>> </BIG>
>>>>>
>>>>> You don't have a freaking CRAY application need! Plus
>>>>> if
>>>>> you said your process time is 100ms is less, then YOU
>>>>> DON'T KNOW what you are talking about if you say you
>>>>> can't handle more than one thread.
>>>>>
>>>>> It means YOU PROGRAMMED YOUR SOFTWARE WRONG!
>>>>
>>>> Look, you can't take a single thread process that
>>>> demands
>>>> 4GB of meta processing and believe that this is
>>>> optimized
>>>> for a WINTEL QUAD machine to run as single thread
>>>> process
>>>> instances, and then use as a BASELINE for any other
>>>> WEB-SERVICE design. Its foolish.
>>>
>>>Do you want me to paypal you fifty dollars? All that I
>>>need
>>>is some way to get your paypal email address. You can
>>>email
>>>me at PeteOlcott(a)gmail.com Only send me your paypal
>>>address
>>>because I never check this mailbox. If you do send me
>>>your
>>>paypal address, please tell me so I can check this email
>>>box
>>>that I never otherwise check.
>>>
>>>>
>>>> You have to redesign your OCR software to make it
>>>> thread-ready and use sharable data so that its only
>>>> LOADED
>>>> once and USED many times.
>>>>
>>>> If you have thousands of font glyph files, then you can
>>>> use a memory mapped class array shared data. I will
>>>> guarantee you that will allow you to run multiple
>>>> threads.
>>>
>>>I am still convinced that multiple threads for my OCR
>>>process is a bad idea. I think that the only reason that
>>>you
>>>are not seeing this is that you don't understand my
>>>technology well enough. I also don't think that there
>>>exists
>>>any possible redesign that would not reduce performance.
>>>The
>>>design is fundamentally based on leveraging large amounts
>>>of
>>>RAM to increase speed. Because I am specifically
>>>leveraging
>>>RAM to increase speed, the architecture is necessarily
>>>memory bandwidth intensive.***
>> ****
>> Why? What evidence do you have to suggest this would be a
>> "bad idea". It would allow you
>> to have more than one reconigtion going on concurrently,
>> in the same image, and if you
>
>(I have already said these thing several times before)
>
>Empirical:
>(1) I tried it and it doesn't work, it cuts the performance
>of each process by at least half.
>(2) The fact that I achieved an 800% performance improvement
>between one machine and another and the primary difference
>was 800% faster RAM shows that my process must be taking
>essentially all of the memory bandwidth.
>
>Analytical
>If my process is already taking ALL of the memory bandwidth,
>then adding another thread of execution can not possibly
>help, because the process is memory bandwidth bound, not CPU
>bound.
****
And the shared L3 cache means that bandwidth is shared across the threads, so even if one
thread saturates it, you have not accounted for the effects of the L3 cache on the second
thread.

And you are ignoring other aspects of the architecture improvements of the i7, not just
the shared L3 cache, but the fact that instruction fetches use bandwidth, and the deeper
instruction pipe can have an influence. You are looking at overly-simplified metrics and
believing they tell the whole story. I don't know what the whole story is, but unless
I've run ACTUAL experiments using multiple threads across multiple cores, I have no idea
how the complex L1/L2/L3 cache interactions are going to affect this. Also, you forgot
about the larger TLB on the i7, and for the large amount of memory you are using TLB
thrashing is a potential issue, reduced on the i7. So it isn't just one factor; in fact,
it could the amazing coincidence tha the speedup is exactly equal to the memory speed
factor. And from this coincidence, you are forming an unfounded conclusion.
joe
*****
>
>
>> believe the whole image is going to remain resident, then
>> the second thread would cause no
>> page faults and therefore effectively be "free". If you
>> ar running multicore, then you
>
>But each process would still have to take turns accessing
>the memory bus. The memory bus has a finite maximum access
>speed. If one process is already using ALL of this up, then
>another process or thread can not possibly help.
***
You are confusing "multiple process" with "multiple thread". They are not at all the
same!
****
>
>> should be able to get throughput equal to the number of
>> cores, which means concurrent
>
>Only for CPU bound processes, not for memory access bound
>processes.
****
And I whisper again, "Shared L3 cache". You don't know. You really don't know>
****
>
>> requiestss would fall within the magical 500ms limit,
>> which you thought was so critiical
>> last week, so critical it was non-negotiable. I guess it
>> wasn't, since you clearly don't
>> care about performance this week. Notice that
>> multithreading doesn't require additional
>> memory bandwidth, because you most likely are going to be
>> running on multiple cores, with
>> multiple caches, and if you aren't, it isn't going to
>> require any more memory bandwidth on
>> a single core because the cache is probably going to
>> smooth this out.
>> joe
>> ****
>
>Nope not in my case. In my case I must have access to a much
>larger DFA than will possibly fit into cache. With the
>redesign there are often times that a DFA will fit into
>cache. With this new design I may have 1,000 (or more) DFAs
>all loaded at once, thus still requiring fast RAM access.
>Some of these DFAs will not fit into cache. I have not
>tested the new design yet. In the case of the new design it
>might be possible to gain from multiple cores.
****
But you don't know what the acess patterns are, so you cannot use this simplistic model to
predict actual behavnior.
*****
>
>Much more interesting to me than this, is testing your
>theory about cache hit ratio. If you are right, then my
>basic design will gain a huge amount of performance. The
>cache hit ratio could improve from 5% to 95%.
****
So stop guessing and run the experiment! Get some REAL data!
****
>
>>>
>>>>
>>>> But if you insist it can only be a FIFO single thread
>>>> processor, well, you are really wasting people time here
>>>> because everything else you want to do contradicts your
>>>> limitations. You want to put a web server INTO your
>>>> OCR,
>>>> when in reality, you need to put your OCR into your WEB
>>>> SERVER.
>>>>
>>>> --
>>>> HLS
>>>
>> Joseph M. Newcomer [MVP]
>> email: newcomer(a)flounder.com
>> Web: http://www.flounder.com
>> MVP Tips: http://www.flounder.com/mvp_tips.htm
>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Joseph M. Newcomer on
See below...
On Sat, 20 Mar 2010 16:09:25 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote:

>
>"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
>news:O%23cPxAHyKHA.984(a)TK2MSFTNGP05.phx.gbl...
>> Peter Olcott wrote:
>>
>>>
>>> Geoff has explained this better than I have.
>>
>>
>> And I don't agree with him - not iota.
>>
>
>Let's see where Joe weighs in on this.
>
>> Until you redesign your software MEMORY USAGE, your
>> current code is not optimize for your WINTEL box or for
>> any web service worth its salt. You might as well get a
>> DOS machine to reduce all Windows overhead especially
>> graphical overhead. Recompile your code to use DMPI and
>> you will be better off than what you have now.
>>
>> --
>> HLS
>
>I won't be running on Wintel, I will be running on Linux
>Intel. I won't need any GUI.
>
****
My linux expert uses only cgi scripting and server-side Java, doesn't know how big a linux
process can get. But you had better find out real quick before you commit yourself down a
path that this not technically feasible.

But mostly this discussion seems to be one of your opinion wanting to be louder than
anyone else's suggestion. My opinion is that until you actually MEASURE what is going on,
you are flailing about pointelessly, having NO IDEA what is going to happen. RUN THE #$!
EXPERIMENT before you rule something out based on flimsy factoids that may nor may not
have any meaning in a real environment. And go hand out in some linux newsgroups to find
out what you need to know.

Note: the way we ported an app to Linux was that we had a master directory in which 85% or
so of the code resided. Then we had two subdirectories, \linux and \windows, in which we
wrote GUI components that called that common code (the menu handlers were in the common
code, even though the whole GUI infrastructure was completely different). I suggest that
you isolate any Windows-specific code, come up with a platform-independent interface to
it, and write two function, one whose source is in the linux subdirectory and one whose
source is in the windows subdirectory. They way we did builds was that we put the
makefile in the linux subdirectory and the VS4.2 project (.dsw/.dsp) files in the windows
directory, and included the files from the common source directory as part of each build
(..\whatever.c and ..\stuff.c for example). THen we would build the two systems and test
them. THis was largely done with the linux expert on one side of the table and me on the
other side, and I'd say "I need a way to dispatch a menu item to a handler" and we'd write
an OnWhatever handler and then he and I would write the GUI code to call it at the right
itme. The handler needed to open a file, so it would call a function called
PlatformGetFileNme, and I wrote a file dialog and he used a common linux/unix dialog box,
and we agreed on the format of the filename coming back. WHich we would then pass to
PlatformOpenFile and get back somethine we would pass to PlatformWriteFile to write it. We
did the whole port in three days, start to finish. An intense three days, mind you, but
it was only three days. So when we talk about multithreading, or loading your DFAs, you
have a lot of platform-specific code there which you will have to record for linux.
joe
****
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Joseph M. Newcomer on
See below...
On Sat, 20 Mar 2010 22:30:47 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote:

>
>"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
>message news:52maq5h8eg3490hrv06l066r76fp02fo0u(a)4ax.com...
>> See below...
>> On Sat, 20 Mar 2010 15:12:43 -0500, "Peter Olcott"
>> <NoSpam(a)OCR4Screen.com> wrote:
>>>>>(5) OCR processing time is less than 100 milliseconds.
>>>> ****
>>>> As you point out, if there are 10 requests pending, this
>>>> mens 1sec, which violates your
>>>> 500ms goal. But if you had a concurrent server, it
>>>> would
>>>> be (nrequests * 100)/ncores
>>>> milliseconds, so for a quad-core server with 10 pending
>>>> requests and a 4-thread pool you
>>>> would have only 250ms, within your goals. If response
>>>> time is so critical, why are you
>>>> not running multithreaded already?
>>>> ****
>>>
>>>If I get more than an average of one request per second, I
>>>will get another server. My process does not improve with
>>>my multiple cores, it does get faster with faster memory.
>>>800% faster memory at the same core speed provided an 800%
>>>increase in speed. Two processes on a quad core resulted
>>>in
>>>half the speed for each.
>> ****
>> Well, if you have not designed your code to run
>> multithreaded, how do you KNOW it won't
>> run faster if you add more cores? If it is single
>> threaded, it will run at EXACTLY the
>> same speed on an 8-core system as it does on a
>> uniprocessor, because you have no
>> concurrency, but if you want to process 8 requests, you
>> currently require ~800ms, which
>> violates you apparently nonnegotiable 500ms limit, but if
>> you run multithreaded, an 8-core
>> system could handle all 8 of the them concurrently,
>> meaning your total processing time on
>> 8 concurrent requests is ~100ms. Or have I missed
>> something here, and was the 500ms limit
>> avandoned?
>>
>> Seriously, how hard can it be to convert code that
>> requires no locking to multithreaded?
>> joe
>
>
>How can I empirically test exactly how much of the total
>memory bandwidth that my process is taking up?
***
I don't know. But I know that unless you measure it running multithreaded, any *opinion*
you offer about its potential performance is unfounded.
****
>
>Would you agree that if the empirical test shows that my
>single process is taking up 100% of the memory bandwidth,
>that multiple cores or multiple threads could not help
>increase speed?
****
Since you have no idea that this is true, why do you keep insisting it must be true?

My objection to your methodology is you make some guess about something based upon
accidental data and/or wishful thinking, and predicate all future behavior on these
unsubstantiated guesses. I don't guess, I measure. And there are rules about measuring:
1.Know what you want to measure
2. Know that your experiment is going to measure (1)
3. Know what the limitations of your measurement tools are
4. Know the impact of your measurement tools on your experiment
(Heisenberg was right!)
5. Know the accuracy of your measurements
6. Know that your experiment measured what you intended to measure
7. Do not conclude from a single experiment that you have measured
what you set out to measure, in a meaningful way
8. Always be open to alternative hypotheses that would explain your results

I cite here the famous Michelson-Morely experiment that measured the absence of the
"luminiferous aether" through which electronmagnetic waves were hypothesized to travel.

They did an elegant experiment in interference patterns that demonstrated a 6m/sec
different between the speed of light measured in one direction and the speed of light
measured 90 degrees to that direction. They concluced the 10km/sec difference was "within
experimental error" and therefore insignificant. A generation of science fiction writers
and wannabe "scientists" used this to hypothesize that there really was an "aether", and
its true existence was going to turn physics on its head (which would have been true, had
it existed). But in the 1970s an enterprising graduate student got to use Michelson and
Morely's original equipment to redo the experiement, and discovered that they had not only
created a very sensitive and accurate way to measure the speed of light, but they had also
created an incredibly sensitive thermometer. When rotated 90 degrees, the mirrors were
placed in a different relationship to the ambient air in the lab, and the 10km/sec
difference was atually the result of the minute changes caused by the temperature changes
across the apparatus. When the temperature could be stabilized or accounted for (I forget
which) then the 10km/sec difference disappeareared. There's an important lesson here;
they THOUGHT they were measuring the speed of light, but they were really measuring the
temperature differences in their lab! So in simplistic measurements, you might have a
number that just coincicentally ends up looking like the raw memory speed; there is no
reason to presume that IS what you have measured. Until you have measured the performance
of 8 threads running in an 8-core system, you have no data, and therefore you cannot guess
at what is really going to happen. Modern experiments with lasers have shown that the
differences are insignificant.

And what if there IS memory interference? It might mean that instead of 100ms it takes
125ms to process each request. So instead of taking 800ms to process 8 requests, you
take 125ms, instead of 100, to process each of the 8 requests. Isn't this better than
800ms, epspecially because you fastened onto 500ms as if it was some nonnegotiable
performance number. You have no data. Stop flailing about and do some science!
joe
****
>
>
>> ****
>>>
>>>>>
>>>>>
>>>> Joseph M. Newcomer [MVP]
>>>> email: newcomer(a)flounder.com
>>>> Web: http://www.flounder.com
>>>> MVP Tips: http://www.flounder.com/mvp_tips.htm
>>>
>> Joseph M. Newcomer [MVP]
>> email: newcomer(a)flounder.com
>> Web: http://www.flounder.com
>> MVP Tips: http://www.flounder.com/mvp_tips.htm
>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Joseph M. Newcomer on
See below...
On Sun, 21 Mar 2010 00:01:46 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote:
>
>One thing looks like it proves you wrong try and see if it
>does prove you wrong, or you have another way to explain it.
>
>I run one of my DFA processes on a quad-core machine and I
>get 100% of the one of the CPU's cycles (25% of the total of
>four CPUs)
>
>I run one additional DFA processes on a quad-core machine
>and now the sum CPU usage of both processes is substantially
>lower than the CPU usage of the prior single process,
>something like 12 + 7 = 19, or less than a single CPU.
****
THis is a failed experiment. The reason it is a failed experiment is that you are trying
to measure the performance of two PROCESSES, which is not at all the same as trying to
measure the performance of TWO THREADS IN ONE PROCESS. No number you produce from
measuring the performance of two processes can POSSIBLY have anything to do with running
two threads in the same process. Until you have run the
multiple-threads-in-a-single-process in a multicore environment you have NO IDEA what the
performance is going to be!

You have done a poor experiment, come to an erroneous conclusion, and failed to understand
that what you measured has no relationship to what you need to measure!
joe
****
>
>What could explain this other than the fact that the first
>process ate up all of the RAM bandwidth?
>
***
Paging.

Your experiment was ill-conceived, badly-conducted, produced irrelevant results, and you
are generalizing from this to something unrelated to it.
****
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Joseph M. Newcomer on
See below...
On Sun, 21 Mar 2010 10:52:08 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote:

>
>"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
>news:eQO7mgLyKHA.1548(a)TK2MSFTNGP02.phx.gbl...
>> Peter Olcott wrote:
>>
>>> One thing looks like it proves you wrong try and see if
>>> it does prove you wrong, or you have another way to
>>> explain it.
>>>
>>> I run one of my DFA processes on a quad-core machine and
>>> I get 100% of the one of the CPU's cycles (25% of the
>>> total of four CPUs)
>>
>>
>> In general, when you get 100% of a CPU, here could be two
>> reasons:
>>
>> 1) You context switching is TOO HIGH!
>
>I am running my process on a quad-core machine and getting
>100% of one of the four cores. I don't think under this
>situation there is any context switching at all.
****
You continue to demonstrate how little you know about operating systems.. For example,
interrupts that happen are charged to the running thread, and any Deferred Procedure Call
time is charged to the running thread. You don't see any of this accounted for in the
normal %utliization display, because it isn't. So all you know running on a uniprocessor
is that 100% of the total CPU cycles are charged against one thread. THere could be
dozens to thousdands of context swaps going on in there, and you don't know!

And scheduler interrupts and the time spent discovering that your thread is still the
biggest, baddest thread on the block and therefore should run again is charged to the
process. If you run longer than 15ms, you have one of these context-swaps-into-the-kernel
events every 15ms. The scheduler is invoked every 30 ms. You could have tens to
thousands of device interrupts (which upset the cache behavior) every second.

I repeat, you have NO IDEA what you are measuring, and therefore you cannot make valid
design decisions based on this data. Until you have built a true multithreaded app, you
have no clue what multithreading will do. It *might* not be a good idea, it might be the
best thing that ever happened, or something in between. But you have NO DATA on which to
base any of these decisions. Your data is questionable, your experimental design wouldn't
pass a sophomore physics course, and your logic for deriving design decisions from suspect
data is highly questionable. I may have gone to a liberal arts undergraduate college, but
my professors would never have accepted any of what you are doing as acceptable
methodology. At serious tech schools like MIT, Standford, Cal Tech, etc. the freshmen
would probably laugh at these posts, given the bad science they represent.

You are working in a very complex world, so complex that there are rarely closed-form
analytic solutions to any of the problems you are concerned with. Yet you take one or two
numbers and leap to unfounded conclusions. This is pure guesswork, no real validation. So
the only validation is to run actual measurements and get REAL numbers.

Note that you have not told us HOW you get these measurements. For example, if you use
GetTickCount(), then your number are implicitly �15ms. If you don't know what "gating
error" is then you really don't know what the reliability of your measurment tool is. So
you cannot compare ANY two experiments that differ by less than 30ms if you use the
GetTickCount() method.
joe
****
>
>>
>> 2) All you not yielding enough or not doing any work
>> that
>> cause system interrupts. In other words, context
>> switching
>> is all natural based on the CPU quantum.
>>
>> Example:
>>
>> while (1)
>> {
>> }
>>
>> and you get 100% of the CPU. Change it to this:
>>
>> while (1)
>> {
>> if (_kbhit() && _getch() == 27) break;
>> }
>>
>> and you get a lot less, lets say 60%, change it to this:
>>
>> while (1)
>> {
>> sleep(1);
>> }
>>
>> and you will see 0% CPU!!! But change it to this:
>>
>> while (1)
>> {
>> sleep(0);
>> }
>>
>> and you will see 100% again!
>>
>> Why?
>>
>> The kbhit (keyboard) introduces a hardware interrupt, that
>> causes a pre-emptive context switch (CS) which means YOU
>> don't have 100% of the CPU.
>>
>> The Sleep(1) forces a mininum quantum (~15ms) CS. The
>> sleep is ~15ms, not 1ms. However the Sleep(0), is a
>> special sleep which acts like a POKE (a term used in RTOS
>> programming) to give others a change to RUN but do not
>> SLEEP. Because is of this POKING and NO SLEEPING, the
>> context switching is very high thus 100% CPU usage.
>>
>> So if you have 100% CPU, then either you are doing too
>> much context switching or not enough where the single
>> thread has 100% attention.
>
>Why would a single process that is executing on one of four
>CPU's need to do any context switching at all? Any other
>process could simply use another CPU.
>
>>
>> Now......
>>
>>> I run one additional DFA processes on a quad-core machine
>>> and now the sum CPU usage of both processes is
>>> substantially lower than the CPU usage of the prior
>>> single process, something like 12 + 7 = 19, or less than
>>> a single CPU.
>>>
>>> What could explain this other than the fact that the
>>> first process ate up all of the RAM bandwidth?
>>
>>
>> It didn't!!! The 100% CPU is representative of the FACT
>> that you are either not doing any context switching or YOU
>> are doing too much.
>
>In any case actual benchmarking shows that adding another
>process disproportionally slows down the processing of both
>processes. The sum of the wall clock time for both processes
>is substantially more than double the time for a single
>process.
>
>I don't see how this could change much if I changed two
>processes into two threads, it might be a slowdown to a
>lesser degree, but, any slowdown at all is unacceptable.
>
>>
>> You can not use the CPU percentage like this to determine
>> what your performance will be.
>
>I am also measuring wall clock time, and wall clock time is
>a direct measure of response time.
>
>>
>> In other words, we called that HOGGING the CPU! Your
>> process is NOT friendly with the machine.
>
>
>The ONLY purpose of my dedicated server is to run my
>process, so it can not be considered to be hogging the CPU.
>Anything at all that is not necessary to the execution of my
>process is wasting the CPU.
>
>>
>> You will be surprise at the performance gains simply by
>> understanding the fundamental ideas of preemptive
>> multi-threading operating system.
>>
>> Try adding a Sleep(1) in some places, or a Sleep(0) in
>> some loop BUT NEVER A TIGHT LOOP (like above that does
>> little work). You can do a poke for example every 100ms.
>
>My whole process is necessarily (for maximum performance
>reasons) a single very tight loop.
>
>>
>> Trust me, YOUR PROCESS WILL BEHAVE SO MUCH DIFFERENTLY!
>>
>> Right now, you are hogging the system. Your MEMORY ACCESS
>> is not your problem, but the HUGE REDUNDANT LOADING is
>> contribution to other factors that will cause more paging
>> and slow downs.
>>
>> --
>> HLS
>
>Yeah OK this may be the case in multiple processes over
>multiple threads with shared data. What if each process used
>a different portion of 2 GB of shared data (assuming the
>system has plenty of excess RAM to hold this without
>paging). Couldn't is still be possible for the memory
>bandwidth speed to limit the performance of two simultaneous
>threads executing on separate CPUs over the performance of a
>two sequential threads executing on the same CPU?
>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm