From: Peter Olcott on

"Geoff" <geoff(a)invalid.invalid> wrote in message
news:7egbq5pif7se5jmkbc0idv3hg6umlt0uo8(a)4ax.com...
> On Sun, 21 Mar 2010 00:16:08 -0500, "Peter Olcott"
> <NoSpam(a)OCR4Screen.com> wrote:
>
>>Given (as in Geometry) as an immutable premise (to be
>>taken
>>as true even if it is false) that my process takes
>>essentially ALL of the memory bandwidth, then within the
>>specific context of this immutable premise (to be taken as
>>true even if it is false) could adding another such
>>process
>>speed things up or would it slow them down?
>>
>
> A thread is not a process.
>
>>> What I am saying is this, suppose you have 10 lines of
>>> DFA
>>> C code, the compiler creates OP CODES for these 10
>>> lines.
>>> Each OP CODE has a fixed frequency cycle. When the
>>> accumulated frequency reaches a QUANTUM (~15ms), you
>>> will
>>> get a context switch - in other words, your code is
>>> preempted (stop), swapped out, and Windows will give all
>>> other threads a change to run.
>>
>>This does not occur on my quad-core machine. I
>>consistently
>>get every bit of all of the CPU cycles of a single core.
>
> The other three cores sit idle.
> You are getting 100% use of 25% of the machine capacity.

Also when I add another process (I add a process rather than
a thread because the code can already do this without
changes) the sum of both processes becomes substantially
less then the prior single process. The prior single process
took 25% of the CPU time, the two processes now take 11 + 7
= 19% of the CPU time.

Even though these are processes rather than threads, from
what I understand of the difference between them this test
would tend to indicate that adding another thread would have
comparable results.


From: Peter Olcott on

"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
news:eQO7mgLyKHA.1548(a)TK2MSFTNGP02.phx.gbl...
> Peter Olcott wrote:
>
>> One thing looks like it proves you wrong try and see if
>> it does prove you wrong, or you have another way to
>> explain it.
>>
>> I run one of my DFA processes on a quad-core machine and
>> I get 100% of the one of the CPU's cycles (25% of the
>> total of four CPUs)
>
>
> In general, when you get 100% of a CPU, here could be two
> reasons:
>
> 1) You context switching is TOO HIGH!

I am running my process on a quad-core machine and getting
100% of one of the four cores. I don't think under this
situation there is any context switching at all.

>
> 2) All you not yielding enough or not doing any work
> that
> cause system interrupts. In other words, context
> switching
> is all natural based on the CPU quantum.
>
> Example:
>
> while (1)
> {
> }
>
> and you get 100% of the CPU. Change it to this:
>
> while (1)
> {
> if (_kbhit() && _getch() == 27) break;
> }
>
> and you get a lot less, lets say 60%, change it to this:
>
> while (1)
> {
> sleep(1);
> }
>
> and you will see 0% CPU!!! But change it to this:
>
> while (1)
> {
> sleep(0);
> }
>
> and you will see 100% again!
>
> Why?
>
> The kbhit (keyboard) introduces a hardware interrupt, that
> causes a pre-emptive context switch (CS) which means YOU
> don't have 100% of the CPU.
>
> The Sleep(1) forces a mininum quantum (~15ms) CS. The
> sleep is ~15ms, not 1ms. However the Sleep(0), is a
> special sleep which acts like a POKE (a term used in RTOS
> programming) to give others a change to RUN but do not
> SLEEP. Because is of this POKING and NO SLEEPING, the
> context switching is very high thus 100% CPU usage.
>
> So if you have 100% CPU, then either you are doing too
> much context switching or not enough where the single
> thread has 100% attention.

Why would a single process that is executing on one of four
CPU's need to do any context switching at all? Any other
process could simply use another CPU.

>
> Now......
>
>> I run one additional DFA processes on a quad-core machine
>> and now the sum CPU usage of both processes is
>> substantially lower than the CPU usage of the prior
>> single process, something like 12 + 7 = 19, or less than
>> a single CPU.
>>
>> What could explain this other than the fact that the
>> first process ate up all of the RAM bandwidth?
>
>
> It didn't!!! The 100% CPU is representative of the FACT
> that you are either not doing any context switching or YOU
> are doing too much.

In any case actual benchmarking shows that adding another
process disproportionally slows down the processing of both
processes. The sum of the wall clock time for both processes
is substantially more than double the time for a single
process.

I don't see how this could change much if I changed two
processes into two threads, it might be a slowdown to a
lesser degree, but, any slowdown at all is unacceptable.

>
> You can not use the CPU percentage like this to determine
> what your performance will be.

I am also measuring wall clock time, and wall clock time is
a direct measure of response time.

>
> In other words, we called that HOGGING the CPU! Your
> process is NOT friendly with the machine.


The ONLY purpose of my dedicated server is to run my
process, so it can not be considered to be hogging the CPU.
Anything at all that is not necessary to the execution of my
process is wasting the CPU.

>
> You will be surprise at the performance gains simply by
> understanding the fundamental ideas of preemptive
> multi-threading operating system.
>
> Try adding a Sleep(1) in some places, or a Sleep(0) in
> some loop BUT NEVER A TIGHT LOOP (like above that does
> little work). You can do a poke for example every 100ms.

My whole process is necessarily (for maximum performance
reasons) a single very tight loop.

>
> Trust me, YOUR PROCESS WILL BEHAVE SO MUCH DIFFERENTLY!
>
> Right now, you are hogging the system. Your MEMORY ACCESS
> is not your problem, but the HUGE REDUNDANT LOADING is
> contribution to other factors that will cause more paging
> and slow downs.
>
> --
> HLS

Yeah OK this may be the case in multiple processes over
multiple threads with shared data. What if each process used
a different portion of 2 GB of shared data (assuming the
system has plenty of excess RAM to hold this without
paging). Couldn't is still be possible for the memory
bandwidth speed to limit the performance of two simultaneous
threads executing on separate CPUs over the performance of a
two sequential threads executing on the same CPU?


From: Peter Olcott on

"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
news:OgyWYsLyKHA.1548(a)TK2MSFTNGP02.phx.gbl...
> Peter Olcott wrote:
>
>> The next question is an attempt to see if we can agree on
>> anything or that your mind is stuck in refute mode.
>> Please do not avoid this question, please answer this
>> question.
>>
>> Given (as in Geometry) as an immutable premise (to be
>> taken as true even if it is false) that my process takes
>> essentially ALL of the memory bandwidth, then within the
>> specific context of this immutable premise (to be taken
>> as true even if it is false) could adding another such
>> process speed things up or would it slow them down?
>
>
> This is all based on an erroneous flawed premise that your
> CPU usage is all based on MEMORY ACCESS which is
> completely FALSE.
>
> You thinking is based on that erroneous idea that you have
> 100% exclusive access to memory which you don't and NEVER
> will, that your

If my process only uses 60% of the memory bandwidth speed,
then adding another thread that also wants access to 60% of
memory speed would necessarily slow down both threads such
that the sum total of their wall clock time would exceed
200% of the wall clock time for a single thread.

I think that the actual case is that my process wants as
close to 100% of memory bandwidth speed that it can get. I
don't know how to empirically measure this.

I do know that almost all of the execution of the process
involves reading memory. The only part that does not involve
reading memory is the part that gets the next memory address
to be read, and it always get this from reading memory, plus
the single sum of two integers.

> single process thread is always 100% active state
> (Running) which is FALSE. A thread has the following
> states:
>
> - Processor State: Ready, Standby, Running
> - Global States: deferred ready, waiting
>
> Your single thread process will *never* be in a running
> state. That is where your FLAW thinking starts!
>
> However, if there is no one around to do work, then you
> have a "observed measurement" of 100% running state with
> 100% CPU usage.
>
> You are hogging the CPU.
>
>> This does not occur on my quad-core machine. I
>> consistently get every bit of all of the CPU cycles of a
>> single core.
>
>
> Are you saying your process never gets preempted? That
> its already in a running state? that the CPU hardware
> issued QUANTUM does not apply to your process?
>
> I asked you for your context switch count which you didn't
> provide before, what was it? ZERO?
>
> --
> HLS


From: Hector Santos on
Peter,

Your understanding is so off base and with your stubbornness to listen
or follow anyone, I am just going to point you to links for your
reading. Hopefully, you will get a better grasp and finally say "Ah
HA!" I believe the first one is very good "easy reading" and maybe
that alone will give you better insight. As you read, there are 4-5
things to highlight and pay attention to:

- Process and Thread Affinity
- Preferred Processor
- Non-Uniform Memory Access (NUMA)
- Virtual Allocation
- Network Interrupts (Web Server Requests!!)

Enjoy

Processor Affinity
http://www.tmurgent.com/WhitePapers/ProcessorAffinity.pdf

Perceived Performance
http://www.tmurgent.com/WhitePapers/PerceivedPerformance.pdf

Pushing the Limits of Windows: Processes and Threads
http://blogs.technet.com/markrussinovich/archive/2009/07/08/3261309.aspx

New NUMA Support with Windows Server 2008 R2 and Windows 7
http://code.msdn.microsoft.com/64plusLP

NUMA Support
http://msdn.microsoft.com/en-us/library/aa363804(VS.85).aspx

NUMA optimization in Windows Applications
http://developer.amd.com/documentation/articles/pages/1162007106.aspx

Mainstream NUMA and the TCP/IP stack: Part I. (EXPERT LEVEL)
http://blogs.msdn.com/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx

Linux Support for NUMA Hardware
http://lse.sourceforge.net/numa/

Process Lasso (Utility to help balance your processors)
http://www.bitsum.com/prolasso.php


--
HLS

Peter Olcott wrote:

> "Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
> news:OgyWYsLyKHA.1548(a)TK2MSFTNGP02.phx.gbl...
>> Peter Olcott wrote:
>>
>>> The next question is an attempt to see if we can agree on
>>> anything or that your mind is stuck in refute mode.
>>> Please do not avoid this question, please answer this
>>> question.
>>>
>>> Given (as in Geometry) as an immutable premise (to be
>>> taken as true even if it is false) that my process takes
>>> essentially ALL of the memory bandwidth, then within the
>>> specific context of this immutable premise (to be taken
>>> as true even if it is false) could adding another such
>>> process speed things up or would it slow them down?
>>
>> This is all based on an erroneous flawed premise that your
>> CPU usage is all based on MEMORY ACCESS which is
>> completely FALSE.
>>
>> You thinking is based on that erroneous idea that you have
>> 100% exclusive access to memory which you don't and NEVER
>> will, that your
>
> If my process only uses 60% of the memory bandwidth speed,
> then adding another thread that also wants access to 60% of
> memory speed would necessarily slow down both threads such
> that the sum total of their wall clock time would exceed
> 200% of the wall clock time for a single thread.
>
> I think that the actual case is that my process wants as
> close to 100% of memory bandwidth speed that it can get. I
> don't know how to empirically measure this.
>
> I do know that almost all of the execution of the process
> involves reading memory. The only part that does not involve
> reading memory is the part that gets the next memory address
> to be read, and it always get this from reading memory, plus
> the single sum of two integers.
>
>> single process thread is always 100% active state
>> (Running) which is FALSE. A thread has the following
>> states:
>>
>> - Processor State: Ready, Standby, Running
>> - Global States: deferred ready, waiting
>>
>> Your single thread process will *never* be in a running
>> state. That is where your FLAW thinking starts!
>>
>> However, if there is no one around to do work, then you
>> have a "observed measurement" of 100% running state with
>> 100% CPU usage.
>>
>> You are hogging the CPU.
>>
>>> This does not occur on my quad-core machine. I
>>> consistently get every bit of all of the CPU cycles of a
>>> single core.
>>
>> Are you saying your process never gets preempted? That
>> its already in a running state? that the CPU hardware
>> issued QUANTUM does not apply to your process?
>>
>> I asked you for your context switch count which you didn't
>> provide before, what was it? ZERO?
>>
>> --
>> HLS
>
>




From: Peter Olcott on

"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
news:eCRxtlRyKHA.4240(a)TK2MSFTNGP06.phx.gbl...
> Peter,
>
> Your understanding is so off base and with your
> stubbornness to listen
> or follow anyone, I am just going to point you to links
> for your
> reading. Hopefully, you will get a better grasp and
> finally say "Ah
> HA!" I believe the first one is very good "easy reading"
> and maybe that alone will give you better insight. As you
> read, there are 4-5 things to highlight and pay attention
> to:
>
> - Process and Thread Affinity
> - Preferred Processor
> - Non-Uniform Memory Access (NUMA)
> - Virtual Allocation
> - Network Interrupts (Web Server Requests!!)
>
> Enjoy

I am going for a second opinion from another sets of groups.
I made a copy of this other post and posted it to this
forum. I can't spend the time reading up on a whole lot of
different things that would seem to be moot at this point.

I am convinced that an HTTP based webserver directly hooked
to my code is the best approach to making my app web
enabled, I really appreciate your help guiding me to this
path. This would seem to provide a very robust high
performance interface that can be implemented in minimal
time, exactly what I was looking for, thanks again.

>
> Processor Affinity
> http://www.tmurgent.com/WhitePapers/ProcessorAffinity.pdf
>
> Perceived Performance
> http://www.tmurgent.com/WhitePapers/PerceivedPerformance.pdf
>
> Pushing the Limits of Windows: Processes and Threads
> http://blogs.technet.com/markrussinovich/archive/2009/07/08/3261309.aspx
>
> New NUMA Support with Windows Server 2008 R2 and Windows 7
> http://code.msdn.microsoft.com/64plusLP
>
> NUMA Support
> http://msdn.microsoft.com/en-us/library/aa363804(VS.85).aspx
>
> NUMA optimization in Windows Applications
> http://developer.amd.com/documentation/articles/pages/1162007106.aspx
>
> Mainstream NUMA and the TCP/IP stack: Part I. (EXPERT
> LEVEL)
> http://blogs.msdn.com/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx
>
> Linux Support for NUMA Hardware
> http://lse.sourceforge.net/numa/
>
> Process Lasso (Utility to help balance your processors)
> http://www.bitsum.com/prolasso.php
>
>
> --
> HLS
>
> Peter Olcott wrote:
>
>> "Hector Santos" <sant9442(a)nospam.gmail.com> wrote in
>> message news:OgyWYsLyKHA.1548(a)TK2MSFTNGP02.phx.gbl...
>>> Peter Olcott wrote:
>>>
>>>> The next question is an attempt to see if we can agree
>>>> on anything or that your mind is stuck in refute mode.
>>>> Please do not avoid this question, please answer this
>>>> question.
>>>>
>>>> Given (as in Geometry) as an immutable premise (to be
>>>> taken as true even if it is false) that my process
>>>> takes essentially ALL of the memory bandwidth, then
>>>> within the specific context of this immutable premise
>>>> (to be taken as true even if it is false) could adding
>>>> another such process speed things up or would it slow
>>>> them down?
>>>
>>> This is all based on an erroneous flawed premise that
>>> your CPU usage is all based on MEMORY ACCESS which is
>>> completely FALSE.
>>>
>>> You thinking is based on that erroneous idea that you
>>> have 100% exclusive access to memory which you don't and
>>> NEVER will, that your
>>
>> If my process only uses 60% of the memory bandwidth
>> speed, then adding another thread that also wants access
>> to 60% of memory speed would necessarily slow down both
>> threads such that the sum total of their wall clock time
>> would exceed 200% of the wall clock time for a single
>> thread.
>>
>> I think that the actual case is that my process wants as
>> close to 100% of memory bandwidth speed that it can get.
>> I don't know how to empirically measure this.
>>
>> I do know that almost all of the execution of the process
>> involves reading memory. The only part that does not
>> involve reading memory is the part that gets the next
>> memory address to be read, and it always get this from
>> reading memory, plus the single sum of two integers.
>>
>>> single process thread is always 100% active state
>>> (Running) which is FALSE. A thread has the following
>>> states:
>>>
>>> - Processor State: Ready, Standby, Running
>>> - Global States: deferred ready, waiting
>>>
>>> Your single thread process will *never* be in a running
>>> state. That is where your FLAW thinking starts!
>>>
>>> However, if there is no one around to do work, then you
>>> have a "observed measurement" of 100% running state with
>>> 100% CPU usage.
>>>
>>> You are hogging the CPU.
>>>
>>>> This does not occur on my quad-core machine. I
>>>> consistently get every bit of all of the CPU cycles of
>>>> a single core.
>>>
>>> Are you saying your process never gets preempted? That
>>> its already in a running state? that the CPU hardware
>>> issued QUANTUM does not apply to your process?
>>>
>>> I asked you for your context switch count which you
>>> didn't provide before, what was it? ZERO?
>>>
>>> --
>>> HLS
>>
>>
>
>
>
>