From: Peter Olcott on

"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
news:eGiPNBLyKHA.1548(a)TK2MSFTNGP02.phx.gbl...
> Peter Olcott wrote:
>
>> Would you agree that if the empirical test shows that my
>> single process is taking up 100% of the memory bandwidth,
>> that multiple cores or multiple threads could not help
>> increase speed?
>
>
> You are thinking about this all wrong.
>
> You have a quantum based context switching you can't stop,
> even for a single thread process. In other words, you
> will never have 100% full exclusive control of MEMORY
> ACCESS - never. If you did, nothing else will run.
>
> What I am saying is this, suppose you have 10 lines of DFA
> C code, the compiler creates OP CODES for these 10 lines.
> Each OP CODE has a fixed frequency cycle. When the
> accumulated frequency reaches a QUANTUM (~15ms), you will
> get a context switch - in other words, your code is
> preempted (stop), swapped out, and Windows will give all
> other threads a change to run.
>
> That gives other threads in your process, if it was
> multi-thread to do the same type of memory access work.
> Since it is READ ONLY, there is no contention. If your
> preempted thread BLOCKED it, then your have contention or
> even a dead lock - but you are not doing that. You are
> reading only READ ONLY memory - which will have a maximum
> access.
>
> Now comes a MULTI-CORE, and you have two or more threads,
> the SPEED is that there is NO CONTEXT SWITCHING - you
> still may have the same memory access, but that would be
> no slower if it was single cpu. Your speed comes in less
> context switching. Understand?
>
> In short:
>
> single cpu: speed lost due to context switching
> multi cpu/core: less context switching, more resident
> time.
>
> You can' not think of term of a single thread process
> because there is no advantage for it on a multi-core/cpu
> machine.
>
> The INTEL Multi-Core chips has advanced technology to help
> multi-threaded applications. Single thread processes can
> not benefit on multi-core machine. They must be designed
> for threads to see any benefits. If you want to read up
> on it, check out the Intel technical documents, like this
> one:
>
>
> http://download.intel.com/technology/architecture/sma.pdf
>
> Specifically read SMA "Smart Memory Access"
>
> The bottom line is really simple:
>
> You have a single process with a huge memory load. Each
> instance redundantly create additional huge memory loads
> and that alone will cause a SYSTEM WIDE performance
> serious degradation with huge page faulting and context
> switching delays.
>
> You will never get any improvements until you change your
> memory usage for intelligent sharable and use threads.
> When done correctly, you will gain benefits provided by
> the OS and machine.
>
> You really need to look at this as a whole:
>
> 1 process - with X number of threads
>
> vs
>
> X single thread Processes.
>
> You need to trust us this is NOT the same when the DATA is
> HUGE!. In the threaded model, it is shared. In the
> non-threaded model, is redundant for each instance - that
> will murder you!
>
> If you had NO HUGE data requirement, then they become more
> equal because now its just CODE.
>
> Now, it is conceivable that for specific your application,
> you might realize that X may be 5-10 threads before you
> see a performance issue that isn't towards your liken.
>
> Show me how you are using std::vector with your files, and
> I will create a simulator for you to PROVE to you how your
> thinking is all wrong. This simulator will allow you to
> fine tune it to determine what is your boundary conditions
> for performance.
>
> While you have 20,000 hrs into this WITHOUT even exploring
> high end thread designs, I have 6 years in Intel RMX
>
> http://en.wikipedia.org/wiki/RMX_(operating_system)
>
> which was considered one of the early Intel "multi-thread"
> frameworks we have today and gave me an early nature
> understanding when NT 3.1 (17 years?) where I have done
> exclusively high-end multi-threaded commercial server
> products since then. Count the hours! I can assure you,
> your single process thinking is wrong.
>
> --
> HLS

One thing looks like it proves you wrong try and see if it
does prove you wrong, or you have another way to explain it.

I run one of my DFA processes on a quad-core machine and I
get 100% of the one of the CPU's cycles (25% of the total of
four CPUs)

I run one additional DFA processes on a quad-core machine
and now the sum CPU usage of both processes is substantially
lower than the CPU usage of the prior single process,
something like 12 + 7 = 19, or less than a single CPU.

What could explain this other than the fact that the first
process ate up all of the RAM bandwidth?


From: Peter Olcott on

"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
news:eGiPNBLyKHA.1548(a)TK2MSFTNGP02.phx.gbl...
> Peter Olcott wrote:
>
>> Would you agree that if the empirical test shows that my
>> single process is taking up 100% of the memory bandwidth,
>> that multiple cores or multiple threads could not help
>> increase speed?
>
>
> You are thinking about this all wrong.
>
> You have a quantum based context switching you can't stop,
> even for a single thread process. In other words, you
> will never have 100% full exclusive control of MEMORY
> ACCESS - never. If you did, nothing else will run.

OK fine then nitpick to death the insignificant imprecision
of my statement.

The next question is an attempt to see if we can agree on
anything or that your mind is stuck in refute mode. Please
do not avoid this question, please answer this question.

Given (as in Geometry) as an immutable premise (to be taken
as true even if it is false) that my process takes
essentially ALL of the memory bandwidth, then within the
specific context of this immutable premise (to be taken as
true even if it is false) could adding another such process
speed things up or would it slow them down?

> What I am saying is this, suppose you have 10 lines of DFA
> C code, the compiler creates OP CODES for these 10 lines.
> Each OP CODE has a fixed frequency cycle. When the
> accumulated frequency reaches a QUANTUM (~15ms), you will
> get a context switch - in other words, your code is
> preempted (stop), swapped out, and Windows will give all
> other threads a change to run.

This does not occur on my quad-core machine. I consistently
get every bit of all of the CPU cycles of a single core.

>
> That gives other threads in your process, if it was
> multi-thread to do the same type of memory access work.
> Since it is READ ONLY, there is no contention. If your
> preempted thread BLOCKED it, then your have contention or
> even a dead lock - but you are not doing that. You are
> reading only READ ONLY memory - which will have a maximum
> access.
>
> Now comes a MULTI-CORE, and you have two or more threads,
> the SPEED is that there is NO CONTEXT SWITCHING - you
> still may have the same memory access, but that would be
> no slower if it was single cpu. Your speed comes in less
> context switching. Understand?
>
> In short:
>
> single cpu: speed lost due to context switching
> multi cpu/core: less context switching, more resident
> time.
>
> You can' not think of term of a single thread process
> because there is no advantage for it on a multi-core/cpu
> machine.
>
> The INTEL Multi-Core chips has advanced technology to help
> multi-threaded applications. Single thread processes can
> not benefit on multi-core machine. They must be designed
> for threads to see any benefits. If you want to read up
> on it, check out the Intel technical documents, like this
> one:
>
>
> http://download.intel.com/technology/architecture/sma.pdf
>
> Specifically read SMA "Smart Memory Access"
>
> The bottom line is really simple:
>
> You have a single process with a huge memory load. Each
> instance redundantly create additional huge memory loads
> and that alone will cause a SYSTEM WIDE performance
> serious degradation with huge page faulting and context
> switching delays.
>
> You will never get any improvements until you change your
> memory usage for intelligent sharable and use threads.
> When done correctly, you will gain benefits provided by
> the OS and machine.
>
> You really need to look at this as a whole:
>
> 1 process - with X number of threads
>
> vs
>
> X single thread Processes.
>
> You need to trust us this is NOT the same when the DATA is
> HUGE!. In the threaded model, it is shared. In the
> non-threaded model, is redundant for each instance - that
> will murder you!
>
> If you had NO HUGE data requirement, then they become more
> equal because now its just CODE.
>
> Now, it is conceivable that for specific your application,
> you might realize that X may be 5-10 threads before you
> see a performance issue that isn't towards your liken.
>
> Show me how you are using std::vector with your files, and
> I will create a simulator for you to PROVE to you how your
> thinking is all wrong. This simulator will allow you to
> fine tune it to determine what is your boundary conditions
> for performance.
>
> While you have 20,000 hrs into this WITHOUT even exploring
> high end thread designs, I have 6 years in Intel RMX
>
> http://en.wikipedia.org/wiki/RMX_(operating_system)
>
> which was considered one of the early Intel "multi-thread"
> frameworks we have today and gave me an early nature
> understanding when NT 3.1 (17 years?) where I have done
> exclusively high-end multi-threaded commercial server
> products since then. Count the hours! I can assure you,
> your single process thinking is wrong.
>
> --
> HLS


From: Hector Santos on
Peter Olcott wrote:

> One thing looks like it proves you wrong try and see if it
> does prove you wrong, or you have another way to explain it.
>
> I run one of my DFA processes on a quad-core machine and I
> get 100% of the one of the CPU's cycles (25% of the total of
> four CPUs)


In general, when you get 100% of a CPU, here could be two reasons:

1) You context switching is TOO HIGH!

2) All you not yielding enough or not doing any work that
cause system interrupts. In other words, context switching
is all natural based on the CPU quantum.

Example:

while (1)
{
}

and you get 100% of the CPU. Change it to this:

while (1)
{
if (_kbhit() && _getch() == 27) break;
}

and you get a lot less, lets say 60%, change it to this:

while (1)
{
sleep(1);
}

and you will see 0% CPU!!! But change it to this:

while (1)
{
sleep(0);
}

and you will see 100% again!

Why?

The kbhit (keyboard) introduces a hardware interrupt, that causes a
pre-emptive context switch (CS) which means YOU don't have 100% of the
CPU.

The Sleep(1) forces a mininum quantum (~15ms) CS. The sleep is ~15ms,
not 1ms. However the Sleep(0), is a special sleep which acts like a
POKE (a term used in RTOS programming) to give others a change to RUN
but do not SLEEP. Because is of this POKING and NO SLEEPING, the
context switching is very high thus 100% CPU usage.

So if you have 100% CPU, then either you are doing too much context
switching or not enough where the single thread has 100% attention.

Now......

> I run one additional DFA processes on a quad-core machine
> and now the sum CPU usage of both processes is substantially
> lower than the CPU usage of the prior single process,
> something like 12 + 7 = 19, or less than a single CPU.
>
> What could explain this other than the fact that the first
> process ate up all of the RAM bandwidth?


It didn't!!! The 100% CPU is representative of the FACT that you are
either not doing any context switching or YOU are doing too much.

You can not use the CPU percentage like this to determine what your
performance will be.

In other words, we called that HOGGING the CPU! Your process is NOT
friendly with the machine.

You will be surprise at the performance gains simply by understanding
the fundamental ideas of preemptive multi-threading operating system.

Try adding a Sleep(1) in some places, or a Sleep(0) in some loop BUT
NEVER A TIGHT LOOP (like above that does little work). You can do a
poke for example every 100ms.

Trust me, YOUR PROCESS WILL BEHAVE SO MUCH DIFFERENTLY!

Right now, you are hogging the system. Your MEMORY ACCESS is not your
problem, but the HUGE REDUNDANT LOADING is contribution to other
factors that will cause more paging and slow downs.

--
HLS
From: Hector Santos on
Peter Olcott wrote:

> The next question is an attempt to see if we can agree on
> anything or that your mind is stuck in refute mode. Please
> do not avoid this question, please answer this question.
>
> Given (as in Geometry) as an immutable premise (to be taken
> as true even if it is false) that my process takes
> essentially ALL of the memory bandwidth, then within the
> specific context of this immutable premise (to be taken as
> true even if it is false) could adding another such process
> speed things up or would it slow them down?


This is all based on an erroneous flawed premise that your CPU usage
is all based on MEMORY ACCESS which is completely FALSE.

You thinking is based on that erroneous idea that you have 100%
exclusive access to memory which you don't and NEVER will, that your
single process thread is always 100% active state (Running) which is
FALSE. A thread has the following states:

- Processor State: Ready, Standby, Running
- Global States: deferred ready, waiting

Your single thread process will *never* be in a running state. That
is where your FLAW thinking starts!

However, if there is no one around to do work, then you have a
"observed measurement" of 100% running state with 100% CPU usage.

You are hogging the CPU.

> This does not occur on my quad-core machine. I consistently
> get every bit of all of the CPU cycles of a single core.


Are you saying your process never gets preempted? That its already in
a running state? that the CPU hardware issued QUANTUM does not apply
to your process?

I asked you for your context switch count which you didn't provide
before, what was it? ZERO?

--
HLS
From: Geoff on
On Sun, 21 Mar 2010 00:16:08 -0500, "Peter Olcott"
<NoSpam(a)OCR4Screen.com> wrote:

>Given (as in Geometry) as an immutable premise (to be taken
>as true even if it is false) that my process takes
>essentially ALL of the memory bandwidth, then within the
>specific context of this immutable premise (to be taken as
>true even if it is false) could adding another such process
>speed things up or would it slow them down?
>

A thread is not a process.

>> What I am saying is this, suppose you have 10 lines of DFA
>> C code, the compiler creates OP CODES for these 10 lines.
>> Each OP CODE has a fixed frequency cycle. When the
>> accumulated frequency reaches a QUANTUM (~15ms), you will
>> get a context switch - in other words, your code is
>> preempted (stop), swapped out, and Windows will give all
>> other threads a change to run.
>
>This does not occur on my quad-core machine. I consistently
>get every bit of all of the CPU cycles of a single core.

The other three cores sit idle.
You are getting 100% use of 25% of the machine capacity.