From: Peter Olcott on
I have an application that uses enormous amounts of RAM in a
very memory bandwidth intensive way. I recently upgraded my
hardware to a machine with 600% faster RAM and 32-fold more
L3 cache. This L3 cache is also twice as fast as the prior
machines cache. When I benchmarked my application across the
two machines, I gained an 800% improvement in wall clock
time. The new machines CPU is only 11% faster than the prior
machine. Both processes were tested on a single CPU.

I am thinking that all of the above would tend to show that
my process is very memory bandwidth intensive, and thus
could not benefit from multiple threads on the same machine
because the bottleneck is memory bandwidth rather than CPU
cycles. Is this analysis correct?


From: Hector Santos on
Geez, and here I was hoping you would get your "second opinion" from a
more appropriate forum llke:

microsoft.public.win32.programmer.kernel

or one of the performance forums.

Peter Olcott wrote:

> I have an application that uses enormous amounts of RAM in a
> very memory bandwidth intensive way.


How do you do this?

How much memory is the process loading?

Show code that shows how intensive this is. Is it blocking memory access?

I recently upgraded my
> hardware to a machine with 600% faster RAM and 32-fold more
> L3 cache. This L3 cache is also twice as fast as the prior
> machines cache.


What kind of CPU? Intel, AMD?

If Intel, what kind of INTEL chips are you using?

> When I benchmarked my application across the
> two machines, I gained an 800% improvement in wall clock

> time. The new machines CPU is only 11% faster than the prior

> machine. Both processes were tested on a single CPU.


Does this make sense to anyone? Two physical machines?

> I am thinking that all of the above would tend to show that
> my process is very memory bandwidth intensive, and thus
> could not benefit from multiple threads on the same machine
> because the bottleneck is memory bandwidth rather than CPU
> cycles. Is this analysis correct?

no.

But if you believe your application has reaches his optimal design
point and can not do any improved for machine performance, then you
probably wasted money on improving your machine which will provide you
no scalability benefits.

At best, it will allow you to do your email, web browser and
multi-task to other things while your application is chunking along at
100%.

--
HLS
From: Joseph M. Newcomer on
NOte in the i7 architecture the L3 cache is shared across all CPUs, so you are less likely
to be hit by raw memory bandwidth (which compared to a CPU is dead-slow), and the answer s
so whether multiple threads will work effectively can only be determined by measurement of
a multithreaded app.

Because your logic seems to indicate that raw memory speed is the limiting factor, and you
have not accounted for the effects of a shared L3 cache, any opnion you offer on what is
going to happen is meaningless. In fact, any opinion about performanance is by definition
meaningless; only actual measurements represent facts ("If you can't express it in
numbers, it ain't science, it's opinion" -- Robert A. Heinlein)

More below...
On Sun, 21 Mar 2010 13:19:34 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote:

>I have an application that uses enormous amounts of RAM in a
>very memory bandwidth intensive way. I recently upgraded my
>hardware to a machine with 600% faster RAM and 32-fold more
>L3 cache. This L3 cache is also twice as fast as the prior
>machines cache. When I benchmarked my application across the
>two machines, I gained an 800% improvement in wall clock
>time. The new machines CPU is only 11% faster than the prior
>machine. Both processes were tested on a single CPU.
***
The question is whether you are measuring multiple threads in a single executable image
across multiple cores, or multiple executable images on a single core. Not sure how you
know that both processes were tested on a single CPU, since you don't mention how you
accomplished this (there are several techniques, but it is important to know which one you
used, since each has its own implications for predicting overall behavior of a system).
****
>
>I am thinking that all of the above would tend to show that
>my process is very memory bandwidth intensive, and thus
>could not benefit from multiple threads on the same machine
>because the bottleneck is memory bandwidth rather than CPU
>cycles. Is this analysis correct?
****
Nonsense! You have no idea what is going on here! The shared L3 cache could completely
wipe out the memory performance issue, reducing your problem to a cache-performance issue.
Since you have not conducted the experiment in multiple threading, you have no data to
indicate one way or the other what is going on, and it is the particular memory access
patterns of YOUR app that matter, and therefore, nobody can offer a meaningful estimate
based on your L1/L2/L3 cache accessses, whatever they may be.
joe
****
>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Joseph M. Newcomer on
More below...

On Sun, 21 Mar 2010 15:41:39 -0400, Hector Santos <sant9442(a)nospam.gmail.com> wrote:

>Geez, and here I was hoping you would get your "second opinion" from a
>more appropriate forum llke:
>
> microsoft.public.win32.programmer.kernel
>
>or one of the performance forums.
>
>Peter Olcott wrote:
>
>> I have an application that uses enormous amounts of RAM in a
>> very memory bandwidth intensive way.
>
>
>How do you do this?
>
>How much memory is the process loading?
>
>Show code that shows how intensive this is. Is it blocking memory access?
>
> I recently upgraded my
>> hardware to a machine with 600% faster RAM and 32-fold more
>> L3 cache. This L3 cache is also twice as fast as the prior
>> machines cache.
>
>
>What kind of CPU? Intel, AMD?
***
Actually, he said it is an i7 architecture some hundreds of messages ago....
****
>
>If Intel, what kind of INTEL chips are you using?
>
>> When I benchmarked my application across the
>> two machines, I gained an 800% improvement in wall clock
>
> > time. The new machines CPU is only 11% faster than the prior
****
Based on what metric? Certainly, I hope you are not using clock speed, which is known to
be irrelevant to performance. Did you look at the size of the i-pipe microinstrtuction
cache on the two architectures? DId you look at the amount of concurrency in the
execution engine (CPUs since 1991 have NOT executed instructions sequentially, they just
maintain the illusion that they are). What about the new branch predictor in the i7
archietecture? CPU clock time is only comparable within a chipset family. It bears no
relationship to another chipset family, particularly an older model, since most of the
improvements come in the instruction and data pipelines, cache management (why do you
think there is now an L3 cache in the i7s?) and other microaspects of the architecture.
And if you used a "benchmark" program to ascertain this nominal 11% improvement, do you
know what instruction sequence was being executed when it made the measurement? Probably
not, but it turns out that's the level that matters. So how did you arrive at this
magical number 11%?

Note also that raw memory speed doesn't matter too much on real problems; cache management
is the killer of performance, and the wrong sequence of address accesses will thrash your
cache; and if you are modifying data it hurts even worse (a cache line has to be written
back before it can be reused). caching read-only pages works well, and if you mark your
data pages as "read only" after reading them in you can improve performance. But you are
quoting perormance numbers here without giving any explanation of why you think they
matter.
joe
****
>
>> machine. Both processes were tested on a single CPU.
>
>
>Does this make sense to anyone? Two physical machines?
>
>> I am thinking that all of the above would tend to show that
>> my process is very memory bandwidth intensive, and thus
>> could not benefit from multiple threads on the same machine
>> because the bottleneck is memory bandwidth rather than CPU
>> cycles. Is this analysis correct?
****
Precisely because the bottleneck appears to be memory performance, and precisely because
you have an L3 cache shared across all the chips, you are offering meanningless opinion
here. the ONLY way to figure out what is going to happen is to try real experiments! And
measure what they do. No amount of guesswork is going to tell you anything relevant, and
you are guessing when it is clear you have NO IDEA what the implications of the i7
technology are. They are NOT just "faster memory" or "11% faster CPU" (whatever THAT
means!). I downloaded the Intel docs and read them while I was working on my new
multithreading course, and the i7 is more than a clock speed and a memory speed.
joe
****
>
>no.
>
>But if you believe your application has reaches his optimal design
>point and can not do any improved for machine performance, then you
>probably wasted money on improving your machine which will provide you
>no scalability benefits.
>
>At best, it will allow you to do your email, web browser and
>multi-task to other things while your application is chunking along at
>100%.
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Hector Santos on
Peter Olcott wrote:

> I have an application that uses enormous amounts of RAM in a
> very memory bandwidth intensive way. I recently upgraded my
> hardware to a machine with 600% faster RAM and 32-fold more
> L3 cache. This L3 cache is also twice as fast as the prior
> machines cache. When I benchmarked my application across the
> two machines, I gained an 800% improvement in wall clock
> time. The new machines CPU is only 11% faster than the prior
> machine. Both processes were tested on a single CPU.
>
> I am thinking that all of the above would tend to show that
> my process is very memory bandwidth intensive, and thus
> could not benefit from multiple threads on the same machine
> because the bottleneck is memory bandwidth rather than CPU
> cycles. Is this analysis correct?

As stated numerous times, your thinking is wrong. But I don't fault
you because you don't have the experience here, but you should not be
ignoring what EXPERTS are telling you - especially if you never
written multi-threaded applications.

Attached C/C++ simulation (testpeter2t.cpp) illustrates how your
single main thread process with a HUGE redundant memory access
requirement is not optimized for a multi-core/processor machine and
for any kind of scalability and performance efficiency.

Compile the attach application.

TestPeter2T.CPP will allow you to test:

Test #1 - a single main thread process
Test #2 - a multi-threads (2) process.

To run the single thread process, just run the EXE with no switches:

Here is TEST #1

V:\wc5beta> testpeter2t

- size : 357913941
- memory : 1431655764 (1398101K)
- repeat : 10
---------------------------------------
Time: 12297 | Elapsed: 0 | Len: 0
---------------------------------------
Total Client Time: 12297

The source code is set to allocate DWORD array with a total memory
block of 1.4 GB. I have a 2GB XP Dual Core Intel box. It should 50%
CPU.

Now this single process test provides the natural quantum scenario
with a processdata() function:

void ProcessData()
{
KIND num;
for(int r = 0; r < repeat; r++)
for (DWORD i=0; i < size; i++)
num = data[i];
}

By natural quantum, there is NO "man-made" interupts, sleeps or
yields. The OS will preempt this as naturally it can do it every quantum.

If you ran TWO single process installs like so:

start testpeter2T
start testpeter2T

On my machine it is seriously degraded BOTH process because the HUGE
virtual memory and paging requirements. The page faults were really
HIGH and it just never completed and I didn't wish to wait because it
was TOO obviously was not optimized for multiple instances. The
memory load requirements was too high here.

Now comes test #2 with threads, run the EXE with the /t switch and
this will start TWO threads and here are the results:

- size : 357913941
- memory : 1431655764 (1398101K)
- repeat : 10
* Starting threads
- Creating thread 0
- Creating thread 1
* Resuming threads
- Resuming thread# 0 [000007DC] in 41 msecs.
- Resuming thread# 1 [000007F4] in 467 msecs.
* Wait For Thread Completion
* Done
---------------------------------------
0 | Time: 13500 | Elapsed: 0 | Len: 0
1 | Time: 13016 | Elapsed: 0 | Len: 0
---------------------------------------
Total Time: 26516

BEHOLD!! Scalability using a SHARED MEMORY ACCESS threaded design.

I am going to recompile the code for 4 threads by changing:

#define NUM_THREADS 4 // # of threads

Lets try it:

V:\wc5beta>testpeter2t /t
- size : 357913941
- memory : 1431655764 (1398101K)
- repeat : 10
* Starting threads
- Creating thread 0
- Creating thread 1
- Creating thread 2
- Creating thread 3
* Resuming threads
- Resuming thread# 0 [000007DC] in 41 msecs.
- Resuming thread# 1 [000007F4] in 467 msecs.
- Resuming thread# 2 [000007D8] in 334 msecs.
- Resuming thread# 3 [000007D4] in 500 msecs.
* Wait For Thread Completion
* Done
---------------------------------------
0 | Time: 26078 | Elapsed: 0 | Len: 0
1 | Time: 25250 | Elapsed: 0 | Len: 0
2 | Time: 25250 | Elapsed: 0 | Len: 0
3 | Time: 24906 | Elapsed: 0 | Len: 0
---------------------------------------
Total Time: 101484

So the summary so far:

1 thread - 12 ms
2 threads - 13 ms
4 threads - 25 ms

This is where you begin to look at various designs to improve things.
There are many ideas but it requires a look at your actual work load.
We didn't use a MEMORY MAP FILE and that MIGHT help. I should try
that, but lets try a 3 threads run:

#define NUM_THREADS 3 // # of threads

and recompile, run testpeter2t /t

- size : 357913941
- memory : 1431655764 (1398101K)
- repeat : 10
* Starting threads
- Creating thread 0
- Creating thread 1
- Creating thread 2
* Resuming threads
- Resuming thread# 0 [000007DC] in 41 msecs.
- Resuming thread# 1 [000007F4] in 467 msecs.
- Resuming thread# 2 [000007D8] in 334 msecs.
* Wait For Thread Completion
* Done
---------------------------------------
0 | Time: 19453 | Elapsed: 0 | Len: 0
1 | Time: 13890 | Elapsed: 0 | Len: 0
2 | Time: 18688 | Elapsed: 0 | Len: 0
---------------------------------------
Total Time: 52031

How interesting!! To see how one thread got a near best case result.

You can actually normalize all this can probably come how with a
formula to guessimate what the performance with be with requests. But
this is where WORKER POOLS and IOCP come into play and if you are
using NUMA, the Windows NUMA API will help there too!

All in all peter, this proves how multithreads, using shared memory is
FAR superior then your misconceived idea that your application can not
be resigned for multi-core/processor machine.

I am willing to bet this simulator is for more stressful than your own
DFA/OCR application in its work load. ProcessData() here is don't NO
WORK at all but accessing memory. You will not be doing this, so the
ODDS are very high you will run much more efficiently than this simulator.

I want to hear you say "Oh My!" <g>

--
HLS