From: Peter Olcott on
I have an application that uses enormous amounts of RAM in a
very memory bandwidth intensive way. I recently upgraded my
hardware to a machine with 600% faster RAM and 32-fold more
L3 cache. This L3 cache is also twice as fast as the prior
machines cache. When I benchmarked my application across the
two machines, I gained an 800% improvement in wall clock
time. The new machines CPU is only 11% faster than the prior
machine. Both processes were tested on a single CPU.

I am thinking that all of the above would tend to show that
my process is very memory bandwidth intensive, and thus
could not benefit from multiple threads on the same machine
because the bottleneck is memory bandwidth rather than CPU
cycles. Is this analysis correct?


From: Eric Sosman on
On 3/21/2010 2:02 PM, Peter Olcott wrote:
> I have an application that uses enormous amounts of RAM in a
> very memory bandwidth intensive way. I recently upgraded my
> hardware to a machine with 600% faster RAM and 32-fold more
> L3 cache. This L3 cache is also twice as fast as the prior
> machines cache. When I benchmarked my application across the
> two machines, I gained an 800% improvement in wall clock
> time. The new machines CPU is only 11% faster than the prior
> machine. Both processes were tested on a single CPU.
>
> I am thinking that all of the above would tend to show that
> my process is very memory bandwidth intensive, and thus
> could not benefit from multiple threads on the same machine
> because the bottleneck is memory bandwidth rather than CPU
> cycles. Is this analysis correct?

Insufficient information, I think. The performance of the
memory subsystem is certainly important to your application's
elapsed time[*], and it seems likely that the CPU probably stalls
a lot waiting for memory to deliver or absorb data. But if
there's another CPU/core/strand/pipeline, it's possible that one
processor's stall time could be put to productive use by another
if there were multiple execution threads. It's also possible
that multiple threads could interfere, overload things and clog
the memory bus ...

[*] Rant: I really, really hate "800% improvement" and
similar phrases. If Machine A takes ten seconds and Machine B
shows an "800% improvement," can we conclude that Machine B
finishes the job in minus seventy seconds? Have you considered
a career in politics? ;-)

--
Eric Sosman
esosman(a)ieee-dot-org.invalid
From: Peter Olcott on

"Eric Sosman" <esosman(a)ieee-dot-org.invalid> wrote in
message news:ho5tof$lon$1(a)news.eternal-september.org...
> On 3/21/2010 2:02 PM, Peter Olcott wrote:
>> I have an application that uses enormous amounts of RAM
>> in a
>> very memory bandwidth intensive way. I recently upgraded
>> my
>> hardware to a machine with 600% faster RAM and 32-fold
>> more
>> L3 cache. This L3 cache is also twice as fast as the
>> prior
>> machines cache. When I benchmarked my application across
>> the
>> two machines, I gained an 800% improvement in wall clock
>> time. The new machines CPU is only 11% faster than the
>> prior
>> machine. Both processes were tested on a single CPU.
>>
>> I am thinking that all of the above would tend to show
>> that
>> my process is very memory bandwidth intensive, and thus
>> could not benefit from multiple threads on the same
>> machine
>> because the bottleneck is memory bandwidth rather than
>> CPU
>> cycles. Is this analysis correct?
>
> Insufficient information, I think. The performance of
> the
> memory subsystem is certainly important to your
> application's
> elapsed time[*], and it seems likely that the CPU probably
> stalls
> a lot waiting for memory to deliver or absorb data. But
> if
> there's another CPU/core/strand/pipeline, it's possible
> that one
> processor's stall time could be put to productive use by
> another
> if there were multiple execution threads. It's also
> possible
> that multiple threads could interfere, overload things and
> clog
> the memory bus ...
>
> [*] Rant: I really, really hate "800% improvement" and
> similar phrases. If Machine A takes ten seconds and
> Machine B
> shows an "800% improvement," can we conclude that Machine
> B
> finishes the job in minus seventy seconds? Have you
> considered
> a career in politics? ;-)
>
> --
> Eric Sosman
> esosman(a)ieee-dot-org.invalid

The numbers are sufficiently precisely accurate. Within the
specific given (as in geometry, thus immutable) premise that
the only difference between two machines is much faster
access to RAM, and the performance of the faster machine is
7.98-fold faster than the slower machine, is there any
possible way that this app is not memory bound that you can
provide a specific concrete example of?


From: Ian Collins on
On 03/22/10 07:02 AM, Peter Olcott wrote:

[please don't multi-post, cross-post if you must.]

> I have an application that uses enormous amounts of RAM in a
> very memory bandwidth intensive way. I recently upgraded my
> hardware to a machine with 600% faster RAM and 32-fold more
> L3 cache. This L3 cache is also twice as fast as the prior
> machines cache. When I benchmarked my application across the
> two machines, I gained an 800% improvement in wall clock
> time. The new machines CPU is only 11% faster than the prior
> machine. Both processes were tested on a single CPU.
>
> I am thinking that all of the above would tend to show that
> my process is very memory bandwidth intensive, and thus
> could not benefit from multiple threads on the same machine
> because the bottleneck is memory bandwidth rather than CPU
> cycles. Is this analysis correct?

Maybe, the only way to know for sure is to measure.

--
Ian Collins
From: Peter Olcott on

"Ian Collins" <ian-news(a)hotmail.com> wrote in message
news:80o47uFitsU1(a)mid.individual.net...
> On 03/22/10 07:02 AM, Peter Olcott wrote:
>
> [please don't multi-post, cross-post if you must.]
>
>> I have an application that uses enormous amounts of RAM
>> in a
>> very memory bandwidth intensive way. I recently upgraded
>> my
>> hardware to a machine with 600% faster RAM and 32-fold
>> more
>> L3 cache. This L3 cache is also twice as fast as the
>> prior
>> machines cache. When I benchmarked my application across
>> the
>> two machines, I gained an 800% improvement in wall clock
>> time. The new machines CPU is only 11% faster than the
>> prior
>> machine. Both processes were tested on a single CPU.
>>
>> I am thinking that all of the above would tend to show
>> that
>> my process is very memory bandwidth intensive, and thus
>> could not benefit from multiple threads on the same
>> machine
>> because the bottleneck is memory bandwidth rather than
>> CPU
>> cycles. Is this analysis correct?
>
> Maybe, the only way to know for sure is to measure.
>
> --
> Ian Collins

I did measure that is the whole point. An app is 7.98-fold
faster on one machine than another and the only difference
is that the faster machine has much faster access to RAM.
Can anyone provide any possible scenario where the app is
not memory bandwidth bound?