From: Ian Collins on
On 03/22/10 04:24 PM, Peter Olcott wrote:
> "Ian Collins"<ian-news(a)hotmail.com> wrote:
>>
>> Maybe, the only way to know for sure is to measure.

[please don't quote sigs]

> I did measure that is the whole point. An app is 7.98-fold
> faster on one machine than another and the only difference
> is that the faster machine has much faster access to RAM.

But you haven't measured with more than one thread. See Eric's reply.

--
Ian Collins
From: Peter Olcott on

"Ian Collins" <ian-news(a)hotmail.com> wrote in message
news:80o6p9FitsU2(a)mid.individual.net...
> On 03/22/10 04:24 PM, Peter Olcott wrote:
>> "Ian Collins"<ian-news(a)hotmail.com> wrote:
>>>
>>> Maybe, the only way to know for sure is to measure.
>
> [please don't quote sigs]
>
>> I did measure that is the whole point. An app is
>> 7.98-fold
>> faster on one machine than another and the only
>> difference
>> is that the faster machine has much faster access to RAM.
>
> But you haven't measured with more than one thread. See
> Eric's reply.
>
> --
> Ian Collins

I have measured it with another process and both slow down
disproportionally.

I don't want to spent hundreds of hours making a complex
process thread-safe just to prove what I knew all along.

(1) Machine A performs process B in X minutes.
(2) Machine C performs process B in X/8 Minutes (800%
faster)
(3) The only difference between machine A and machine C
is that machine C has much faster access to RAM (by
whatever means).
(4) Therefore Process B is memory bandwidth bound, and can
not possibly benefit from more CPU cycles.

If no one successfully provides a valid counter example to
the above, I have no other choice than to simply assumes
that I am right.

If I am clearly wrong, then providing a valid
counter-example should be easy.


From: David Schwartz on
On Mar 21, 8:24 pm, "Peter Olcott" <NoS...(a)OCR4Screen.com> wrote:

> I did measure that is the whole point. An app is 7.98-fold
> faster on one machine than another and the only difference
> is that the faster machine has much faster access to RAM.
> Can anyone provide any possible scenario where the app is
> not memory bandwidth bound?

It was memory bandwidth bound in the slower case, most likely. But
there's no way to know whether it's memory bound in the faster case.
Also, the CPU might only be slightly faster, but there are cases where
slight differences in component performance result in huge differences
in workload performance, so the "11% faster" CPU might perform three
times faster on this particular workload. (Consider, for example, a
critical branch that is always mispredicted in one case and always
correctly predicted in the other. Consider a cache that's just big
enough in one case but just a bit too small in the other. And so on.)

You have to measure.

DS
From: David Schwartz on
On Mar 21, 8:58 pm, "Peter Olcott" <NoS...(a)OCR4Screen.com> wrote:

> I have measured it with another process and both slow down
> disproportionally.
>
> I don't want to spent hundreds of hours making a complex
> process thread-safe just to prove what I knew all along.
>
> (1) Machine A performs process B in X minutes.
> (2) Machine C performs process B in X/8 Minutes (800%
> faster)
> (3) The only difference between machine A and machine C
>  is that machine C has much faster access to RAM (by
> whatever means).
> (4) Therefore Process B is memory bandwidth bound, and can
> not possibly benefit from more CPU cycles.
>
> If no one successfully provides a valid counter example to
> the above, I have no other choice than to simply assumes
> that I am right.
>
> If I am clearly wrong, then providing a valid
> counter-example should be easy.

It's so obviously wrong. At a minimum, you must see that in the case
where the process is running faster, it could be largely CPU bound.

But also, it may be memory bandwidth bound because it's single-
threaded. Assume, for example, the memory access pattern looks like
this:

X, Y, X+1, Y+1, X+2, Y+2, X+3, Y+3

Imagine the prefetcher cannot see the request for 'X+1' until after it
processes 'Y'. This can be a worst case scenario, as the memory
controller keeps opening and closing pages and is unable to balance
thye channels.

Now imagine you turn it into two threads, one doing this:
X, X+1, X+2, X+3
and one doing this:
Y, Y+1, Y+2, Y+3

Now, the prefetcher (still seeing only one read ahead) will see the
read for X+1 when it processes the read for X. The net result will be
that the two threads will run about twice as fast with the same memory
hardware, even though they are purely memory limited.

There are many other ways this kind of crazily simplistic assuming can
go wrong.

DS
From: Peter Olcott on
I can't reply to this post with quoting turned off. I always
reply point for point, but, with quoting turned off it would
be too difficult to see who said what. Is there any way that
you can report your reply with quoting turned on?

It seems you may have missed this point machine A and
machine C are given to have identical processors, and the
ONLY difference between them is that machine C has much
faster access to RAM than machine A.

"David Schwartz" <davids(a)webmaster.com> wrote in message
news:89bdf509-0afa-48c3-a107-67cdaaa27eee(a)t9g2000prh.googlegroups.com...
On Mar 21, 8:58 pm, "Peter Olcott" <NoS...(a)OCR4Screen.com>
wrote:

> I have measured it with another process and both slow down
> disproportionally.
>
> I don't want to spent hundreds of hours making a complex
> process thread-safe just to prove what I knew all along.
>
> (1) Machine A performs process B in X minutes.
> (2) Machine C performs process B in X/8 Minutes (800%
> faster)
> (3) The only difference between machine A and machine C
> is that machine C has much faster access to RAM (by
> whatever means).
> (4) Therefore Process B is memory bandwidth bound, and can
> not possibly benefit from more CPU cycles.
>
> If no one successfully provides a valid counter example to
> the above, I have no other choice than to simply assumes
> that I am right.
>
> If I am clearly wrong, then providing a valid
> counter-example should be easy.

It's so obviously wrong. At a minimum, you must see that in
the case
where the process is running faster, it could be largely CPU
bound.

But also, it may be memory bandwidth bound because it's
single-
threaded. Assume, for example, the memory access pattern
looks like
this:

X, Y, X+1, Y+1, X+2, Y+2, X+3, Y+3

Imagine the prefetcher cannot see the request for 'X+1'
until after it
processes 'Y'. This can be a worst case scenario, as the
memory
controller keeps opening and closing pages and is unable to
balance
thye channels.

Now imagine you turn it into two threads, one doing this:
X, X+1, X+2, X+3
and one doing this:
Y, Y+1, Y+2, Y+3

Now, the prefetcher (still seeing only one read ahead) will
see the
read for X+1 when it processes the read for X. The net
result will be
that the two threads will run about twice as fast with the
same memory
hardware, even though they are purely memory limited.

There are many other ways this kind of crazily simplistic
assuming can
go wrong.

DS