From: Scott Lurndal on
"Peter Olcott" <NoSpam(a)OCR4Screen.com> writes:
>I can't reply to this post with quoting turned off. I always
>reply point for point, but, with quoting turned off it would
>be too difficult to see who said what. Is there any way that
>you can report your reply with quoting turned on?

Learn to use a better news client. There is 'no such thing' as
quoting, other that the perfectly legitimate quoting that
David had already provided in his post (hint: the Usenet RFC's
allow one or leading '>' symbols to denote quoting).

scott
From: Scott Lurndal on
David Schwartz <davids(a)webmaster.com> writes:
>On Mar 21, 9:29=A0pm, "Peter Olcott" <NoS...(a)OCR4Screen.com> wrote:
>
>> It seems you may have missed this point machine A and
>> machine C are given to have identical processors, and the
>> ONLY difference between them is that machine C has much
>> faster access to RAM =A0than machine A.
>
>You have previously said: "The new machines CPU is only 11% faster
>than the prior
>machine."

And that seems to have been based purely on clock speed. Of course
that doesn't include micorarchitectural and superscaler improvements.

scott

From: Peter Olcott on

"Scott Lurndal" <scott(a)slp53.sl.home> wrote in message
news:c7Ppn.5$xs7.4(a)news.usenetserver.com...
> David Schwartz <davids(a)webmaster.com> writes:
>>On Mar 21, 9:29=A0pm, "Peter Olcott"
>><NoS...(a)OCR4Screen.com> wrote:
>>
>>> It seems you may have missed this point machine A and
>>> machine C are given to have identical processors, and
>>> the
>>> ONLY difference between them is that machine C has much
>>> faster access to RAM =A0than machine A.
>>
>>You have previously said: "The new machines CPU is only
>>11% faster
>>than the prior
>>machine."
>
> And that seems to have been based purely on clock speed.
> Of course
> that doesn't include micorarchitectural and superscaler
> improvements.
>
> scott
>

Oh right I forgot about these sorts of things. Basically
more instructions per clock cycle.


From: Chris Friesen on
On 03/21/2010 10:18 PM, David Schwartz wrote:


> Now imagine you turn it into two threads, one doing this:
> X, X+1, X+2, X+3
> and one doing this:
> Y, Y+1, Y+2, Y+3
>
> Now, the prefetcher (still seeing only one read ahead) will see the
> read for X+1 when it processes the read for X. The net result will be
> that the two threads will run about twice as fast with the same memory
> hardware, even though they are purely memory limited.

I was under the impression that the hardware prefetcher was independent
of threads of execution, in which case this wouldn't make any
difference. Are you aware of CPUs which tie the prefetcher to execution
context?

Also, you are probably aware of this but for the benefit of other
readers generally on modern processers the prefetcher can track several
prefetch streams simultaneously.

Chris
From: David Schwartz on
On Mar 22, 1:11 pm, Chris Friesen <cbf...(a)mail.usask.ca> wrote:

> I was under the impression that the hardware prefetcher was independent
> of threads of execution, in which case this wouldn't make any
> difference.  Are you aware of CPUs which tie the prefetcher to execution
> context?

The prefetcher is a per-core construct and only sees the flow of
instructions on that particular core. Two cores means two prefetchers,
each seeing half of the operations.

> Also, you are probably aware of this but for the benefit of other
> readers generally on modern processers the prefetcher can track several
> prefetch streams simultaneously.

Right. The example was a huge oversimplification. More likely, there
will be a small number of expensive memory operations interleaved with
a large number of cheap (from cache) memory operations. The issue is
how often the prefetcher will be able to merge fetches in the
instruction stream that could be merged.

It's easy to see in an artificial example like (Where Fx are fast
operations):
X, F1, F2, F3, X+1, F4, F5, F6
The prefetcher might not see the 'X+1' when it processes the 'X'.

But if two cores wind up with one doing:
X, F2, X+1, F5, ...
The prefetcher is more likely to merge the X and X+1 fetches.

When the prefetcher issues the fetch for X, a window opens up for the
duration of that fetch during which a fetch for X+1 is much less
expensive than it would ordinarily be. Whether the prefetcher sees
that fetch in that window or not will depend on how many instructions,
and how many fetches, are between the fetch for X and the fetch for X
+1. (And other factors.)

DS