From: Scott Lurndal on
David Schwartz <davids(a)webmaster.com> writes:
>On Mar 22, 1:11=A0pm, Chris Friesen <cbf...(a)mail.usask.ca> wrote:
>
>> I was under the impression that the hardware prefetcher was independent
>> of threads of execution, in which case this wouldn't make any
>> difference. =A0Are you aware of CPUs which tie the prefetcher to executio=
>n
>> context?
>
>The prefetcher is a per-core construct and only sees the flow of
>instructions on that particular core. Two cores means two prefetchers,
>each seeing half of the operations.

Not necessarily; on the current gen Opteron the prefetcher is part of the
L3/Northbridge on each socket; since the L3 is shared, the NB can
adaptively prefetch for all active cores on the socket. The AMD
Family 10h BKDG goes into greater detail on the prefetcher and how
it is configured.

There are applications for which the prefetcher is completely unsuitable
(e.g. social graph analysis, extremely large datasets, pointer chasing
applications), and there are system configurations for which the prefetcher
is invaluble or pathetic (ccNUMA with long latencies to some remote memory).

scott
From: David Schwartz on
On Mar 22, 3:46 pm, sc...(a)slp53.sl.home (Scott Lurndal) wrote:

> Not necessarily;  on the current gen Opteron the prefetcher is part of the
> L3/Northbridge on each socket; since the L3 is shared, the NB can
> adaptively prefetch for all active cores on the socket. The AMD
> Family 10h BKDG goes into greater detail on the prefetcher and how
> it is configured.

I think we're talking about two different prefetchers, though I'm not
100% sure -- I'm not that familiar with the internals of modern AMD
CPUs. What I mean by "prefetcher" is the mechanism that sees an
upcoming memory read in the instruction stream and attempts to get
that data before the CPU actually has to wait for the contents of the
memory to be read from the cache hierarchy. It has to be a per-core
construct because it's looking at the instruction stream for that core
at various stages in the pipeline.

DS
From: Chris Friesen on
On 03/22/2010 05:03 PM, David Schwartz wrote:
> On Mar 22, 3:46 pm, sc...(a)slp53.sl.home (Scott Lurndal) wrote:
>
>> Not necessarily; on the current gen Opteron the prefetcher is part of the
>> L3/Northbridge on each socket; since the L3 is shared, the NB can
>> adaptively prefetch for all active cores on the socket. The AMD
>> Family 10h BKDG goes into greater detail on the prefetcher and how
>> it is configured.
>
> I think we're talking about two different prefetchers, though I'm not
> 100% sure -- I'm not that familiar with the internals of modern AMD
> CPUs. What I mean by "prefetcher" is the mechanism that sees an
> upcoming memory read in the instruction stream and attempts to get
> that data before the CPU actually has to wait for the contents of the
> memory to be read from the cache hierarchy. It has to be a per-core
> construct because it's looking at the instruction stream for that core
> at various stages in the pipeline.

I'm not a hardware guy, but I think what you're referring to is
generally called a speculative read. It requires access to the
instruction stream and thus must exist on every core.

The hardware prefetcher that Scott is referring to monitors the actual
requested memory accesses and tries to look for patterns. So if I'm in
a tight loop and indirectly access address X, X+8, and X+16 the
prefetcher is going to preload X+24, X+32, X+40... for me.

Chris
From: David Schwartz on
On Mar 23, 7:23 am, Chris Friesen <cbf...(a)mail.usask.ca> wrote:

> I'm not a hardware guy, but I think what you're referring to is
> generally called a speculative read.  It requires access to the
> instruction stream and thus must exist on every core.

Yes.

> The hardware prefetcher that Scott is referring to monitors the actual
> requested memory accesses and tries to look for patterns. So if I'm in
> a tight loop and indirectly access address X, X+8, and X+16 the
> prefetcher is going to preload X+24, X+32, X+40... for me.

That's interesting. I didn't know there was such a mechanism.

DS
From: Scott Lurndal on
Chris Friesen <cbf123(a)mail.usask.ca> writes:
>On 03/22/2010 05:03 PM, David Schwartz wrote:
>> On Mar 22, 3:46 pm, sc...(a)slp53.sl.home (Scott Lurndal) wrote:
>>
>>> Not necessarily; on the current gen Opteron the prefetcher is part of the
>>> L3/Northbridge on each socket; since the L3 is shared, the NB can
>>> adaptively prefetch for all active cores on the socket. The AMD
>>> Family 10h BKDG goes into greater detail on the prefetcher and how
>>> it is configured.
>>
>> I think we're talking about two different prefetchers, though I'm not
>> 100% sure -- I'm not that familiar with the internals of modern AMD
>> CPUs. What I mean by "prefetcher" is the mechanism that sees an
>> upcoming memory read in the instruction stream and attempts to get
>> that data before the CPU actually has to wait for the contents of the
>> memory to be read from the cache hierarchy. It has to be a per-core
>> construct because it's looking at the instruction stream for that core
>> at various stages in the pipeline.
>
>I'm not a hardware guy, but I think what you're referring to is
>generally called a speculative read. It requires access to the
>instruction stream and thus must exist on every core.
>
>The hardware prefetcher that Scott is referring to monitors the actual
>requested memory accesses and tries to look for patterns. So if I'm in
>a tight loop and indirectly access address X, X+8, and X+16 the
>prefetcher is going to preload X+24, X+32, X+40... for me.

Yes, although it works by prefetching 64-byte cache-lines.

scott