From: MitchAlsup on
On Jul 7, 11:12 am, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote:
> On 7/6/2010 7:31 PM, MitchAlsup wrote:
> But the benchmark in question was contrived,.  Run the program in the benchmark for a while, and you did not get this
> neat pattern.  The linked list became randomized.

I completely agree with the implication of this--one of which is that
you need to simulate for trillions of instruction per benchmark to see
the "real" effects of minor microarchitectural decisions. Things that
look good at a billion instructions often go south by a trillion.

> I.e. your algo did not work as well in real life as it did in the benchmark, even for the program in the benchmark.

There are situations where it works better in real life than in
benchmark land. Consider a multithreaded application where a number of
threads are romping thorugh a data structure. Having the prefetch dwn
in the DRAM controller, allows it to lock onto access patterns fromall
members of the thread since the prefetcher does not care whence the
requests came (or go). {But I don't have an application in mid that
actually does do one of these things nor woudl a few examples reduce
Andys comments.}

The real advantge of the prefetcher down in the DRAM controller is
that it does not create DRAM activity that conflicts with already
existant demand requests on the DRAM busses (address or data). It can
cause some minor backup if such a request arrives while the prefetch
is in progress. Prefetching back at the CPU or at any level of the
cache hierarchy cannot be throttled thusly.

Mitch

From: Andy 'Krazy' Glew on
On 7/8/2010 5:23 PM, MitchAlsup wrote:
> On Jul 7, 11:12 am, Andy 'Krazy' Glew<ag-n...(a)patten-glew.net> wrote:
>> On 7/6/2010 7:31 PM, MitchAlsup wrote:
>> But the benchmark in question was contrived,. Run the program in the benchmark for a while, and you did not get this
>> neat pattern. The linked list became randomized.
>
> I completely agree with the implication of this--one of which is that
> you need to simulate for trillions of instruction per benchmark to see
> the "real" effects of minor microarchitectural decisions. Things that
> look good at a billion instructions often go south by a trillion.

Thanks for not getting PO'ed at me, Mitch.

When I said "simple access patterns like stride 1 and stride N", I probably should have said "repetitive access
patterns", such as you described

>The DRAM controler on my last design actually did data structures like
>this with amazing clairvoiance. It ends up that the benchmark built
>then traversed a very large data structure. Aftersetup, the data
>structure did not change "all that much" and left a clear pattern of
>+1,+1,-3 <repeat> that our DRAM prefetcher latched onto and eliminated
>the DRAM latency.

We can create a taxonomy of such patterns for prediction:

* repeating the same value

* repetitive
** stride 1
** stride N
** how can we describe more complex repetitions:
*** up to N elements, and then repeat?
*** K sets of N_k elements, each repeating at a different frequency
*** epicyclic ...
*** overall, we want to say something like "patterns representable in state machines of some fixed size"

and then we move into
* repetitive with some random effects

e.g. Wilkerson's footprint predictor, that doesn't predict the unpredictable pointer chase but which, at the pointer
chase, predicts a set of accesses that happen nearly thereafter (some of which Mutlu showed were due to memory
allocation patterns such as in the benchmark you mentioned, Mitch - was that MCF?)

* Hidden Markocx Models - i don't like using that as a generic term



From: Andy 'Krazy' Glew on
On 7/8/2010 5:23 PM, MitchAlsup wrote:
> On Jul 7, 11:12 am, Andy 'Krazy' Glew<ag-n...(a)patten-glew.net> wrote:
>> On 7/6/2010 7:31 PM, MitchAlsup wrote:
>> But the benchmark in question was contrived,. Run the program in the benchmark for a while, and you did not get this
>> neat pattern. The linked list became randomized.
>
> I completely agree with the implication of this--one of which is that
> you need to simulate for trillions of instruction per benchmark to see
> the "real" effects of minor microarchitectural decisions. Things that
> look good at a billion instructions often go south by a trillion.
>
>> I.e. your algo did not work as well in real life as it did in the benchmark, even for the program in the benchmark.
>
> There are situations where it works better in real life than in
> benchmark land. Consider a multithreaded application where a number of
> threads are romping thorugh a data structure. Having the prefetch dwn
> in the DRAM controller, allows it to lock onto access patterns fromall
> members of the thread since the prefetcher does not care whence the
> requests came (or go). {But I don't have an application in mid that
> actually does do one of these things nor woudl a few examples reduce
> Andys comments.}
>
> The real advantge of the prefetcher down in the DRAM controller is
> that it does not create DRAM activity that conflicts with already
> existant demand requests on the DRAM busses (address or data). It can
> cause some minor backup if such a request arrives while the prefetch
> is in progress. Prefetching back at the CPU or at any level of the
> cache hierarchy cannot be throttled thusly.

Perhaps not quite such good throttling, but why not mark prefetch requests, and allow them to be preempted or thrown
away at any stage in the system? Of course, most systems expect a reply to all requests sennt out, but why should we?
From: Terje Mathisen "terje.mathisen at on
Andy 'Krazy' Glew wrote:
> *** epicyclic ...
> *** overall, we want to say something like "patterns representable in
> state machines of some fixed size"

Since LFSR pseudo-random generators are both very small and quite
random, it would seem that the last specification is nearly worthless. :-(

>
> and then we move into
> * repetitive with some random effects

This is much more interesting imho: You really don't want a predictor
which is blown away by a few random accesses in the middle of a long
sequential/stride-N pattern.

My friend John Cash once upon a time (1993?) implemented a prefetcher
for NetWare, he found that he needed to allow a certain percentage of
noise to get maximum benefit from his file read-ahead process.

Terje
PS. John went on to write the network code for Quake, then "retired"
from the frantic pace at id Software by moving to Blizzard where they
were just starting a massive multiplayer project.
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: nmm1 on
In article <drrig7-ibj2.ln1(a)ntp.tmsw.no>,
Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>Andy 'Krazy' Glew wrote:
>> *** epicyclic ...
>> *** overall, we want to say something like "patterns representable in
>> state machines of some fixed size"
>
>Since LFSR pseudo-random generators are both very small and quite
>random, it would seem that the last specification is nearly worthless. :-(

Yes. There is a lot of good mathematical work on this, in the area
of real (i.e. Kolmogorov) complexity theory. Most of it can be
summarised for practical use as "if you have a genuinely random
finite state machine, you are up the creek without a paddle, so
consider what empirical, semi-mathematical heuristics you can use."

>> and then we move into
>> * repetitive with some random effects
>
>This is much more interesting imho: You really don't want a predictor
>which is blown away by a few random accesses in the middle of a long
>sequential/stride-N pattern.

Absolutely.

>My friend John Cash once upon a time (1993?) implemented a prefetcher
>for NetWare, he found that he needed to allow a certain percentage of
>noise to get maximum benefit from his file read-ahead process.

I found the same when working on sorting, and the reason that I like
the sorts that I do (and my tweaks to them) is that they are robust
against almost all such noise.

Quite a lot of the work on 'almost sorted' data was terrible (which
will not surprise you), because it took a single measure of degree
of sorting. There are several realistic ones, which are very
different, and a dataset can be almost sorted on one and almost
random on another. The same applies to prefetching, as it is
mathematically very similar.

For example, you can consider the minimum number of points that
need moving to make the data sorted, or the maximum distance any
point has to move, or the total number of pairwise swaps.


Regards,
Nick Maclaren.