From: Sid Touati on
Hi all,
branch predictors and data prefetchors are usually evaluated by
considering a single task: the fixed benchmark is the unique program
executed for evaluating the efficiency of the hardware data prefetchor
or the branch predictor.

With the multicore era, desktop and servers will execute more and more
multi-threaded applications, or multiple distinct applications, from
distinct users. When executing multiple threads from multiple
applications, branch predictors and data prefetchor are disturbed, and
their "learning" becomes erroneous (especially when they use physical
address as tags).

Does anyone know about a serious experimental study of the performance
of hardware data prefetchor and branch predictor in such context ?

Thanks

S, waiting for the next generation of branch predictors and data
prefetchors for multicore processors
From: "Andy "Krazy" Glew" on
Sid Touati wrote:
> Hi all,
> branch predictors and data prefetchors are usually evaluated by
> considering a single task: the fixed benchmark is the unique program
> executed for evaluating the efficiency of the hardware data prefetchor
> or the branch predictor.
>
> With the multicore era, desktop and servers will execute more and more
> multi-threaded applications, or multiple distinct applications, from
> distinct users. When executing multiple threads from multiple
> applications, branch predictors and data prefetchor are disturbed, and
> their "learning" becomes erroneous (especially when they use physical
> address as tags).
>
> Does anyone know about a serious experimental study of the performance
> of hardware data prefetchor and branch predictor in such context ?
>
> Thanks
>
> S, waiting for the next generation of branch predictors and data
> prefetchors for multicore processors


a) There have been studies in academia, published I believe, on the
effects of context switching on branch predictors. As you might expect,
the more context switching, the worse.

b) Most branch predictors in my experience use virtual addresses,
although using physical addresses can shave a cycle in front end.

c) P6 anecdote, circa 1991: the IFU (I-cache) designer wanted to flush
the BTB on all context switches. Because we cross checked, we did not
need to do so for correctness, and not flushing turned out to be a
slight performance win.

d) Multicore in some ways *reduces* the frequency of context switches
(compared to the same workload running timesliced), so predictors may
improve. It's all a question of what you measure with respect to.

e) Since many multicore and GP-GPU workloads run the same code on
multiple processors, one might hope for possible IMPROVEMENTS in branch
predictors. Especially if learning from one thread can help another.

E.g. shared BIdB (Branch Identification Buffer) and BTB - basically,
shared big expensive tagged structures. Private histories.
Problem: nobody wants to have shared structures. It's nicer if the
cores are independent. But if your units start becoming clusters of 2,4
processor, then such sharing is reasonable. Similarly, SIMT/CT
(Choherent Threading) warps or clusters may easily emply a shared branch
predictor. There should also be optimizations related to the mainly
shared history.

f) I've long wanted to have the option of loading/unloading predictor
state like other context. Trouble is, it is often faster to recompute
than reload.
From: Sid Touati on
Andy "Krazy" Glew a �crit :
>
>
>
> a) There have been studies in academia, published I believe, on the
> effects of context switching on branch predictors. As you might expect,
> the more context switching, the worse.
>

do you have exact references on such academic studies ? Of course I was
talking about real experiments, not simulations. Simulating the
performances of multicore systems is tricky.

> b) Most branch predictors in my experience use virtual addresses,
> although using physical addresses can shave a cycle in front end.

Fine, how do they distinguish between the PC of two separate
applications running in parallel on the same multicore processor ?

> c) P6 anecdote, circa 1991: the IFU (I-cache) designer wanted to flush
> the BTB on all context switches. Because we cross checked, we did not
> need to do so for correctness, and not flushing turned out to be a
> slight performance win.

It depends on the worksload, and on the application.


> e) Since many multicore and GP-GPU workloads run the same code on
> multiple processors, one might hope for possible IMPROVEMENTS in branch
> predictors. Especially if learning from one thread can help another.

you are right when we talk about executing multiple open-mp threads of
the same application. In practice, multiple applications can be run in
parallel, and this is the way we use computers usually (batch mode is
reserved for special situations only)

> f) I've long wanted to have the option of loading/unloading predictor
> state like other context. Trouble is, it is often faster to recompute
> than reload.

I am missing your point here.

Regards
From: Mayan Moudgill on
Sid Touati wrote:
> Andy "Krazy" Glew a �crit :
>
>> f) I've long wanted to have the option of loading/unloading predictor
>> state like other context. Trouble is, it is often faster to recompute
>> than reload.
>

You can save the branch predictor tables and restore them on a context
switch. Or you can zero out the tables on a context switch. Or you can
just leave them alone, and let them correct themselves as the
switched-in program runs.

Turns out that there is not much point to doing either of the first two
approaches; the branch predictor will correct itself pretty quickly -
quickly enough that the extra cycles spent unloading and reloading the
predictor tables on a context switch overwhelm the actual performance gain.
From: Quadibloc on
On Oct 28, 12:08 pm, Sid Touati <SidnospamTou...(a)inria.fr> wrote:

> With the multicore era, desktop and servers will execute more and more
> multi-threaded applications, or multiple distinct applications, from
> distinct users.  When executing multiple threads from multiple
> applications, branch predictors and data prefetchor are disturbed, and
> their "learning" becomes erroneous (especially when they use physical
> address as tags).

Multicore processors help, rather than hindering, as someone else
already noted, since threads running on other processors are
irrelevant; the branch predictor is a part of each core, so if there
are other processors, this means fewer threads from the total have to
be handled by each core.

If one has a multithreaded core, then that core should have separate
branch predictor states for each thread as well.

John Savard