From: nmm1 on
In article <dd85042e-2ccf-4691-bada-6e5b29c4fdde(a)2g2000prl.googlegroups.com>,
Quadibloc <jsavard(a)ecn.ab.ca> wrote:
>On Oct 28, 12:08=A0pm, Sid Touati <SidnospamTou...(a)inria.fr> wrote:
>
>> With the multicore era, desktop and servers will execute more and more
>> multi-threaded applications, or multiple distinct applications, from
>> distinct users. =A0When executing multiple threads from multiple
>> applications, branch predictors and data prefetchor are disturbed, and
>> their "learning" becomes erroneous (especially when they use physical
>> address as tags).
>
>Multicore processors help, rather than hindering, as someone else
>already noted, since threads running on other processors are
>irrelevant; the branch predictor is a part of each core, so if there
>are other processors, this means fewer threads from the total have to
>be handled by each core.

Grrk. It's not that simple :-( Under most circumstances, threads
tend to wander around CPUs, and kernel code is often executed on
the CPU that invoked it. While it is rare for them to make the
problem worse, it can happen.


Regards,
Nick Maclaren.
From: Chris Gray on
Mayan Moudgill <mayan(a)bestweb.net> writes:

> You can save the branch predictor tables and restore them on a context
> switch. Or you can zero out the tables on a context switch. Or you can
> just leave them alone, and let them correct themselves as the
> switched-in program runs.

> Turns out that there is not much point to doing either of the first
> two approaches; the branch predictor will correct itself pretty
> quickly -
> quickly enough that the extra cycles spent unloading and reloading the
> predictor tables on a context switch overwhelm the actual performance
> gain.

Is it possible to maintain a small reversible *summary* of the contents of
the branch prediction unit? Something like a 64 bit word that has 6 sets of
low-4-bits-of-PC+1-bit-of-taken plus 4 other bits. I guess what I'm thinking
of here might be just having a second, very small, branch predictor state
that gets a decent number of branches correct. What it suggests is overridden
by the main predictor's state.

Also, what about turning off the use of the branch predictor when switching
into kernel code, then turning it back on on return to user mode? The first
instruction when returning to user mode could be "start reloading the branch
predictor from this summary word", and the last could be "return to user mode
and re-enable branch predictor".

Just throwing out some weird ideas.

--
Experience should guide us, not rule us.

Chris Gray cg(a)GraySage.COM
http://www.Nalug.ORG/ (Lego)
http://www.GraySage.COM/cg/ (Other)
From: "Andy "Krazy" Glew" on
Sid Touati wrote:
> Andy "Krazy" Glew a �crit :
>>
>>
>>
>> a) There have been studies in academia, published I believe, on the
>> effects of context switching on branch predictors. As you might
>> expect, the more context switching, the worse.
>>
>
> do you have exact references on such academic studies ? Of course I was
> talking about real experiments, not simulations. Simulating the
> performances of multicore systems is tricky.
>
>> b) Most branch predictors in my experience use virtual addresses,
>> although using physical addresses can shave a cycle in front end.
>
> Fine, how do they distinguish between the PC of two separate
> applications running in parallel on the same multicore processor ?

On existing Intel and AMD multicore systems, the processor cores are
separate, and do not were breach predictors.it would be hard to share
BP, since the cores are attached only at the L2 or L3 cache.

Multithreaded cores could share branch predictorsbetween threads.
Whether they share tables but arrange to hash different threads'
versions of the same address to different table entries is interesting.
a minor research toper.


> you are right when we talk about executing multiple open-mp threads of
> the same application. In practice, multiple applications can be run in
> parallel, and this is the way we use computers usually (batch mode is
> reserved for special situations only)

Nevertheless, even different apps share many OS services and DLLs-the.
training from one for the shared code may help a separate app, whether
running simultaneously or later in time. The problem is distinguishing
shared from unshared. Standard multi predictor choosers may work,
choosing between shared & unshared. Similarly, standard techniques such
as partial tags, and the aforementioned hashing.

- - -

Sorry if I am cryptic.

Writing this using handwriting recognition on a tablet PC in a shuttle
ran driving from Seattle to Portland as part of my weekly commute
to/from Intellectual Ventures in Bellevue.

The ran bounces so much that keyboard is almost useless.

Interesting: cursive is normally better than printing, but not when
driving/ vibrating.
From: "Andy "Krazy" Glew" on
Mayan Moudgill wrote:
> Sid Touati wrote:
>> Andy "Krazy" Glew a �crit :
>>
>>> f) I've long wanted to have the option of loading/unloading predictor
>>> state like other context. Trouble is, it is often faster to
>>> recompute than reload.
>>
>
> You can save the branch predictor tables and restore them on a context
> switch. Or you can zero out the tables on a context switch. Or you can
> just leave them alone, and let them correct themselves as the
> switched-in program runs.
>
> Turns out that there is not much point to doing either of the first two
> approaches; the branch predictor will correct itself pretty quickly -
> quickly enough that the extra cycles spent unloading and reloading the
> predictor tables on a context switch overwhelm the actual performance gain.

Exactly. All work in thisarea has been disappointing. Perhaps as tables
grow bigger - but it is also quite likely that cross app training can be
useful.

Also, r/W interfaces to BP arrays have a cost. Maybe the DFT guys have)
want suchaccess ports.

I am skeptical of any proposal to context switchpredictor state.

Perhaps an efficient delta- not the whole table, but a list of the most
costly mispredicts.
From: "Andy "Krazy" Glew" on
Quadibloc wrote:
> If one has a multithreaded core, then that core should have separate
> branch predictor states for each thread as well.
>
> John Savard

could, not necessarily should.