From: Anton Ertl on
"Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> writes:
>Anton Ertl wrote:
>> "Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> writes:
>>> b) Most branch predictors in my experience use virtual addresses,
>>> although using physical addresses can shave a cycle in front end.
>>
>> I would expect physical addresses to cost extra cycles, because there
>> is additional translation.
>
>Other way around.
>
>If you have a physically addressed I$,

Some years ago the usual way was virtually-indexed physically-tagged
L1 caches. Has this changed?

> but a virtual branch predictor,

Ah, you mean the addresses coming out of the branch predictor, right?

I was thinking about the addresses going in; that's because
conditional branch predictors only predict taken/not-taken, and
because the question being discussed was the aliasing in the branch
predictor from merging the histories of different threads.

For the addresses going in using physical addresses would increase the
latency (or at least the hardware required), and the benefit is
probably small.

>you have to translate the, e.g., virtual branch target addresses into
>physical, giving you latency on a predicted taken branch. On the other
>hand, it is I-fetch, where latency can often be tolerated.

For the BTB, storing physical addresses may be a good idea (if it
gives any advantage over virtually-indexed physically-tagged access).

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
From: "Andy "Krazy" Glew" on
Anton Ertl wrote:
> "Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> writes:

>> If you have a physically addressed I$,
>
> Some years ago the usual way was virtually-indexed physically-tagged
> L1 caches. Has this changed?

Although there have been a number of systems with virtually indexed
physically tagged D$ and I$, including IIRC the Willamette L0 D$, array lookup,
most Intel x86s of the P6 family have physically indexed, physically
tagged, caches.

IMHO virtual indexing has gotten a bit of a bad rap. But, it certainly
had a bad reputation in quite a few design groups.

>> but a virtual branch predictor,
>
> Ah, you mean the addresses coming out of the branch predictor, right?

Could be the address coming out

Could be the address going in.

It is convenient if they are of the same type, so that you can feed
the predictor output right back to the input.


>
> I was thinking about the addresses going in; that's because
> conditional branch predictors only predict taken/not-taken, and
> because the question being discussed was the aliasing in the branch
> predictor from merging the histories of different threads.
>
> For the addresses going in using physical addresses would increase the
> latency (or at least the hardware required), and the benefit is
> probably small.

Why would physical addresses going in increase the latency?

They would not increase latency of the array lookup or tag match.

They add complexity. And they require the target to be translated
when it is put into the array, typically on a misprediction when you are
doing an ifetch anyway.

> For the BTB, storing physical addresses may be a good idea (if it
> gives any advantage over virtually-indexed physically-tagged access).

Like I said, unclear if it is a complexity win. Definitely costs devices.
From: Quadibloc on
On Nov 2, 3:08 am, "Ken Hagan" <K.Ha...(a)thermoteknix.com> wrote:
> On Fri, 30 Oct 2009 19:21:28 -0000, Quadibloc <jsav...(a)ecn.ab.ca> wrote:

> > If one has a multithreaded core, then that core should have separate
> > branch predictor states for each thread as well.
>
> Isn't that the same as "For a multithreaded core, the space available for  
> storing branch predictor state should be divided exactly 1/N to each  
> context."? That's fair to each thread, but not necessarily the best use of  
> a presumably limited resource.

It's only the same if one has a branch predictor that is capable of
working that way. I certainly do agree that if one can optimally
allocate branch predictor state without incurring inordinate costs for
that capability, one should do so.

However, I was trying to get at something much more simple, and I
think less controversial:

If one has a multithreaded core, branch predictor information should
be labelled by thread, so that information gathered about the branches
in one thread isn't used to control how branches in another thread are
handled. The branch predictor should not simply ignore the fact that
multiple different threads are being executed in the core.

In other words, I was assuming that the branch predictor would be
crude and simple in design; a handful of gates, not a computer in its
own right, which is why I failed to be sufficiently explicit.

John Savard
From: Anton Ertl on
"Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> writes:
>Anton Ertl wrote:
>> "Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> writes:
>
>>> If you have a physically addressed I$,
>>
>> Some years ago the usual way was virtually-indexed physically-tagged
>> L1 caches. Has this changed?
>
>Although there have been a number of systems with virtually indexed
>physically tagged D$ and I$, including IIRC the Willamette L0 D$, array lookup,
>most Intel x86s of the P6 family have physically indexed, physically
>tagged, caches.
>
>IMHO virtual indexing has gotten a bit of a bad rap. But, it certainly
>had a bad reputation in quite a few design groups.

Why is that? I never heard about it before.

>> For the addresses going in using physical addresses would increase the
>> latency (or at least the hardware required), and the benefit is
>> probably small.
>
>Why would physical addresses going in increase the latency?

My thoughts were along the following lines (but see below): Either the
CPU produces the physical address by translating from the virtual
address, then there is latency. Or it maintains the physical PC as
well, then there is additional hardware required (plus latency in rare
cases, e.g. page-crossing).

>They would not increase latency of the array lookup or tag match.

Ok, using the common part for indexing, and delaying the tag match
until after the translation, as in virtually-indexed physically-tagged
caches. Yes, that may be possible without extra latency.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
From: Terje Mathisen on
Quadibloc wrote:
> On Nov 2, 3:08 am, "Ken Hagan"<K.Ha...(a)thermoteknix.com> wrote:
>> On Fri, 30 Oct 2009 19:21:28 -0000, Quadibloc<jsav...(a)ecn.ab.ca> wrote:
>
>>> If one has a multithreaded core, then that core should have separate
>>> branch predictor states for each thread as well.
>>
>> Isn't that the same as "For a multithreaded core, the space available for
>> storing branch predictor state should be divided exactly 1/N to each
>> context."? That's fair to each thread, but not necessarily the best use of
>> a presumably limited resource.
>
> It's only the same if one has a branch predictor that is capable of
> working that way. I certainly do agree that if one can optimally
> allocate branch predictor state without incurring inordinate costs for
> that capability, one should do so.
>
> However, I was trying to get at something much more simple, and I
> think less controversial:
>
> If one has a multithreaded core, branch predictor information should
> be labelled by thread, so that information gathered about the branches
> in one thread isn't used to control how branches in another thread are
> handled. The branch predictor should not simply ignore the fact that
> multiple different threads are being executed in the core.

In a multicore cpu, this is very probably exactly the wrong thing to do:

The usual programming paradigm for such a system is to have many threads
running the same algorithm, which means that training information from
one thread is likely to be useful for another, or at least not detrimental.

Cores that run different functions, will have a separate set of branches
to consider, and again each set running the same code can share branch info.

The main reason for keeping them separate is simply that the branch
predictions needs to be very close to the instruction fetch and
execution units, something which is hard to achieve if a single large
global branch table is many cycles away.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"