From: Ken Hagan on
On Fri, 30 Oct 2009 19:21:28 -0000, Quadibloc <jsavard(a)ecn.ab.ca> wrote:

> If one has a multithreaded core, then that core should have separate
> branch predictor states for each thread as well.

Isn't that the same as "For a multithreaded core, the space available for
storing branch predictor state should be divided exactly 1/N to each
context."? That's fair to each thread, but not necessarily the best use of
a presumably limited resource.
From: Sid Touati on
Mayan Moudgill a �crit :
> Sid Touati wrote:
>> Andy "Krazy" Glew a �crit :
>>
>>> f) I've long wanted to have the option of loading/unloading predictor
>>> state like other context. Trouble is, it is often faster to
>>> recompute than reload.
>>
>
> You can save the branch predictor tables and restore them on a context
> switch. Or you can zero out the tables on a context switch. Or you can
> just leave them alone, and let them correct themselves as the
> switched-in program runs.

Yeah, we can imagine lot of games inside a chip. My question was about
what has been done, what has been experimented. All what we see in
papers is good performance numbers of branch predictors and prefetchors
that nobody is able to reproduce simply because rare people use a
machine in a batch mode. The usage is most of the case with
multitasking, multi-threading, etc.


>
> Turns out that there is not much point to doing either of the first two
> approaches; the branch predictor will correct itself pretty quickly -
> quickly enough that the extra cycles spent unloading and reloading the
> predictor tables on a context switch overwhelm the actual performance gain.

The term "learning" that is usually used to describe dynamic mechanisms
is a subliminal description of what is going on inside speculative
mechanismes: threads, predictors and prefetchors do not "learn" anything
at execution time, they just play against random. Learning has something
related to "understanding", a simple automata with a table cannot learn
anything :)

Anyway, if someone has an exact reference to a serious experimental
study on branch predictors and data prefetchors in the context of
multi-tasks, multi-threads, could you please point it.

Best regards
From: Anton Ertl on
"Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> writes:
>b) Most branch predictors in my experience use virtual addresses,
>although using physical addresses can shave a cycle in front end.

I would expect physical addresses to cost extra cycles, because there
is additional translation.

Is there much aliasing from using virtual addresses without address
space numbers or similar? I wouldn't expect it.

>c) P6 anecdote, circa 1991: the IFU (I-cache) designer wanted to flush
>the BTB on all context switches. Because we cross checked, we did not
>need to do so for correctness, and not flushing turned out to be a
>slight performance win.

That seems obvious. With flushing, you have no chance of a hit,
without you have (even though it may be small). Am I overlooking
something?

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
From: "Andy "Krazy" Glew" on
Anton Ertl wrote:
> "Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> writes:
>> b) Most branch predictors in my experience use virtual addresses,
>> although using physical addresses can shave a cycle in front end.
>
> I would expect physical addresses to cost extra cycles, because there
> is additional translation.

Other way around.

If you have a physically addressed I$, but a virtual branch predictor,
you have to translate the, e.g., virtual branch target addresses into
physical, giving you latency on a predicted taken branch. On the other
hand, it is I-fetch, where latency can often be tolerated.

Whereas you could use physical addresses for I-fetch: e.g. have a
current I-fetch PC (Intel parlance, PFIP, physical fetch instruction
pointer (I made that up)), increment it to the next I$ line. Have the
BTB have physical addresses. Trouble is, you have to do extra work,
like translating when sequential instruction fetch crosses a page
boundary, or remembering such crossings. You pretty much have to
maintain the virtual or linear, VFIP or VLIP, instruction pointers as
well, although maybe not as fast as the main PFIP.


From: "Andy "Krazy" Glew" on
Anton Ertl wrote:
> "Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> writes:
>> b) Most branch predictors in my experience use virtual addresses,
>> although using physical addresses can shave a cycle in front end.
>
> I would expect physical addresses to cost extra cycles, because there
> is additional translation.

Other way around.

If you have a physically addressed I$, but a virtual branch predictor,
you have to translate the, e.g., virtual branch target addresses into
physical, giving you latency on a predicted taken branch. On the other
hand, it is I-fetch, where latency can often be tolerated.

Whereas you could use physical addresses for I-fetch: e.g. have a
current I-fetch PC (Intel parlance, PFIP, physical fetch instruction
pointer (I made that up)), increment it to the next I$ line. Have the
BTB have physical addresses. Trouble is, you have to do extra work,
like translating when sequential instruction fetch crosses a page
boundary, or remembering such crossings. You pretty much have to
maintain the virtual or linear, VFIP or VLIP, instruction pointers as
well, although maybe not as fast as the main PFIP.