From: Cameron Jack on
On Nov 18, 9:16 pm, Mayan Moudgill <ma...(a)bestweb.net> wrote:
> Andy "Krazy" Glew wrote:
> > Mayan Moudgill wrote:
>
> >> Matt wrote:
>
> >>> Have you seen the "eager execution" patent application (app. no.
> >>> 20090172370)? I mentioned it in my blog a while back (
> >>>http://citavia.blog.de/2009/07/07/more-details-on-bulldozers-multi-th....
>
> >> Had a look; it seems, at a first reading, to be one more of the
> >> "wouldn't it be great to follow both paths of a branch, and once we
> >> figure out which way the branch went, we'll copy registers from one
> >> path over to the other, and, oh, BTW, its not clear that it wouldn't
> >> have been less work to just re-execute the second path" ideas.
>
> > When forking, you copied the renamer state (not really
> > that many bits).
>
> You'll also have to duplicate the call-return prediction structure.
>
> You'll have to tweak the address matching logic in the store queue.
>
> You'll have to change the branch resolution clean-up. IMO, this is the
> difficult bit.
> - you have to free up some renamed registers (or, if you're using a
> different implementation, partially purge the writeback queue).
> - you have to free up entries in the store queue.
>
> On a non-eager machine, the store queue is logically a FIFO; when a
> branch mispredict resolves you (logically) move the valid pointer to the
> last correctly predicted store. In the case of eager execution, the
> store queue becomes a sparse structure,  which in turn means that there
> are going to be issues with allocation, load/store matching, and
> identifying age for commit.
>
> Even free-ing of mispredicted registers could be a little more expensive.
> - You may need more bits to distinguish between true/false path predictions.
> - If you're going to be eager on more than one branch, then mispredict
> resoultion can potentially kill trees of rename sequences, not just
> straight-line paths.
> - Detecting a free register may become more of a challenge.

Below is a repost of a comment I made at RWT a few weeks back. I had
no replies and am keen to hear what people think of my ideas - whether
they are derivative, someone else's, good or rubbish:

I've been wondering about dynamic predication ever since the first
static implementation in Itanium.

Could you use the existing instruction pointer stack with an
additional branch ID field (like Thread ID) and predicate value, and
additionally tagging each instruction as it passed through the
pipeline with these?

If the processor already uses a Future Register File like the Opteron
already does then it should be an easy matter to reset all renamed
registers with the matching branch ID and logically false, while
completing all those with the matching BID and logically true. You'll
need more rename registers and everything will get bulkier. You'd also
want to limit its usage to simple cases like where the BHT has seen
the branch a lot of times but the outcome in the BTB is less than say
30% certain.

Added: My reasoning for going down the dynamic predication route
rather than DMT as such is less problem with context sharing -
especially with VMs and the extra page table structures now on the
CPU.

Also, since you mention register access becoming a problem I've also
been wondering about using a register "block" use/renaming/retiring
scheme similar to Power5's instruction "blocks" (where up to 5
instructions in a certain mix are build, executed and retired
together). Obviously you'd need a lot more registers, but wiring could
be made a lot simpler - especially renaming could be extremely wide.
From: "Andy "Krazy" Glew" on
Cameron Jack wrote:
> I've been wondering about dynamic predication ever since the first
> static implementation in Itanium.

Go for it!

Some of us have done a lot of work on dynamic predication. (My resume
includes an OOO Itanium, plus I have been working on VLIW and
predication longer than OOO.) But since such work inside companies will
never see the light of day, do not let that hold you back, since you are
not so constrained by NDAs and trade secrets.

I worked on the trace cache (back in my MS, before Intel) mainly so that
I could easily see convergence points, and predicate the sides of the
branches. That was naive: you don't need to see the convergence point
when you start predicating.

Perhaps because I have worked on this, I am not so afraid of eager
execution as is Mayan. When renamed registers have multiple copies with
predicate mask sets attached to them, doing the same thing for the store
buffer is not such a step.

My problem with eager execution is not about how to build it, but about
how to make it worthwhile. Sure, you can track branch probabilities,
and then go the other path... But branch predictors are so accurate.

Go for it, Jack. Maybe I should brush off my dynamic predication work
from my MS circa 1990, and/or forget anything I did on Intel and try to
come up with new, better, ways. Trouble is, it is just so painful to
have to forget almost 2 decades' worth of ideas: I almost prefer not to
work on such topics.
From: Mayan Moudgill on
Andy "Krazy" Glew wrote:

> Perhaps because I have worked on this, I am not so afraid of eager
> execution as is Mayan. When renamed registers have multiple copies with
> predicate mask sets attached to them, doing the same thing for the store
> buffer is not such a step.
>

I don't know if that is correct (BTW: I've worked on this on paper, not
at the VHDL level, though).

Assumption is that all in-flight instructions have an issue order (tag);
this is the sequence number in which that instance of the instruction
would have been executed.

Lets look at the store buffer. Its entries are:
A. invalid
B. store written out to L1
C. committed but not yet written out to L1
D. uncommitted
Case B. is useful for bypassing access to the L1 in some cases. We shall
ignore this case, treating these entries as "free", but its worth
keeping in mind

When a load comes in, it must search the store buffer. Possible answers
from each entry are:
E. missed
F. hit but store is from the future (i.e. was issed later than the load)
G. partial hit from unwritten store
H. complete hit
Case F. can be treated as a miss, but its important to be able to
distinguish between future matches and past matches. Order is also
important when reconciling between multiple G. & H. hits; for each byte,
you have to pick the isued-closest-to-the-load store that wrote to the
byte, and if there is no such store, you have to go the L1 cache. We
call the case where there is a (partial) hit in the store queue a
(partial) bypass, and the issue order(s) of the bypassing store(s) the
bypass issue order

Additionally, if loads can overtake stores, then there must be a
load-bypass queue in which the load's issue-order, address and size must
be stored, and the bypassed store issue order(s) must be stored. [This
can be combined with the load-waiting-for-a-L1-miss queue].

When a new store comes in, an invalid (or free) entry is found, and the
store's adddres, size, data, and issue order is written to the entry.
[Couple of things: the entry can be allocated earlier, and the
computation of the data and the address can be split].

Additionally, if loads can overtake stores, it must search the load
queue, check to see:
- does it overlap with a load?
- does the load come later in issue-order?
- would it have mattered (i.e. for each overlapping byte, did the byte
get bypassed from a yet-later store)?
If so, corrective action must be taken [generally equivalent to a branch
mispredict]

The store retire logic generally will retire committed stores in issue
order. It may be possible to retire committed stores in other orders,
but things get squirrely w.r.t. memory ordering and overlapping writes.
[BTW: I am ignoring a large number of details w.r.t. memory ordering and
cache coherence protocols]

When a branch mispredict is determined, any store queue entries that are
from the mispredicted path are rolled back; i.e. if an entry's issue
order is after the issue-order of the mispredicted branch, the entry
marked invalid.

One possible simplification is to allocate store queue entries in issue
order. This allows depth in the store-queue to be a proxy for
issue-order; the closer it is to the head of the queue, the earlier it
was issued. Writing to L1 is always from the head of the queue.
Mispredict rollback moves the tail of the queue to the last known
mispredicted entry. The next free entry to be allocated is at the tail
of the queue. [This can also be done by allocating entries at execution
time and executing stores in-order]

Eager execution complicates the concept of issue-order. It is now
issue-order-on-a-particular-path. The behaviors required of the store
queue remain the same.

However, allocating store queues in issue order is no longer practical;
it would lead to invalid entries between the head and the tail of the
FIFO, and a sparse store queue. This, in turn, would make the store
queue a potential bottleneck (either because of the number of effective
entries if you kept the store queue size the same, or because of the
cycle time if you increased the store queue size).

So, the store queue structure you would need to use would have to have
out-of-order entries, additional logic to check for order and path,
etc., etc.

I will offer an interesting data-point: On both Power5, and AFAIK,
Core-i7/Nehalem, the store-queue is NOT shared between the two threads;
they have separate queues. The modifications to support eager execution
are similar, but more complicated, than those required for two threads.
I would suggest that, given store-queue partitioning for 2 threads was
picked over a shared store-queue, the architects found that modifying
the store queue for eager execution would indeed have a cycle time impact.
From: Brett Davis on
In article <4B01BAD2.6080002(a)patten-glew.net>,
"Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> wrote:

> Andy "Krazy" Glew wrote:
> > Brett Davis wrote:
> >> For SpMT I looked at Alpha style paired integer pipelines ...
> >
> > You haven't said enough about the physical layout to talk about those
> > clustering effects.
>
> The physical layout matters a lot, and hence has its own terminology.
>
> 8-wide on a bit interleaved datapath is pushing the envelopew.
>
>
> 8-wide as 2 4 wide bit interleaved datapaths is not so bad, although you
> will pay a lot of area for the wire turns to get to side by side
> datapaths. S I might call this an 8-wide cluster composed of two
> adjacent 4-wide bit-interleaved clusters. With whatever bypassing you
> say.
>
> As I mentioned earlier, there is a sweet trick you can play with
> datapaths that are paired and opposite, reflections. You bit interleave
> the input and output wires within the, say, 4-wide cluster, and then you
> bit interleave the outputs (but not the inputs). Trouble with this
> trick is that it takes one of the ends of the datapath away - where do
> you put the scheduler? The data cache?
> So, I call this "opposed" clustering.

I am afraid I need a Transistor 101 refresher:

The speed of light in copper, or aluminum, in a modern fab process,
and the resulting mm covered at 3GHz, and translated into the number
of wires you can cross at a right angle, distance wise.

The time a transistor takes to switch, and the resulting number of
transistors you can jump through in 3GHz for a CPU stage.

The number of active inputs you can drive with one transistors output.
The number of inactive inputs you can drive with one transistors output.
(Does a hard limit of only two active listeners out of eight help.)


I do not really like the Alpha 4+4 and two cycle delay to cross,
actually I hate it. I can live with 0 cycles to two closet ALUs, with
a one cycle delay for every second crossed. With 64 bit registers that
means crossing 127 wires. You have like 15 wire layers, so there is a
desire to stack the 2in1out wires to reduce the distance covered by
the bypass wires if you do not interleave. (Ignoring the big economic
incentive to use big previous gen patterning for the upper level wires.)

How many layers are 45nm on a 45nm CPU process, verses 65nm layers,
and 90nm? Most die slices I have seen tend to march up to low res fast.

I am willing to skin this old cat in new ways to make a 8 way design
work. RRAM is going to force many of the upper layers to high res
anyway, so I will be catching a wave that enables high density wire
stacking where it is needed. Lots of breakthroughs and patents to be
had in this area...

Brett
From: Del Cecchi on

"Brett Davis" <ggtgp(a)yahoo.com> wrote in message
news:ggtgp-191C1C.01014603122009(a)netnews.asp.att.net...
> In article <4B01BAD2.6080002(a)patten-glew.net>,
> "Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> wrote:
>
>> Andy "Krazy" Glew wrote:
>> > Brett Davis wrote:
>> >> For SpMT I looked at Alpha style paired integer pipelines ...
>> >
>> > You haven't said enough about the physical layout to talk about
>> > those
>> > clustering effects.
>>
>> The physical layout matters a lot, and hence has its own
>> terminology.
>>
>> 8-wide on a bit interleaved datapath is pushing the envelopew.
>>
>>
>> 8-wide as 2 4 wide bit interleaved datapaths is not so bad,
>> although you
>> will pay a lot of area for the wire turns to get to side by side
>> datapaths. S I might call this an 8-wide cluster composed of two
>> adjacent 4-wide bit-interleaved clusters. With whatever bypassing
>> you
>> say.
>>
>> As I mentioned earlier, there is a sweet trick you can play with
>> datapaths that are paired and opposite, reflections. You bit
>> interleave
>> the input and output wires within the, say, 4-wide cluster, and
>> then you
>> bit interleave the outputs (but not the inputs). Trouble with this
>> trick is that it takes one of the ends of the datapath away - where
>> do
>> you put the scheduler? The data cache?
>> So, I call this "opposed" clustering.
>
> I am afraid I need a Transistor 101 refresher:
>
> The speed of light in copper, or aluminum, in a modern fab process,
> and the resulting mm covered at 3GHz, and translated into the number
> of wires you can cross at a right angle, distance wise.
>
> The time a transistor takes to switch, and the resulting number of
> transistors you can jump through in 3GHz for a CPU stage.
>
> The number of active inputs you can drive with one transistors
> output.
> The number of inactive inputs you can drive with one transistors
> output.
> (Does a hard limit of only two active listeners out of eight help.)
>
>
> I do not really like the Alpha 4+4 and two cycle delay to cross,
> actually I hate it. I can live with 0 cycles to two closet ALUs,
> with
> a one cycle delay for every second crossed. With 64 bit registers
> that
> means crossing 127 wires. You have like 15 wire layers, so there is
> a
> desire to stack the 2in1out wires to reduce the distance covered by
> the bypass wires if you do not interleave. (Ignoring the big
> economic
> incentive to use big previous gen patterning for the upper level
> wires.)
>
> How many layers are 45nm on a 45nm CPU process, verses 65nm layers,
> and 90nm? Most die slices I have seen tend to march up to low res
> fast.
>
> I am willing to skin this old cat in new ways to make a 8 way design
> work. RRAM is going to force many of the upper layers to high res
> anyway, so I will be catching a wave that enables high density wire
> stacking where it is needed. Lots of breakthroughs and patents to be
> had in this area...
>
> Brett

In CMOS, the transistors don't really have a switching time. Rather
there is delay associated with charging or discharging the gate
(transistor gate, not logic gate) between the on voltage and the off
voltage (usually VDD and Gnd respectively). Since both gate
capacitance and current capability of a transistor are proportional to
width, this leads to the concept of gate limited delay, or the minimum
delay of a gate with no interconnect driving an identical gate, and
driven by an identical gate.

To the gate delay one must add the effects of the interconnect and fan
out.

So far as I know, most of the layers in a 45 nm process are 45 nm
although typically the last few metal layers are thicker and coarser
for lower resistance power distribution.

Unless things have changed recently I don't think anybody is doing 15
levels of interconnect.

del