Multi-star [Computer Architecture]

Prev: 735998 Computer Knowledge, Free and alwqays Up to Date 80
Next: Using NTP performance as an early warning (Was Re: SYSENTER/SYSEXIT_vs._SYSCALL/SYSRET)

From: nedbrek on 22 Jan 2010 20:18

Hello all,
I am reading through the document at:
http://docs.google.com/fileview?id=0B5qTWL2s3LcQZDIyZDVmN2EtYjY4MC00YjU2LWE4ZGMtYzk2MmU4M2U2NDQ5&hl=en

Some pretty interesting stuff.

I am most surprised by Section 1.9.2 Scheduler (page 29):
"The S1 has 4 16 entry pickers, " and "The S2 scheduler contains 16
partitions, each of 64 entries. Each entry consists of 4 microoperations."

That's 64 uops in L1, and 4K uops in the L2!

====
When we were evaluating HSW, we assumed a ROB of 256 entries, mostly as a
nod to multithreading. Similar machines at the time (P4 and Core 2) used
~128 entry ROBs. Our L2 scheduler was pretty closely coupled to the ROB,
and our L1 was 18 entries (limited by the operand capture file).

I see you discovered the key to HSW (pages 36 and 37), combining the P4 and
P3 register file styles :)

More later!
Ned

From: "Andy "Krazy" Glew" on 23 Jan 2010 14:27

nedbrek wrote:

> I see you discovered the key to HSW (pages 36 and 37), combining the P4 and
> P3 register file styles :)

What's HSW?

From: "Andy "Krazy" Glew" on 23 Jan 2010 15:59

nedbrek wrote:
> Hello all,
> I am reading through the document at:
> http://docs.google.com/fileview?id=0B5qTWL2s3LcQZDIyZDVmN2EtYjY4MC00YjU2LWE4ZGMtYzk2MmU4M2U2NDQ5&hl=en
>
> Some pretty interesting stuff.
>
> I am most surprised by Section 1.9.2 Scheduler (page 29):
> "The S1 has 4 16 entry pickers, " and "The S2 scheduler contains 16
> partitions, each of 64 entries. Each entry consists of 4 microoperations."
>
> That's 64 uops in L1, and 4K uops in the L2!
>
> ====
> When we were evaluating HSW, we assumed a ROB of 256 entries, mostly as a
> nod to multithreading. Similar machines at the time (P4 and Core 2) used
> ~128 entry ROBs. Our L2 scheduler was pretty closely coupled to the ROB,
> and our L1 was 18 entries (limited by the operand capture file).
>
> I see you discovered the key to HSW (pages 36 and 37), combining the P4 and
> P3 register file styles :)
>
> More later!
> Ned
>
>

Ah, Ned, I see that HSW = Hierarchical Scheduling Windows, as in

Hierarchical Scheduling Windows by E Brekelbaum - 2002
http://portal.acm.org/ft_gateway.cfm?id=774865&type=pdf&coll=GUIDE&dl=GUIDE&CFID=72794697&CFTOKEN=44981784

I must admit that I never paid your HSW papers much attention. I was at AMD at the time, so didn't have the opportunity
to talk to you directly.

I don't like the combination of slow scheduler and fast scheduler feeding the same execution units, seems to me to lead
to a mess of wires, violates modularity. And replicating the execution units just leads to a bypass mess. But your
point about latency tolerance is well taken.

I also had bad experience with the EDF (Explicit Data Forwardng) system you use in your slow scheduler. I have
encountered this now in two different companies, and neither time did it pan out. Heck, I might as well say, such a
structure was the original P6 microarchitecture proposal, before I came in and showed that some CAMs are bad and some
CAMs are not so bad. However, my aversion to this is not so strong as my concerns of the previous paragraph: I have
spent my whole career eliminating CAMs, and I think that the EDF direction is the right one, if tweaked appropriately.
I say more on this below. I've just seen other people's EDF-like designs crash and burn twice now over 20 years (as well
as innumerable less detailed EDF-like proposals, like EDF itself), so I tend to shy away.

> I see you discovered the key to HSW (pages 36 and 37), combining the P4 and
> P3 register file styles :)

I got all excited when I read what you said above, and printed out and read the above HSW paper by you and Chris and
Bryan and Jeff. But I am afraid I don't see where you describe this. Is it in another HSW paper, or am I just not
reading carefully enough?
Perhaps that is the "extensive analysis (not documented in this paper)" you refer to on page 31 in the reference above.
However, I am probably looking at the wrong HSW paper, since the one I am looking at ends at page36, and dos not
have a page 37.

Ah, wait... I see on page 31 you say "register read is performed before instructions are inserted into the fast window
making it a data capture device".

Cool. Similar lines of thought.

Now, I think the key thing that I realized in Multi-star wrt register files has nothing to do with clusters or two level
schedulers. It is that you can combine the P6 and P4 ROB/PRF/register file/operand capture approaches, which I call
"read RF before schedule" and "read RF after schedule", even with a single level of scheduler.

You can have a big PRF (typically renamed into from the logical regser file), a scheduler (your favorite scheduler - P6
style RS, a bit matrix, whatever), and a smaller operand structure. This smaller structure, OC, can be arranged as a
P6-style CAM, or an RF. Or a hybrid.ectly controlling th

I always thought of the P6-style RS CAMs as controlling wordlines effecting capture of the data - and the scheduler
directly controlling those wordlines. Put another way, scheduler entries either contained or were in 1:1 direct
correspondence with operand capture entries (actually, 1:2 correspondence, since a P6 uop always had 2 src operands).

A level of indirection here is not necessarily a bad thing. Although it probably costs an extra cycle or so coming out
of the scheduler and going to the execution units. (For a long time I was thinking unclearly, assuming that the
advantage of capture structures was their lower latency loop. That is the advantage, but the source of it is mainly
reduced #prts*#bitsper_entry*#entries, not necessarily the elimination of the level of indirection - or, rather,
avoiding the introduction of the level of indirection we are about describe.)

I.e. instead of the scheduler saying "I am scheduling uop #i" and the operand capture structure looking up entry #i,
i.e. where #i is used to index both scheduler and OC,
we might have scheduler say "I am executing uop #i",
and then we might look up uop #i in the uop table, and find out that it needs none or one or more than one of the
operands #p1, #p2, #p3....
and then we would look up in the operand capture structure these operands.

This allows us to do things like
a) sharing operands between uops in the scheduler
and
b) using 0 or 1 or 2 or 3 operands, usng just what is needed, instead of always using a fixed number of 2 or 3

although it does make the scheduler a little bit harder, since it must schedule RF ports.

This level of indirection basically makes the operand capture structre more like an RF.
Hence I sometimes say that you have an RF2 before the scheduler, and an RF1 after the scheduler, even though it is a
sngle level instruction window machine.

A related key point is the realization that the ports you use to access the operand capture array after coming out of
the scheduler
a) do not need to be 1:1 with the scheduler instructin entries
b) do not need to have the same indexing principle.

So, you can have three ports on the operand capture OC/RF1:
1) the port you write data into initially - RAM indexed
2) the port you use to obtain operands after scheduling.
a) Could be RAM index.
b) Could be 1:1 implicit, direct.
3) the port you use to writeback results from the execution units
a) could be CAM
b) could be RAM

P6 was 1) 2b) 3c).

I only realized (back when I was working n Multistar) that 1) 2a) 3a) was a good possibility - which I made the
Multistar strawman.

While 1) 2a) 3b) is a good possibility, given RF2/RF1 mapping.

Finally, 1) 2b) 3b) is, I suppose, a possibility - but I really don't knw what it would be good for.

CAMs are convenient. If you don't use a CAM on the writeback, you need to have a mapping table to remap RF2/PRF numbers
into RF1/OC operand enry numbers. It's easier than the logical to PRF remapping, since it desn't need as much
management, but everything you add is a pain. The CAM just wraps that up into one place.

This seems to be a general principle: you can often use CAMs on pipestages, or implement a mapping table. The mapping
table typically adds an extra pipestage, and often has management cmlexity such as checkpoint recovery (although not
always). So, it is just a tradeoff - an extra pipestage + some complexity versus CAMs.

---

Hey, this is fun. Haven't thought about this in years in such detail.

---

Getting back to the scheduler: while I think that a non-CAM approach, sch as you say EDF is like (I'm not sure if I
have read the EDF papers; I have encountered what sound like similar approaches) *should* be a good idea, my experience
when I have seen these worked out in detail was not so good.

Multistar has a placeholder for such a scheduler.

But another of the new ideas in Multistar was that is is not so bad to build a CAM for the big second level scjheduler,
S2. Bcause you don't need to CAM all inputs.

E.g. in the second level scheduler conatins entries of 4 or 8 or 16 instructions, you don't need to CAM all possible
inputs. You don't even need to CAM all live-ins. You only need to CAM soe reasonably close to the last one of the
inputs. ou can use a scoreboard like timuing estimator at the RAT to choose the input to CAM on.

Thus, in the strawman you quote above

> "The S2 scheduler contains 16 partitions, each of 64 entries. Each entry consists of 4 microoperations."

if each S2 entry has 4 uops, and only is CAMming 1 input, that's "only" 1K CAMs for 4K uops. Not the 8K CAMs that would
have been required if you had 2 CAMs per uop, or thw 12K if you had 3 CAMs per uop to handle FMAC.

More and more my thinking has been leaning towards S2 scheduler entries of 8 or 16 uops - or possibly threaded S2
scheduler entries, say 4 uop chunks, chaned together into, say, 12 uop groups. If you can do that, the S2 schedulr
might hae "only" 512 or 256 CAMs - comparable to existing schedulers, but holding 16 times more uops.

By the way, this input reduction technique - performing imprecise scheduling on the S2 uop group entries - also
simplifies the EDF-like hardware that eliminates CAMs. Not sure if it is enough simplification to mae EDF worthwhile -
the problem with EDF is handling the overflow cases, i.e. it is design complexity, not size.

Now, the traditional problems with having scheduler entries hold uop groups are
a) what happens if they have too many inpus for the hardware to handle
b) what happens if there are inter-group dependencies.
c fragmentation. It's damned hard to get 16 dependence related instructions in a group.

Again, that's a key realization in Multi-star: it doesn't matter!!!!!!

If you take the approach that all of the instruction groups coming out of the S2 are going to the S1, where they are
scheduled as single uops (or, at least, smaller instruction groups, more tightly coupled), then the S1 handles all the
work, scheduling instructions when they are ready.

a) Too many inputs in S2 (which, as we have said above, may have only 1 or 2 inputs per 8 instruction group) - well, let
the S1 handle al the inputs.

b) inter-group dependencies - again, the S1 handles. All we need to do is ensure that we don't gey a deadlock - that we
don't get into a situation where an instruction in the S1 is waiting for an instruction from the S2, which can't get
into the S1 because the S1 is draining. There are many techniques to handle this.

c) fragmentation. Doesn't matter I.e. it doesn't mattert if we put unrelated instructions into the same group (as long
as we don't get the above mentioned deadlocks.

This is another reason why I don't like the HSW approac of sending supposedly latency insenstive instrucions straight
from the S2 to the EUs: you can't shluff off this work the the S1.

However, I think that I did note in the Multistar writeup that you could execute straight from the S2, even though not
necessarily all inputs are ready. E.g. by using Wmt-style replay, detecting non-readinss, ad sending the non-ready
instructions back to either the S1 or S2. (Actualy, I guess that is not Wmt-style replay - they did not replay through
the scheduler. Silly people.)

---

Hey, this was fun.

Those were the days.

From: nedbrek on 23 Jan 2010 18:14

"Andy "Krazy" Glew" <ag-news(a)patten-glew.net> wrote in message
news:4B5B6334.4080201(a)patten-glew.net...
> nedbrek wrote:
>
> I don't like the combination of slow scheduler and fast scheduler feeding
> the same execution units, seems to me to lead to a mess of wires, violates
> modularity. And replicating the execution units just leads to a bypass
> mess. But your point about latency tolerance is well taken.

Two clusters worked for the paper, no one complained about it. When we were
looking into actually going to product, axing the slow cluster was one of
the first ideas we looked at... we could delay the results from slow to
fast, but it still meant wires somewhere.

> I also had bad experience with the EDF (Explicit Data Forwardng) system
> you use in your slow scheduler.

EDF was another "good for a paper" thing. We actually had some reviewers
say how much they liked integrating previous work. Talking to circuit guys,
they seemed sure that slow CAMs aren't too bad (i.e. the problem with CAMs
mostly apply only to fast CAMs).

> > I see you discovered the key to HSW (pages 36 and 37), combining the P4
> > and
> > P3 register file styles :)
>
> I got all excited when I read what you said above
>
> Ah, wait... I see on page 31 you say "register read is performed before
> instructions are inserted into the fast window making it a data capture
> device".

I was thinking of the ppt from the Micro presentation, it is made explicit
there (slide 5 is P3 vs. P4, slide 7 is "Stop! You're both right!")

> A level of indirection here is not necessarily a bad thing. Although it
> probably costs an extra cycle or so coming out of the scheduler and going
> to the execution units. (For a long time I was thinking unclearly,
> assuming that the advantage of capture structures was their lower latency
> loop. That is the advantage, but the source of it is mainly reduced
> #prts*#bitsper_entry*#entries, not necessarily the elimination of the
> level of indirection - or, rather, avoiding the introduction of the level
> of indirection we are about describe.)

Yes, P3 scheduling is all about fewer read ports (1 wide, vs 2 split per
uop). Half the ports makes a smaller structure, which gives you the lower
latency loop :)

> More and more my thinking has been leaning towards S2 scheduler entries of
> 8 or 16 uops - or possibly threaded S2 scheduler entries, say 4 uop
> chunks, chaned together into, say, 12 uop groups. If you can do that, the
> S2 schedulr might hae "only" 512 or 256 CAMs - comparable to existing
> schedulers, but holding 16 times more uops.

That's where my mind starts to boggle. I would need see branch predictor
and serialization data showing a window this big would deliver significant
performance gains. We were looking at Itanium runs of Spec2k built by
Electron (i.e. super optimized). We were assuming very heavy implementation
(few serializing conditions). We were unable to scale this far.

> By the way, this input reduction technique - performing imprecise
> scheduling on the S2 uop group entries - also simplifies the EDF-like
> hardware that eliminates CAMs. Not sure if it is enough simplification to
> mae EDF worthwhile - the problem with EDF is handling the overflow cases,
> i.e. it is design complexity, not size.

One of the nifty things we showed about EDF in the face of a 2 level
scheduler - you can drop the updates on overflow. The mover will eventually
shift the stuff down to L1. We filed a patent on that, it's out there
somewhere.

> Now, the traditional problems with having scheduler entries hold uop
> groups are
> a) what happens if they have too many inpus for the hardware to handle
> b) what happens if there are inter-group dependencies.
> c fragmentation. It's damned hard to get 16 dependence related
> instructions in a group.
>
> Again, that's a key realization in Multi-star: it doesn't matter!!!!!!
>
> If you take the approach that all of the instruction groups coming out of
> the S2 are going to the S1, where they are scheduled as single uops (or,
> at least, smaller instruction groups, more tightly coupled), then the S1
> handles all the work, scheduling instructions when they are ready.

Yea, that's what the "mover" is all about. It makes it possible for
unschedulable stuff to make it to L1.

> However, I think that I did note in the Multistar writeup that you could
> execute straight from the S2, even though not necessarily all inputs are
> ready. E.g. by using Wmt-style replay, detecting non-readinss, ad sending
> the non-ready instructions back to either the S1 or S2. (Actualy, I guess
> that is not Wmt-style replay - they did not replay through the scheduler.
> Silly people.)

That's probably our biggest area of disagreement.

We used to sit next to the Tejas guys. We heard so many horror stories
about replay, we were dead set against it. HSW was very much oriented
towards making it possible to have little or no replay (the L1 scheduler was
"snatch-back", while the L2 would eat the full L1 latency on dependent ops).

I actually had a slide in the Micro presentation showing relative replay
rates. It didn't go well for the P4 :)

> Hey, this was fun.
>
> Those were the days.

Yea, it's been over three years since I left Intel. My new boss tries to
humor the microarchitect in me, but it's just not relevant in my new job.

Enjoy,
Ned

From: "Andy "Krazy" Glew" on 24 Jan 2010 14:03

nedbrek wrote:
>> Me (Andy Glew)
>> More and more my thinking has been leaning towards S2 scheduler entries of
>> 8 or 16 uops - or possibly threaded S2 scheduler entries, say 4 uop
>> chunks, chaned together into, say, 12 uop groups. If you can do that, the
>> S2 schedulr might hae "only" 512 or 256 CAMs - comparable to existing
>> schedulers, but holding 16 times more uops.
>
> That's where my mind starts to boggle. I would need see branch predictor
> and serialization data showing a window this big would deliver significant
> performance gains. We were looking at Itanium runs of Spec2k built by
> Electron (i.e. super optimized). We were assuming very heavy implementation
> (few serializing conditions). We were unable to scale this far.

That's almost exactly what Jim Smith told me at Wisconsin - except then we were talking about 256 entry instruction
windows. Although I will freely admit that even then I was thinking in terms of instruction windows in the thousands
and larger. It has just taken years to invent the pieces, the mechanisms, that make it buildable.

In terms of academic research, look at ... well, I wanted to refer to the kilo instruction processor papers, but I
don't have access to the ACM Digital Library. But, anyway, if memory serves they show some benefit for large
instruction windows, which they justify but mecahnisms to reduce window cost. I just go further.

As for stuff that you can do on your own, here are some very simple limit studies:

a) First, do the perfect branch prediction study with unlimited instruction fetch bandwidth, and see how performance
changes as window size changes.

b) Then, go and have realistic branch prediction. But still keep ifetch bandwidth infinite. And se how many
instructions can be reached without a branch misprediction. Take advantage of control independence - e.g. you dont
need to cross past branch mispredictions to fetch instructions after a call returns - you are guaranteed to ber able to
execute those.

Track how many of these are dependent on earlier instructions, and how many are not.

c) Now go and add in finite ifetch. But on potentially many threads - the non-speculative threads, and speculative
threads (if that is what you want to call them) e.g. at control independence points.

d) somewhere along the way, take into account branch mispredictions that are dependent on cache missses versus those
that are not.

You get the picture. Do the work: you will see that there are lots of potential instructions out there. Even without
SpMT/SkMT; but with SpMT/SkMT there are even more.

Having demonstrated that there is potential in keeping a large number of instructions in flight, you then have to figure
out how to build the hardware cheaply enough That's what stuff like Multistar is about.

Note that when you have an S2 scheduler such as I am talking about, you are not necessarily keeping all instructions
around, nor allocating PRF entries for their results. A scheduler entry might be just an IP and a tag used to trigger
its dispatch.

>> However, I think that I did note in the Multistar writeup that you could
>> execute straight from the S2, even though not necessarily all inputs are
>> ready. E.g. by using Wmt-style replay, detecting non-readinss, ad sending
>> the non-ready instructions back to either the S1 or S2. (Actualy, I guess
>> that is not Wmt-style replay - they did not replay through the scheduler.
>> Silly people.)
>
> That's probably our biggest area of disagreement.
>
> We used to sit next to the Tejas guys. We heard so many horror stories
> about replay, we were dead set against it. HSW was very much oriented
> towards making it possible to have little or no replay (the L1 scheduler was
> "snatch-back", while the L2 would eat the full L1 latency on dependent ops).
>
> I actually had a slide in the Micro presentation showing relative replay
> rates. It didn't go well for the P4 :)

| Next | Last
Pages: 1 2
Prev: 735998 Computer Knowledge, Free and alwqays Up to Date 80
Next: Using NTP performance as an early warning (Was Re: SYSENTER/SYSEXIT_vs._SYSCALL/SYSRET)