From: Andy Glew "newsgroup at on
Attempting to describe some microarchitectures in text-only email to a
friend inspires me to post this. It would be much better to have lots of
drawings illustrating precisely what I am talking about; but this
textual notation is fairly compact, can be understood by some people,
and in some ways, because it underspecifies certain configurations,
allows us to talk without getting bogged down in multiple different ways
of interconnecting the blocks. I.e. sometimes drawing is TOO specific.

= Basic Pipelines =

Let's start out with in-order pipelines
InO
and out-of-order pipelines
OoO

A typical in-order pipeline might look like
InO = BP -> I$ -> D -> RD -> X -> L1$ -> WB
with pipestage names
* BP = branch prediction
* I$ = instruction cache
* RD = register read
* D = decode
* X = execute
* L1$ = L1 data cache
* WB = writeback

A prototypical out-of-order pipeline might look like
OoO = BP -> I$ -> D -> RR -> S -> X -> L1$ -> IW
with additional pipestages
* RR = register rename
* S = schedule
* IW = instruction window - ROB, etc.

Where are the PRF reads?

On a P6-style read-before-schedule & capture machine:
OoO = BP -> I$ -> D -> RR -> RD -> S -> X -> L1$ -> IW

On a HaRRM (1987) or Wmt-style read-after-schedule machine:
OoO = BP -> I$ -> D -> RR -> S -> X -> RD -> L1$ -> IW

I am not going to go into details about this, or the ROB/PRF, because
that's not my topic of discussion here.

Actually, the execution units may be superscalar, or they may share the
same single port.
I could contrive ways of drawing this, but again it's not my real point.
Similarly, the L1$ is often accessed in parallel with the first few
cycles of X.
I might depict this as
X || L1$
or just
X L1$
rather than
X -> L1$

In fact, eliding the arrows overall makes things less cluttered
A typical in-order pipeline might look like
InO = BP I$ D RD X L1$ WB
OoO = BP I$ D RR S X RD L1$ IW

and it allows us to throw other stuff on, such as the L2$
InO = BP I$ D RD X L1$ WB L2$
OoO = BP I$ D RR S X RD L1$ IW L2$

we can stretch the notation to indicate store queues, etc; it is nice to
use the DEC/AMD term LSU, even though Intel tends to split them up

= SMT =

SMT, symmetric multithreading, like Intel Hyperthreading, runs multiple
threads on the same out-of-order pipeline.
SMT2(OoO) = SMT2(BP I$ D RR S X RD L1$ IW) L2$
Not indicated: the fact that I$ is shared, but renamer is replicated, etc.

We might also have SMT in-order machines, like Intel's first generation
Atom:
SMT4(InO) = SMT4(BP I$ D RD X L1$ WB) L2$


By comparison, we can emphasize the single threaded nature of the
orignal pipelines:
ST(OoO) = ST(BP I$ D RR S X RD L1$ IW) L2$
ST(InO) = ST(BP I$ D RD X L1$ WB) L2$


= Multicluster =

The original multicluster microarchitectures that I am familiar created
wider superscalar machines by replicating the execution units. They
were multicluster in that they did not have uniform bypassing: some
might bypass within the cluster fast, but impose an extra cycle
bypassing to other clusters; some might not permit bypassing at all,
except perhaps through special inter-cluster instructions.


We can attempt to draw this in ascii art as

MC(OoO) = BP I$ D RR S X RD L1$ IW L2$
X

MC(InO) = BP I$ D RD X L1$ WB L2$
X

but I will instead draw it as


MC(OoO) = BP I$ D RR S MC2(X) RD L1$ IW L2$
MC(InO) = BP I$ D RD MC2(X) L1$ WB L2$

The question always arises as to how much is replicated in the clusters:
MC(X)
MC(X RD)
MC(S X)
MC(S X RD)
MC(S X RD L1$)
This last will be revisited in MCMT.

Execution units can be shared between clusters, like the FPU in AMD
Bulldozer.
It's a pain to depict this in ascii art, so I won't bother here.

Should the renamer be inside the clusters? Depends. It's easier to
maintain single thread semantics if the renamer is outside, and similar
the LSU:

MC(OoO) = BP I$ D RR MC2(S X RD L1$) LSU IW L2$

But since the S X LSU unit is as critical as the S X L1$ loop, it is
natural to put it inside. But then coordinating the two clusters becomes
a pain. You might, e.g., copy all stores to both clusters' LSU.
I don't know how to draw that in ascii art.

MC(OoO) = BP I$ D RR MC2(S X RD L1$ LSU) IW L2$

You start down the slippery slope of complexity if you have an L1 LSU
iside the cluster for speedd,
and an L2 LSU outside the cluster to coordinate things.
Note that the LSU2 need not have any data.
MC(OoO) = BP I$ D RR MC2(S X RD L1$ LSU1) LSU2 IW L2$

We'll assume that IW is outside the cluster for single thread


= MCMT =

Once you have multiple clusters, you can run multiple threads on them.
One thread per cluster is natural. It permits the nice simplification of
not having to do bypasses between clusters.

MCMT2(OoO) = BP I$ D RR MCMT2(S X RD L1$) IW L2$

With separate threads, an LSU inside the cluster is natural
MCMT2(OoO) = BP I$ D RR MCMT2(S X RD L1$ LSU) IW L2$
but the stuff talked about above may apply.

For separate threads, it might be natural to put IW inside the cluster.
But IW, instruction window management such as ROB, is not in the
critical loop. If you put IW outside, you might be able to share
non-uniformly, e.g. giving simple threads 1/4 of the IW, and complex 3/4.
I call one way of doing this a [[segmented sequential instruction window]].


= MCMT and SMT =

You could make each cluster itself SMT:

MCMT(SMT(OoO)) = BP I$ D RR MCMT2(SMT(S X RD L1$ LSU)) IW L2$

although now you start driving the ifetch prety hard.

= Speeding Up Single Threads =

SMT and MCMT are ways of supporting multiple threads. It would be nice
if they could speed up single threads.

MC is used for single threads. However, it is a challenge to do
scheduling for truly decoupled MC microarchitectures. One additional
cycle of bypassing can be afforded, but it gets harder the more
decoupled the clusters are.

My old, old, name for using multiple threads to speed up a single thread
was IMT, implicit multithreading.

SpMT, speculative multithreading, is the form of IMT I am most familiar
with,
There are other forms of IMT - more fine grained than SpMT, with
executes contiguous sections of the instruction such as loop bodies and
regions separated by CALLs and RETurns.

IMT and SpMT need to run on a multithreaded substrate.
Above we have described two multithreaded substrates
SMT
MCMT
with the obvious addition of
MP
Any of which can be InO or OoO. MCMT can be built with SMT inside the
clusters.
And MP, multiprocessor, separate cores, can be built out of single
threaded cores, multithreaded cores, etc.
MP =
MP(ST)
MP(SMT)
MP(MCMT)
MP(MCMT(SMT))

You can therefore run, or at least think about running, IMT or SpMT on
any of these multithreaded substrates:
SpMT(SMT)
SpMT(MCMT)
SpMT(MCMT(SMT))
SpMT(MP)
SpMT(MP(ST))
SpMT(MP(SMT))
SpMT(MP(MCMT))
SpMT(MP(MCMT(SMT)))

Overall, SpMT(MP(ST)) is hard. It is hard to overcome the inter-core
latencies when forking threads.

MP(SMT) is doable - indeed, it is Haitham Akkary's DMT.
However, SMT is often limited to only 2 threads, which really does not
give you much opportunity to
speed things up with SpMT. And the speculative threads tend to interfere
with the non-speculative thread. First, do no harm.
::SpMT(SMT2) - low potential

I motivated MCMT as a way of building a multithreaded microarchitecture
for SpMT, that might be able to support more threads. Alternately, I
motivated MCMT by itself, without SpMT, as something intermediate
between SMT and M.

SpMT(MCMT(ST)) is okay, but there is still probably overhead to go
between clusters, felt on forking.
By the way, SpMT(MCMT(ST)) probably has lots of other stuff added to
preserve sequential semantics.
Won't discuss that here, although I will mention that I am found of log
based execution.

SpMT(MCMT(SMT2)) is a potential sweet spot. I think the normal
operating mode would be with one thread per cluster, but the SMT2 allows
a "run thread" to be used to fork off the non-speculative thread, or a
less speculative thread. The runt thread could cycle steal transferring
state from cluster to cluster, and thus minimize slowing down the
non-speculative thread.

SpMT(MP) is hard, as mentioned above.

But SpMT(MP(SMT2)) is another sweet spot - with the runt thread trick
again played to help forking.

So, we have two attractive configurations:
SpMT(MCMT(SMT2))
and
SpMT(MP(SMT2))
with the inner SMT2 used to speed up forking.

SpMT(MCMT(SMT2)) is probably smaller,
but SpMT(MP(SMT2))
makes better use of the many, many, cores we have nowadays.

Combining both in
SpMT(MP(MCMT(SMT2)))
is imaginable,
but it is more complex than doing just
SpMT(MP(SMT2)).
Not worth doing unless there is some big win.
Or unless you want to fill a product line hole with MCMT(SMT2)
- e.g. something larger than a single core ST or SMT2(St), but smaller
than MP2.

= Multilevel Schedulers, etc. =

TBD: S2 S1 schedulers, etc.

TBD: batch scheduling across clusters versus fine grain de-interleaved
scheduling.


= See =

Wiki at
http://semipublic.comp-arch.net/wiki/Alphabet_Soup:_a_Collection_of_Microarchitectures#Basic_Pipelines
From: Paul A. Clayton on
On Aug 4, 7:32 pm, Andy Glew <"newsgroup at comp-arch.net"> wrote:
[snip]
> = Basic Pipelines =
>
> Let's start out with in-order pipelines
>   InO
> and out-of-order pipelines
>   OoO
>
> A typical in-order pipeline might look like
>   InO = BP -> I$ -> D -> RD -> X -> L1$ -> WB
> with pipestage names
> * BP = branch prediction
> * I$ = instruction cache
> * RD = register read
> * D = decode
> * X = execute
> * L1$ = L1 data cache
> * WB = writeback
>
> A prototypical out-of-order pipeline might look like
>   OoO = BP -> I$ -> D -> RR -> S -> X -> L1$ -> IW
> with additional pipestages
> * RR = register rename
> * S = schedule
> * IW = instruction window - ROB, etc.
>
> Where are the PRF reads?
>
> On a P6-style read-before-schedule & capture machine:
>   OoO = BP -> I$ -> D -> RR -> RD -> S -> X -> L1$ -> IW
>
> On a HaRRM (1987) or Wmt-style read-after-schedule machine:
>   OoO = BP -> I$ -> D -> RR -> S -> X -> RD -> L1$ -> IW
>
> I am not going to go into details about this, or the ROB/PRF, because
> that's not my topic of discussion here.

One of the pipeline variants that I like is the use of a
future file for some values (particularly the stack
pointer, global pointer, thread-local storage pointer--
for early address generation; but also some LSbits of a
counter [with an upper-bits-not-zero bit] and perhaps
condition registers for early branch resolution). This
allows moving RD before RR (scheduling and execution
could also be simplified/accelerated--FIFO scheduling
and only shortish immediate offset addressing [perhaps
with base register stored in moderately redundant
format]). (It could even be possible in some cases to
resolve a branch rather than predict it.) Specializing
those three pointers also makes TLB specialization easier,
reducing storage redundancy and tag comparisons.


Paul A. Clayton
just a technophile
From: jacko on
Storage redundancy elimination? Sounds good.

Does the local stack pointer(s) (a slight forth language effect) have
to be a pointer, or would a compression codec with a finite logic
area, and depth independent memory (perfect recall but may ocassional
stall (rate dependent on area)) be more appropriate?
From: Andy Glew "newsgroup at on
On 8/4/2010 6:57 PM, Paul A. Clayton wrote:
> On Aug 4, 7:32 pm, Andy Glew<"newsgroup at comp-arch.net"> wrote:
> [snip]

>> Where are the PRF reads?
>>
>> On a P6-style read-before-schedule& capture machine:
>> OoO = BP -> I$ -> D -> RR -> RD -> S -> X -> L1$ -> IW
>>
>> On a HaRRM (1987) or Wmt-style read-after-schedule machine:
>> OoO = BP -> I$ -> D -> RR -> S -> X -> RD -> L1$ -> IW
>>
>> I am not going to go into details about this, or the ROB/PRF, because
>> that's not my topic of discussion here.
>
> One of the pipeline variants that I like is the use of a
> future file for some values (particularly the stack
> pointer, global pointer, thread-local storage pointer--
> for early address generation; but also some LSbits of a
> counter [with an upper-bits-not-zero bit] and perhaps
> condition registers for early branch resolution). This
> allows moving RD before RR (scheduling and execution
> could also be simplified/accelerated--FIFO scheduling
> and only shortish immediate offset addressing [perhaps
> with base register stored in moderately redundant
> format]). (It could even be possible in some cases to
> resolve a branch rather than predict it.) Specializing
> those three pointers also makes TLB specialization easier,
> reducing storage redundancy and tag comparisons.


Yep.

Talking about the different OOO pipes in detail was not the main purpose
of yesterday's exercise (which I did at a very late afternoon lunch
break, after email the night before), which was mainly to talk about MC
and MCMT and SpMT. But you can use the same notation to talk about
varieties of OOO.

The importance of notation is when it promotes conversation by eliding
unnecessary details. The weakness is when it elides necessary details.

Viz, my post of yesterday hopefully makes it plain that you can have
MC( in-order )
MC( OoO P6 ROB style read before schedule )
MC( OoO Wmt / HaRRM style read after schedule )
MC( OoO future/file AMD integer style )

I was about to say that I do not like future files, because most of the
data is invalid. But I see that you specified only certain values
likely to be known up front.

I think that it is unlikely that FF RD is moved before RR: they are
logically similar, indexing a data structure with the logical register
number, and getting out stuff like
physical register number
value (if known)
estimated completion time
ready/not ready
producer
where the value is coming from
(which instruction, pipe, cluster, RF, PRF, ...)
rewritten/optimized instructions.


Hmmm, I can feel the need for another blurb, on renamer tricks.
From: Paul A. Clayton on
On Aug 5, 9:20 am, Andy Glew <"newsgroup at comp-arch.net"> wrote:
[snip]
> I think that it is unlikely that FF RD is moved before RR:  they are
> logically similar, indexing a data structure with the logical register
> number, and getting out stuff like
>    physical register number
>    value (if known)
>    estimated completion time
>       ready/not ready
>    producer
>       where the value is coming from
>       (which instruction, pipe, cluster, RF, PRF, ...)
>    rewritten/optimized instructions.

Yeah. I guess I was just thinking in terms of the early RISC
in-order pipelines in which register read was part of decode.
Once the logical RegID is known a RF or RAT could be read.

A tiny future file (e.g., 16 bits of three registers) integrated
with an adder (or two) could be very low latency.

Interestingly, most recent x86 processors do have a limited
set of front-end registers--the segment registers--(with
dedicated adders to create a single immediate), though a
segment register update stalls the pipeline rather than
allowing forward progress on independent operations.


Paul A. Clayton
just a technophile