From: Andy 'Krazy' Glew on 28 May 2010 02:08
== Assumptions ==
Say you want to speed up a single thread of execution.
Say you have N processors to "spread" the execution across.
Or say that you have N "execution clusters" (e.g. see [[Multicluster Microarchitecture]], or [[MCMT]]).
For the moment, let's just forget issues and techniques such as [[SpMT]] or [[Control Independence]] or [[Fine Grained
Deinterleaving Multithreading of Single Threaded Execution]]. Forget [[Instruction Chain Scheduling for Multicluster]],
and the like.
== Cross-cluster dependencies, the killer issue ==
Even if you can figure out how to extract parallelism, such "single thread spreading" microarchitectures often suffer a
common weakness: cross thread, or cluster, or processor, dependencies. If you can find completely independent threads,
or chains, or whatever, great. But if there are inter-chain or inter-thread dependencies, and if the cost for
communicating such dependencies is high, well, then you have a weakness.
Without loss of generality, let us assume that we are talking about spreading a single thread of execution across
multiple clusters. This applies just as much to
Thought experiment: how can we minimize dependencies when spreading a single thread across multiple clusters? Without
any special knowledge of the code. E.g. assuming that the code that we are executing on has uniformly random
Given that assumption, then it follows that the way to spread the code across clusters is to use as few "chunks" or
"batches" of code as possible. E.g. if each cluster is capable of holding 100 instructions, then start fetching
instructions. Feed the first 100 to the first cluster (plus maybe a few more, since some of them will have been
executed while the later instructions are being fetched). Feed the next instruction to the second cluster. And so on,
if you have more than 2 clusters. When you use up all of the clusters, wrap around to the beginning.
This minimizes the cross cluster dependencies for arbitrary, uniformly dependent code. It delivers contiguous batches
of instructions tio each cluster.
Sure, "smart" algorithms might be able to allocate non-contiguous groups of instructions to each cluster, for particular
dependence graphs. But such smart algorithms either fall back to the batches described above, or are worse for some codes.
Note: I am not saying that smart clustering algorithms are impossible. I am just saying that this stupid algorithm for
allocating contiguous batches of instructions to separate threads is not so bad, for the most challenging case. And it
is really simple. I suggest it as a foundation, to which embellishments can be added.
Since there are a fixed number of registers, there can only every be a fixed number of register carried data
dependencies between batches, across cluster boundaries. Worst case, you could insert microinstructions to copy all
register values from one batch to the next batch on a different cluster. That's cheap for instruction sets with a small
number of registers. More registers more expensive.
If you can identify the truly live registers... either a priori (e.g. by [[instruction to kill or indicate liveness of
registers]]); or a posteriori (pull registers from the earlier batch/cluster to the later, when the register is actually
Memory carried dependencies do not have such a convenient upper limit. Potentially every store instruction in an
earlier batch/cluster might need to be forwarded to a load instruction in a later batch/cluster. ... But, we have to
do this anyway. Most practical proposals for building large instruction windows have large store-to-load conflict
detection windows. Only in the best of times can you elide many such memory dependencies.
So here's what we have got
* 2 or more "clusters", e.g. scheduler, execution unit. maybe cache. i.e. each cluster has an instruction window
* allocate instructions to clusters in sequential, contiguous, "batches"
* a mechanism to carry registers between cluster/batches. e.g. explicit push, or pull, or ...
* a memory dependence mechanism
** perhaps a per-cluster L1 store buffer
** but definitely a shared between clusters master store-to-load conflict detection buffer.
There are many designs for the inter-cluster Store-to-Load buffer. (TBD, describe. Point to papers.)
= Now Back Off =
Now let's back off from the assumption that all batches are maximal sized.
== Smaller Batches ==
How about allocating batches that are (approximately) half, or 1/4 the per-cluster instruction window?
This way, you get multiple batches in a cluster, some old, some new. Rather than a cluster being the strictly oldest at
any time, which would tend to suggest that first one cluster executes, then another.
This way, if there is long distance parallelism, the batches spread across different clusters can find it.
== Loop Batching ==
Try to align the batch boundaries to loop boundaries. This way, you can get different loop bodies in different clusters.
From: MitchAlsup on 28 May 2010 11:37
This sounds a lot like Guri Sohi's MultiScalar thing of the late
From: Andy 'Krazy' Glew on 28 May 2010 22:06
On 5/28/2010 8:37 AM, MitchAlsup wrote:
> This sounds a lot like Guri Sohi's MultiScalar thing of the late
It might not be surprising, since Guri was my advisor at UWisc.
But, oddly enough, I never thought of Multiscalar in this regard. Perhaps because when I was working with Guri, my head
was in the SpMT fork-on-call, etc. space. (I wasn't working on SpMT; I was assuming that Haitham Akkary would have
finished DMT, so I was trying to skip to what might come after DMT.)
Maybe I backed into Multiscalar... ?
Ah, no. Multiscalar was expandable split windows. Very much like SpMT, DMT, etc. The instruction window was not
contiguous. Or, rather, there was a large instruction window, but only sdiscontiguous pieces of it were instantiated at
any time. Differs mainly from SpMT and DMT in details and mechanisms.
My batched instruction window is in some ways a retreat from discontiguous instruction windows.
Don't get me wrong: I still hope to see SpMT or the like get built. Maybe I'll even be lucky enough to build it. Or at
least publish some patents on it.
However, batched instruction window is a retreat. It doesn't look for places to skip to and run speculative threads. It
doesn't have multiple instruction pointers. It just seeks to paste together the instruction windows of two processors
or EU clusters, to make a single large instruction window.
Argument: in an MCMT machine you have the bandwidth to supply two threads. If only one is running, then can you use
that extra ifetch bandwidth? Using unrolled BTB, etc., not so hard to fetch. (Modulo branch mispredictions.) Two
cases: (1) the execution window drains as fast as the 2X ifetch can supply: cool, great! (2) the execution window clogs
up. In which case, switching to the execution window of the second cluster in a 2 cluster MCMT *might* get you some
more performance by exposing more parallelism.
Basically, it's worth a shot. The cross cluster dependencies are minimized as described, etc.
Now, I admit that I have an idea in flight that is more like batch and multiscalar. I'm even allowed to post it. But I
want to give my employer a chance to patent it.
From: Andy 'Krazy' Glew on 30 May 2010 12:01
On 5/28/2010 7:06 PM, Andy 'Krazy' Glew wrote:
> On 5/28/2010 8:37 AM, MitchAlsup wrote:
>> Andy: [[wrt my batched instruction window proposal]]
>> This sounds a lot like Guri Sohi's MultiScalar thing of the late
> Ah, no. Multiscalar was expandable split windows.
Another way this may have reminded you of multiscalar is that the easiest way to to allocate the windows to clusters
would be in a ring. Multiscalar had a ring. I think that was one of multiscalar's biggest weaknesses.
From: Robert Myers on 30 May 2010 16:36
On May 28, 2:08 am, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote:
> Memory carried dependencies do not have such a convenient upper limit. Potentially every store instruction in an
> earlier batch/cluster might need to be forwarded to a load instruction in a later batch/cluster. ... But, we have to
> do this anyway. Most practical proposals for building large instruction windows have large store-to-load conflict
> detection windows. Only in the best of times can you elide many such memory dependencies.
I am reminded of the famous cartoon with "and then a miracle happens"
in the middle of a complex proof.
One advantage that I can think of for your proposal is that it imposes
a natural order on memory references. The usual parallelism problem
of who got there first has a definite answer imposed by the ordering
of a serial instruction stream.
You can't retire any write instructions until all ambiguous pending
read instructions in preceding instruction blocks are complete. You
can't count on any read instructions until all ambiguous pending write
instructions in preceding instruction blocks are complete. You can
ignore the latter constraint and fix up, if necessary, but the thought
of doing it makes my head hurt. It's the same as hoisting a load
above a store. How often does that work?
I'll save you the trouble. "Aaargh!" But maybe others are as puzzled
about the details as I am.