Picking N-th ready element (e.g. in an OOO scheduler) [Computer Architecture]

Prev: Cheap Wholesale Air Force One 25 Women
Next: x86 simulator (Was Re: RISC load-store verses x86 Add from memory.)

From: Andy 'Krazy' Glew on 26 Jun 2010 14:51

https://semipublic.comp-arch.net/wiki/Picking_N-th_ready_element

One common operation in schedulers is picking the 1st, or 1st and 2nd, or ... N-th ... ready elements to schedule.

Without loss of generality we will assume that we are picking in a physical order. E.g. we assume that an array looks like

0
1
2
....
N-1

from top to bottom, and we are trying to find the highest (lowest numbered) element that has a ready bit set.

If we want to pick according to some permuted, priority order we might simply send the ready bits through a
[[permutation circuits]], perhaps a [[permutation matrix circuit]], select in that order, and then transform back to
determine the original locations.

= Picking 1 =

Finding the first, highest priority, lowest index number ready entry in the array is straightforward: use a
find-first-set bit circuit ([[FFS]]). E.g. a ripple carry circuit, or possibly transformed via any of a large number
of techniques for parallel prefix, such as carry lookahead.

= Picking 2 =

Picking 2 is slightly harder.

We could use two [[FFS]] circuits, cascading one into the other. This adds delay in most design styles, and scales
poorly beyond picking N > 2.

A common trick is to start two FFS circuits at opposite ends of the array, in opposite directions. Unfortunately, if
the array is sorted by priority, this gives you the highest priority element and the lowest priority element. Not the
first and second highest priority elements. And it does not scale past N=2.

= Picking M =

The following technique has been suggested to pick any of the top M elements:
* for every entry K in the array, calculate Ready_Before_Me[K] = the number of ready entries I<K such that Ready[I] is true.
** This can be calculated using a well-known [[parallel prefix circuit]]. It allows considerable sharing of logic.
log(N) delay, N*log(N) hardware
* at every entry in the array, perform the comparison Ready_Before_Me[K] = 1,2,3 ...
** use this to select which of the M output ports an entry should be connected to
** this is basically the well-known [[sum addressed memory decoder]] circuit, which is carry free
** i.e. this whole computation - calculating Ready_Before_Me and selecting - is [[carry propagate free]]

Hardware complexity N*log(N) + M*N - which is not good if M is large, comparable to N. But good if M is small.

From: MitchAlsup on 26 Jun 2010 21:35

On Jun 26, 1:51 pm, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote:
> https://semipublic.comp-arch.net/wiki/Picking_N-th_ready_element
>
> One common operation in schedulers is picking the 1st, or 1st and 2nd, or ... N-th ... ready elements to schedule.
>
> Without loss of generality we will assume that we are picking in a physical order. E.g. we assume that an array looks like
>
> 0
> 1
> 2
> ...
> N-1
>
> from top to bottom, and we are trying to find the highest (lowest numbered) element that has a ready bit set.
>
> If we want to pick according to some permuted, priority order we might simply send the ready bits through a
> [[permutation circuits]], perhaps a [[permutation matrix circuit]], select in that order, and then transform back to
> determine the original locations.
>
> = Picking 1 =
>
> Finding the first, highest priority, lowest index number ready entry in the array is straightforward: use a
> find-first-set bit circuit ([[FFS]]). E.g. a ripple carry circuit, or possibly transformed via any of a large number
> of techniques for parallel prefix, such as carry lookahead.

With proper access to true and complement signals from latching
points, one can pick the first element of 4 in one gate delay, 9 in 2
gate delays, and 24 in 3 gate delays. None of this violates 4-input
fan-in rules, but does apply some heavy loads to some inverter fan-
outs. I have the schematics if anyone is interested.

For 2 more gate delays, one can make this whole process circular with
a different starting point each cycle (if desired) and a window limit
if desired.

> Picking 2 is slightly harder.
>
> We could use two [[FFS]] circuits, cascading one into the other. This adds delay in most design styles, and scales
> poorly beyond picking N > 2.

So, for 7 gate delays (3+1+3) one can pick 2 from at least 24 entries.
For 11 gate delays (3+1+3+1+3) one can pick 3 from at least 24
entries. For 14 gates, one can pick 4 from at least 24 entries. In all
of these cases, on is picking the oldest ready element in the vector
of candidates. That is Oldest reads is picked first, second oldest is
picked second,...

> A common trick is to start two FFS circuits at opposite ends of the array, in opposite directions. Unfortunately, if
> the array is sorted by priority, this gives you the highest priority element and the lowest priority element. Not the
> first and second highest priority elements. And it does not scale past N=2.

And not exactly what you want to do anyways. This kind of picker can
pick 'instructions' in unexpected orders and lead to other instruction
order issues.

Pickers are a seriously degenerate form of logic that uses something
resembling a carry chain from an integer adder. In the picker sense,
the carry chain is a single AND gate while in the adder sense the
carry chain is a Or-And gate (manchester) per carry stage. The same
kinds of lookaheads, selectors, and tree arrangements that are
appropriate for adders work with pickers, except that the picker carry
chains are a lot simpler.

I strongly suspect, as Andy describes, that when attempting to pick
more than 4-ish from (maybe) up to 60-ish elements, that
multiplication array techniques would be in (logic) order; using
circuits that smell a lot like carry save adder cells.

For pickers faster than described above (14 gates for 4 from 24),
simply take all of the cones of logic, express them, and send the
whole kaboodle through a logic minimizer. Assign a logic engineer and
don't let him give up before the minimization is complete, and the
minimization can be easily expressed in the logic library one happens
to have at hand.

Mitch

From: Terje Mathisen "terje.mathisen at on 27 Jun 2010 11:25

MitchAlsup wrote:
> On Jun 26, 1:51 pm, Andy 'Krazy' Glew<ag-n...(a)patten-glew.net>
> wrote:
>> https://semipublic.comp-arch.net/wiki/Picking_N-th_ready_element
>>
>> One common operation in schedulers is picking the 1st, or 1st and
>> 2nd, or ... N-th ... ready elements to schedule.
[snip]
> I strongly suspect, as Andy describes, that when attempting to pick
> more than 4-ish from (maybe) up to 60-ish elements, that
> multiplication array techniques would be in (logic) order; using
> circuits that smell a lot like carry save adder cells.
>
> For pickers faster than described above (14 gates for 4 from 24),
> simply take all of the cones of logic, express them, and send the
> whole kaboodle through a logic minimizer. Assign a logic engineer
> and don't let him give up before the minimization is complete, and
> the minimization can be easily expressed in the logic library one
> happens to have at hand.

For large N and M, would it be appropriate to look at probabilistic methods?

I.e. a single counter which knows how many ready (R) elements there are
at a given time, when asking for ready(M) we start at (approximately)
element M*N/R.

What I'm asking is if anyone have modeled probabilistic pickers?

BTW, it is well-known that you can pick the Nth percentile out of an
unordered array in log(N) time, using a non-recursive half quick sort
which always picks the relevant partition to drill further into.

I assume this will always to out of the question though, since it would
throw away nearly all the effort expended in partly sorting the array.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: Andy 'Krazy' Glew on 27 Jun 2010 12:40

On 6/27/2010 8:25 AM, Terje Mathisen wrote:
> MitchAlsup wrote:
>> On Jun 26, 1:51 pm, Andy 'Krazy' Glew<ag-n...(a)patten-glew.net>
>> wrote:
>>> https://semipublic.comp-arch.net/wiki/Picking_N-th_ready_element
>>>
>>> One common operation in schedulers is picking the 1st, or 1st and
>>> 2nd, or ... N-th ... ready elements to schedule.

>
> For large N and M, would it be appropriate to look at probabilistic
> methods?
>
> I.e. a single counter which knows how many ready (R) elements there are
> at a given time, when asking for ready(M) we start at (approximately)
> element M*N/R.
>
> What I'm asking is if anyone have modeled probabilistic pickers?
>
> BTW, it is well-known that you can pick the Nth percentile out of an
> unordered array in log(N) time, using a non-recursive half quick sort
> which always picks the relevant partition to drill further into.
>
> I assume this will always to out of the question though, since it would
> throw away nearly all the effort expended in partly sorting the array.

Terje, I don't really know of any work in probailistic pickers in hardware. I encourage you to think/talk/work on them
some more.

A complication: your first suggestion, starting at element M*N/R, I think is making the implcit assumption that the
array is sorted. This is often not true in hardware schedulers. I did in that first post talk about running the array
ready bits through a permutation, that allows us to look at the ready bits in order even though the array itself is out
of order, that really only works for small arrays.

The quicksort idea is interesting. I wonder if we get into the usual heapsort vs. quicksort argument, deterministic vs.
probabilistic, where deterministic corresponds to current synchronous clocking, and probailistic corresponds to
asynchronous clocking? As I have discussed here, I have long suspected that asynchronous circuis are one road to the
future. Forgetting that, most present OOO schedulers must be pipelineable, delivering back to back instructions in
successive clocks. There has been little work that I am aware of in multicycle loop schedulers.

Of course I have been thinking about it. E.g. in my multicycle work. Some of the best aspects are figuring out how to
make use of a big slow outer scheduler. Key: make the big slow outer scheduler work in units of multiple
instructions, instruction groups of 8, 16, or more, where the latency through the execution group is related to, ideally
at least, the latency of the scheduler. Make the outer scheduler approximate: don't schedule based on all inputs, only
on one or two or a few: e.g. worst case a 16 instruction group might have 32 or even 48 inputs, but you might only
provide 2 or 3. Don't stop filling an instruction group when the number of inputs it can handle is exceeded: instead
just select a subset of the most important inputs, e.g. the last arriving. (Although there may be better choices
still.) Don't restrict instruction groups to dependency chains.

From: Andy 'Krazy' Glew on 27 Jun 2010 13:11

On 6/27/2010 9:40 AM, Andy 'Krazy' Glew wrote:
> On 6/27/2010 8:25 AM, Terje Mathisen wrote:
>> MitchAlsup wrote:
>>> On Jun 26, 1:51 pm, Andy 'Krazy' Glew<ag-n...(a)patten-glew.net>
>>> wrote:
>>>> https://semipublic.comp-arch.net/wiki/Picking_N-th_ready_element
>>>>
>>>> One common operation in schedulers is picking the 1st, or 1st and
>>>> 2nd, or ... N-th ... ready elements to schedule.
>
>> For large N and M, would it be appropriate to look at probabilistic
>> methods?
>>
>> I.e. a single counter which knows how many ready (R) elements there are
>> at a given time, when asking for ready(M) we start at (approximately)
>> element M*N/R.
>>
>> What I'm asking is if anyone have modeled probabilistic pickers?

> Terje, I don't really know of any work in probabilistic pickers in
> hardware. I encourage you to think/talk/work on them some more.

Tweaking my earlier answer:

I am not aware of anyone building probabilistic pickers for the main scheduling task in an OOO scheduler: picking N
ready instructions to execute out of a large array.

I am aware of people building probabilistic arbitrators: given M>K ready instructions, using probability to choose
which K to execute first.

I am quite a big advocate of using probabilistic methods between threads, since it is the best way I know of to provide
nice properties like fair-share scheduling (e.g. each thread getting 25% of the machine, or one getting 60% and the
other two getting 20% each).

I am not aware of much work on probabilistic methods within a single thread.

Butler and Patt do evaluate a random scheduler in the paper whose reference is attached below. However, I do not
attach much credence to that paper, since it failed to reproduce the scheduler anomalies seen in the real world, the
eddies that we encountered on the P6 schedulers, and the tornadoes that plagued Willamette. They say "We have found for
the scheduling techniques modeled, performance is largely independent of the scheduling discipline." Since real world
experience has indicated the contrary, I suspect (a) they didn't model the right scheduler techniques, (b) their
modelling technology (simulator) was suspect, (3) they weren't looking at a good selection of workloads, and/or (4)
their analysis was faulty. Wrt the last, I recall onP6 that we were getting reasonable average performance across
workloads; we only noticed the scheduler anomalies when looking into smaller regions, fixing of which helped overall.

Guri Sohi is/was always insistent on his students modeling random as a policy to compare against. I'm not sure if any
serious work went into modeling random seriously as a goal.

---

Butler, M. and Patt, Y. 1992. An investigation of the performance of various dynamic scheduling techniques. In
Proceedings of the 25th Annual international Symposium on Microarchitecture (Portland, Oregon, United States, December
01 - 04, 1992). International Symposium on Microarchitecture. IEEE Computer Society Press, Los Alamitos, CA, 1-9. DOI=
http://doi.acm.org/10.1145/144953.144974

| Next | Last
Pages: 1 2 3 4 5 6 7 8 9
Prev: Cheap Wholesale Air Force One 25 Women
Next: x86 simulator (Was Re: RISC load-store verses x86 Add from memory.)