From: nedbrek on
Hello all,
Now we are getting into my bread and butter... :)

"Andy 'Krazy' Glew" <ag-news(a)patten-glew.net> wrote in message
news:4C264C37.8090008(a)patten-glew.net...
> https://semipublic.comp-arch.net/wiki/Picking_N-th_ready_element
>
> = Picking M =
>
> The following technique has been suggested to pick any of the top M
> elements:
> * for every entry K in the array, calculate Ready_Before_Me[K] = the
> number of
> ready entries I<K such that Ready[I] is true.
> ** This can be calculated using a well-known [[parallel prefix circuit]].
> It
> allows considerable sharing of logic. log(N) delay, N*log(N) hardware
> * at every entry in the array, perform the comparison Ready_Before_Me[K] =
> 1,2,3 ...
> ** use this to select which of the M output ports an entry should be
> connected
> to
> ** this is basically the well-known [[sum addressed memory decoder]]
> circuit,
> which is carry free
> ** i.e. this whole computation - calculating Ready_Before_Me and
> selecting - is
> [[carry propagate free]]
>

Has such a thing ever been published? I used to follow ISCA, and MICRO (not
to mention HPCA, ISC, and ISPASS). No such thing was ever published.
Everyone would cite Palacharla (which had to be one of the most cited
papers, along with Tomasulo). At least until the matrix scheduler guys
(Goshima, Nishino, Kitamura, et al.).

Does it end up in a circuits conference?

It was Palacharla's paper that led everyone towards clustering (pick 2 is
hard)...

-----
The scheduler loop runs from the critical ready input (last input known
ready, from whatever mechanism) to asserting the entry ready. This then
propagates to the pick logic, and back as select. You need to de-assert the
ready line in time for the next cycle (to prevent double picking).

Pipelining that loop will cost 5-10% across the board (closer to 10% on a
well-balanced uarch). See "Loose loops sink chips".

You can work around this by pulsing the ready line (asserting only every 1/N
cycles, where N equals the scheduler pick latency). I filed a patent on
this, but it was placed in def pub. It's not a good solution for an L1
scheduler, but could be handy at the L2 level...

-----
I believe this algorithm can be expressed as (for N scheduler entries):
Given a NxN is_older_than matrix:
ready[i] = inputs_ready[i] && sum(j = 1..N; ready[j] && is_older_than[i, j])
== 0

This requires the ready lines to be propagated to every entry.

-----
Building the is_older_than matrix (which takes care of ordering the array):
This might appear a daunting task, but it is really quite simple. In
dispatch, you must know which entries are free and which are occupied. When
allocating an entry, this "free list" is the value for the is_older_than
entry you are now writing!

Ned


From: jacko on
Taking datflow...

consider the array to work as a fifo shift register, then this gives
in order execution stalling. The next step would be to consider what
order motion besides bump up to head would be needed to achive no
duplications and minimum logic?

So

>head - bump up
=stall - wait
<tail (delay/restall/open slot) - delay
<<tail (slews gate to behind first staller delay) - also know as a
super stall

This allows a bump up to exchange with a delay, or a stall (as
priority overides!!) In some senses wait delay and super stall are
three priorities of wait delays.

Wait implies no progress, delay implies stall bunching, and super
stall implies a common drift down stream bus for pick up where the
stall stream starts.

The head would be most ready.

For multiple dispatch, multiple instruction stream buffers could be
filled, based on the assigned writeback group of registers from the
register pool. This would maximize data writebacks to registers, so
lessening stalls on data waits.

Cheers Jacko

??

From: Andy 'Krazy' Glew on
On 7/1/2010 6:54 AM, jacko wrote:
> Taking datflow...
>
> consider the array to work as a fifo shift register, then this gives
> in order execution stalling. The next step would be to consider what
> order motion besides bump up to head would be needed to achive no
> duplications and minimum logic?
>
> So
>
>> head - bump up
> =stall - wait
> <tail (delay/restall/open slot) - delay
> <<tail (slews gate to behind first staller delay) - also know as a
> super stall
>
> This allows a bump up to exchange with a delay, or a stall (as
> priority overides!!) In some senses wait delay and super stall are
> three priorities of wait delays.
>
> Wait implies no progress, delay implies stall bunching, and super
> stall implies a common drift down stream bus for pick up where the
> stall stream starts.
>
> The head would be most ready.
>
> For multiple dispatch, multiple instruction stream buffers could be
> filled, based on the assigned writeback group of registers from the
> register pool. This would maximize data writebacks to registers, so
> lessening stalls on data waits.
>
> Cheers Jacko

(1) I'm not sure I understand
(2) are you serious
(3) are you joking?

You seem to be proposing using a classic VLSI shift register as part of the scheduler. Are you? Exactly what sort of
shift register? It sounds like one that allows you to only remove the front element, and which allows all earlier
elements to be shifted up by one? Or one that allows multiple shifts? I.e. where
element[i] next cycle := SELECT( element[i] this cycle, element[i-1] )
or
element[i] next cycle := SELECT( element[i] this cycle, element[i-1], element[i-2] ... element[i-n] )

With the same control to all elements, or different control at each (the latter allowing holes to be swallowed up).

The "are you serious/joking?" part comes about because I have not encountered a classic VLSI shift register in more than
a decade. All of the shift registers I have recently encountered have been arrays, with a current index pointer
circling around.

Reason: activity factor. In a classic VLSI fifo of N entries, M bits wide, you may have N*M bits toggling. Lots of
power. We have already discussed how the self compacting dispatch queue of DEC Alpha and early AMD machines seems to
have been a mistake.

Q: are there places and situations with the classic VLSI fifo is still used?

I conjecture that it might, might, almost make sense where the FIFO is narrow and shallow. Narrow: instead of holding
a full width instruction that is M bits wide, the fifo might hold only log(N) bits indexing a RAM array.

Shallow: may be answered below.

A fifo, classic or otherwise, doesn't solve the picker problem. A fifo only allows you to pick off the head; if you can
pick out of the middle, it isn't a fifo any more, it's what we are talking about. The at times widely discussed queue
or string or dependency chain based schedulers (e.g. Subarao Palacharla's complexity effective paper; I've done a lot of
work here, mostly unsuccessful, as has Dave Sager and others) somewhat handle this by having multiple queues, more than
the execution units: e.g. for 4 exec units, they might have 8 or 12 or 16 dependency chain heads to look at. Which
reduces the number from which a picker must pick by ... unfortunately, often not much more than 2x. Arithmetic
dependency chains are rarely long, while dependencies through memory are less tangible.

---

So much for my attempts to understand what you posted, jacko.

Care to explain what you mean by "a common drift down stream bus" and why you think your last paragraph lessens "stalls
on data waits"? Compared to what? I doubt compared to full OOO; maybe compared to some other queue based scheduler?
From: MitchAlsup on
On Jun 26, 8:35 pm, MitchAlsup <MitchAl...(a)aol.com> wrote:
>
> With proper access to true and complement signals from latching
> points, one can pick the first element of 4 in one gate delay, 9 in 2
> gate delays, and 24 in 3 gate delays. None of this violates 4-input
> fan-in rules, but does apply some heavy loads to some inverter fan-
> outs. I have the schematics if anyone is interested.

I had a reason to go back and review my schematics, and found that one
can get a 1:10 picker in 2 gates, and (up from 1:24) a 1:44 picker in
only 3 gates (with a good chance that this might be able to do 1:48 or
1:49 in 3 gates--but I have neither the time nor inclination to
pursue.)

While in there, I also revisted the fan-out issue mentioned above, and
by sizing some gates, I can achieve fan-out 4 for all of the above
circuits.

Thus, a 1:16 circular picker can resolve in just 4 gates.

Mitch
From: jacko on
Picking the head is the allowed.

So it's either pick head, flush head down a bus to a insert point
freed by the freed head or swaping or skipping an element toward the
head as things shift.