From: EricP on
Bart Swinnen wrote:
> EricP wrote:
>> I believe Mitch is referring to potential new hardware functionality
>> like AMD's Advanced Synchronization Facility proposal.
>> I can't seem to find any info on it on the AMD website as the proposal
>> seems to have degenerated into just a registered trademark notice.
>
> Huray for Google?
> http://developer.amd.com/CPU/ASF/Pages/default.aspx

The problem with the current definition of ASF is that it has
_no_ guarantee of forward progress. Two contenders can wind up
ping-ponging back and forth in a livelock and software must
provide intervention that resolves this.

I was suggesting in my other message that it looks to me that
in order to guarantee that _one_ contender makes forward progress,
one need only establish a global order for aborts in the face of
contention. That does not mean fairness or eliminate livelock
as an individual can still wind up being continuously victimized.
To guarantee fairness there must be some sort of queue to
fall back upon.

Eric

From: EricP on
MitchAlsup wrote:
> On Dec 26, 2:39 pm, EricP <ThatWouldBeTell...(a)thevillage.com> wrote:
>> EricP wrote:
>
> You return the order as an integer response to
> a message that contains all of the participating addresses. This part
> of the process does not use any side-band signaling. After a CPU hase
> been granted exclusive access to those cache lines, it, then, is
> enabled to NAK requests from other CPUs (or devices) so that it, the
> blessed CPU makes forward progress while the unblessed are delayed.
>
>> So in this example the overhead cost would be 3 bus ops to
>> Read the global device, an Invalidate indicate my order to
>> the peers, and an Invalidate at the end of the sequence.
>
> In my model, there is a message carrying up to eight 64-bit physicall
> addreses to the Observer entity, if there are no current grants to any
> of the requested cache lines, the transaction as a whole is granted
> and a 64-bit response is given as a response to the sending CPU. Most
> would call this one fabric transaction. Just like a Read-cache-Line is
> one fabric operation.
>
> The CPUs contend for the actual cache lines in the standard maner
> (with the exception of the NAK above).

In your method, the MU (Multiple Update) participants collect
their physical addresses together into a batch and send them
to an arbiter.

I am exploring a slightly different approach which
dynamically self organizes amongst the processors.
It is based on the idea that only the order number is important
for dispute resolution. The physical addresses are just triggers.
If there is a collision then the lower (earlier) order number
wins and the higher number aborts.

Once the order is established, the lower order will always
eventually win. The loser will retry using its original order
number so it will eventually win too. The addresses don't matter
for dispute resolution.

It requires 2 bus message features: a broadcast of the order
number to all peers at the start of an MU attempt,
and the ability to NAK a ReadToOwn cache line request with
a special error code that triggers an Abort in the requester.

- A processor starting a MU attempt reads a global 64 unsigned
integer counter from a global up-counter device.
It then broadcasts that value to all processors so
all agree on that processor's order.

The broadcast is necessary because there can be multiple buses
and therefore multiple pathways through the interconnect.
Only a central global counter can establish a single order.

- Each processor has a flag bit vector indexed by bus id #.
Each peer processor compares the order number to itself
and sets a flag for that bus id of the broadcaster.

If a peer is not in a MU sequence the the flag = 0.
If a peer is in a MU sequence then this cpu has its own
order number. We compare the broadcast number to ours
and the flag = 1 if broadcast is higher, otherwise = 0.

- Each cpu now has a bit vector, indexed by bus id #,
that tells that processor whether it should respond to
an individual ReadToOwn by sending a line and aborting myself,
or sending a NAK which will trigger an abort in the peer.

- Each MU cpu executes a mu_load instruction.
If successful it records the physical address in a special
MU collision detect register.

If any peer later executes a mu_load to that same address
then there is a collision and we consult the bit vector
to decide how to respond.

- As an optimization, when a cpu wins a contention and
sends a Nak, then it also remembers the bus id of who it Nak'ed.
When it finished its sequence, or if it aborts, it sends
a TryAgain message to the loser who spins on a special
mu_wait instruction.

- The MU instruction sequence is then designed to support the
above handshake sequence.

Just thinking out loud...
Eric