From: MitchAlsup on
On Dec 26, 2:39 pm, EricP <ThatWouldBeTell...(a)thevillage.com> wrote:
> EricP wrote:
>
> > Having the ability to perform a LoadLocked/StoreConditional on
> > up to 4 separate memory locations would eliminate much of the
> > need to escalate to the heavyweight OS synchronization ops.
>
> This appears to require a method of establishing a global
> first-come-first-served ordering for the cpus that is
> independent of the physical memory locations involved.

Correct.

> In the unlikely event of cache line ownership contention then
> the first cpu to begin its multiple update sequence wins
> and the other rolls back.

But more importantly, that entity that determins who wins also
establishes order over current participants avoid contention on the
subsequent access. Thus, one can achieve something on the order of BigO
(log(n) instead of BigO(n**2) memory references worst case. You cannot
get to this point unless the synchronization 'event' returns an
integer number instead of simple win-retry.

>
> The trick is for it be a low cost mechanism (ideally the cost of
> a single cache miss to establish the order) that works within
> the existing cpu hardware, bus and coherency protocol.

In practice it requires a two way transfer through the fabric, but
does not require a DRAM access delay. So the latency is better than a
DRAM access. The entity looks and smells remarkably like a TLB and can
process a stream of requests as fast as the fabric can deliver a
stream of requests (i.e. no back pressure--at least none required).
And the TLB does not have to be "that big" either.

> For that I'm thinking that maybe a global device located
> at some special physical memory location would establish
> the global order at the start of a multiple update sequence.

Yep, programmed up by a standard header making it look like a device
witting anywhere in fabric addressible space.

> Then using Invalidate bus ops to a set of other special
> physical memory locations could communicate that ordering
> to other cpus and they can associate that with the bus id.

Nope, dead wrong, here. You return the order as an integer response to
a message that contains all of the participating addresses. This part
of the process does not use any side-band signaling. After a CPU hase
been granted exclusive access to those cache lines, it, then, is
enabled to NAK requests from other CPUs (or devices) so that it, the
blessed CPU makes forward progress while the unblessed are delayed.

> So in this example the overhead cost would be 3 bus ops to
> Read the global device, an Invalidate indicate my order to
> the peers, and an Invalidate at the end of the sequence.

In my model, there is a message carrying up to eight 64-bit physicall
addreses to the Observer entity, if there are no current grants to any
of the requested cache lines, the transaction as a whole is granted
and a 64-bit response is given as a response to the sending CPU. Most
would call this one fabric transaction. Just like a Read-cache-Line is
one fabric operation.

The CPUs contend for the actual cache lines in the standard maner
(with the exception of the NAK above).

Mitch
From: EricP on
EricP wrote:
>
> It requires 2 bus message features: a broadcast of the order
> number to all peers at the start of an MU attempt,
> and the ability to NAK a ReadToOwn cache line request with
> a special error code that triggers an Abort in the requester.
>
> <snip>
>
> - Each cpu now has a bit vector, indexed by bus id #,
> that tells that processor whether it should respond to
> an individual ReadToOwn by sending a line and aborting myself,
> or sending a NAK which will trigger an abort in the peer.

This could also be done without a NAK, though it is
not very elegant: it could do a grab-back.

If a line owner receives a ReadToOwn it consults the bus id bit vector.
If the requester is lower order, this cpu replies as normal and
aborts its own MU sequence.
If the requester is higher order, this cpu replies with the value
but immediately requests it back. That will trigger the same logic
sequence in the requester (because we all agree on the order numbers)
who will reply and abort its MU sequence.

Eric