From: EricP on
EricP wrote:
>
> Having the ability to perform a LoadLocked/StoreConditional on
> up to 4 separate memory locations would eliminate much of the
> need to escalate to the heavyweight OS synchronization ops.

This appears to require a method of establishing a global
first-come-first-served ordering for the cpus that is
independent of the physical memory locations involved.
In the unlikely event of cache line ownership contention then
the first cpu to begin its multiple update sequence wins
and the other rolls back.

The trick is for it be a low cost mechanism (ideally the cost of
a single cache miss to establish the order) that works within
the existing cpu hardware, bus and coherency protocol.

For that I'm thinking that maybe a global device located
at some special physical memory location would establish
the global order at the start of a multiple update sequence.
Then using Invalidate bus ops to a set of other special
physical memory locations could communicate that ordering
to other cpus and they can associate that with the bus id.

So in this example the overhead cost would be 3 bus ops to
Read the global device, an Invalidate indicate my order to
the peers, and an Invalidate at the end of the sequence.

Eric



From: Bart Swinnen on
EricP wrote:
> I believe Mitch is referring to potential new hardware functionality
> like AMD's Advanced Synchronization Facility proposal.
> I can't seem to find any info on it on the AMD website as the proposal
> seems to have degenerated into just a registered trademark notice.

Huray for Google?
http://developer.amd.com/CPU/ASF/Pages/default.aspx
From: "Andy "Krazy" Glew" on
nmm1(a)cam.ac.uk wrote:
> In article <hIGdnRxV8ZyP1qvWnZ2dnUVZ_uSdnZ2d(a)bestweb.net>,
> Mayan Moudgill <mayan(a)bestweb.net> wrote:
>
>> BTW: my experience is with systems where we're synchronizing on less
>> than 100 cycle granulatrity - at that granularity, you're basically
>> programming against bare metal, with fixed thread mappings, all-or-none
>> thread scheduling and no "system software" to speak of.
>
> That's largely because there are no adequate facilities for doing it
> any other way :-(

I'm not so sure. I tend to agree with sonething that is implicit in what
Mayan is saying.

I have seen two main classes of system:

(1) Systems that allow thread and process migration from CPU to CPU.

(2) Systems that don't.

When you allow migration, if you are cache coherent you need do nothing
much. Maybe a fence, but no cache flushes.

However, if you use caches but are not cache coherent, then on a process
or thread migration you need to flush all of the cache footprint of the
thread being migrated. Not necessarily all the way to memory, but at
least to the shared cache level.

E.g. in a multicore chip, typically the L1 and L2 caches are private,
but the LLC is shared. So if you migrate between processors on the same
chip, you need to flush to the LLC. If you migrate between chips,
further out, typically to memory.

With AMD Bulldozer's MCMT, private L1's but shared L2s, sometimes you
would need to flush to L2, sometimes to L3, and sometimes all the way to
memory.

Unfortunately, x86 doesn't have all the varieties of cache flush you
would need to do this.

Unfortunately, it's chicken and egg: the vast majority of shared memory
cache coherent systems don't need this. So it isn't even provided, let
alone made fast.

Because of this, there seems to be a clustering of usage models:

(1) General purpose cache coherent systems that allow thread migration

(2) Special purpose systems that are not necessarily cache coherent,
which disallow, or at least highly deprecate, thread migration.

These latter I call embedded systems. Even if they are in big
supercomputers.

Unfortunately, high degrees of parallelism require thread migration for
load balancing. But thread migration in the absence of cache coherency
is really slow, unless somebody creates the appropriate cache flush
operations. Which are easy to do, but hard to justify in today's
incremental climate.
From: EricP on
EricP wrote:
>
> I was suggesting in my other message that it looks to me that
> in order to guarantee that _one_ contender makes forward progress,
> one need only establish a global order for aborts in the face of
> contention. That does not mean fairness or eliminate livelock
> as an individual can still wind up being continuously victimized.
> To guarantee fairness there must be some sort of queue to
> fall back upon.

Actually I was being a bit dumb here.
I was thinking that each locker would acquire an 'order ticket'
and if there is a collision, the one with the higher (older)
order would abort, acquire a new ticket, and retry.
And that does lead to potential retry starvation.

However if the loser just aborts but retains its ticket
for the next try then that is a fair queue.

And since the winner knows who it collided with, it can send
a message to the loser letting it know when it is ok to retry.
So the loser can watch for a 'Done' message or have a timer
(just in case) and retry the multiple update sequence.
That doesn't guarantee success on the next try, but does
minimize latencies due timer delay retry loops.

Eric

From: Mayan Moudgill on
EricP wrote:

> Mayan Moudgill wrote:

>> More heavyweight synchronization operations (such as a lock with
>> suspend on the lock if already locked) *can* be more expensive - but
>> the cost is due to all the additional function in the operation. Its
>> not clear that tweaking the underlying hardware primitives is going to
>> do much for this.
>
>
> I believe Mitch is referring to potential new hardware functionality
> like AMD's Advanced Synchronization Facility proposal.
> I can't seem to find any info on it on the AMD website as the proposal
> seems to have degenerated into just a registered trademark notice.
>
> Having the ability to perform a LoadLocked/StoreConditional on
> up to 4 separate memory locations would eliminate much of the
> need to escalate to the heavyweight OS synchronization ops.
>
> Eric

Let me ask the following question: assume that we have a N-process
barrier, how do you avoid using N system calls (use any hardware
synchronization you want)?

The fundamental problem becomes that, in the general case, the first N-1
processes that arrive at the barrier will need to (have the kernel) put
themselves on a waiting queue, and the last process to arrive will need
to (have the kernel) put the N-1 waiting processes back on the ready queue.

I doubt that software transactional memory solves this problem.