From: Mayan Moudgill on
nmm1(a)cam.ac.uk wrote:

> Yes. How many have been published in places you can find them, or
> even written up suitable for publication, I don't know. I know that
> mine weren't.

Pity

> Note that the situation involves more than just the synchronisation
> operations, because a lot of it is about scheduling. If you are
> trying to parallelise code with a 10 microsecond grain, having to do
> ANY interaction with the system scheduler runs the risk of a major
> problem. That is one of the main reasons that almost all HPC codes
> rely on gang scheduling, with all threads running all the time.
>

Agreed.

BTW: my experience is with systems where we're synchronizing on less
than 100 cycle granulatrity - at that granularity, you're basically
programming against bare metal, with fixed thread mappings, all-or-none
thread scheduling and no "system software" to speak of.
From: nmm1 on
In article <hIGdnRxV8ZyP1qvWnZ2dnUVZ_uSdnZ2d(a)bestweb.net>,
Mayan Moudgill <mayan(a)bestweb.net> wrote:
>
>More heavyweight synchronization operations (such as a lock with suspend
> on the lock if already locked) *can* be more expensive - but the cost
>is due to all the additional function in the operation. Its not clear
>that tweaking the underlying hardware primitives is going to do much for
>this.

It's not clear, I agree, but one problem with existing ones is that
they are usually privileged, which forces a system call. That isn't
what you want, for many reasons.

>BTW: my experience is with systems where we're synchronizing on less
>than 100 cycle granulatrity - at that granularity, you're basically
>programming against bare metal, with fixed thread mappings, all-or-none
>thread scheduling and no "system software" to speak of.

That's largely because there are no adequate facilities for doing it
any other way :-(


Regards,
Nick Maclaren.
From: Mayan Moudgill on
nmm1(a)cam.ac.uk wrote:

> In article <hIGdnRxV8ZyP1qvWnZ2dnUVZ_uSdnZ2d(a)bestweb.net>,
> Mayan Moudgill <mayan(a)bestweb.net> wrote:
>
>>More heavyweight synchronization operations (such as a lock with suspend
>> on the lock if already locked) *can* be more expensive - but the cost
>>is due to all the additional function in the operation. Its not clear
>>that tweaking the underlying hardware primitives is going to do much for
>>this.
>
>
> It's not clear, I agree, but one problem with existing ones is that
> they are usually privileged, which forces a system call. That isn't
> what you want, for many reasons.
>

Again, that supports my original point - the performance of
synchronization has nothing to do with improving synchronization
primitives, but with everything else in the system.

The reason you need that system call, I assume, is to suspend a thread
on a contended lock or to resume suspended threads. You could always use
spin-locks and avoid that system call - but then you get into the issue
of utilization.

From: nmm1 on
In article <XIidncgyP_VLyKvWnZ2dnUVZ_hKdnZ2d(a)bestweb.net>,
Mayan Moudgill <mayan(a)bestweb.net> wrote:
>>
>> It's not clear, I agree, but one problem with existing ones is that
>> they are usually privileged, which forces a system call. That isn't
>> what you want, for many reasons.
>
>Again, that supports my original point - the performance of
>synchronization has nothing to do with improving synchronization
>primitives, but with everything else in the system.

"Nothing to to with" is too strong - part of the reason that the
rest of a system gets it wrong is that the hardware primitives do.
Only a part, I agree.

>The reason you need that system call, I assume, is to suspend a thread
>on a contended lock or to resume suspended threads. You could always use
>spin-locks and avoid that system call - but then you get into the issue
>of utilization.

It's worse than that :-(

Let's say that thread A wants to suspend itself in favour of thread B,
until the latter next suspends itself. If thread A uses a spin-loop
for its wait, thread B may never get to run, so thread A will wait for
ever ....

There are lots of important threading paradigms, which are known to be
useful, which are close to infeasible to use on modern systems.


Regards,
Nick Maclaren.
From: EricP on
Mayan Moudgill wrote:
>
> So core 1 writes some data, core 1&2 synchronize, and core 2 reads the
> data. What actually happens post-synchronization?
>
> Well, cache lines get copied from dcache-CPU-1 to dcache-CPU-2. This
> takes time. This time will be proportional to the shared data. The cost
> can actually be higher than in the case of an explicit message passing
> system.
>
> The synchronization, by contrast, can involve the tranfer of exactly one
> cache-line [e.g. if you're doing an atomic-increment].
>
> More heavyweight synchronization operations (such as a lock with suspend
> on the lock if already locked) *can* be more expensive - but the cost
> is due to all the additional function in the operation. Its not clear
> that tweaking the underlying hardware primitives is going to do much for
> this.

I believe Mitch is referring to potential new hardware functionality
like AMD's Advanced Synchronization Facility proposal.
I can't seem to find any info on it on the AMD website as the proposal
seems to have degenerated into just a registered trademark notice.

Having the ability to perform a LoadLocked/StoreConditional on
up to 4 separate memory locations would eliminate much of the
need to escalate to the heavyweight OS synchronization ops.

Eric