From: Terje Mathisen "terje.mathisen at on
Mayan Moudgill wrote:
> Let me ask the following question: assume that we have a N-process
> barrier, how do you avoid using N system calls (use any hardware
> synchronization you want)?

OK, so you want all N threads/cores/cpus to wait here until the last one
arrives?
>
> The fundamental problem becomes that, in the general case, the first N-1
> processes that arrive at the barrier will need to (have the kernel) put
> themselves on a waiting queue, and the last process to arrive will need
> to (have the kernel) put the N-1 waiting processes back on the ready queue.
>
> I doubt that software transactional memory solves this problem.

First of all, I assume dedicated cores for each thread, i.e. gang
scheduling, otherwise the OS is involved in thread switches anyway, right?

It seems to me that you could use a single LOCK XADD(counter,-1) from
every thread, followed by a spin loop that contains a MONITOR
instruction on 'counter' and a counter re-read to determine if it has
made it all the way down to zero by now.

This would seem to give forward progress for each bus transaction, the
others involved in a contention will do a hardware retry, leading to at
most N updates total as long as the bus protocol is such that when
multiple cores try to acquire the same cache line for update, one of
them will win and go on to decrement the counter and flush the line.

Alternatively, if N is _very_ large, then a multi-stage counter where
sqrt(N) cores use each of the sqrt(N) first-level counters, and the
"winners" of each of those update the top-level counter, and report back
to the other first-level cores.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: Mayan Moudgill on
Terje Mathisen wrote:
> Mayan Moudgill wrote:
>
>> Let me ask the following question: assume that we have a N-process
>> barrier, how do you avoid using N system calls (use any hardware
>> synchronization you want)?
>
>
> OK, so you want all N threads/cores/cpus to wait here until the last one
> arrives?
>

>
> First of all, I assume dedicated cores for each thread, i.e. gang
> scheduling, otherwise the OS is involved in thread switches anyway,
right?
>
> It seems to me that you could use a single LOCK XADD(counter,-1) from
> every thread, followed by a spin loop that contains a MONITOR
> instruction on 'counter' and a counter re-read to determine if it has
> made it all the way down to zero by now.
>


This solution is (as stated) not general; in the restricted case being
considered, it is .... ummm ... baroque? Cleaner solutions are
definitely possible.

One particular red-flag is the use of the MONITOR instruction. Its only
available at privilege level 0. How did you get there? With a system call?
From: EricP on
Mayan Moudgill wrote:
> Terje Mathisen wrote:
> > Mayan Moudgill wrote:
> >
> >> Let me ask the following question: assume that we have a N-process
> >> barrier, how do you avoid using N system calls (use any hardware
> >> synchronization you want)?
> >
> >
> > OK, so you want all N threads/cores/cpus to wait here until the last one
> > arrives?
> >
>
> >
> > First of all, I assume dedicated cores for each thread, i.e. gang
> > scheduling, otherwise the OS is involved in thread switches anyway,
> right?
> >
> > It seems to me that you could use a single LOCK XADD(counter,-1) from
> > every thread, followed by a spin loop that contains a MONITOR
> > instruction on 'counter' and a counter re-read to determine if it has
> > made it all the way down to zero by now.
> >
>
>
> This solution is (as stated) not general; in the restricted case being
> considered, it is .... ummm ... baroque? Cleaner solutions are
> definitely possible.
>
> One particular red-flag is the use of the MONITOR instruction. Its only
> available at privilege level 0. How did you get there? With a system call?

This barrier question is quite different from the
atomic multiple update issue.

Monitor is available in user mode, and Mwait can be
if the OS enables it.

The general solution for a barrier spinwait requires 2 counters,
a WaitCount and RunCount, ideally on separate cache lines.
Each thread increments WaitCount and if less than N spinwaits
while RunCount == 0.
When WaitCount hits N then that thread resets it to zero
and sets RunCount = N-1 to release the peers and proceeds.
Each peer decrements RunCount and proceeds.

Its may not be a great solution because when the pack is
released they all immediately try to decrement RunCount.
Depending on the system that might cause a lot of contention
but can't be avoided as the last thread leaving the barrier
must shut the door behind it.

volatile int gWaitCount;
volatile int gRunCount;

if (AtomicFetchAdd (&gWaitCount, 1) == THREADCOUNT-1)
{ // Release the peers
AtomicWrite (&gWaitCount, 0);
AtomicWrite (&gRunCount, THREADCOUNT-1);
}
else
{ // Wait for quorum
while (AtomicRead (&gRunCount) == 0)
{ CpuPause(); }
AtomicFetchAdd (&gRunCount, -1);
}

Eric



From: Terje Mathisen "terje.mathisen at on
Mayan Moudgill wrote:
> Terje Mathisen wrote:
> > It seems to me that you could use a single LOCK XADD(counter,-1) from
> > every thread, followed by a spin loop that contains a MONITOR
> > instruction on 'counter' and a counter re-read to determine if it has
> > made it all the way down to zero by now.
> >
>
> This solution is (as stated) not general; in the restricted case being
> considered, it is .... ummm ... baroque? Cleaner solutions are
> definitely possible.

If so, then I really don't see what the problem is?
>
> One particular red-flag is the use of the MONITOR instruction. Its only
> available at privilege level 0. How did you get there? With a system call?

I didn't know that, until now I had assumed that MONITOR+PAUSE was a
user-level way to sleep a core, without involving the OS at all.

Skipping the MONITOR part leaves me with a stock spin loop where every
core will re-read the counter until it has reached zero, which also
seems OK, since there's nothing else useful they can do.

What's important is _only_ the latency from when the last core executes
its XADD() operation until all the waiting cores have been able to
reload the counter and seen that they are ready to go on.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: "Andy "Krazy" Glew" on
Mayan Moudgill wrote:
> Let me ask the following question: assume that we have a N-process
> barrier, how do you avoid using N system calls (use any hardware
> synchronization you want)?
>
> The fundamental problem becomes that, in the general case, the first N-1
> processes that arrive at the barrier will need to (have the kernel) put
> themselves on a waiting queue, and the last process to arrive will need
> to (have the kernel) put the N-1 waiting processes back on the ready queue.

Since I have my head in hardware multithreading land - GPUs, various
flavours of CPUs with a small number of actively running threads and a
larger number of threads that are not actively running in a thread pool
- then it is straightforward.

The first N threads to arrive at the barrier switch to not activey
running. They are in the thread pool, but other threads take their
place on the actively running hardware thread contexts.

When the last thread arrives at the barrier, they all are marked
runnable again.

If the OS wants to context switch other threads/processes in/out, it can.

---

Yes, I am talking about a hardware or microcode thread scheduler.

I got into this space because of my work on speculative multithreading.
In an SpMT system you typically need many more potential threads that
you need actively running threads. If you don't want the SpMT to be
software visible, then you need to do such scheduling in hardware or
microcode, underneath the OS or VMM.

Some folks want to get the OS involved in such scheduling. Sure, maybe
if you work at Microsoft. Not if you work at Intel or AMD. If you work
at a hardware company, you don't want your project schedule to depend on
a different company.

However, my attitude has always been to reduce the need for software
support, but to allow software access. So, if you have a hardware or
microcode thread scheduler for SpMT, why not, eventually, expose it to
software? If it has any advantages.

The sorts of advantages it might expose things like power management,
load balancing, QOS.