From: Nick Piggin on
On Thu, Feb 25, 2010 at 11:53:01AM -0500, Mathieu Desnoyers wrote:
> * Nick Piggin (npiggin(a)suse.de) wrote:
> > On Wed, Feb 24, 2010 at 10:22:52AM -0500, Mathieu Desnoyers wrote:
> > > * Nick Piggin (npiggin(a)suse.de) wrote:
> > > > On Mon, Feb 22, 2010 at 04:23:21PM -0500, Mathieu Desnoyers wrote:
> > > > > * Chris Friesen (cfriesen(a)nortel.com) wrote:
> > > > > > On 02/12/2010 04:46 PM, Mathieu Desnoyers wrote:
> > > > > >
> > > > > > > Editorial question:
> > > > > > >
> > > > > > > This synchronization only takes care of threads using the current process memory
> > > > > > > map. It should not be used to synchronize accesses performed on memory maps
> > > > > > > shared between different processes. Is that a limitation we can live with ?
> > > > > >
> > > > > > It makes sense for an initial version. It would be unfortunate if this
> > > > > > were a permanent limitation, since using separate processes with
> > > > > > explicit shared memory is a useful way to mitigate memory trampler issues.
> > > > > >
> > > > > > If we were going to allow that, it might make sense to add an address
> > > > > > range such that only those processes which have mapped that range would
> > > > > > execute the barrier. Come to think of it, it might be possible to use
> > > > > > this somehow to avoid having to execute the barrier on *all* threads
> > > > > > within a process.
> > > > >
> > > > > The extensible system call mandatory and optional flags will allow this kind of
> > > > > improvement later on if this appears to be needed. It will also allow user-space
> > > > > to detect if later kernels support these new features or not. But meanwhile I
> > > > > think it's good to start with this implementation that covers 99.99% of
> > > > > use-cases I can currently think of (ok, well, maybe I'm just unimaginative) ;)
> > > >
> > > > It's a good point, I think having at least the ability to do
> > > > process-shared or process-private in the first version of the API might
> > > > be a good idea. That matches glibc's synchronisation routines so it
> > > > would probably be a desirable feature even if you don't implement it in
> > > > your library initially.
> > >
> > > I am tempted to say that we should probably wait for users of this API feature
> > > to manifest themselves before we go on and implement it. This will ensure that
> > > we don't end up maintaining an unused feature and this provides a minimum
> > > testability. For now, returning -EINVAL seems like an appropriate response for
> > > this system call feature.
> >
> > It would be very trivial compared to the process-private case. Just IPI
> > all CPUs. It would allow older kernels to work with newer process based
> > apps as they get implemented. But... not a really big deal I suppose.
>
> This is actually what I did in v1 of the patch, but this implementation met
> resistance from the RT people, who were concerned about the impact on RT tasks
> of a lower priority process doing lots of sys_membarrier() calls. So if we want
> to do other-process-aware sys_membarrier(), we would have to iterate on all
> cpus, for every running process shared memory maps and see if there is something
> shared with all shm of the current process. This is clearly not as trivial as
> just broadcasting the IPI to all cpus.

I don't see how this is fundamentally worse than your existing approach,
because on some architectures with asids, the mm_cpumask isn't cleared
when a process is scheduled off the CPU then you could effectively just
cause IPIs to lots of CPUs anyway.

x86 may also one day implement ASIDS in the same way.

So if we are worried about this then we need to solve it properly IMO.
Rate-limiting it might work.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Steven Rostedt on
On Fri, 2010-02-26 at 16:08 +1100, Nick Piggin wrote:
> On Thu, Feb 25, 2010 at 11:53:01AM -0500, Mathieu Desnoyers wrote:

> > This is actually what I did in v1 of the patch, but this implementation met
> > resistance from the RT people, who were concerned about the impact on RT tasks
> > of a lower priority process doing lots of sys_membarrier() calls. So if we want
> > to do other-process-aware sys_membarrier(), we would have to iterate on all
> > cpus, for every running process shared memory maps and see if there is something
> > shared with all shm of the current process. This is clearly not as trivial as
> > just broadcasting the IPI to all cpus.
>
> I don't see how this is fundamentally worse than your existing approach,
> because on some architectures with asids, the mm_cpumask isn't cleared
> when a process is scheduled off the CPU then you could effectively just
> cause IPIs to lots of CPUs anyway.

That's why checking the mm_cpumask isn't the only check. That just
limits what CPUs we check, but before a IPI is sent, that cpu has its rq
lock held and a check against cpu_curr(cpu)->mm vs the current->mm. If
that fails, then that CPU does not have an IPI sent to it.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Josh Triplett on
On Thu, Feb 25, 2010 at 06:23:16PM -0500, Mathieu Desnoyers wrote:
> I am proposing this patch for the 2.6.34 merge window, as I think it is ready
> for inclusion.
>
> Here is an implementation of a new system call, sys_membarrier(), which
> executes a memory barrier on all threads of the current process.
[...]

> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
> Acked-by: KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com>
> Acked-by: Steven Rostedt <rostedt(a)goodmis.org>
> Acked-by: Paul E. McKenney <paulmck(a)linux.vnet.ibm.com>
> CC: Nicholas Miell <nmiell(a)comcast.net>
> CC: Linus Torvalds <torvalds(a)linux-foundation.org>
> CC: mingo(a)elte.hu
> CC: laijs(a)cn.fujitsu.com
> CC: dipankar(a)in.ibm.com
> CC: akpm(a)linux-foundation.org
> CC: josh(a)joshtriplett.org

Acked-by: Josh Triplett <josh(a)joshtriplett.org>

I agree that v9 seems ready for inclusion.

Out of curiosity, do you have any benchmarks for the case of not
detecting sys_membarrier dynamically? Detecting it at library
initialization time, for instance, or even just compiling to assume its
presence? I'd like to know how much that would improve the numbers.

If significant, it might make sense to try to have a mechanism similar
to SMP alternatives, to have different code in either case. dlopen,
function pointers, runtime code patching (nop out the rmb), or similar.

- Josh Triplett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Ingo Molnar on

* Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com> wrote:

> I am proposing this patch for the 2.6.34 merge window, as I think it is
> ready for inclusion.

It's a bit late for this merge window i think.

> Here is an implementation of a new system call, sys_membarrier(), which
> executes a memory barrier on all threads of the current process. It can be
> used to distribute the cost of user-space memory barriers asymmetrically by
> transforming pairs of memory barriers into pairs consisting of
> sys_membarrier() and a compiler barrier. For synchronization primitives that
> distinguish between read-side and write-side (e.g. userspace RCU, rwlocks),
> the read-side can be accelerated significantly by moving the bulk of the
> memory barrier overhead to the write-side.

Why is this such a low level and still special-purpose facility?

Synchronization facilities for high-performance threading may want to do a bit
more than just execute a barrier instruction on another CPU that has a
relevant thread running.

You cited signal based numbers:

> (what we have now, with dynamic sys_membarrier check, expedited scheme)
> memory barriers in reader: 907693804 reads, 817793 writes
> sys_membarrier scheme: 4316818891 reads, 503790 writes
>
> (dynamic sys_membarrier check, non-expedited scheme)
> memory barriers in reader: 907693804 reads, 817793 writes
> sys_membarrier scheme: 8698725501 reads, 313 writes

Much of that signal handler overhead is i think due to:

- FPU/SSE context save/restore
- the need to wake up, run and deschedule all threads

Instead i'd suggest for you to try to implement user-space RCU speedups not
via the new sys_membarrier() syscall, but via two new signal extensions:

- SA_NOFPU: on x86 to skip the FPU/SSE save/restore, for such fast in/out special
purpose signal handlers? (can whip up a quick patch for you if you want)

- SA_RUNNING: a way to signal only running threads - as a way for user-space
based concurrency control mechanisms to deschedule running threads (or, like
in your case, to implement barrier / garbage collection schemes).

( Note: to properly sync back you'll also need an sa_info field to tell
target tasks how many tasks were woken up. That way a futex can be used
as a semaphore to signal back to the issuing thread, and make it all
properly event triggered and nicely scalable. Also, queued signals are a
must for such a scheme. )

My estimation is that it will be _much_ faster than the naive signal based
approach - maybe even quite comparable to an open-coded sys_membarrier():

- as most of the overhead in a real scenario ought to be the IPI sending and
latency - not the syscall entry/exit. (with a signal approach we'd still go
into target thread user-mode, so one more syscall exit+re-entry)

- or for the common case where there are no other threads running, we are
just in/out of SA_RUNNING without having to do any synchronization. In that
case it should be quite close to sys_membarrier() - modulo some minimal
signal API overhead. [which we could optimize some more, if it's visible in
your benchmarks.]

Signals per se are pretty scalable these days - now that most of the fastpaths
are decoupled from tasklist_lock and everything is RCU-ized.

Further benefits are:

- both SA_NOFPU and SA_RUNNING could be used by a _lot_ more user-space
facilities than just user-space RCU.

- synergetic effects: growing some real high-performance facility based on
signals would ensure further signal speedups in the future as well.
Currently any server app that runs into signal limitations tends to shy
away from them and use some different (and often inferior) signalling
scheme. It would be better extend signals with 'lightweight' capabilities
as well.

All in one, signals are used by like 99.9% of Linux apps, while
sys_membarrier() would be used only by [WAG] 0.00001% of them.

So before we can merge this (at least via the RCU tree, which you have sent it
to), i'd like to see you try _much_, _MUCH_ harder to fix the very obvious
signal overhead performance problems you have demoed via the numbers above so
nicely.

If _that_ fails, and if we get all the fruits of that, _then_ we might
perhaps, with a lot of hesitation, concede defeat and think about adding yet
another syscall.

I know it's cool to add a brand new syscall - but, unfortunately, in practice
it doesnt help Linux apps all that much. (at least until we have tools/klibc/
or so.)

[ There's also a few small cleanliness details i noticed in your patch: enums
are a tiny bit nicer for ABIs than #define's, the #ifdef SMP is ugly, etc. -
but it doesnt really matter much as i think we should concentrate on the
scalability problems of signals first. ]

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on


On Thu, 4 Mar 2010, Ingo Molnar wrote:
>
> - SA_NOFPU: on x86 to skip the FPU/SSE save/restore, for such fast in/out special
> purpose signal handlers? (can whip up a quick patch for you if you want)

I'd love to do this, but it's wrong.

It's too damn easy to use the FPU by mistake in user land, without ever
being aware of it. memset()/memcpy are obvious potential users SSE, but
they might be called in non-obvious ways implicitly by the compiler (ie
structure copy and setup).

And modern glibc ends up using SSE4 even for things like strstr and
strlen, so it really is creeping into all kinds of trivial helper
functions that might not be obvious. So SA_NOFPU is a lovely idea, but
it's also an idea that sucks rotten eggs in practice, with quite possibly
the same _binary_ working or not working depending on what kind of CPU and
what shared library it happens to be using.

Too damn fragile, in other words.

(Now, if it's accompanied by the kernel actually _testing_ that there is
no FPU activity, by setting the TS flag and checking at fault time and
causing a SIGFPE, then that would be better. At least you'd get a nice
clear signal rather than random FPU state corruption. But you're still in
the situation that now the binary might work on some machines and setups,
and not on others.

> - SA_RUNNING: a way to signal only running threads - as a way for user-space
> based concurrency control mechanisms to deschedule running threads (or, like
> in your case, to implement barrier / garbage collection schemes).

Hmm. This sounds less fundamentally broken, but at the same time also
_way_ more invasive in the signal handling layer. It's already one of our
more "exciting" layers out there.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/