introduce sys_membarrier(): process-wide memory barrier (v9) [Kernel]

Prev: drivers: isdn: get rid of custom strtoul()
Next: KVM: x86: Kick VCPU outside PIC lock again

From: Ingo Molnar on 4 Mar 2010 15:30

* Linus Torvalds <torvalds(a)linux-foundation.org> wrote:

>
> On Thu, 4 Mar 2010, Ingo Molnar wrote:
> >
> > - SA_NOFPU: on x86 to skip the FPU/SSE save/restore, for such fast in/out special
> > purpose signal handlers? (can whip up a quick patch for you if you want)
>
> I'd love to do this, but it's wrong.
>
> It's too damn easy to use the FPU by mistake in user land, without ever
> being aware of it. memset()/memcpy are obvious potential users SSE, but they
> might be called in non-obvious ways implicitly by the compiler (ie structure
> copy and setup).
>
> And modern glibc ends up using SSE4 even for things like strstr and strlen,
> so it really is creeping into all kinds of trivial helper functions that
> might not be obvious. So SA_NOFPU is a lovely idea, but it's also an idea
> that sucks rotten eggs in practice, with quite possibly the same _binary_
> working or not working depending on what kind of CPU and what shared library
> it happens to be using.
>
> Too damn fragile, in other words.
>
> (Now, if it's accompanied by the kernel actually _testing_ that there is no
> FPU activity, by setting the TS flag and checking at fault time and causing
> a SIGFPE, then that would be better. At least you'd get a nice clear signal
> rather than random FPU state corruption. But you're still in the situation
> that now the binary might work on some machines and setups, and not on
> others.

Perhaps NOFPU could do lazy context saving: clear the TS flag and only save
the FPU state if it's actually used by the signal handler?

This turns it into a 'hint', not into an FPU state corruption issue.

Clearing/enabling FPU instructions is still faster than a full-blown FPU
context save/restore.

Careful and lightweight signal handlers (like a GC scheme would likely be)
would thus be faster. In the worst-case it incures an extra trap and a
(measurable/profilable) slowdown.

In any case this would be a secondary optimization - the biggest difference
i'd expect from the 'dont wake up the world' logic:

> > - SA_RUNNING: a way to signal only running threads - as a way for user-space
> > based concurrency control mechanisms to deschedule running threads (or, like
> > in your case, to implement barrier / garbage collection schemes).
>
> Hmm. This sounds less fundamentally broken, but at the same time also _way_
> more invasive in the signal handling layer. It's already one of our more
> "exciting" layers out there.

Yeah, definitely. But i still tend to think it should be actively tried, at
which point we can still say 'yuck this cannot work, lets go for the
sys_membarrier() solution'.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds on 6 Mar 2010 14:50

On Thu, 4 Mar 2010, Ingo Molnar wrote:
>
> Perhaps NOFPU could do lazy context saving: clear the TS flag and only save
> the FPU state if it's actually used by the signal handler?

If we can get that working reliably, we probably shouldn't use NOFPU at
all, and we should just do it unconditionally. That big (and almost always
pointless) FPU state save is a _big_ performance issue on signal handling,
and if we can do it lazily, we should.

However, I'm not at all convinced we can do this reliably. How do we
detect the "signal frame is dead" case with things like siglongjmp() etc?

And if we can't detect that "frame no longer exists", we can't really do
the lazy context saving.

Now, there's _also_ the issue of the signal handler function possibly
actually looking at the FPU state on the stack, and for that, a SA_NOFPU
would be a good way to say "you can't do that". So it's possible that even
if we could reliably detect the frame liveness we'd really have to use
that new flag anyway.

But if we do need a SA_NOFPU flag, then that means that basically no app
will use it, and it will be some special case for some really unusual
library. So I really don't think this whole thing is worth it unless you
could do it automatically.

(The "user accesses the frame" case _could_ possibly be handled by
pointing the FP frame to a special faulting location, and never nesting
the FP optimization. Nested signal handlers are unusual enough that they
aren't worth optimizing for anyway. So I'm sure that there are possible
solutions for "automatically just do the right thing" in theory, but I
suspect they get rather complex)

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 9 Mar 2010 02:10

On Sat, Mar 06, 2010 at 11:43:26AM -0800, Linus Torvalds wrote:
>
>
> On Thu, 4 Mar 2010, Ingo Molnar wrote:
> >
> > Perhaps NOFPU could do lazy context saving: clear the TS flag and only save
> > the FPU state if it's actually used by the signal handler?
>
> If we can get that working reliably, we probably shouldn't use NOFPU at
> all, and we should just do it unconditionally. That big (and almost always
> pointless) FPU state save is a _big_ performance issue on signal handling,
> and if we can do it lazily, we should.
>
> However, I'm not at all convinced we can do this reliably. How do we
> detect the "signal frame is dead" case with things like siglongjmp() etc?
>
> And if we can't detect that "frame no longer exists", we can't really do
> the lazy context saving.
>
> Now, there's _also_ the issue of the signal handler function possibly
> actually looking at the FPU state on the stack, and for that, a SA_NOFPU
> would be a good way to say "you can't do that". So it's possible that even
> if we could reliably detect the frame liveness we'd really have to use
> that new flag anyway.
>
> But if we do need a SA_NOFPU flag, then that means that basically no app
> will use it, and it will be some special case for some really unusual
> library. So I really don't think this whole thing is worth it unless you
> could do it automatically.

The library is librcu, which I suspect will become quite important for
parallel programming in future (maybe I hope for too much).

But maybe it's better to not merge _any_ librcu special case until
we see results from programs using it. More general speedups or features
(that also help librcu) is a different story.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Ingo Molnar on 16 Mar 2010 03:40

* Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com> wrote:

> * Mathieu Desnoyers (mathieu.desnoyers(a)efficios.com) wrote:
> > * Linus Torvalds (torvalds(a)linux-foundation.org) wrote:
> > > > - SA_RUNNING: a way to signal only running threads - as a way for user-space
> > > > based concurrency control mechanisms to deschedule running threads (or, like
> > > > in your case, to implement barrier / garbage collection schemes).
> > >
> > > Hmm. This sounds less fundamentally broken, but at the same time also
> > > _way_ more invasive in the signal handling layer. It's already one of our
> > > more "exciting" layers out there.
> > >
> >
> > Hrm, thinking about it a bit further, the only way I see we could provide a
> > usable SA_RUNNING flag would be to add hooks to the scheduler. These hooks would
> > somehow have to call user-space code (!) when scheduling in/out a thread. Yes,
> > this sounds utterly broken (since these hooks would have to be preemptable).
> >
> > The idea is this: if we look, for instance, at the kernel preemptable RCU
> > implementations, they consist of two parts: one is iteration on all CPUs to
> > consider all active CPUs, and the other is a modification of the scheduler to
> > note all preempted tasks that were in a preemptable RCU C.S..
> >
> > Just for the memory barrier we consider for sys_membarrier(), I had to ensure
> > that the scheduler issues memory barriers to order accesses to user-space memory
> > and mm_cpumask modifications. In reality, what we are doing is to ensure that
> > the operation required on the running thread is done by the scheduler too when
> > scheduling in/out the task.
> >
> > As soon as we have signal handlers which perform more than a simple memory
> > barrier (e.g. something that has side-effects outside of the processor), I
> > doubt it would ever make sense to only run the handler on running threads
> > unless we have hooks in the scheduler too.
>
> Unless this question is answered, Ingo's SA_RUNNING signal proposal, as
> appealing as it may look at a first glance, falls into the "fundamentally
> broken" category. [...]

How is it different from your syscall? I.e. which lines of code make the
difference? We could certainly apply the (trivial) barrier change to
context_switch().

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 16 Mar 2010 04:00

On Tue, Mar 16, 2010 at 08:36:35AM +0100, Ingo Molnar wrote:
>
> * Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com> wrote:
>
> > * Mathieu Desnoyers (mathieu.desnoyers(a)efficios.com) wrote:
> > > * Linus Torvalds (torvalds(a)linux-foundation.org) wrote:
> > > > > - SA_RUNNING: a way to signal only running threads - as a way for user-space
> > > > > based concurrency control mechanisms to deschedule running threads (or, like
> > > > > in your case, to implement barrier / garbage collection schemes).
> > > >
> > > > Hmm. This sounds less fundamentally broken, but at the same time also
> > > > _way_ more invasive in the signal handling layer. It's already one of our
> > > > more "exciting" layers out there.
> > > >
> > >
> > > Hrm, thinking about it a bit further, the only way I see we could provide a
> > > usable SA_RUNNING flag would be to add hooks to the scheduler. These hooks would
> > > somehow have to call user-space code (!) when scheduling in/out a thread. Yes,
> > > this sounds utterly broken (since these hooks would have to be preemptable).
> > >
> > > The idea is this: if we look, for instance, at the kernel preemptable RCU
> > > implementations, they consist of two parts: one is iteration on all CPUs to
> > > consider all active CPUs, and the other is a modification of the scheduler to
> > > note all preempted tasks that were in a preemptable RCU C.S..
> > >
> > > Just for the memory barrier we consider for sys_membarrier(), I had to ensure
> > > that the scheduler issues memory barriers to order accesses to user-space memory
> > > and mm_cpumask modifications. In reality, what we are doing is to ensure that
> > > the operation required on the running thread is done by the scheduler too when
> > > scheduling in/out the task.
> > >
> > > As soon as we have signal handlers which perform more than a simple memory
> > > barrier (e.g. something that has side-effects outside of the processor), I
> > > doubt it would ever make sense to only run the handler on running threads
> > > unless we have hooks in the scheduler too.
> >
> > Unless this question is answered, Ingo's SA_RUNNING signal proposal, as
> > appealing as it may look at a first glance, falls into the "fundamentally
> > broken" category. [...]
>
> How is it different from your syscall? I.e. which lines of code make the
> difference? We could certainly apply the (trivial) barrier change to
> context_switch().

I think it is just easy for userspace to misuse or think it does
something that it doesn't (because of races).

If a context switch includes a barrier, then it is easy to know that
either the task of interest will execute the barrier, or it will have
context switched.

What more complex operation could be done in the signal handler that
isn't broken by races? Programs that use realtime scheduling policies,
and maybe some statistical or heuristic operations... Any cool use that
would make anybody other than librcu bother using it?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: drivers: isdn: get rid of custom strtoul()
Next: KVM: x86: Kick VCPU outside PIC lock again