From: Linus Torvalds on
On Wed, Jul 14, 2010 at 8:49 AM, Mathieu Desnoyers
<mathieu.desnoyers(a)efficios.com> wrote:
>> I think you're vastly overestimating what is sane to do from an NMI
>> context. �It is utterly and totally insane to assume vmalloc is available
>> in NMI.

I agree that NMI handlers shouldn't touch vmalloc space. But now that
percpu data is mapped through the VM, I do agree that other CPU's may
potentially need to touch that data, and an interrupt (including an
NMI) might be the first to create the mapping.

And that's why the faulting code needs to be interrupt-safe for the
vmalloc area.

However, it does look like the current scheduler should make it safe
to access "current->mm->pgd" from regular interrupts, so the problem
is apparently only an NMI issue. So exactly what are the circumstances
that create and expose percpu data on a CPU _without_ mapping it on
that CPU?

IOW, I'm missing some background here. I agree that at least some
basic percpu data should generally be available for an NMI handler,
but at the same time I wonder how come that basic percpu data wasn't
already mapped?

The traditional irq vmalloc race was something like:
- one CPU does a "fork()", which copies the basic kernel mappings
- in another thread a driver does a vmalloc(), which creates a _new_
mapping that didn't get copied.
- later on a switch_to() switches to the newly forked process that
missed the vmalloc initialization
- we take an interrupt for the driver that needed the new vmalloc
space, but now it doesn't have it, and we fill it in at run-time for
the (rare) race.

and I'm simply not seeing how fork() could ever race with percpu data setup.

So please just document the sequence that actually needs the page
table setup for the NMI/percpu case.

This patch (1/2) doesn't look horrible per se. I have no problems with
it. I just want to understand why it is needed.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mathieu Desnoyers on
* Linus Torvalds (torvalds(a)linux-foundation.org) wrote:
> On Wed, Jul 14, 2010 at 8:49 AM, Mathieu Desnoyers
> <mathieu.desnoyers(a)efficios.com> wrote:

(I was quoting Peter Anvin below) ;)

> >> I think you're vastly overestimating what is sane to do from an NMI
> >> context. �It is utterly and totally insane to assume vmalloc is available
> >> in NMI.
>
> I agree that NMI handlers shouldn't touch vmalloc space. But now that
> percpu data is mapped through the VM, I do agree that other CPU's may
> potentially need to touch that data, and an interrupt (including an
> NMI) might be the first to create the mapping.
>
[...]
> So please just document the sequence that actually needs the page
> table setup for the NMI/percpu case.
>
> This patch (1/2) doesn't look horrible per se. I have no problems with
> it. I just want to understand why it is needed.

The problem originally addressed by this patch is the case where a NMI handler
try to access vmalloc'd per-cpu data, which goes as follow:

- One CPU does a fork(), which copies the basic kernel mappings.
- Perf allocates percpu memory for buffer control data structures.
This mapping does not get copied.
- Tracing is activated.
- switch_to() to the newly forked process which missed the new percpu
allocation.
- We take a NMI, which touches the vmalloc'd percpu memory in the Perf tracing
handler, therefore leading to a page fault in NMI context. Here, we might be
in the middle of switch_to(), where ->current might not be in sync with the
current cr3 register.

The three choices we have to handle this that I am aware of are:
1) supporting page faults in NMI context, which imply removing ->current
dependency and supporting iret-less return path.
2) duplicating the percpu alloc API with a variant that maps to kmalloc.
3) using vmalloc_sync_all() after creating the mapping. (only works for x86_64,
not x86_32).

Choice 3 seems like a no-go on x86_32, choice 2 seems like a last-resort
(involves API duplication and reservation of a fixed-amount of per-cpu memory at
boot). Hence the proposal of choice 1.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on
On Wed, Jul 14, 2010 at 10:06 AM, Mathieu Desnoyers
<mathieu.desnoyers(a)efficios.com> wrote:
>>
>> This patch (1/2) doesn't look horrible per se. I have no problems with
>> it. I just want to understand why it is needed.

[ And patch 2/2 is much more intrusive, and touches a critical path
too.. If it was just the 1/2 series, I don't think I would care. For
the 2/2, I think I'd want to explore all the alternative options ]

> The problem originally addressed by this patch is the case where a NMI handler
> try to access vmalloc'd per-cpu data, which goes as follow:
>
> - One CPU does a fork(), which copies the basic kernel mappings.
> - Perf allocates percpu memory for buffer control data structures.
> �This mapping does not get copied.
> - Tracing is activated.
> - switch_to() to the newly forked process which missed the new percpu
> �allocation.
> - We take a NMI, which touches the vmalloc'd percpu memory in the Perf tracing
> �handler, therefore leading to a page fault in NMI context. Here, we might be
> �in the middle of switch_to(), where ->current might not be in sync with the
> �current cr3 register.

Ok. I was wondering why anybody would allocate core percpu variables
so late that this would ever be an issue, but I guess perf is a
reasonable such case. And reasonable to do from NMI.

That said - grr. I really wish there was some other alternative than
adding yet more complexity to the exception return path. That "iret
re-enables NMI's unconditionally" thing annoys me.

In fact, I wonder if we couldn't just do a software NMI disable
instead? Hav ea per-cpu variable (in the _core_ percpu areas that get
allocated statically) that points to the NMI stack frame, and just
make the NMI code itself do something like

NMI entry:
- load percpu NMI stack frame pointer
- if non-zero we know we're nested, and should ignore this NMI:
- we're returning to kernel mode, so return immediately by using
"popf/ret", which also keeps NMI's disabled in the hardware until the
"real" NMI iret happens.
- before the popf/iret, use the NMI stack pointer to make the NMI
return stack be invalid and cause a fault
- set the NMI stack pointer to the current stack pointer

NMI exit (not the above "immediate exit because we nested"):
clear the percpu NMI stack pointer
Just do the iret.

Now, the thing is, now the "iret" is atomic. If we had a nested NMI,
we'll take a fault, and that re-does our "delayed" NMI - and NMI's
will stay masked.

And if we didn't have a nested NMI, that iret will now unmask NMI's,
and everything is happy.

Doesn't the above sound like a good solution? In other words, we solve
the whole problem by simply _fixing_ the crazy Intel "iret-vs-NMI"
semantics. And we don't need to change the hotpath, and we'll just
_allow_ nested faults within NMI's.

Am I missing something? Maybe I'm not as clever as I think I am... But
I _feel_ clever.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Ingo Molnar on

* Linus Torvalds <torvalds(a)linux-foundation.org> wrote:

> Ok. I was wondering why anybody would allocate core percpu variables so late
> that this would ever be an issue, but I guess perf is a reasonable such
> case. And reasonable to do from NMI.

Yeah.

Frederic (re-)discovered this problem via very hard to debug crashes when he
extended perf call-graph tracing to have a bit larger buffer and used
percpu_alloc() for it (which is entirely reasonable in itself).

> That said - grr. I really wish there was some other alternative than adding
> yet more complexity to the exception return path. That "iret re-enables
> NMI's unconditionally" thing annoys me.

Ok. We can solve it by allocating the space from the non-vmalloc percpu area -
8K per CPU.

> In fact, I wonder if we couldn't just do a software NMI disable
> instead? Hav ea per-cpu variable (in the _core_ percpu areas that get
> allocated statically) that points to the NMI stack frame, and just
> make the NMI code itself do something like
>
> NMI entry:

I think at this point [NMI re-entry] we've corrupted the top of the NMI kernel
stack already, due to entering via the IST stack mechanism, which is
non-nesting and which enters at the same point - right?

We could solve that by copying that small stack frame off before entering the
'generic' NMI routine - but it all feels a bit pulled in by the hair.

I feel uneasy about taking pagefaults from the NMI handler. Even if we
implemented it all correctly, who knows what CPU erratas are waiting there to
be discovered, etc ...

I think we should try to muddle through by preventing these situations from
happening (and adding a WARN_ONCE() to the vmalloc page-fault handler would
certainly help as well), and only go to more clever schemes if no other option
looks sane anymore?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on
On Wed, Jul 14, 2010 at 11:46 AM, Ingo Molnar <mingo(a)elte.hu> wrote:
>> �NMI entry:
>
> I think at this point [NMI re-entry] we've corrupted the top of the NMI kernel
> stack already, due to entering via the IST stack mechanism, which is
> non-nesting and which enters at the same point - right?

Yeah, you're right, but we could easily fix that up. We know we don't
need any stack for the nested case, so all we would need to do is to
just subtract a small bit off %rsp, and copy the three words or so to
create a "new" stack for the non-nested case.

> We could solve that by copying that small stack frame off before entering the
> 'generic' NMI routine - but it all feels a bit pulled in by the hair.

Why? It's much cleaner than making the _real_ codepaths much worse.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/