x86_64 page fault NMI-safe [Kernel]

Prev: [PATCH] enable readback to get HPET working on ATI SB4x00, kernel 2.6.35_rc5
Next: input: Fix wrong dimensions check for synaptics

From: Ingo Molnar on 14 Jul 2010 17:20

* Linus Torvalds <torvalds(a)linux-foundation.org> wrote:

> On Wed, Jul 14, 2010 at 1:17 PM, Mathieu Desnoyers
> <mathieu.desnoyers(a)efficios.com> wrote:
> >
> > It only handles the case of a single NMI coming in. What happens in this
> > scenario?
>
> [ two nested NMI's ]
>
> The _right_ thing happens.
>
> What do you think the hardware would have done itself? The NMI was blocked.
> It wouldn't get replayed twice. If you have two NMI's happening while
> another NMI is active, you will get a single NMI after the first NMI has
> completed.

If it ever became an issue, we could even do what softirqs do and re-execute
the NMI handler. At least for things like PMU NMIs we have to handle them once
they have been (re-)issued, or we'd get a stuck PMU.

But in any case it should be a non-issue.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds on 14 Jul 2010 17:30

On Wed, Jul 14, 2010 at 1:39 PM, Mathieu Desnoyers
<mathieu.desnoyers(a)efficios.com> wrote:
>
>> �- load percpu NMI stack frame pointer
>> �- if non-zero we know we're nested, and should ignore this NMI:
>> � � - we're returning to kernel mode, so return immediately by using
>> "popf/ret", which also keeps NMI's disabled in the hardware until the
>> "real" NMI iret happens.
>
> Maybe incrementing a per-cpu missed NMIs count could be appropriate here so we
> know how many NMIs should be replayed at iret ?

No. As mentioned, there is no such counter in real hardware either.

Look at what happens for the not-nested case:

- NMI1 triggers. The CPU takes a fault, and runs the NMI handler with
NMI's disabled

- NMI2 triggers. Nothing happens, the NMI's are disabled.

- NMI3 triggers. Again, nothing happens, the NMI's are still disabled

- the NMI handler returns.

- What happens now?

How many NMI interrupts do you get? ONE. Exactly like my "emulate it
in software" approach. The hardware doesn't have any counters for
pending NMI's either. Why should the software emulation have them?

>> � � - before the popf/iret, use the NMI stack pointer to make the NMI
>> return stack be invalid and cause a fault
>
> I assume you mean "popf/ret" here.

Yes, that was as typo. The whole point of using popf was obviously to
_avoid_ the iret ;)

> So assuming we use a frame copy, we should
> change the nmi stack pointer in the nesting 0 nmi stack copy, so the nesting 0
> NMI iret will trigger the fault
>
>> � - set the NMI stack pointer to the current stack pointer
>
> That would mean bringing back the NMI stack pointer to the (nesting - 1) nmi
> stack copy.

I think you're confused. Or I am by your question.

The NMI code would literally just do:

- check if the NMI was nested, by looking at whether the percpu
nmi-stack-pointer is non-NULL

- if it was nested, do nothing, an return with a popf/ret. The only
stack this sequence might needs is to save/restore the register that
we use for the percpu value (although maybe we can just co a "cmpl
$0,%_percpu_seg:nmi_stack_ptr" and not even need that), and it's
atomic because at this point we know that NMI's are disabled (we've
not _yet_ taken any nested faults)

- if it's a regular (non-nesting) NMI, we'd basically do

6* pushq 48(%rsp)

to copy the five words that the NMI pushed (ss/esp/eflags/cs/eip)
and the one we saved ourselves (if we needed any, maybe we can make do
with just 5 words).

- then we just save that new stack pointer to the percpu thing with a simple

movq %rsp,%__percpu_seg:nmi_stack_ptr

and we're all done. The final "iret" will do the right thing (either
fault or return), and there are no races that I can see exactly
because we use a single nmi-atomic instruction (the "iret" itself) to
either re-enable NMI's _or_ test whether we should re-do an NMI.

There is a single-instruction window that is interestign in the return
path, which is the window between the two final instructions:

movl $0,%__percpu_seg:nmi_stack_ptr
iret

where I wonder what happens if we have re-enabled NMI (due to a fault
in the NMI handler), but we haven't actually taken the NMI itself yet,
so now we _will_ re-use the stack. Hmm. I suspect we need another of
those horrible "if the NMI happens at this particular %rip" cases that
we already have for the sysenter code on x86-32 for the NMI/DEBUG trap
case of fixing up the stack pointer.

And maybe I missed something else. But it does look reasonably simple.
Subtle, but not a lot of code. And the code is all very much about the
NMI itself, not about other random sequences. No?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Maciej W. Rozycki on 14 Jul 2010 17:50

On Wed, 14 Jul 2010, Linus Torvalds wrote:

> No. As mentioned, there is no such counter in real hardware either.

There is a 1-bit counter or actually a latch.

> Look at what happens for the not-nested case:
>
> - NMI1 triggers. The CPU takes a fault, and runs the NMI handler with
> NMI's disabled

Correct.

> - NMI2 triggers. Nothing happens, the NMI's are disabled.

The NMI latch records the second NMI. Note this is edge-sensitive like
the NMI line itself.

> - NMI3 triggers. Again, nothing happens, the NMI's are still disabled

Correct.

> - the NMI handler returns.
>
> - What happens now?

NMI2 latched above causes the NMI handler to be invoked as the next
instruction after IRET. The latch is cleared as the interrupt is taken.

> How many NMI interrupts do you get? ONE. Exactly like my "emulate it
> in software" approach. The hardware doesn't have any counters for
> pending NMI's either. Why should the software emulation have them?

Two. :)

Maciej
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds on 14 Jul 2010 18:00

On Wed, Jul 14, 2010 at 2:45 PM, Maciej W. Rozycki <macro(a)linux-mips.org> wrote:
> On Wed, 14 Jul 2010, Linus Torvalds wrote:
>
>> No. As mentioned, there is no such counter in real hardware either.
>
> �There is a 1-bit counter or actually a latch.

Well, that's what our single-word flag is too.

>> Look at what happens for the not-nested case:
>>
>> �- NMI1 triggers. The CPU takes a fault, and runs the NMI handler with
>> NMI's disabled
>
> �Correct.
>
>> �- NMI2 triggers. Nothing happens, the NMI's are disabled.
>
> �The NMI latch records the second NMI. �Note this is edge-sensitive like
> the NMI line itself.
>
>> �- NMI3 triggers. Again, nothing happens, the NMI's are still disabled
>
> �Correct.
>
>> �- the NMI handler returns.
>>
>> �- What happens now?
>
> �NMI2 latched above causes the NMI handler to be invoked as the next
> instruction after IRET. �The latch is cleared as the interrupt is taken.
>
>> How many NMI interrupts do you get? ONE. Exactly like my "emulate it
>> in software" approach. The hardware doesn't have any counters for
>> pending NMI's either. Why should the software emulation have them?
>
> �Two. :)

You just count differently. I don't count the first one (the "real"
NMI). That obviously happens. So I only count how many interrupts we
need to fake. That's my "one". That's the one that happens as a result
of the fault that we take on the iret in the emulated model.

So there is no need to count anything. We take a fault on the iret if
we got a nested NMI (regardless of how _many_ such nested NMI's we
took). That's the "latch", exactly like in the hardware. No counter.

(Yeah, yeah, you can call it a "one-bit counter", but I don't think
that's a counter. It's just a bit of information).

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Frederic Weisbecker on 14 Jul 2010 18:20

On Wed, Jul 14, 2010 at 12:54:19PM -0700, Linus Torvalds wrote:
> On Wed, Jul 14, 2010 at 12:36 PM, Frederic Weisbecker
> <fweisbec(a)gmail.com> wrote:
> >
> > There is also the fact we need to handle the lost NMI, by defering its
> > treatment or so. That adds even more complexity.
>
> I don't think your read my proposal very deeply. It already handles
> them by taking a fault on the iret of the first one (that's why we
> point to the stack frame - so that we can corrupt it and force a
> fault).

Ah right, I missed this part.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Prev: [PATCH] enable readback to get HPET working on ATI SB4x00, kernel 2.6.35_rc5
Next: input: Fix wrong dimensions check for synaptics