From: Linus Torvalds on
On Thu, Jul 15, 2010 at 11:43 AM, Linus Torvalds
<torvalds(a)linux-foundation.org> wrote:
>
> But maybe I'm missing something.

Hmm. Of course - one way of solving this might be to just make the
32-bit case switch stacks in software. That might be a good idea
regardless, and would not be complicated. We already do that for
sysenter, but the NMI case would be simpler because we don't need to
worry about being re-entered by NMI/DEBUG during the stack switch.

And since we have to play some games with moving the data on the stack
around _anyway_, doing the whole "switch stacks entirely rather than
just subtract a bit from the old stack" would be fairly logical.

So I think you may end up being right: we don't need to save the
original NMI stack pointer, because we can make sure that the
replacement stack (that we need anyway) is always deterministic.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on
On Thu, Jul 15, 2010 at 11:31 AM, Mathieu Desnoyers
<mathieu.desnoyers(a)efficios.com> wrote:
>
> Hrm, we could probably get away with only keeping the nmi_stack_nested per-cpu
> variable. The nmi_stack_ptr could be known statically if we set it at a fixed
> offset from the bottom of stack rather than using an offset relative to the top
> (which can change depending if we are nested over the kernel or userspace).
> We just have to reserve enough space for the bottom of stack.

I thought about trying that, but I don't think it's true. At least not
for the 32-bit case.

The thing is, the 32-bit case will re-use the kernel stack if it
happens in kernel space, and will thus start from a random space (and
won't push all the information anyway). So a nested NMI really doesn't
know where the original NMI stack is to be found unless we save it
off.

In the case of x86-64, I think the stack will always be at a fixed
address, and push a fixed amount of data (because we use the IST
thing). So there you could probably just use the flag, but you'd still
have to handle the 32-bit case, and quite frankly, I think it would be
much nicer if the logic could be shared for the 32-bit and 64-bit
cases.

But maybe I'm missing something.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: H. Peter Anvin on
On 07/15/2010 10:38 AM, Mathieu Desnoyers wrote:
>>
>> Of course, if there is some trap that re-enables interrupts even if
>> the trap happened in an interrupt-disabled region, then that would
>> change things, but that would be a bad bug regardless (and totally
>> independently of any NMI issues). So in that sense it's a "could
>> happen", but it's something that would be a totally separate bug.
>
> Yep, this kind of scenario would therefore be a bug that does not belong to the
> specific realm of nmis.
>

Yes, the only specific issue here is NMI -> trap -> IRET -> [nested
NMI], which is what this whole thing is about.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mathieu Desnoyers on
Hi Linus,

I modified your code, intenting to handle the fake NMI entry gracefully given
that NMIs are not necessarily disabled at the entry point. It uses a "need fake
NMI" flag rather than playing games with CS and faults. When a fake NMI is
needed, it simply jumps back to the beginning of regular nmi code. NMI exit code
and fake NMI entry are made reentrant with respect to NMI handler interruption
by testing, at the very beginning of the NMI handler, if a NMI is nested over
the whole nmi_atomic .. nmi_atomic_end code region. This code assumes NMIs have
a separate stack.

This code is still utterly untested and might eat your Doritos, only provided
for general enjoyment.

Thanks,

Mathieu

#
# Two per-cpu variables: a "are we nested" flag (one byte).
# a "do we need to execute a fake NMI" flag (one byte).
# The %rsp at which the stack copy is saved is at a fixed address, which leaves
# enough room at the bottom of NMI stack for the "real" NMI entry stack. This
# assumes we have a separate NMI stack.
# The NMI stack copy top of stack is at nmi_stack_copy.
# The NMI stack copy "rip" is at nmi_stack_copy_rip, which is set to
# nmi_stack_copy-32.
#
nmi:
# Test if nested over atomic code.
cmpq $nmi_atomic,0(%rsp)
jae nmi_addr_is_ae
# Test if nested over general NMI code.
cmpb $0,%__percpu_seg:nmi_stack_nesting
jne nmi_nested_set_fake_and_return
# create new stack
is_unnested_nmi:
# Save some space for nested NMI's. The exception itself
# will never use more space, but it might use less (since
# if will be a kernel-kernel transition).

# Save %rax on top of the stack (need to temporarily use it)
pushq %rax
movq %rsp, %rax
movq $nmi_stack_copy,%rsp

# copy the five words of stack info. rip starts at 8+0(%rax).
pushq 8+32(%rax) # ss
pushq 8+24(%rax) # rsp
pushq 8+16(%rax) # eflags
pushq 8+8(%rax) # cs
pushq 8+0(%rax) # rip
movq 0(%rax),%rax # restore %rax

set_nmi_nesting:
# and set the nesting flags
movb $0xff,%__percpu_seg:nmi_stack_nesting

regular_nmi_code:
...
# regular NMI code goes here, and can take faults,
# because this sequence now has proper nested-nmi
# handling
...

nmi_atomic:
# An NMI nesting over the whole nmi_atomic .. nmi_atomic_end region will
# be handled specially. This includes the fake NMI entry point.
cmpb $0,%__percpu_seg:need_fake_nmi
jne fake_nmi
movb $0,%__percpu_seg:nmi_stack_nesting
iret

# This is the fake NMI entry point.
fake_nmi:
movb $0x0,%__percpu_seg:need_fake_nmi
jmp regular_nmi_code
nmi_atomic_end:

# Make sure the address is in the nmi_atomic range and in CS segment.
nmi_addr_is_ae:
cmpq $nmi_atomic_end,0(%rsp)
jae is_unnested_nmi
# The saved rip points to the final NMI iret. Check the CS segment to
# make sure.
cmpw $__KERNEL_CS,8(%rsp)
jne is_unnested_nmi

# This is the case when we hit just as we're supposed to do the atomic code
# of a previous nmi. We run the NMI using the old return address that is still
# on the stack, rather than copy the new one that is bogus and points to where
# the nested NMI interrupted the original NMI handler!
# Easy: just set the stack pointer to point to the stack copy, clear
# need_fake_nmi (because we are directly going to execute the requested NMI) and
# jump to "nesting flag set" (which is followed by regular nmi code execution).
movq $nmi_stack_copy_rip,%rsp
movb $0x0,%__percpu_seg:need_fake_nmi
jmp set_nmi_nesting

# This is the actual nested case. Make sure we branch to the fake NMI handler
# after this handler is done.
nmi_nested_set_fake_and_return:
movb $0xff,%__percpu_seg:need_fake_nmi
popfq
jmp *(%rsp)


--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on
On Thu, Jul 15, 2010 at 3:01 PM, Mathieu Desnoyers
<mathieu.desnoyers(a)efficios.com> wrote:
>
> . NMI exit code
> and fake NMI entry are made reentrant with respect to NMI handler interruption
> by testing, at the very beginning of the NMI handler, if a NMI is nested over
> the whole nmi_atomic .. nmi_atomic_end code region.

That is totally bogus. The NMI can be nested by exceptions and
function calls - the whole _point_ of this thing. So testing "rip" for
anything else than the specific final "iret" is meaningless. You will
be in an NMI region regardless of what rip is.

> This code assumes NMIs have a separate stack.

It also needs to be made per-cpu (and the flags be per-cpu).

Then you could in fact possibly test the stack pointer for whether it
is in the NMI stack area, and use the value of %rsp itself as the
flag. So you could avoid the flag entirely. Because testing %rsp is
valid - testing %rip is not.

That would also avoid the race, because %rsp (as a flag) now gets
cleared atomically by the "iret". So that might actually solve things.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/