From: Mathieu Desnoyers on
Hi Linus,

What I omitted in my original description paragraph is that I also test for NMIs
nested over NMI "regular code" with a "nesting" per-cpu flag, which deals with
the concerns you expressed in your reply about function calls and traps.

I'm self-replying to keep track of Avi's comment about the need to save/restore
cr2 at the beginning/end of the NMI handler, so we don't end up corrupting a VM
CR2 if we have the following scenario: trap in VM, NMI, trap in NMI. So I added
cr2 awareness to the code snippet below, so we should be close to have something
that starts to make sense. (although I'm not saying it's bug-free yet) ;)

Please note that I'll be off on vacation for 2 weeks starting this evening (back
on August 2) without Internet access, so my answers might be delayed.

Thanks !

Mathieu


Code originally written by Linus Torvalds, modified by Mathieu Desnoyers
intenting to handle the fake NMI entry gracefully given that NMIs are not
necessarily disabled at the entry point. It uses a "need fake NMI" flag rather
than playing games with CS and faults. When a fake NMI is needed, it simply
jumps back to the beginning of regular nmi code. NMI exit code and fake NMI
entry are made reentrant with respect to NMI handler interruption by testing, at
the very beginning of the NMI handler, if a NMI is nested over the whole
nmi_atomic .. nmi_atomic_end code region. It also tests for nested NMIs by
keeping a per-cpu "nmi nested" flag"; this deals with detection of nesting over
the "regular nmi" execution. This code assumes NMIs have a separate stack.

#
# Two per-cpu variables: a "are we nested" flag (one byte).
# a "do we need to execute a fake NMI" flag (one byte).
# The %rsp at which the stack copy is saved is at a fixed address, which leaves
# enough room at the bottom of NMI stack for the "real" NMI entry stack. This
# assumes we have a separate NMI stack.
# The NMI stack copy top of stack is at nmi_stack_copy.
# The NMI stack copy "rip" is at nmi_stack_copy_rip, which is set to
# nmi_stack_copy-40.
#
nmi:
# Test if nested over atomic code.
cmpq $nmi_atomic,0(%rsp)
jae nmi_addr_is_ae
# Test if nested over general NMI code.
cmpb $0,%__percpu_seg:nmi_stack_nesting
jne nmi_nested_set_fake_and_return
# create new stack
is_unnested_nmi:
# Save some space for nested NMI's. The exception itself
# will never use more space, but it might use less (since
# if will be a kernel-kernel transition).

# Save %rax on top of the stack (need to temporarily use it)
pushq %rax
movq %rsp, %rax
movq $nmi_stack_copy,%rsp

# copy the five words of stack info. rip starts at 8+0(%rax).
# cr2 is saved at nmi_stack_copy_rip+40
pushq %cr2 # save cr2 to handle nesting over page faults
pushq 8+32(%rax) # ss
pushq 8+24(%rax) # rsp
pushq 8+16(%rax) # eflags
pushq 8+8(%rax) # cs
pushq 8+0(%rax) # rip
movq 0(%rax),%rax # restore %rax

set_nmi_nesting:
# and set the nesting flags
movb $0xff,%__percpu_seg:nmi_stack_nesting

regular_nmi_code:
...
# regular NMI code goes here, and can take faults,
# because this sequence now has proper nested-nmi
# handling
...

nmi_atomic:
# An NMI nesting over the whole nmi_atomic .. nmi_atomic_end region will
# be handled specially. This includes the fake NMI entry point.
cmpb $0,%__percpu_seg:need_fake_nmi
jne fake_nmi
movb $0,%__percpu_seg:nmi_stack_nesting
# restore cr2
movq %nmi_stack_copy_rip+40,%cr2
iret

# This is the fake NMI entry point.
fake_nmi:
movb $0x0,%__percpu_seg:need_fake_nmi
jmp regular_nmi_code
nmi_atomic_end:

# Make sure the address is in the nmi_atomic range and in CS segment.
nmi_addr_is_ae:
cmpq $nmi_atomic_end,0(%rsp)
jae is_unnested_nmi
# The saved rip points to the final NMI iret. Check the CS segment to
# make sure.
cmpw $__KERNEL_CS,8(%rsp)
jne is_unnested_nmi

# This is the case when we hit just as we're supposed to do the atomic code
# of a previous nmi. We run the NMI using the old return address that is still
# on the stack, rather than copy the new one that is bogus and points to where
# the nested NMI interrupted the original NMI handler!
# Easy: just set the stack pointer to point to the stack copy, clear
# need_fake_nmi (because we are directly going to execute the requested NMI) and
# jump to "nesting flag set" (which is followed by regular nmi code execution).
movq $nmi_stack_copy_rip,%rsp
movb $0x0,%__percpu_seg:need_fake_nmi
jmp set_nmi_nesting

# This is the actual nested case. Make sure we branch to the fake NMI handler
# after this handler is done.
nmi_nested_set_fake_and_return:
movb $0xff,%__percpu_seg:need_fake_nmi
popfq
jmp *(%rsp)


--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Avi Kivity on
On 07/15/2010 04:23 AM, Linus Torvalds wrote:
> On Wed, Jul 14, 2010 at 3:37 PM, Linus Torvalds
> <torvalds(a)linux-foundation.org> wrote:
>
>> I think the %rip check should be pretty simple - exactly because there
>> is only a single point where the race is open between that 'mov' and
>> the 'iret'. So it's simpler than the (similar) thing we do for
>> debug/nmi stack fixup for sysenter that has to check a range.
>>
> So this is what I think it might look like, with the %rip in place.
> And I changed the "nmi_stack_ptr" thing to have both the pointer and a
> flag - because it turns out that in the single-instruction race case,
> we actually want the old pointer.
>
> Totally untested, of course. But _something_ like this might work:
>
> #
> # Two per-cpu variables: a "are we nested" flag (one byte), and
> # a "if we're nested, what is the %rsp for the nested case".
> #
> # The reason for why we can't just clear the saved-rsp field and
> # use that as the flag is that we actually want to know the saved
> # rsp for the special case of having a nested NMI happen on the
> # final iret of the unnested case.
> #
> nmi:
> cmpb $0,%__percpu_seg:nmi_stack_nesting
> jne nmi_nested_corrupt_and_return
> cmpq $nmi_iret_address,0(%rsp)
> je nmi_might_be_nested
> # create new stack
> is_unnested_nmi:
> # Save some space for nested NMI's. The exception itself
> # will never use more space, but it might use less (since
> # if will be a kernel-kernel transition). But the nested
> # exception will want two save registers and a place to
> # save the original CS that it will corrupt
> subq $64,%rsp
>
> # copy the five words of stack info. 96 = 64 + stack
> # offset of ss.
> pushq 96(%rsp) # ss
> pushq 96(%rsp) # rsp
> pushq 96(%rsp) # eflags
> pushq 96(%rsp) # cs
> pushq 96(%rsp) # rip
>
> # and set the nesting flags
> movq %rsp,%__percpu_seg:nmi_stack_ptr
> movb $0xff,%__percpu_seg:nmi_stack_nesting
>
>

By trading off some memory, we don't need this trickery. We can
allocate two nmi stacks, so the code becomes:

nmi:
cmpb $0, %__percpu_seg:nmi_stack_nesting
je unnested_nmi
cmpq $nmi_iret,(%rsp)
jne unnested_nmi
cmpw $__KERNEL_CS,8(%rsp)
jne unnested_nmi
popf
retfq
unnested_nmi:
xorq $(nmi_stack_1 ^ nmi_stack_2),%__percpu_seg:tss_nmi_ist_entry
movb $1, __percpu_seg:nmi_stack_nesting
regular_nmi:
...
regular_nmi_end:
movb $0, __percpu_seg:nmi_stack_nesting
nmi_iret:
iretq




--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on
On Sun, Jul 18, 2010 at 4:03 AM, Avi Kivity <avi(a)redhat.com> wrote:
>
> By trading off some memory, we don't need this trickery. �We can allocate
> two nmi stacks, so the code becomes:

I really don't think you need even that. See earlier in the discussion
about how we could just test %rsp itself. Which makes all the %rip
testing totally unnecessary, because we don't even need any flags,and
we have no races because %rsp is atomically changed with taking the
exception.

Lookie here, the %rsp comparison really isn't that hard:

nmi:
pushq %rax
pushq %rdx
movq %rsp,%rdx # current stack top
movq 40(%rsp),%rax # old stack top
xor %rax,%rdx # same 8kB aligned area?
shrq $13,%rdx # ignore low 13 bits
je it_is_a_nested_nmi # looks nested..
non_nested:
...
... ok, we're not nested, do normal NMI handling ...
...
popq %rdx
popq %rax
iret

it_is_a_nested_nmi:
cmpw $0,48(%rsp) # double-check that it really was a nested exception
jne non_nested # from user space or something..
# this is the nested case
# NOTE! NMI's are blocked, we don't take any exceptions etc etc
addq $-160,%rax # 128-byte redzone on the old stack + 4 words
movq (%rsp),%rdx
movq %rdx,(%rax) # old %rdx
movq 8(%rsp),%rdx
movq %rdx,8(%rax) # old %rax
movq 32(%rsp),%rdx
movq %rdx,16(%rax) # old %rflags
movq 16(%rsp),%rdx
movq %rdx,24(%rax) # old %rip
movq %rax,%rsp
popq %rdx
popq %rax
popf
ret $128 # restore %rip and %rsp

doesn't that look pretty simple?

NOTE! OBVIOUSLY TOTALLY UNTESTED!

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Avi Kivity on
On 07/18/2010 08:36 PM, Linus Torvalds wrote:
> On Sun, Jul 18, 2010 at 4:03 AM, Avi Kivity<avi(a)redhat.com> wrote:
>
>> By trading off some memory, we don't need this trickery. We can allocate
>> two nmi stacks, so the code becomes:
>>
> I really don't think you need even that. See earlier in the discussion
> about how we could just test %rsp itself. Which makes all the %rip
> testing totally unnecessary, because we don't even need any flags,and
> we have no races because %rsp is atomically changed with taking the
> exception.
>
> Lookie here, the %rsp comparison really isn't that hard:
>
> nmi:
> pushq %rax
> pushq %rdx
> movq %rsp,%rdx # current stack top
> movq 40(%rsp),%rax # old stack top
> xor %rax,%rdx # same 8kB aligned area?
> shrq $13,%rdx # ignore low 13 bits
> je it_is_a_nested_nmi # looks nested..
>
>

....

> doesn't that look pretty simple?
>
>

Too simple - an MCE will switch to its own stack, failing the test. Now
that we have correctable MCEs, that's not a good idea.

So the plain everyday sequence

NMI
#PF
MCE (uncompleted)
NMI

will fail.

Plus, even in the non-nested case, you have to copy the stack frame, or
the nested NMI will corrupt it. With stack switching, the nested NMI is
allocated its own frame.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on
On Sun, Jul 18, 2010 at 10:36 AM, Linus Torvalds
<torvalds(a)linux-foundation.org> wrote:
>
> Lookie here, the %rsp comparison really isn't that hard:

A few notes on that (still) untested code suggestion:

> �nmi:
> � � �pushq %rax
> � � �pushq %rdx
> � � �movq %rsp,%rdx � � � � �# current stack top
> � � �movq 40(%rsp),%rax � # old stack top
> � � �xor %rax,%rdx � � � � � � �# same 8kB aligned area?
> � � �shrq $13,%rdx � � � � � � # ignore low 13 bits
> � � �je it_is_a_nested_nmi � # looks nested..
> �non_nested:
> � � �...
> � � �... ok, we're not nested, do normal NMI handling ...
> � � �...

The non_nested case still needs to start off with moving it's stack
frame to a safe area that won't be overwritten by any nesting NMI's
(note that they cannot nest at this point, since we've done nothing
that can fault). So we'd still need that

7* pushq 48(%rsp)

which copies the five words that got pushed by hardware, and the two
register-save locations that we used for the nesting check and special
return.

After we've done those 7 pushes, we can then run code that may take a
fault. Because when the fault returns with an "iret" and re-enables
NMI's, our nesting code is ready.

So all told, we need a maximum of about 216 bytes of stack for the
nested NMI case: 56 bytes for the seven copied words, and the 160
bytes that we build up _under_ the stack pointer for the nested case.
And we need the NMI stack itself to be aligned in order for that
"ignore low bits" check to work. Although we don't actually have to do
that "xor+shr", we could do the test equally well with a "sub+unsigned
compare against stack size".

Other than that, I think the extra test that we're really nested might
better be done differently:

> �it_is_a_nested_nmi:
> � � �cmpw $0,48(%rsp) � � # double-check that it really was a nested exception
> � � �jne non_nested � � � � � # from user space or something..
> � � �# this is the nested case

It migth be safer to check the saved CS rather than the saved SS on
the stack to see that we really are in kernel mode. It's possible that
somebody could load a NULL SS in user mode and then just not use the
stack - and try to make it look like they are in kernel mode for when
the NMI happens. Now, I _think_ that loading a zero SS is supposed to
trap, but checking CS is still likely to be the better test for "were
we in kernel mode". That's where the CPL is really encoded, after all.

So that "cmpw $0,48(%rsp)" is probably ok, but it would likely be
better to do it as

testl $3,24(%rsp)
jne non_nested

instead. That's what entry_64.S does everywhere else.

Oh, and the non-nested case obviously needs all the regular "make the
kernel state look right" code. Like the swapgs stuff etc if required.
My example code was really meant to just document the nesting
handling, not the existing stuff we already need to do with
save_paranoid etc.

And I really think it should work, but I'd again like to stress that
it's just a RFD code sequence with no testing what-so-ever etc.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/