From: Avi Kivity on
On 07/14/2010 06:49 PM, Mathieu Desnoyers wrote:
> Implements an alternative iret with popf and return so trap and exception
> handlers can return to the NMI handler without issuing iret. iret would cause
> NMIs to be reenabled prematurely. x86_32 uses popf and far return. x86_64 has to
> copy the return instruction pointer to the top of the previous stack, issue a
> popf, loads the previous esp and issue a near return (ret).
>
> It allows placing dynamically patched static jumps in asm gotos, which will be
> used for optimized tracepoints, in NMI code since returning from a breakpoint
> would be valid. Accessing vmalloc'd memory, which allows executing module code
> or accessing vmapped or vmalloc'd areas from NMI context, would also be valid.
> This is very useful to tracers like LTTng.
>
> This patch makes all faults, traps and exception safe to be called from NMI
> context*except* single-stepping, which requires iret to restore the TF (trap
> flag) and jump to the return address in a single instruction. Sorry, no kprobes
> support in NMI handlers because of this limitation. This cannot be emulated
> with popf/lret, because lret would be single-stepped. It does not apply to
> "immediate values" because they do not use single-stepping. This code detects if
> the TF flag is set and uses the iret path for single-stepping, even if it
> reactivates NMIs prematurely.
>

You need to save/restore cr2 in addition, otherwise the following hits you

- page fault
- processor writes cr2, enters fault handler
- nmi
- page fault
- cr2 overwritten

I guess you would usually not notice the corruption since you'd just see
a spurious fault on the page the NMI handler touched, but if the first
fault happened in a kvm guest, then we'd corrupt the guest's cr2.

But the whole thing strikes me as overkill. If it's 8k per-cpu, what's
wrong with using a per-cpu pointer to a kmalloc() area?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mathieu Desnoyers on
* Avi Kivity (avi(a)redhat.com) wrote:
> On 07/14/2010 06:49 PM, Mathieu Desnoyers wrote:
>> Implements an alternative iret with popf and return so trap and exception
>> handlers can return to the NMI handler without issuing iret. iret would cause
>> NMIs to be reenabled prematurely. x86_32 uses popf and far return. x86_64 has to
>> copy the return instruction pointer to the top of the previous stack, issue a
>> popf, loads the previous esp and issue a near return (ret).
>>
>> It allows placing dynamically patched static jumps in asm gotos, which will be
>> used for optimized tracepoints, in NMI code since returning from a breakpoint
>> would be valid. Accessing vmalloc'd memory, which allows executing module code
>> or accessing vmapped or vmalloc'd areas from NMI context, would also be valid.
>> This is very useful to tracers like LTTng.
>>
>> This patch makes all faults, traps and exception safe to be called from NMI
>> context*except* single-stepping, which requires iret to restore the TF (trap
>> flag) and jump to the return address in a single instruction. Sorry, no kprobes
>> support in NMI handlers because of this limitation. This cannot be emulated
>> with popf/lret, because lret would be single-stepped. It does not apply to
>> "immediate values" because they do not use single-stepping. This code detects if
>> the TF flag is set and uses the iret path for single-stepping, even if it
>> reactivates NMIs prematurely.
>>
>
> You need to save/restore cr2 in addition, otherwise the following hits you
>
> - page fault
> - processor writes cr2, enters fault handler
> - nmi
> - page fault
> - cr2 overwritten
>
> I guess you would usually not notice the corruption since you'd just see
> a spurious fault on the page the NMI handler touched, but if the first
> fault happened in a kvm guest, then we'd corrupt the guest's cr2.

OK, just to make sure: you mean we'd have to save/restore the cr2 register
at the beginning/end of the NMI handler execution, right ? The shouldn't we
save/restore cr3 too ?

> But the whole thing strikes me as overkill. If it's 8k per-cpu, what's
> wrong with using a per-cpu pointer to a kmalloc() area?

Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is
much more than perf) can potentially cause large latencies, which could be
squashed by allowing page faults in NMI handlers. This looks like a stronger
argument to me.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andi Kleen on
> Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is
> much more than perf) can potentially cause large latencies, which could be

You need to fix all other code too that walks tasks lists to avoid all those.

% gid for_each_process | wc -l

In fact the mm-struct walk is cheaper than a task-list walk because there
are always less than tasks.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mathieu Desnoyers on
* Andi Kleen (andi(a)firstfloor.org) wrote:
> > Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is
> > much more than perf) can potentially cause large latencies, which could be
>
> You need to fix all other code too that walks tasks lists to avoid all those.
>
> % gid for_each_process | wc -l

This can very well be done incrementally. And I agree, these should eventually
targeted too, especially those which hold locks. We've already started hearing
about tasklist lock live-locks in the past year, so I think we're pretty much at
the point where it should be looked at.

Thanks,

Mathieu

>
> In fact the mm-struct walk is cheaper than a task-list walk because there
> are always less than tasks.

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Avi Kivity on
On 07/16/2010 05:49 PM, Mathieu Desnoyers wrote:
>
>> You need to save/restore cr2 in addition, otherwise the following hits you
>>
>> - page fault
>> - processor writes cr2, enters fault handler
>> - nmi
>> - page fault
>> - cr2 overwritten
>>
>> I guess you would usually not notice the corruption since you'd just see
>> a spurious fault on the page the NMI handler touched, but if the first
>> fault happened in a kvm guest, then we'd corrupt the guest's cr2.
>>
> OK, just to make sure: you mean we'd have to save/restore the cr2 register
> at the beginning/end of the NMI handler execution, right ?

Yes.

> The shouldn't we
> save/restore cr3 too ?
>
>

No, faults should not change cr3.

>> But the whole thing strikes me as overkill. If it's 8k per-cpu, what's
>> wrong with using a per-cpu pointer to a kmalloc() area?
>>
> Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is
> much more than perf) can potentially cause large latencies, which could be
> squashed by allowing page faults in NMI handlers. This looks like a stronger
> argument to me.

Why is that kernel code calling vmalloc_sync_all()? If it is only NMI
which cannot take vmalloc faults, why bother? If not, why not?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/