[patch 1/2] x86_64 page fault NMI-safe [Kernel]

Prev: Unify TSC logic
Next: enable readback to get HPET working on ATI SB4x00, kernel 2.6.35_rc5

From: Mathieu Desnoyers on 14 Jul 2010 12:10

> I think you're vastly overestimating what is sane to do from an NMI
> context. It is utterly and totally insane to assume vmalloc is available
> in NMI.
>
> -hpa
>

Ok, please tell me where I am wrong then.. by looking into
arch/x86/mm/fault.c, I see that vmalloc_sync_all() touches pgd_list
entries while the pgd_lock spinlock is taken, with interrupts disabled.
So it's protected against concurrent pgd_list modification from

a - vmalloc_sync_all() on other CPUs
b - local interrupts

However, a completely normal interrupt can come on a remote CPU, run
vmalloc_fault() and issue a set_pgd concurrently. Therefore I conclude
this interrupt disable is not there to insure any kind of protection
against concurrent updates.

Also, we see that vmalloc_fault has comments such as :

(for x86_32)
* Do _not_ use "current" here. We might be inside
* an interrupt in the middle of a task switch..

So it takes the pgd_addr from cr3, not from current. Using only the
stack/registers makes this NMI-safe even if "current" is invalid when
the NMI comes. This is caused by the fact that __switch_to will update
the registers before updating current_task without disabling interrupts.

You are right in that x86_64 does not seems to play as safely as x86_32
on this matter; it uses current->mm. Probably it shouldn't assume
"current" is valid. Actually, I don't see where x86_64 disables
interrupts around __switch_to, so this would seem to be a race
condition. Or have I missed something ?

(Ingo)
> > the scheduler disables interrupts around __switch_to(). (x86 does
> > not set __ARCH_WANT_INTERRUPTS_ON_CTXSW)
>
(Mathieu)
> Ok, so I guess it's only useful to NMIs then. However, it makes me
> wonder why this comment was there in the first place on x86_32
> vmalloc_fault() and why it uses read_cr3() :
>
> * Do _not_ use "current" here. We might be inside
> * an interrupt in the middle of a task switch..
(Ingo)
hm, i guess it's still useful to keep the
__ARCH_WANT_INTERRUPTS_ON_CTXSW case working too. On -rt we used to
enable it to squeeze a tiny bit more latency out of the system.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers(a)polymtl.ca>
CC: akpm(a)osdl.org
CC: mingo(a)elte.hu
CC: "H. Peter Anvin" <hpa(a)zytor.com>
CC: Jeremy Fitzhardinge <jeremy(a)goop.org>
CC: Steven Rostedt <rostedt(a)goodmis.org>
CC: "Frank Ch. Eigler" <fche(a)redhat.com>
---
arch/x86/mm/fault.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng/arch/x86/mm/fault.c
===================================================================
--- linux-2.6-lttng.orig/arch/x86/mm/fault.c 2010-03-13 16:56:46.000000000 -0500
+++ linux-2.6-lttng/arch/x86/mm/fault.c 2010-03-13 16:57:53.000000000 -0500
@@ -360,6 +360,7 @@ void vmalloc_sync_all(void)
*/
static noinline __kprobes int vmalloc_fault(unsigned long address)
{
+ unsigned long pgd_paddr;
pgd_t *pgd, *pgd_ref;
pud_t *pud, *pud_ref;
pmd_t *pmd, *pmd_ref;
@@ -374,7 +375,8 @@ static noinline __kprobes int vmalloc_fa
* happen within a race in page table update. In the later
* case just flush:
*/
- pgd = pgd_offset(current->active_mm, address);
+ pgd_paddr = read_cr3();
+ pgd = __va(pgd_paddr) + pgd_index(address);
pgd_ref = pgd_offset_k(address);
if (pgd_none(*pgd_ref))
return -1;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

|
Pages: 1
Prev: Unify TSC logic
Next: enable readback to get HPET working on ATI SB4x00, kernel 2.6.35_rc5