From: Frederic Weisbecker on
On Wed, Jul 14, 2010 at 12:14:01PM -0700, Linus Torvalds wrote:
> On Wed, Jul 14, 2010 at 11:46 AM, Ingo Molnar <mingo(a)elte.hu> wrote:
> >> �NMI entry:
> >
> > I think at this point [NMI re-entry] we've corrupted the top of the NMI kernel
> > stack already, due to entering via the IST stack mechanism, which is
> > non-nesting and which enters at the same point - right?
>
> Yeah, you're right, but we could easily fix that up. We know we don't
> need any stack for the nested case, so all we would need to do is to
> just subtract a small bit off %rsp, and copy the three words or so to
> create a "new" stack for the non-nested case.
>
> > We could solve that by copying that small stack frame off before entering the
> > 'generic' NMI routine - but it all feels a bit pulled in by the hair.
>
> Why? It's much cleaner than making the _real_ codepaths much worse.



There is also the fact we need to handle the lost NMI, by defering its
treatment or so. That adds even more complexity.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on
On Wed, Jul 14, 2010 at 12:14 PM, Linus Torvalds
<torvalds(a)linux-foundation.org> wrote:
> On Wed, Jul 14, 2010 at 11:46 AM, Ingo Molnar <mingo(a)elte.hu> wrote:
>> We could solve that by copying that small stack frame off before entering the
>> 'generic' NMI routine - but it all feels a bit pulled in by the hair.
>
> Why? It's much cleaner than making the _real_ codepaths much worse.

... but if the option is to never take a fault at all from the NMI
handler, and that is doable, than that would be good, of course.

But that may not be fixable. At least not without much more pain than
just adding a fairly simple hack to the NMI path itself, and keeping
all the NMI pain away from all the other cases.

And doing the per-cpu NMI nesting hack would actually also work as a
way to essentially block NMI's from critical regions. With my NMI
nestign avoidance suggestion, you could literally do something like
this to block NMI's:

/* This is just a fake stack structure */
struct nmi_blocker {
unsigned long rflags;
unsigned long cs;
unsigned long rip;
};

void block_nmi_on_this_cpu(struct nmi_blocker *blocker)
{
get_cpu();
memset(blocker, 0, sizeof(*blocker));
per_cpu_nmi_stack_frame = blocker;
}

void unblock_nmi_on_this_cpu(struct nmi_blocker *blocker)
{
per_cpu_nmi_stack_frame = NULL;
barrier();
/* Did an NMI happen? If so, we're now running NMI-blocked by hardware,
* we need to emulate the NMI and do a real 'iret' here
*/
if (blocker->cs == INVALID_CS)
asm volatile(".. build stack frame, call NMI routine ..");
put_cpu();
}

or similar. Wouldn't that be nice to have as a capability?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andi Kleen on
> or similar. Wouldn't that be nice to have as a capability?

It means the NMI watchdog would get useless if these areas
become common.

Again I suspect all of this is not really needed anyways if
vmalloc_sync_all() works properly. That would solve the original
problem Mathieu was trying to solve for per_cpu data. The rule
would be just to call vmalloc_sync_all() properly when changing
per CPU data too.

In fact I'm pretty sure it worked originally. Perhaps it regressed?

-Andi
--
ak(a)linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on
On Wed, Jul 14, 2010 at 12:36 PM, Frederic Weisbecker
<fweisbec(a)gmail.com> wrote:
>
> There is also the fact we need to handle the lost NMI, by defering its
> treatment or so. That adds even more complexity.

I don't think your read my proposal very deeply. It already handles
them by taking a fault on the iret of the first one (that's why we
point to the stack frame - so that we can corrupt it and force a
fault).

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mathieu Desnoyers on
* Andi Kleen (andi(a)firstfloor.org) wrote:
> > or similar. Wouldn't that be nice to have as a capability?
>
> It means the NMI watchdog would get useless if these areas
> become common.
>
> Again I suspect all of this is not really needed anyways if
> vmalloc_sync_all() works properly. That would solve the original
> problem Mathieu was trying to solve for per_cpu data. The rule
> would be just to call vmalloc_sync_all() properly when changing
> per CPU data too.

Yep, that would solve the page fault in nmi problem altogether without adding
complexity.

>
> In fact I'm pretty sure it worked originally. Perhaps it regressed?

I'd first ask the obvious to Perf authors: does perf issue vmalloc_sync_all()
between percpu data allocation and tracing activation ? The generic ring buffer
library I posted last week does it already as a precaution for this very
specific reason (making sure NMIs never trigger page faults).

Thanks,

Mathieu

>
> -Andi
> --
> ak(a)linux.intel.com -- Speaking for myself only.

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/