From: Frederic Weisbecker on
On Wed, Jul 14, 2010 at 04:05:52PM -0400, Mathieu Desnoyers wrote:
> * Andi Kleen (andi(a)firstfloor.org) wrote:
> > > or similar. Wouldn't that be nice to have as a capability?
> >
> > It means the NMI watchdog would get useless if these areas
> > become common.
> >
> > Again I suspect all of this is not really needed anyways if
> > vmalloc_sync_all() works properly. That would solve the original
> > problem Mathieu was trying to solve for per_cpu data. The rule
> > would be just to call vmalloc_sync_all() properly when changing
> > per CPU data too.
>
> Yep, that would solve the page fault in nmi problem altogether without adding
> complexity.
>
> >
> > In fact I'm pretty sure it worked originally. Perhaps it regressed?
>
> I'd first ask the obvious to Perf authors: does perf issue vmalloc_sync_all()
> between percpu data allocation and tracing activation ? The generic ring buffer
> library I posted last week does it already as a precaution for this very
> specific reason (making sure NMIs never trigger page faults).


Ok, I should try.

Until now I didn't because I clearly misunderstand the vmalloc internals. I'm
not even quite sure why a memory allocated with vmalloc sometimes can be not
mapped (and then fault once for this to sync). Some people have tried to explain
me but the picture is still vague to me.

And moreover, I'm not quite sure whether vmalloc_sync_all() syncs the pgd
for every tasks or so... Tejun seemed to say it's not necessary the case in
x86-32... Again I think I haven't totally understood the details.

Thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on
On Wed, Jul 14, 2010 at 3:31 PM, Frederic Weisbecker <fweisbec(a)gmail.com> wrote:
>
> Until now I didn't because I clearly misunderstand the vmalloc internals. I'm
> not even quite sure why a memory allocated with vmalloc sometimes can be not
> mapped (and then fault once for this to sync). Some people have tried to explain
> me but the picture is still vague to me.

So the issue is that the system can have thousands and thousands of
page tables (one for each process), and what do you do when you add a
new kernel virtual mapping?

You can:

- make sure that you only ever use _one_ single top-level entry for
all vmalloc issues, and can make sure that all processes are created
with that static entry filled in. This is optimal, but it just doesn't
work on all architectures (eg on 32-bit x86, it would limit the
vmalloc space to 4MB in non-PAE, whatever)

- at vmalloc time, when adding a new page directory entry, walk all
the tens of thousands of existing page tables under a lock that
guarantees that we don't add any new ones (ie it will lock out fork())
and add the required pgd entry to them.

- or just take the fault and do the "fill the page tables" on demand.

Quite frankly, most of the time it's probably better to make that last
choice (unless your hardware makes it easy to make the first choice,
which is obviously simplest for everybody). It makes it _much_ cheaper
to do vmalloc. It also avoids that nasty latency issue. And it's just
simpler too, and has no interesting locking issues with how/when you
expose the page tables in fork() etc.

So the only downside is that you do end up taking a fault in the
(rare) case where you have a newly created task that didn't get an
even newer vmalloc entry. And that fault can sometimes be in an
interrupt or an NMI. Normally it's trivial to handle that fairly
simple nested fault. But NMI has that inconvenient "iret unblocks
NMI's, because there is no dedicated 'nmiret' instruction" problem on
x86.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jeremy Fitzhardinge on
On 07/14/2010 03:37 PM, Linus Torvalds wrote:
> I think the %rip check should be pretty simple - exactly because there
> is only a single point where the race is open between that 'mov' and
> the 'iret'. So it's simpler than the (similar) thing we do for
> debug/nmi stack fixup for sysenter that has to check a range.
>
> The only worry is if that crazy paravirt code wants to paravirtualize
> the iretq. Afaik, paravirt does that exactly because they screw up
> iret handling themselves. Maybe we could stop doing that stupid iretq
> paravirtualization, and just tell the paravirt people to do the same
> thing I propose, and just allow nesting.
>

We screw around with iret because there's a separate interrupt mask flag
which can't be set atomically with respect to a stack/ring change (well,
there's more to it, but I won't confuse matters).

But it only really matters if the PV guest can also get NMI-like
interrupts. While Xen supports NMI for PV guests, we don't use it much
and I haven't looked into implementing support for it yet.

J
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andi Kleen on
> - at vmalloc time, when adding a new page directory entry, walk all
> the tens of thousands of existing page tables under a lock that
> guarantees that we don't add any new ones (ie it will lock out fork())
> and add the required pgd entry to them.
>
> - or just take the fault and do the "fill the page tables" on demand.
>
> Quite frankly, most of the time it's probably better to make that last
> choice (unless your hardware makes it easy to make the first choice,

Adding new PGDs should happen only very rarely (in fact
at most once per boot on i386-PAE36 with only 4 entries, most used by user
space), most of the time when you do a vmalloc it changes only lower level
tables.

The PGD for the kernel mappings is already set up. On x86-64 it can happen
more often in theory, but in practice it should be also extremly rare because
a PGD is so large.

That's why I'm not sure this problem even happened. It should
be extremly rare that you exactly add that PGD during the per cpu
allocation.

It can happen in theory, but for such a rare case take a lock
and walking everything should be fine.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on
On Wed, Jul 14, 2010 at 3:51 PM, Jeremy Fitzhardinge <jeremy(a)goop.org> wrote:
>
> We screw around with iret because there's a separate interrupt mask flag
> which can't be set atomically with respect to a stack/ring change (well,
> there's more to it, but I won't confuse matters).

Umm, I know. It's what this whole discussion (non-paravirtualized) is
all about. And I have a suggestion that should fix the
non-paravirtualized case _without_ actually touching anything but the
NMI code itself.

What I tried to say is that the paravirtualized people should take a
look at my suggestion, and see if they can apply the logic to their
NMI handling too. And in the process totally remove the need for
paravirtualizing iret, exactly because the approach handles the magic
NMI lock logic entirely in the NMI handler itself.

Because I think the same thing that would make us not need to worry
about nested page faults _during_ NMI (because we could make the NMI
code do the right thing even when the hardware doesn't lock out NMI's
for us) is also the exact same logic that the paravirt monitor could
do for its own NMI handling.

Wouldn't it be nice to be able to remove the need to paravirtualize iret?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/