From: Jeremy Fitzhardinge on
On 07/14/2010 04:02 PM, Linus Torvalds wrote:
> Umm, I know. It's what this whole discussion (non-paravirtualized) is
> all about. And I have a suggestion that should fix the
> non-paravirtualized case _without_ actually touching anything but the
> NMI code itself.
>
> What I tried to say is that the paravirtualized people should take a
> look at my suggestion, and see if they can apply the logic to their
> NMI handling too.

My point is that it's moot (for now) because there is no NMI handing.

> And in the process totally remove the need for
> paravirtualizing iret, exactly because the approach handles the magic
> NMI lock logic entirely in the NMI handler itself.
>

Nothing in this thread is ringing any alarm bells from that perspective,
so I don't much mind either way. When I get around to dealing with
paravirtualized NMI, I'll look at the state of things and see how to go
from there. (Xen's iret hypercall takes a flag to say whether to unmask
NMIs, which will probably come in handy.)

I don't think any of the other pure PV implementations have NMI either,
so I don't think it affects them.

> Wouldn't it be nice to be able to remove the need to paravirtualize iret?
>

Of course. But we also need to do an iret in a hypercall to handle ring
switching in some cases, so we still need it aside from the interrupt issue.

J

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on
On Wed, Jul 14, 2010 at 3:37 PM, Linus Torvalds
<torvalds(a)linux-foundation.org> wrote:
>
> I think the %rip check should be pretty simple - exactly because there
> is only a single point where the race is open between that 'mov' and
> the 'iret'. So it's simpler than the (similar) thing we do for
> debug/nmi stack fixup for sysenter that has to check a range.

So this is what I think it might look like, with the %rip in place.
And I changed the "nmi_stack_ptr" thing to have both the pointer and a
flag - because it turns out that in the single-instruction race case,
we actually want the old pointer.

Totally untested, of course. But _something_ like this might work:

#
# Two per-cpu variables: a "are we nested" flag (one byte), and
# a "if we're nested, what is the %rsp for the nested case".
#
# The reason for why we can't just clear the saved-rsp field and
# use that as the flag is that we actually want to know the saved
# rsp for the special case of having a nested NMI happen on the
# final iret of the unnested case.
#
nmi:
cmpb $0,%__percpu_seg:nmi_stack_nesting
jne nmi_nested_corrupt_and_return
cmpq $nmi_iret_address,0(%rsp)
je nmi_might_be_nested
# create new stack
is_unnested_nmi:
# Save some space for nested NMI's. The exception itself
# will never use more space, but it might use less (since
# if will be a kernel-kernel transition). But the nested
# exception will want two save registers and a place to
# save the original CS that it will corrupt
subq $64,%rsp

# copy the five words of stack info. 96 = 64 + stack
# offset of ss.
pushq 96(%rsp) # ss
pushq 96(%rsp) # rsp
pushq 96(%rsp) # eflags
pushq 96(%rsp) # cs
pushq 96(%rsp) # rip

# and set the nesting flags
movq %rsp,%__percpu_seg:nmi_stack_ptr
movb $0xff,%__percpu_seg:nmi_stack_nesting

regular_nmi_code:
...
# regular NMI code goes here, and can take faults,
# because this sequence now has proper nested-nmi
# handling
...
nmi_exit:
movb $0,%__percpu_seg:nmi_stack_nesting
nmi_iret_address:
iret

# The saved rip points to the final NMI iret, after we've cleared
# nmi_stack_ptr. Check the CS segment to make sure.
nmi_might_be_nested:
cmpw $__KERNEL_CS,8(%rsp)
jne is_unnested_nmi

# This is the case when we hit just as we're supposed to do the final
# iret of a previous nmi. We run the NMI using the old return address
# that is still on the stack, rather than copy the new one that is bogus
# and points to where the nested NMI interrupted the original NMI
# handler!
# Easy: just reset the stack pointer to the saved one (this is why
# we use a separate "valid" flag, so that we can still use the saved
# stack pointer)
movq %__percpu_seg:nmi_stack_ptr,%rsp
jmp regular_nmi_code

# This is the actual nested case. Make sure we fault on iret by setting
# CS to zero and saving the old CS. %rax contains the stack pointer to
# the original code.
nmi_nested_corrupt_and_return:
pushq %rax
pushq %rdx
movq %__percpu_seg:nmi_stack_ptr,%rax
movq 8(%rax),%rdx # CS of original NMI
testq %rdx,%rdx # CS already zero?
je nmi_nested_and_already_corrupted
movq %rdx,40(%rax) # save old CS away
movq $0,8(%rax)
nmi_nested_and_already_corrupted:
popq %rdx
popq %rax
popfq
jmp *(%rsp)

Hmm?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Frederic Weisbecker on
On Wed, Jul 14, 2010 at 03:56:43PM -0700, Linus Torvalds wrote:
> On Wed, Jul 14, 2010 at 3:31 PM, Frederic Weisbecker <fweisbec(a)gmail.com> wrote:
> >
> > Until now I didn't because I clearly misunderstand the vmalloc internals. I'm
> > not even quite sure why a memory allocated with vmalloc sometimes can be not
> > mapped (and then fault once for this to sync). Some people have tried to explain
> > me but the picture is still vague to me.
>
> So the issue is that the system can have thousands and thousands of
> page tables (one for each process), and what do you do when you add a
> new kernel virtual mapping?
>
> You can:
>
> - make sure that you only ever use _one_ single top-level entry for
> all vmalloc issues, and can make sure that all processes are created
> with that static entry filled in. This is optimal, but it just doesn't
> work on all architectures (eg on 32-bit x86, it would limit the
> vmalloc space to 4MB in non-PAE, whatever)


But then, even if you ensure that, don't we need to also fill lower level
entries for the new mapping.

Also, why is this a worry for vmalloc but not for kmalloc? Don't we also
risk to add a new memory mapping for new memory allocated with kmalloc?



> - at vmalloc time, when adding a new page directory entry, walk all
> the tens of thousands of existing page tables under a lock that
> guarantees that we don't add any new ones (ie it will lock out fork())
> and add the required pgd entry to them.
>
> - or just take the fault and do the "fill the page tables" on demand.
>
> Quite frankly, most of the time it's probably better to make that last
> choice (unless your hardware makes it easy to make the first choice,
> which is obviously simplest for everybody). It makes it _much_ cheaper
> to do vmalloc. It also avoids that nasty latency issue. And it's just
> simpler too, and has no interesting locking issues with how/when you
> expose the page tables in fork() etc.
>
> So the only downside is that you do end up taking a fault in the
> (rare) case where you have a newly created task that didn't get an
> even newer vmalloc entry.


But then how did the previous tasks get this new mapping? You said
we don't walk through every process page tables for vmalloc.

I would understand this race if we were to walk on every processes page
tables and add the new mapping on them, but we missed one new task that
forked or so, because we didn't lock (or just rcu).



> And that fault can sometimes be in an
> interrupt or an NMI. Normally it's trivial to handle that fairly
> simple nested fault. But NMI has that inconvenient "iret unblocks
> NMI's, because there is no dedicated 'nmiret' instruction" problem on
> x86.


Yeah.


So the parts of the problem I don't understand are:

- why don't we have this problem with kmalloc() ?
- did I understand well the race that makes the fault necessary,
ie: we walk the tasklist lockless, add the new mapping if
not present, but we might miss a task lately forked, but
the fault will fix that.

Thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andi Kleen on
> But then how did the previous tasks get this new mapping? You said
> we don't walk through every process page tables for vmalloc.

No because those are always shared for the kernel and have been
filled in for init_mm.

Also most updates only update the lower tables anyways, top level
updates are extremly rare. In fact on PAE36 they should only happen
at most once, if at all, and most likely at early boot anyways
where you only have a single task.

On x86-64 they will only happen once every 512GB of vmalloc.
So for most systems also at most once at early boot.
>
> I would understand this race if we were to walk on every processes page
> tables and add the new mapping on them, but we missed one new task that
> forked or so, because we didn't lock (or just rcu).

The new task will always get a copy of the reference init_mm, which
was already updated.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Steven Rostedt on
On Thu, 2010-07-15 at 16:11 +0200, Frederic Weisbecker wrote:

> > - make sure that you only ever use _one_ single top-level entry for
> > all vmalloc issues, and can make sure that all processes are created
> > with that static entry filled in. This is optimal, but it just doesn't
> > work on all architectures (eg on 32-bit x86, it would limit the
> > vmalloc space to 4MB in non-PAE, whatever)
>
>
> But then, even if you ensure that, don't we need to also fill lower level
> entries for the new mapping.

If I understand your question, you do not need to worry about the lower
level entries because all the processes will share the same top level.

process 1's GPD ------,
|
+------> PMD --> ...
|
process 2' GPD -------'

Thus we have one page entry shared by all processes. The issue happens
when the vm space crosses the PMD boundary and we need to update all the
GPD's of all processes to point to the new PMD we need to add to handle
the spread of the vm space.

>
> Also, why is this a worry for vmalloc but not for kmalloc? Don't we also
> risk to add a new memory mapping for new memory allocated with kmalloc?

Because all of memory (well 800 some megs on 32 bit) is mapped into
memory for all processes. That is, kmalloc only uses this memory (as
does get_free_page()). All processes have a PMD (or PUD, whatever) that
maps this memory. The issues only arises when we use new virtual memory,
which vmalloc does. Vmalloc may map to physical memory that is already
mapped to all processes but the address that the vmalloc uses to access
that memory is not yet mapped.

The usual reason the kernel uses vmalloc is to get a contiguous range of
memory. The vmalloc can map several pages as one contiguous piece of
memory that in reality is several different pages scattered around
physical memory. kmalloc can only map pages that are contiguous in
physical memory. That is, if kmalloc gets 8192 bytes on an arch with
4096 byte pages, it will allocate two consecutive pages in physical
memory. If two contiguous pages are not available even if thousand of
single pages are, the kmalloc will fail, where as the vmalloc will not.

An allocation of vmalloc can use two different pages and just map the
page table to make them contiguous in view of the kernel. Note, this
comes at a cost. One is when we do this, we suffer the case where we
need to update a bunch of page tables. The other is that we must waste
TLB entries to point to these separate pages. Kmalloc and
get_free_page() uses the big memory mappings. That is, if the TLB allows
us to map large pages, we can do that for kernel memory since we just
want the contiguous memory as it is in physical memory.

Thus the kernel maps the physical memory with the fewest TLB entries as
needed (large pages and large TLB entries). If we can map 64K pages, we
do that. Then kmalloc just allocates within this range, it does not need
to map any pages. They are already mapped.

Does this make a bit more sense?

>
>
>
> > - at vmalloc time, when adding a new page directory entry, walk all
> > the tens of thousands of existing page tables under a lock that
> > guarantees that we don't add any new ones (ie it will lock out fork())
> > and add the required pgd entry to them.
> >
> > - or just take the fault and do the "fill the page tables" on demand.
> >
> > Quite frankly, most of the time it's probably better to make that last
> > choice (unless your hardware makes it easy to make the first choice,
> > which is obviously simplest for everybody). It makes it _much_ cheaper
> > to do vmalloc. It also avoids that nasty latency issue. And it's just
> > simpler too, and has no interesting locking issues with how/when you
> > expose the page tables in fork() etc.
> >
> > So the only downside is that you do end up taking a fault in the
> > (rare) case where you have a newly created task that didn't get an
> > even newer vmalloc entry.
>
>
> But then how did the previous tasks get this new mapping? You said
> we don't walk through every process page tables for vmalloc.

Actually we don't even need to walk the page tables in the first task
(although we might do that). When the kernel accesses that memory we
take the page fault, the page fault will see that this memory is vmalloc
data and fill in the page tables for the task at that time.

>
> I would understand this race if we were to walk on every processes page
> tables and add the new mapping on them, but we missed one new task that
> forked or so, because we didn't lock (or just rcu).
>
>
>
> > And that fault can sometimes be in an
> > interrupt or an NMI. Normally it's trivial to handle that fairly
> > simple nested fault. But NMI has that inconvenient "iret unblocks
> > NMI's, because there is no dedicated 'nmiret' instruction" problem on
> > x86.
>
>
> Yeah.
>
>
> So the parts of the problem I don't understand are:
>
> - why don't we have this problem with kmalloc() ?

I hope I explained that above.

> - did I understand well the race that makes the fault necessary,
> ie: we walk the tasklist lockless, add the new mapping if
> not present, but we might miss a task lately forked, but
> the fault will fix that.

I'm lost on this race. If we do a lock and walk all page tables I think
the race goes away. So I don't understand this either?

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/