From: Linus Torvalds on
On Thu, Jul 15, 2010 at 7:11 AM, Frederic Weisbecker <fweisbec(a)gmail.com> wrote:
> On Wed, Jul 14, 2010 at 03:56:43PM -0700, Linus Torvalds wrote:
>> You can:
>>
>> �- make sure that you only ever use _one_ single top-level entry for
>> all vmalloc issues, and can make sure that all processes are created
>> with that static entry filled in. This is optimal, but it just doesn't
>> work on all architectures (eg on 32-bit x86, it would limit the
>> vmalloc space to 4MB in non-PAE, whatever)
>
> But then, even if you ensure that, don't we need to also fill lower level
> entries for the new mapping.

Yes, but now they are all mapped by the one *shared* top-level entry.

Think about it.

[ Time passes ]

End result: if you can map the whole vmalloc area with a single
top-level entry that is shared by all processes, and can then just
fill in the lower levels when doing actual allocations, it means that
all processes will automatically get the entries added, and do not
need any fixups.

In other words, the page tables will be automatically correct and
filled in for everybody - without having to traverse any lists,
without any extra locking, and without any races. So this is efficient
and simple, and never needs any faulting to fill in page tables later
on.

(Side note: "single top-level entry" could equally well be "multiple
preallocated entries covering the whole region": the important part is
not really the "single entry", but the "preallocated and filled into
every page directory from the start" part)

> Also, why is this a worry for vmalloc but not for kmalloc? Don't we also
> risk to add a new memory mapping for new memory allocated with kmalloc?

No. The kmalloc space is all in the 1:1 kernel mapping, and is always
mapped. Even with PAGEALLOC_DEBUG, it's always mapped at the top
level, and even if a particular page is unmapped/remapped for
debugging, it is done so in the shared kernel page tables (which ends
up being the above trivial case - there is just a single set of page
directory entries that are shared by everybody).

>> �- at vmalloc time, when adding a new page directory entry, walk all
>> the tens of thousands of existing page tables under a lock that
>> guarantees that we don't add any new ones (ie it will lock out fork())
>> and add the required pgd entry to them.
>>
>> �- or just take the fault and do the "fill the page tables" on demand.
>>
>> Quite frankly, most of the time it's probably better to make that last
>> choice (unless your hardware makes it easy to make the first choice,
>> which is obviously simplest for everybody). It makes it _much_ cheaper
>> to do vmalloc. It also avoids that nasty latency issue. And it's just
>> simpler too, and has no interesting locking issues with how/when you
>> expose the page tables in fork() etc.
>>
>> So the only downside is that you do end up taking a fault in the
>> (rare) case where you have a newly created task that didn't get an
>> even newer vmalloc entry.
>
> But then how did the previous tasks get this new mapping? You said
> we don't walk through every process page tables for vmalloc.

We always add the mapping to the "init_mm" page tables when it is
created (just a single mm), and when fork creates a new page table, it
will always copy the kernel mapping parts from the old one. So the
_common_ case is that all normal mappings are already set up in page
tables, including newly created page tables.

The uncommon case is when there is a new page table created _and_ a
new vmalloc mapping, and the race that happens between those events.
Whent hat new page table is then later used (and it can be _much_
later, of course: we're now talking process scheduling, so IO delays
etc are relevant), it won't necessarily have the page table entries
for vmalloc stuff that was created since the page tables were created.
So we fill _those_ in dynamically.

But vmalloc mappings should be reasonably rare, and the actual "fill
them in" cases are much rarer still (since we fill them in page
directory entries at a time: so even if you make a lot of vmalloc()
calls, we only _fill_ at most once per page directory entry, which is
usually a pretty big chunk). On 32-bit x86, for example, we'd fill
once every 4MB (or 2MB if PAE), and you cannot have a lot of vmalloc
mappings that large (since the VM space is limited).

So the cost of filling things in is basically zero, because it happens
so seldom. And by _allowing_ things to be done lazily, we avoid all
the locking costs, and all the costs of traversing every single
possible mm (and there can be many many thousands of those).

> I would understand this race if we were to walk on every processes page
> tables and add the new mapping on them, but we missed one new task that
> forked or so, because we didn't lock (or just rcu).

... and how do you keep track of which tasks you missed? And no, it's
not just the new tasks - you have old tasks that have their page
tables built up too, but need to be updated. They may never need the
mapping since they may be sleeping and never using the driver or data
structures that created it (in fact, that's a common case), so filling
them would be pointless. But if we don't do the lazy fill, we'd have
to fill them all, because WE DO NOT KNOW.

> So the parts of the problem I don't understand are:
>
> - why don't we have this problem with kmalloc() ?

Hopefully clarified.

> - did I understand well the race that makes the fault necessary,
> �ie: we walk the tasklist lockless, add the new mapping if
> �not present, but we might miss a task lately forked, but
> �the fault will fix that.

But the _fundamental_ issue is that we do not want to walk the
tasklist (or the mm_list) AT ALL. It's a f*cking waste of time. It's a
long list, and nobody cares. In many cases it won't be needed.

The lazy algorithm is _better_. It's way simpler (we take nested
faults all the time in the kernel, and it's a particularly _easy_ page
fault to handle with no IO or no locking needed), and it does less
work. It really boils down to that.

So it's not the lazy page table fill that is the problem. Never has
been. We've been doing the lazy fill for a long time, and it was
simple and useful way back when.

The problem has always been NMI, and nothing else. NMI's are nasty,
and the x86 NMI blocking is insane and crazy.

Which is why I'm so adamant that this should be fixed in the NMI code,
and we should _not_ talk about trying to screw up other, totally
unrelated, code. The lazy fill really was never the problem.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mathieu Desnoyers on
* Frederic Weisbecker (fweisbec(a)gmail.com) wrote:
> On Wed, Jul 14, 2010 at 07:11:17PM -0400, Mathieu Desnoyers wrote:
> > * Frederic Weisbecker (fweisbec(a)gmail.com) wrote:
> > > On Wed, Jul 14, 2010 at 06:31:07PM -0400, Mathieu Desnoyers wrote:
> > > > * Frederic Weisbecker (fweisbec(a)gmail.com) wrote:
> > > > > On Wed, Jul 14, 2010 at 12:54:19PM -0700, Linus Torvalds wrote:
> > > > > > On Wed, Jul 14, 2010 at 12:36 PM, Frederic Weisbecker
> > > > > > <fweisbec(a)gmail.com> wrote:
> > > > > > >
> > > > > > > There is also the fact we need to handle the lost NMI, by defering its
> > > > > > > treatment or so. That adds even more complexity.
> > > > > >
> > > > > > I don't think your read my proposal very deeply. It already handles
> > > > > > them by taking a fault on the iret of the first one (that's why we
> > > > > > point to the stack frame - so that we can corrupt it and force a
> > > > > > fault).
> > > > >
> > > > >
> > > > > Ah right, I missed this part.
> > > >
> > > > Hrm, Frederic, I hate to ask that but.. what are you doing with those percpu 8k
> > > > data structures exactly ? :)
> > > >
> > > > Mathieu
> > >
> > >
> > >
> > > So, when an event triggers in perf, we sometimes want to capture the stacktrace
> > > that led to the event.
> > >
> > > We want this stacktrace (here we call that a callchain) to be recorded
> > > locklessly. So we want this callchain buffer per cpu, with the following
> > > type:
> >
> > Ah OK, so you mean that perf now has 2 different ring buffer implementations ?
> > How about using a single one that is generic enough to handle perf and ftrace
> > needs instead ?
> >
> > (/me runs away quickly before the lightning strikes) ;)
> >
> > Mathieu
>
>
> :-)
>
> That's no ring buffer. It's a temporary linear buffer to fill a stacktrace,
> and get its effective size before committing it to the real ring buffer.

I was more thinking along the lines of making sure a ring buffer has the proper
support for your use-case. It shares a lot of requirements with a standard ring
buffer:

- Need to be lock-less
- Need to reserve space, write data in a buffer

By configuring a ring buffer with 4k sub-buffer size (that's configurable
dynamically), all we need to add is the ability to squash a previously saved
record from the buffer. I am confident we can provide a clean API for this that
would allow discard of previously committed entry as long as we are still on the
same non-fully-committed sub-buffer. This fits your use-case exactly, so that's
fine.

You could have one 4k ring buffer per cpu per execution context. I wonder if
each Linux architecture have support for separated thread vs softirtq vs irq vs
nmi stacks ? Even then, given you have only one stack for all shared irqs, you
need something that is concurrency-aware at the ring buffer level.

These small stack-like ring buffers could be used to save your temporary stack
copy. When you really need to save it to a larger ring buffer along with a
trace, then you just take a snapshot of the stack ring buffers.

So you get to use one single ring buffer synchronization and memory allocation
mechanism, that everyone has reviewed. The advantage is that we would not be
having this nmi race discussion in the first place: the generic ring buffer uses
"get page" directly rather than relying on vmalloc, because these bugs have
already been identified and dealt with years ago.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on
On Thu, Jul 15, 2010 at 9:44 AM, Mathieu Desnoyers
<mathieu.desnoyers(a)efficios.com> wrote:
>
> The first thing that strikes me is that we could be interrupted by a standard
> interrupt on top of the iret instruction below.

No, that can never happen.

Why? Simple: regular interrupts aren't ever enabled in eflags. So the
only kinds of traps we can get are NMI's (that don't follow the normal
rules), and exceptions.

Of course, if there is some trap that re-enables interrupts even if
the trap happened in an interrupt-disabled region, then that would
change things, but that would be a bad bug regardless (and totally
independently of any NMI issues). So in that sense it's a "could
happen", but it's something that would be a totally separate bug.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mathieu Desnoyers on
* Linus Torvalds (torvalds(a)linux-foundation.org) wrote:
> On Wed, Jul 14, 2010 at 3:37 PM, Linus Torvalds
> <torvalds(a)linux-foundation.org> wrote:
> >
> > I think the %rip check should be pretty simple - exactly because there
> > is only a single point where the race is open between that 'mov' and
> > the 'iret'. So it's simpler than the (similar) thing we do for
> > debug/nmi stack fixup for sysenter that has to check a range.
>
> So this is what I think it might look like, with the %rip in place.
> And I changed the "nmi_stack_ptr" thing to have both the pointer and a
> flag - because it turns out that in the single-instruction race case,
> we actually want the old pointer.
>
> Totally untested, of course. But _something_ like this might work:
>
> #
> # Two per-cpu variables: a "are we nested" flag (one byte), and
> # a "if we're nested, what is the %rsp for the nested case".
> #
> # The reason for why we can't just clear the saved-rsp field and
> # use that as the flag is that we actually want to know the saved
> # rsp for the special case of having a nested NMI happen on the
> # final iret of the unnested case.
> #
> nmi:
> cmpb $0,%__percpu_seg:nmi_stack_nesting
> jne nmi_nested_corrupt_and_return
> cmpq $nmi_iret_address,0(%rsp)
> je nmi_might_be_nested
> # create new stack
> is_unnested_nmi:
> # Save some space for nested NMI's. The exception itself
> # will never use more space, but it might use less (since
> # if will be a kernel-kernel transition). But the nested
> # exception will want two save registers and a place to
> # save the original CS that it will corrupt
> subq $64,%rsp
>
> # copy the five words of stack info. 96 = 64 + stack
> # offset of ss.
> pushq 96(%rsp) # ss
> pushq 96(%rsp) # rsp
> pushq 96(%rsp) # eflags
> pushq 96(%rsp) # cs
> pushq 96(%rsp) # rip
>
> # and set the nesting flags
> movq %rsp,%__percpu_seg:nmi_stack_ptr
> movb $0xff,%__percpu_seg:nmi_stack_nesting
>
> regular_nmi_code:
> ...
> # regular NMI code goes here, and can take faults,
> # because this sequence now has proper nested-nmi
> # handling
> ...
> nmi_exit:
> movb $0,%__percpu_seg:nmi_stack_nesting

The first thing that strikes me is that we could be interrupted by a standard
interrupt on top of the iret instruction below. This interrupt handler could in
turn be interrupted by a NMI, so the NMI handler would not know that it is
nested over nmi_iret_address. Maybe we could simply disable interrupts
explicitly at the beginning of the handler, so they get re-enabled by iret below
upon return from nmi ?

Doing that would ensure that only NMIs can interrupt us.

I'll look a bit more at the code and come back with more comments if things come
up.

Thanks,

Mathieu

> nmi_iret_address:
> iret
>
> # The saved rip points to the final NMI iret, after we've cleared
> # nmi_stack_ptr. Check the CS segment to make sure.
> nmi_might_be_nested:
> cmpw $__KERNEL_CS,8(%rsp)
> jne is_unnested_nmi
>
> # This is the case when we hit just as we're supposed to do the final
> # iret of a previous nmi. We run the NMI using the old return address
> # that is still on the stack, rather than copy the new one that is bogus
> # and points to where the nested NMI interrupted the original NMI
> # handler!
> # Easy: just reset the stack pointer to the saved one (this is why
> # we use a separate "valid" flag, so that we can still use the saved
> # stack pointer)
> movq %__percpu_seg:nmi_stack_ptr,%rsp
> jmp regular_nmi_code
>
> # This is the actual nested case. Make sure we fault on iret by setting
> # CS to zero and saving the old CS. %rax contains the stack pointer to
> # the original code.
> nmi_nested_corrupt_and_return:
> pushq %rax
> pushq %rdx
> movq %__percpu_seg:nmi_stack_ptr,%rax
> movq 8(%rax),%rdx # CS of original NMI
> testq %rdx,%rdx # CS already zero?
> je nmi_nested_and_already_corrupted
> movq %rdx,40(%rax) # save old CS away
> movq $0,8(%rax)
> nmi_nested_and_already_corrupted:
> popq %rdx
> popq %rax
> popfq
> jmp *(%rsp)
>
> Hmm?
>
> Linus

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mathieu Desnoyers on
* Linus Torvalds (torvalds(a)linux-foundation.org) wrote:
> On Thu, Jul 15, 2010 at 9:44 AM, Mathieu Desnoyers
> <mathieu.desnoyers(a)efficios.com> wrote:
> >
> > The first thing that strikes me is that we could be interrupted by a standard
> > interrupt on top of the iret instruction below.
>
> No, that can never happen.
>
> Why? Simple: regular interrupts aren't ever enabled in eflags. So the
> only kinds of traps we can get are NMI's (that don't follow the normal
> rules), and exceptions.

Ah, you're right, since NMIs are an intr gate, IF is disabled in the EFLAGS all
along.

>
> Of course, if there is some trap that re-enables interrupts even if
> the trap happened in an interrupt-disabled region, then that would
> change things, but that would be a bad bug regardless (and totally
> independently of any NMI issues). So in that sense it's a "could
> happen", but it's something that would be a totally separate bug.

Yep, this kind of scenario would therefore be a bug that does not belong to the
specific realm of nmis.

Thanks,

Mathieu


--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/