x86_64 page fault NMI-safe [Kernel]

Prev: [PATCH] enable readback to get HPET working on ATI SB4x00, kernel 2.6.35_rc5
Next: input: Fix wrong dimensions check for synaptics

From: Frederic Weisbecker on 16 Jul 2010 06:50

On Thu, Jul 15, 2010 at 10:46:13AM -0400, Steven Rostedt wrote:
> On Thu, 2010-07-15 at 16:11 +0200, Frederic Weisbecker wrote:
>
> > > - make sure that you only ever use _one_ single top-level entry for
> > > all vmalloc issues, and can make sure that all processes are created
> > > with that static entry filled in. This is optimal, but it just doesn't
> > > work on all architectures (eg on 32-bit x86, it would limit the
> > > vmalloc space to 4MB in non-PAE, whatever)
> >
> >
> > But then, even if you ensure that, don't we need to also fill lower level
> > entries for the new mapping.
>
> If I understand your question, you do not need to worry about the lower
> level entries because all the processes will share the same top level.
>
> process 1's GPD ------,
> |
> +------> PMD --> ...
> |
> process 2' GPD -------'
>
> Thus we have one page entry shared by all processes. The issue happens
> when the vm space crosses the PMD boundary and we need to update all the
> GPD's of all processes to point to the new PMD we need to add to handle
> the spread of the vm space.

Oh right. We point to that PMD, and the update has been made itself inside
the lower level entries pointed by the PMD. Indeed.

>
> >
> > Also, why is this a worry for vmalloc but not for kmalloc? Don't we also
> > risk to add a new memory mapping for new memory allocated with kmalloc?
>
> Because all of memory (well 800 some megs on 32 bit) is mapped into
> memory for all processes. That is, kmalloc only uses this memory (as
> does get_free_page()). All processes have a PMD (or PUD, whatever) that
> maps this memory. The issues only arises when we use new virtual memory,
> which vmalloc does. Vmalloc may map to physical memory that is already
> mapped to all processes but the address that the vmalloc uses to access
> that memory is not yet mapped.

Ok I see.

>
> The usual reason the kernel uses vmalloc is to get a contiguous range of
> memory. The vmalloc can map several pages as one contiguous piece of
> memory that in reality is several different pages scattered around
> physical memory. kmalloc can only map pages that are contiguous in
> physical memory. That is, if kmalloc gets 8192 bytes on an arch with
> 4096 byte pages, it will allocate two consecutive pages in physical
> memory. If two contiguous pages are not available even if thousand of
> single pages are, the kmalloc will fail, where as the vmalloc will not.
>
> An allocation of vmalloc can use two different pages and just map the
> page table to make them contiguous in view of the kernel. Note, this
> comes at a cost. One is when we do this, we suffer the case where we
> need to update a bunch of page tables. The other is that we must waste
> TLB entries to point to these separate pages. Kmalloc and
> get_free_page() uses the big memory mappings. That is, if the TLB allows
> us to map large pages, we can do that for kernel memory since we just
> want the contiguous memory as it is in physical memory.
>
> Thus the kernel maps the physical memory with the fewest TLB entries as
> needed (large pages and large TLB entries). If we can map 64K pages, we
> do that. Then kmalloc just allocates within this range, it does not need
> to map any pages. They are already mapped.
>
> Does this make a bit more sense?

Totally! You've made it very clear to me.
Moreover I did not know we can have such variable page size. I mean I thought
we can have variable page size but that would apply to every pages.

>
> >
> >
> >
> > > - at vmalloc time, when adding a new page directory entry, walk all
> > > the tens of thousands of existing page tables under a lock that
> > > guarantees that we don't add any new ones (ie it will lock out fork())
> > > and add the required pgd entry to them.
> > >
> > > - or just take the fault and do the "fill the page tables" on demand.
> > >
> > > Quite frankly, most of the time it's probably better to make that last
> > > choice (unless your hardware makes it easy to make the first choice,
> > > which is obviously simplest for everybody). It makes it _much_ cheaper
> > > to do vmalloc. It also avoids that nasty latency issue. And it's just
> > > simpler too, and has no interesting locking issues with how/when you
> > > expose the page tables in fork() etc.
> > >
> > > So the only downside is that you do end up taking a fault in the
> > > (rare) case where you have a newly created task that didn't get an
> > > even newer vmalloc entry.
> >
> >
> > But then how did the previous tasks get this new mapping? You said
> > we don't walk through every process page tables for vmalloc.
>
> Actually we don't even need to walk the page tables in the first task
> (although we might do that). When the kernel accesses that memory we
> take the page fault, the page fault will see that this memory is vmalloc
> data and fill in the page tables for the task at that time.

Right.

> >
> > I would understand this race if we were to walk on every processes page
> > tables and add the new mapping on them, but we missed one new task that
> > forked or so, because we didn't lock (or just rcu).
> >
> >
> >
> > > And that fault can sometimes be in an
> > > interrupt or an NMI. Normally it's trivial to handle that fairly
> > > simple nested fault. But NMI has that inconvenient "iret unblocks
> > > NMI's, because there is no dedicated 'nmiret' instruction" problem on
> > > x86.
> >
> >
> > Yeah.
> >
> >
> > So the parts of the problem I don't understand are:
> >
> > - why don't we have this problem with kmalloc() ?
>
> I hope I explained that above.

Yeah :)

Thanks a lot for your explanations!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Frederic Weisbecker on 16 Jul 2010 07:30

On Thu, Jul 15, 2010 at 04:35:18PM +0200, Andi Kleen wrote:
> > But then how did the previous tasks get this new mapping? You said
> > we don't walk through every process page tables for vmalloc.
>
> No because those are always shared for the kernel and have been
> filled in for init_mm.
>
> Also most updates only update the lower tables anyways, top level
> updates are extremly rare. In fact on PAE36 they should only happen
> at most once, if at all, and most likely at early boot anyways
> where you only have a single task.
>
> On x86-64 they will only happen once every 512GB of vmalloc.
> So for most systems also at most once at early boot.
> >
> > I would understand this race if we were to walk on every processes page
> > tables and add the new mapping on them, but we missed one new task that
> > forked or so, because we didn't lock (or just rcu).
>
> The new task will always get a copy of the reference init_mm, which
> was already updated.
>
> -Andi

Ok, got it.

But then, in the example here with perf, I'm allocating 8192 bytes per cpu
and my total memory amount is of 2 GB.

And it always fault at least once on access, after the allocation.
I really doubt it's because we are adding a new top level page table,
considering the amount of memory I have.

It seems to me that the mapping of a newly allocated vmalloc area is
always inserted through the lazy way (update on fault). Or there is
something I'm missing.

Thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Steven Rostedt on 16 Jul 2010 07:50

On Fri, 2010-07-16 at 12:47 +0200, Frederic Weisbecker wrote:
> > Thus the kernel maps the physical memory with the fewest TLB entries as
> > needed (large pages and large TLB entries). If we can map 64K pages, we
> > do that. Then kmalloc just allocates within this range, it does not need
> > to map any pages. They are already mapped.
> >
> > Does this make a bit more sense?
>
>
>
> Totally! You've made it very clear to me.
> Moreover I did not know we can have such variable page size. I mean I thought
> we can have variable page size but that would apply to every pages.

In x86_64, if bit 7 in the PDE (Page Directory Entry) is set then it
points to a 2 Meg page. Otherwise it points to a page table which will
have 512 PTE's pointing to 4K pages.

Download:

http://support.amd.com/us/Processor_TechDocs/24593.pdf

It has nice diagrams that explains this. Check out page 207 (fig 5-17)
and 210 (fig 5-22).

The phys_pmd_init() in arch/x86/mm/init_64.c will try to map memory
using 2M pages if it can, otherwise it falls back to 4K pages.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Frederic Weisbecker on 16 Jul 2010 08:10

On Thu, Jul 15, 2010 at 07:51:55AM -0700, Linus Torvalds wrote:
> On Thu, Jul 15, 2010 at 7:11 AM, Frederic Weisbecker <fweisbec(a)gmail.com> wrote:
> > On Wed, Jul 14, 2010 at 03:56:43PM -0700, Linus Torvalds wrote:
> >> You can:
> >>
> >> �- make sure that you only ever use _one_ single top-level entry for
> >> all vmalloc issues, and can make sure that all processes are created
> >> with that static entry filled in. This is optimal, but it just doesn't
> >> work on all architectures (eg on 32-bit x86, it would limit the
> >> vmalloc space to 4MB in non-PAE, whatever)
> >
> > But then, even if you ensure that, don't we need to also fill lower level
> > entries for the new mapping.
>
> Yes, but now they are all mapped by the one *shared* top-level entry.
>
> Think about it.
>
> [ Time passes ]
>
> End result: if you can map the whole vmalloc area with a single
> top-level entry that is shared by all processes, and can then just
> fill in the lower levels when doing actual allocations, it means that
> all processes will automatically get the entries added, and do not
> need any fixups.
>
> In other words, the page tables will be automatically correct and
> filled in for everybody - without having to traverse any lists,
> without any extra locking, and without any races. So this is efficient
> and simple, and never needs any faulting to fill in page tables later
> on.
>
> (Side note: "single top-level entry" could equally well be "multiple
> preallocated entries covering the whole region": the important part is
> not really the "single entry", but the "preallocated and filled into
> every page directory from the start" part)

Right, I got it. Thanks for these explanations.

>
> > Also, why is this a worry for vmalloc but not for kmalloc? Don't we also
> > risk to add a new memory mapping for new memory allocated with kmalloc?
>
> No. The kmalloc space is all in the 1:1 kernel mapping, and is always
> mapped. Even with PAGEALLOC_DEBUG, it's always mapped at the top
> level, and even if a particular page is unmapped/remapped for
> debugging, it is done so in the shared kernel page tables (which ends
> up being the above trivial case - there is just a single set of page
> directory entries that are shared by everybody).

Ok.

> >> �- at vmalloc time, when adding a new page directory entry, walk all
> >> the tens of thousands of existing page tables under a lock that
> >> guarantees that we don't add any new ones (ie it will lock out fork())
> >> and add the required pgd entry to them.
> >>
> >> �- or just take the fault and do the "fill the page tables" on demand.
> >>
> >> Quite frankly, most of the time it's probably better to make that last
> >> choice (unless your hardware makes it easy to make the first choice,
> >> which is obviously simplest for everybody). It makes it _much_ cheaper
> >> to do vmalloc. It also avoids that nasty latency issue. And it's just
> >> simpler too, and has no interesting locking issues with how/when you
> >> expose the page tables in fork() etc.
> >>
> >> So the only downside is that you do end up taking a fault in the
> >> (rare) case where you have a newly created task that didn't get an
> >> even newer vmalloc entry.
> >
> > But then how did the previous tasks get this new mapping? You said
> > we don't walk through every process page tables for vmalloc.
>
> We always add the mapping to the "init_mm" page tables when it is
> created (just a single mm), and when fork creates a new page table, it
> will always copy the kernel mapping parts from the old one. So the
> _common_ case is that all normal mappings are already set up in page
> tables, including newly created page tables.
>
> The uncommon case is when there is a new page table created _and_ a
> new vmalloc mapping, and the race that happens between those events.
> Whent hat new page table is then later used (and it can be _much_
> later, of course: we're now talking process scheduling, so IO delays
> etc are relevant), it won't necessarily have the page table entries
> for vmalloc stuff that was created since the page tables were created.
> So we fill _those_ in dynamically.

Such new page table created that might race is only about top level page
right? Otherwise it wouldn't race since the top level entries are shared
and then updates inside lower level pages are naturally propagated, if
I understood you well.

So, if only top level pages that gets added can generate such lazily
mapping update, I wonder why I experienced this fault everytime with
my patches.

I allocated 8192 bytes per cpu in a x86-32 system that has only 2 GB.
I doubt there is a top level page table update there at this time with
such a small amount of available memory. But still it faults once on
access.

I have troubles to visualize the race and the problem here.

>
> But vmalloc mappings should be reasonably rare, and the actual "fill
> them in" cases are much rarer still (since we fill them in page
> directory entries at a time: so even if you make a lot of vmalloc()
> calls, we only _fill_ at most once per page directory entry, which is
> usually a pretty big chunk). On 32-bit x86, for example, we'd fill
> once every 4MB (or 2MB if PAE), and you cannot have a lot of vmalloc
> mappings that large (since the VM space is limited).
>
> So the cost of filling things in is basically zero, because it happens
> so seldom. And by _allowing_ things to be done lazily, we avoid all
> the locking costs, and all the costs of traversing every single
> possible mm (and there can be many many thousands of those).

Ok.

> > I would understand this race if we were to walk on every processes page
> > tables and add the new mapping on them, but we missed one new task that
> > forked or so, because we didn't lock (or just rcu).
>
> .. and how do you keep track of which tasks you missed? And no, it's
> not just the new tasks - you have old tasks that have their page
> tables built up too, but need to be updated. They may never need the
> mapping since they may be sleeping and never using the driver or data
> structures that created it (in fact, that's a common case), so filling
> them would be pointless. But if we don't do the lazy fill, we'd have
> to fill them all, because WE DO NOT KNOW.

Right.

>
> > So the parts of the problem I don't understand are:
> >
> > - why don't we have this problem with kmalloc() ?
>
> Hopefully clarified.

Indeed.

> > - did I understand well the race that makes the fault necessary,
> > �ie: we walk the tasklist lockless, add the new mapping if
> > �not present, but we might miss a task lately forked, but
> > �the fault will fix that.
>
> But the _fundamental_ issue is that we do not want to walk the
> tasklist (or the mm_list) AT ALL. It's a f*cking waste of time. It's a
> long list, and nobody cares. In many cases it won't be needed.
>
> The lazy algorithm is _better_. It's way simpler (we take nested
> faults all the time in the kernel, and it's a particularly _easy_ page
> fault to handle with no IO or no locking needed), and it does less
> work. It really boils down to that.

Yeah, agreed. But I'm still confused about when exactly we need to fault
(doubts I have detailed in my question above).

> So it's not the lazy page table fill that is the problem. Never has
> been. We've been doing the lazy fill for a long time, and it was
> simple and useful way back when.
>
> The problem has always been NMI, and nothing else. NMI's are nasty,
> and the x86 NMI blocking is insane and crazy.
>
> Which is why I'm so adamant that this should be fixed in the NMI code,
> and we should _not_ talk about trying to screw up other, totally
> unrelated, code. The lazy fill really was never the problem.

Yeah agreed.

Thanks for your explanations!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Steven Rostedt on 16 Jul 2010 09:00

On Fri, 2010-07-16 at 14:00 +0200, Frederic Weisbecker wrote:
> On Thu, Jul 15, 2010 at 07:51:55AM -0700, Linus Torvalds wrote:

>
> Such new page table created that might race is only about top level page
> right? Otherwise it wouldn't race since the top level entries are shared
> and then updates inside lower level pages are naturally propagated, if
> I understood you well.
>
> So, if only top level pages that gets added can generate such lazily
> mapping update, I wonder why I experienced this fault everytime with
> my patches.
>
> I allocated 8192 bytes per cpu in a x86-32 system that has only 2 GB.
> I doubt there is a top level page table update there at this time with
> such a small amount of available memory. But still it faults once on
> access.
>
> I have troubles to visualize the race and the problem here.
>

A few trace_printks and a tracing_off() on fault would probably show
exactly what was happening ;-)

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Prev: [PATCH] enable readback to get HPET working on ATI SB4x00, kernel 2.6.35_rc5
Next: input: Fix wrong dimensions check for synaptics