mm: Preemptible mmu_gather [Kernel]

Prev: powerpc: Add rcu_read_lock() to gup_fast() implementation
Next: email list

From: Nick Piggin on 8 Apr 2010 23:30

On Thu, Apr 08, 2010 at 09:17:43PM +0200, Peter Zijlstra wrote:
> @@ -39,30 +33,48 @@
> struct mmu_gather {
> struct mm_struct *mm;
> unsigned int nr; /* set to ~0U means fast mode */
> + unsigned int max; /* nr < max */
> unsigned int need_flush;/* Really unmapped some ptes? */
> unsigned int fullmm; /* non-zero means full mm flush */
> - struct page * pages[FREE_PTE_NR];
> +#ifdef HAVE_ARCH_MMU_GATHER
> + struct arch_mmu_gather arch;
> +#endif
> + struct page **pages;
> + struct page *local[8];

Have you done some profiling on this? What I would like to see, if
it's not too much complexity, is to have a small set of pages to
handle common size frees, and then use them up first by default
before attempting to allocate more.

Also, it would be cool to be able to chain allocations to avoid
TLB flushes even on big frees (overridable by arch of course, in
case they're doing some non-preeemptible work or you wish to break
up lock hold times). But that might be just getting over engineered.

> };
>
> -/* Users of the generic TLB shootdown code must declare this storage space. */
> -DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
> +static inline void __tlb_alloc_pages(struct mmu_gather *tlb)
> +{
> + unsigned long addr = __get_free_pages(GFP_ATOMIC, 0);

Slab allocations should be faster, so it's nice to use them in
performance critical code if you don't need the struct page.

Otherwise, looks ok to me.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Peter Zijlstra on 9 Apr 2010 04:20

On Fri, 2010-04-09 at 13:25 +1000, Nick Piggin wrote:
> On Thu, Apr 08, 2010 at 09:17:43PM +0200, Peter Zijlstra wrote:
> > @@ -39,30 +33,48 @@
> > struct mmu_gather {
> > struct mm_struct *mm;
> > unsigned int nr; /* set to ~0U means fast mode */
> > + unsigned int max; /* nr < max */
> > unsigned int need_flush;/* Really unmapped some ptes? */
> > unsigned int fullmm; /* non-zero means full mm flush */
> > - struct page * pages[FREE_PTE_NR];
> > +#ifdef HAVE_ARCH_MMU_GATHER
> > + struct arch_mmu_gather arch;
> > +#endif
> > + struct page **pages;
> > + struct page *local[8];
>
> Have you done some profiling on this? What I would like to see, if
> it's not too much complexity, is to have a small set of pages to
> handle common size frees, and then use them up first by default
> before attempting to allocate more.
>
> Also, it would be cool to be able to chain allocations to avoid
> TLB flushes even on big frees (overridable by arch of course, in
> case they're doing some non-preeemptible work or you wish to break
> up lock hold times). But that might be just getting over engineered.

Did no profiling at all, back when I wrote this I was in a hurry to get
this working for -rt.

But yes, those things do look like something we want to look into, we
can easily add a head structure to these pages like we did for the RCU
batches.

But as it stands I think we can do those things as incrementals on top
of this, no?

What kind of workload would you recommend I use to profile this?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Peter Zijlstra on 9 Apr 2010 16:40

On Fri, 2010-04-09 at 13:25 +1000, Nick Piggin wrote:
> Have you done some profiling on this? What I would like to see, if
> it's not too much complexity, is to have a small set of pages to
> handle common size frees, and then use them up first by default
> before attempting to allocate more.
>
> Also, it would be cool to be able to chain allocations to avoid
> TLB flushes even on big frees (overridable by arch of course, in
> case they're doing some non-preeemptible work or you wish to break
> up lock hold times). But that might be just getting over engineered.
>
Measuring ITLB_FLUSH on Intel nehalem using:

perf stat -a -e r01ae make O=defconfig-build/ -j48 bzImage

-linus 5825850 +- 2545 (100%)
+patches 5891341 +- 6045 (101%)
+below 5783991 +- 4725 ( 99%)

(No slab allocations yet)

Signed-off-by: Peter Zijlstra <a.p.zijlstra(a)chello.nl>
---
include/asm-generic/tlb.h | 122 ++++++++++++++++++++++++++++++----------------
1 file changed, 82 insertions(+), 40 deletions(-)

Index: linux-2.6/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.orig/include/asm-generic/tlb.h
+++ linux-2.6/include/asm-generic/tlb.h
@@ -17,16 +17,6 @@
#include <asm/pgalloc.h>
#include <asm/tlbflush.h>

-/*
- * For UP we don't need to worry about TLB flush
- * and page free order so much..
- */
-#ifdef CONFIG_SMP
- #define tlb_fast_mode(tlb) ((tlb)->nr == ~0U)
-#else
- #define tlb_fast_mode(tlb) 1
-#endif
-
#ifdef HAVE_ARCH_RCU_TABLE_FREE
/*
* Semi RCU freeing of the page directories.
@@ -70,31 +60,66 @@ extern void tlb_remove_table(struct mmu_

#endif

+struct mmu_gather_batch {
+ struct mmu_gather_batch *next;
+ unsigned int nr;
+ unsigned int max;
+ struct page *pages[0];
+};
+
+#define MAX_GATHER_BATCH \
+ ((PAGE_SIZE - sizeof(struct mmu_gather_batch)) / sizeof(unsigned long))
+
/* struct mmu_gather is an opaque type used by the mm code for passing around
* any data needed by arch specific code for tlb_remove_page.
*/
struct mmu_gather {
struct mm_struct *mm;
- unsigned int nr; /* set to ~0U means fast mode */
- unsigned int max; /* nr < max */
- unsigned int need_flush;/* Really unmapped some ptes? */
- unsigned int fullmm; /* non-zero means full mm flush */
- struct page **pages;
- struct page *local[8];
+ unsigned int need_flush : 1, /* Did free PTEs */
+ fast_mode : 1; /* No batching */
+ unsigned int fullmm; /* Flush full mm */
+
+ struct mmu_gather_batch *active;
+ struct mmu_gather_batch local;
+ struct page *__pages[8];

#ifdef HAVE_ARCH_RCU_TABLE_FREE
struct mmu_table_batch *batch;
#endif
};

-static inline void __tlb_alloc_pages(struct mmu_gather *tlb)
+/*
+ * For UP we don't need to worry about TLB flush
+ * and page free order so much..
+ */
+#ifdef CONFIG_SMP
+ #define tlb_fast_mode(tlb) (tlb->fast_mode)
+#else
+ #define tlb_fast_mode(tlb) 1
+#endif
+
+static inline int tlb_next_batch(struct mmu_gather *tlb)
{
- unsigned long addr = __get_free_pages(GFP_ATOMIC, 0);
+ struct mmu_gather_batch *batch;

- if (addr) {
- tlb->pages = (void *)addr;
- tlb->max = PAGE_SIZE / sizeof(struct page *);
+ batch = tlb->active;
+ if (batch->next) {
+ tlb->active = batch->next;
+ return 1;
}
+
+ batch = (void *)__get_free_pages(GFP_ATOMIC, 0);
+ if (!batch)
+ return 0;
+
+ batch->next = NULL;
+ batch->nr = 0;
+ batch->max = MAX_GATHER_BATCH;
+
+ tlb->active->next = batch;
+ tlb->active = batch;
+
+ return 1;
}

/* tlb_gather_mmu
@@ -105,17 +130,16 @@ tlb_gather_mmu(struct mmu_gather *tlb, s
{
tlb->mm = mm;

- tlb->max = ARRAY_SIZE(tlb->local);
- tlb->pages = tlb->local;
-
- if (num_online_cpus() > 1) {
- tlb->nr = 0;
- __tlb_alloc_pages(tlb);
- } else /* Use fast mode if only one CPU is online */
- tlb->nr = ~0U;
-
+ tlb->need_flush = 0;
+ if (num_online_cpus() == 1)
+ tlb->fast_mode = 1;
tlb->fullmm = full_mm_flush;

+ tlb->local.next = NULL;
+ tlb->local.nr = 0;
+ tlb->local.max = ARRAY_SIZE(tlb->__pages);
+ tlb->active = &tlb->local;
+
#ifdef HAVE_ARCH_RCU_TABLE_FREE
tlb->batch = NULL;
#endif
@@ -124,6 +148,8 @@ tlb_gather_mmu(struct mmu_gather *tlb, s
static inline void
tlb_flush_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
{
+ struct mmu_gather_batch *batch;
+
if (!tlb->need_flush)
return;
tlb->need_flush = 0;
@@ -131,12 +157,14 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
#ifdef HAVE_ARCH_RCU_TABLE_FREE
tlb_table_flush(tlb);
#endif
- if (!tlb_fast_mode(tlb)) {
- free_pages_and_swap_cache(tlb->pages, tlb->nr);
- tlb->nr = 0;
- if (tlb->pages == tlb->local)
- __tlb_alloc_pages(tlb);
+ if (tlb_fast_mode(tlb))
+ return;
+
+ for (batch = &tlb->local; batch; batch = batch->next) {
+ free_pages_and_swap_cache(batch->pages, batch->nr);
+ batch->nr = 0;
}
+ tlb->active = &tlb->local;
}

/* tlb_finish_mmu
@@ -146,13 +174,18 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
static inline void
tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
{
+ struct mmu_gather_batch *batch, *next;
+
tlb_flush_mmu(tlb, start, end);

/* keep the page table cache within bounds */
check_pgt_cache();

- if (tlb->pages != tlb->local)
- free_pages((unsigned long)tlb->pages, 0);
+ for (batch = tlb->local.next; batch; batch = next) {
+ next = batch->next;
+ free_pages((unsigned long)batch, 0);
+ }
+ tlb->local.next = NULL;
}

/* tlb_remove_page
@@ -162,14 +195,23 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
*/
static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
{
+ struct mmu_gather_batch *batch;
+
tlb->need_flush = 1;
+
if (tlb_fast_mode(tlb)) {
free_page_and_swap_cache(page);
return;
}
- tlb->pages[tlb->nr++] = page;
- if (tlb->nr >= tlb->max)
- tlb_flush_mmu(tlb, 0, 0);
+
+ batch = tlb->active;
+ if (batch->nr == batch->max) {
+ if (!tlb_next_batch(tlb))
+ tlb_flush_mmu(tlb, 0, 0);
+ batch = tlb->active;
+ }
+
+ batch->pages[batch->nr++] = page;
}

/**

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Peter Zijlstra on 19 Apr 2010 15:20

On Fri, 2010-04-09 at 22:36 +0200, Peter Zijlstra wrote:
> On Fri, 2010-04-09 at 13:25 +1000, Nick Piggin wrote:
> > Have you done some profiling on this? What I would like to see, if
> > it's not too much complexity, is to have a small set of pages to
> > handle common size frees, and then use them up first by default
> > before attempting to allocate more.
> >
> > Also, it would be cool to be able to chain allocations to avoid
> > TLB flushes even on big frees (overridable by arch of course, in
> > case they're doing some non-preeemptible work or you wish to break
> > up lock hold times). But that might be just getting over engineered.

[ patch to do very long queues ]

One thing that comes from having preemptible mmu_gather, and esp. when
we allow such very long gathers, is that we can potentially have a very
large amount of pages stuck on these lists.

So we'd need to hook into reclaim somehow to allow flushing of them when
we're falling short.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

|
Pages: 1
Prev: powerpc: Add rcu_read_lock() to gup_fast() implementation
Next: email list