From: Christoph Lameter on
On Thu, 29 Apr 2010, Mel Gorman wrote:

> There is a race between shift_arg_pages and migration that triggers this bug.
> A temporary stack is setup during exec and later moved. If migration moves
> a page in the temporary stack and the VMA is then removed before migration
> completes, the migration PTE may not be found leading to a BUG when the
> stack is faulted.

A simpler solution would be to not allow migration of the temporary stack?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli on
On Mon, May 03, 2010 at 02:40:44AM +0900, Minchan Kim wrote:
> On Fri, Apr 30, 2010 at 1:21 AM, Andrea Arcangeli <aarcange(a)redhat.com> wrote:
> > Hi Mel,
> >
> > did you see my proposed fix? I'm running with it applied, I'd be
> > interested if you can test it. Surely it will also work for new
> > anon-vma code in upstream, because at that point there's just 1
> > anon-vma and nothing else attached to the vma.
> >
> > http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=6efa1dfa5152ef8d7f26beb188d6877525a9dd03
> >
> > I think it's wrong to try to handle the race in rmap walk by making
> > magic checks on vm_flags VM_GROWSDOWN|GROWSUP and
> > vma->vm_mm->map_count == 1, when we can fix it fully and simply in
> > exec.c by indexing two vmas in the same anon-vma with a different
> > vm_start so the pages will be found at all times by the rmap_walk.
> >
>
> I like this approach than exclude temporal stack while migration.
>
> If we look it through viewpoint of performance, Mel and Kame's one
> look good and simple. But If I look it through viewpoint of
> correctness, Andrea's one looks good.
> I mean Mel's approach is that problem is here but let us solve it with
> there. it makes dependency between here and there. And In future, if
> temporal stack and rmap code might be problem, we also should solve it
> in there. :)

That explains exactly why I wanted to solve it locally to exec.c and
by using the same method of mremap. And it fixes all users not just
migrate (split_huge_page may also need it in the future if we ever
allow the user stack to be born huge instead of growing huge with
khugepaged if needed).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli on
On Tue, May 04, 2010 at 11:32:13AM +0100, Mel Gorman wrote:
> Unfortunately, the same bug triggers after about 18 minutes. The objective of
> your fix is very simple - have a VMA covering the new range so that rmap can
> find it. However, no lock is held during move_page_tables() because of the
> need to call the page allocator. Due to the lack of locking, is it possible
> that something like the following is happening?
>
> Exec Process Migration Process
> begin move_page_tables
> begin rmap walk
> take anon_vma locks
> find new location of pte (do nothing)
> copy migration pte to new location
> #### Bad PTE now in place
> find old location of pte
> remove old migration pte
> release anon_vma locks
> remove temporary VMA
> some time later, bug on migration pte
>
> Even with the care taken, a migration PTE got copied and then left behind. What
> I haven't confirmed at this point is if the ordering of the walk in "migration
> process" is correct in the above scenario. The order is important for
> the race as described to happen.

Ok so this seems the ordering dependency on the anon_vma list that
strikes again, I didn't realize the ordering would matter here, but it
does as shown above, great catch! The destination vma of the
move_page_tables has to be at the tail of the anon_vma list like the
child vma have to be at the end to avoid the equivalent race in
fork. This has to be a requirement for mremap too. We just want to
enforce the same invariants that mremap already enforces, to avoid
adding new special cases to the VM.

== for new anon-vma code ==
Subject: fix race between shift_arg_pages and rmap_walk

From: Andrea Arcangeli <aarcange(a)redhat.com>

migrate.c requires rmap to be able to find all ptes mapping a page at
all times, otherwise the migration entry can be instantiated, but it
can't be removed if the second rmap_walk fails to find the page.

And split_huge_page() will have the same requirements as migrate.c
already has.

Signed-off-by: Andrea Arcangeli <aarcange(a)redhat.com>
---

diff --git a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
#include <linux/fsnotify.h>
#include <linux/fs_struct.h>
#include <linux/pipe_fs_i.h>
+#include <linux/rmap.h>

#include <asm/uaccess.h>
#include <asm/mmu_context.h>
@@ -503,7 +504,9 @@ static int shift_arg_pages(struct vm_are
unsigned long length = old_end - old_start;
unsigned long new_start = old_start - shift;
unsigned long new_end = old_end - shift;
+ unsigned long moved_length;
struct mmu_gather *tlb;
+ struct vm_area_struct *tmp_vma;

BUG_ON(new_start > new_end);

@@ -515,17 +518,43 @@ static int shift_arg_pages(struct vm_are
return -EFAULT;

/*
+ * We need to create a fake temporary vma and index it in the
+ * anon_vma list in order to allow the pages to be reachable
+ * at all times by the rmap walk for migrate, while
+ * move_page_tables() is running.
+ */
+ tmp_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+ if (!tmp_vma)
+ return -ENOMEM;
+ *tmp_vma = *vma;
+ INIT_LIST_HEAD(&tmp_vma->anon_vma_chain);
+ /*
* cover the whole range: [new_start, old_end)
*/
- if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
+ tmp_vma->vm_start = new_start;
+ /*
+ * The tmp_vma destination of the copy (with the new vm_start)
+ * has to be at the end of the anon_vma list for the rmap_walk
+ * to find the moved pages at all times.
+ */
+ if (unlikely(anon_vma_clone(tmp_vma, vma))) {
+ kmem_cache_free(vm_area_cachep, tmp_vma);
return -ENOMEM;
+ }

/*
* move the page tables downwards, on failure we rely on
* process cleanup to remove whatever mess we made.
*/
- if (length != move_page_tables(vma, old_start,
- vma, new_start, length))
+ moved_length = move_page_tables(vma, old_start,
+ vma, new_start, length);
+
+ vma->vm_start = new_start;
+ /* rmap walk will already find all pages using the new_start */
+ unlink_anon_vmas(tmp_vma);
+ kmem_cache_free(vm_area_cachep, tmp_vma);
+
+ if (length != moved_length)
return -ENOMEM;

lru_add_drain();
@@ -551,7 +580,7 @@ static int shift_arg_pages(struct vm_are
/*
* Shrink the vma to just the new range. Always succeeds.
*/
- vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
+ vma->vm_end = new_end;

return 0;
}


== for old anon-vma code ==
Subject: fix race between shift_arg_pages and rmap_walk

From: Andrea Arcangeli <aarcange(a)redhat.com>

migrate.c requires rmap to be able to find all ptes mapping a page at
all times, otherwise the migration entry can be instantiated, but it
can't be removed if the second rmap_walk fails to find the page.

And split_huge_page() will have the same requirements as migrate.c
already has.

Signed-off-by: Andrea Arcangeli <aarcange(a)redhat.com>
---

diff --git a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
#include <linux/fsnotify.h>
#include <linux/fs_struct.h>
#include <linux/pipe_fs_i.h>
+#include <linux/rmap.h>

#include <asm/uaccess.h>
#include <asm/mmu_context.h>
@@ -502,7 +503,9 @@ static int shift_arg_pages(struct vm_are
unsigned long length = old_end - old_start;
unsigned long new_start = old_start - shift;
unsigned long new_end = old_end - shift;
+ unsigned long moved_length;
struct mmu_gather *tlb;
+ struct vm_area_struct *tmp_vma;

BUG_ON(new_start > new_end);

@@ -514,16 +517,41 @@ static int shift_arg_pages(struct vm_are
return -EFAULT;

/*
+ * We need to create a fake temporary vma and index it in the
+ * anon_vma list in order to allow the pages to be reachable
+ * at all times by the rmap walk for migrate, while
+ * move_page_tables() is running.
+ */
+ tmp_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+ if (!tmp_vma)
+ return -ENOMEM;
+ *tmp_vma = *vma;
+
+ /*
* cover the whole range: [new_start, old_end)
*/
- vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL);
+ tmp_vma->vm_start = new_start;
+
+ /*
+ * The tmp_vma destination of the copy (with the new vm_start)
+ * has to be at the end of the anon_vma list for the rmap_walk
+ * to find the moved pages at all times.
+ */
+ anon_vma_link(tmp_vma);

/*
* move the page tables downwards, on failure we rely on
* process cleanup to remove whatever mess we made.
*/
- if (length != move_page_tables(vma, old_start,
- vma, new_start, length))
+ moved_length = move_page_tables(vma, old_start,
+ vma, new_start, length);
+
+ vma->vm_start = new_start;
+ /* rmap walk will already find all pages using the new_start */
+ anon_vma_unlink(tmp_vma);
+ kmem_cache_free(vm_area_cachep, tmp_vma);
+
+ if (length != moved_length)
return -ENOMEM;

lru_add_drain();
@@ -549,7 +577,7 @@ static int shift_arg_pages(struct vm_are
/*
* shrink the vma to just the new range.
*/
- vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
+ vma->vm_end = new_end;

return 0;
}


I'll release new THP-23 and THP-23-anon_vma_chain (prerelease is
already in aa.git but it misses the above bit) soon enough...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli on
On Tue, May 04, 2010 at 03:33:11PM +0100, Mel Gorman wrote:
> I'm currently testing this and have seen no problems after an hour which
> is typically good. To be absolutly sure, it needs 24 hours but so far so
> good. The changelog is a tad on the light side so maybe you'd like to take
> this one instead and edit it to your liking?

I'll take your changelog for aa.git thanks! And the non trivial stuff
was documented in the code too.

So now in aa.git I've two branches, master -> old-anon_vma,
anon_vma_chain -> new-anon_vma.

anon_vma_chain starts with Rik's patch 1/2 and then this
patch. old-anon_vma starts with backout-anon-vma and then this patch 2
backported to old anon-vma code. After the removal of all
vma->anon_vma->lock usages from THP code, and switching to a slower
get_page() spin_unlock(page_table_lock) page_lock_anon_vma(page)
model, the anon_vma_chain branch has a chance to be as solid as the
master branch. anon_vma_chain branch can be pulled from mainline
branches too. The master branch is also not using anymore any
vma->anon_vma->lock even if it still could and it'd be a bit faster,
to give more testing to the anon_vma_chain code.

You can see the difference with "git diff master anon_vma_chain".

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=summary
http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=shortlog;h=refs/heads/anon_vma_chain

This should be THP-23 and THP-23-anon_vma_chain tags, I'll do proper
release soon.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Christoph Lameter on
On Tue, 4 May 2010, Mel Gorman wrote:

> On Sat, May 01, 2010 at 08:56:18PM -0500, Christoph Lameter wrote:
> > On Thu, 29 Apr 2010, Mel Gorman wrote:
> >
> > > There is a race between shift_arg_pages and migration that triggers this bug.
> > > A temporary stack is setup during exec and later moved. If migration moves
> > > a page in the temporary stack and the VMA is then removed before migration
> > > completes, the migration PTE may not be found leading to a BUG when the
> > > stack is faulted.
> >
> > A simpler solution would be to not allow migration of the temporary stack?
> >
>
> The patch's intention is to not migrate pages within the temporary
> stack. What are you suggesting that is different?

A simple way to disallow migration of pages is to increment the refcount
of a page.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/