From: Mel Gorman on
On Wed, Apr 28, 2010 at 02:20:56AM +0200, Andrea Arcangeli wrote:
> On Wed, Apr 28, 2010 at 09:13:45AM +0900, KAMEZAWA Hiroyuki wrote:
> > Doing some check in move_ptes() after vma_adjust() is not safe.
> > IOW, when vma's information and information in page-table is incosistent...objrmap
> > is broken and migartion will cause panic.
> >
> > Then...I think there are 2 ways.
> > 1. use seqcounter in "mm_struct" as previous patch and lock it at mremap.
> > or
> > 2. get_user_pages_fast() when do remap.
>
> 3 take the anon_vma->lock
>

I've been looking at ways during the day that the anon_vma lock can be held
while the page tables are being allocated. The schemes were way too hairy
just to cover a migration corner case.

As this is particular to exec, I'm wondering if Kamezawa's additional proposal
of just skipping migration of pages within the temporary stack might be the
best solution overall in terms of effectiveness and simplicity. His patch
introduced a new variable to the VMA but it shouldn't be necessary and it
altered vma_address which is unnecessary.

Here is a different version of the same basic idea to skip temporary VMAs
during migration. Maybe go with this?

(As a heads-up, I'll also be going offline in about 24 hours until Tuesday
morning. The area I'm in has zero internet access)

==== CUT HERE ====
mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks

Page migration requires rmap to be able to find all ptes mapping a page
at all times, otherwise the migration entry can be instantiated, but it
is possible to leave one behind if the second rmap_walk fails to find
the page. If this page is later faulted, migration_entry_to_page() will
call BUG because the page is locked indicating the page was migrated by
the migration PTE not cleaned up.

There is a race between shift_arg_pages and migration that allows this
bug to trigger. A temporary stack is setup during exec and later moved. If
migration moves a page in the temporary stack and the VMA is then removed,
the migration PTE may not be found leading to a BUG when the stack is faulted.

Ideally, shift_arg_pages must run atomically with respect of rmap_walk by
holding the anon_vma lock but this is problematic as pages must be allocated
for page tables. Instead, this patch identifies when it is about to migrate
pages from a temporary stack and leaves them alone. Memory hot-remove will
try again, sys_move_pages() wouldn't be operating during exec() time and
memory compaction will just continue to another page without concern.

[kamezawa.hiroyu(a)jp.fujitsu.com: Idea for having migration skip the stacks]
Signed-off-by: Mel Gorman <mel(a)csn.ul.ie>
---
mm/rmap.c | 31 ++++++++++++++++++++++++++++++-
1 files changed, 30 insertions(+), 1 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 85f203e..5aaf4df 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1141,6 +1141,21 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
return ret;
}

+static bool is_vma_temporary_stack(struct vm_area_struct *vma)
+{
+ if (vma->vm_flags != VM_STACK_FLAGS)
+ return false;
+
+ /*
+ * Only during exec will the total VM consumed by a process
+ * be exacly the same as the stack
+ */
+ if (vma->vm_mm->stack_vm == 1 && vma->vm_mm->total_vm == 1)
+ return true;
+
+ return false;
+}
+
/**
* try_to_unmap_anon - unmap or unlock anonymous page using the object-based
* rmap method
@@ -1169,7 +1184,21 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)

list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
struct vm_area_struct *vma = avc->vma;
- unsigned long address = vma_address(page, vma);
+ unsigned long address;
+
+ /*
+ * During exec, a temporary VMA is setup and later moved.
+ * The VMA is moved under the anon_vma lock but not the
+ * page tables leading to a race where migration cannot
+ * find the migration ptes. Rather than increasing the
+ * locking requirements of exec(), migration skips
+ * temporary VMAs until after exec() completes.
+ */
+ if (PAGE_MIGRATION && (flags & TTU_MIGRATION) &&
+ is_vma_temporary_stack(vma))
+ continue;
+
+ address = vma_address(page, vma);
if (address == -EFAULT)
continue;
ret = try_to_unmap_one(page, vma, address, flags);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on
On Wed, Apr 28, 2010 at 03:23:56PM +0100, Mel Gorman wrote:
> On Wed, Apr 28, 2010 at 02:20:56AM +0200, Andrea Arcangeli wrote:
> > On Wed, Apr 28, 2010 at 09:13:45AM +0900, KAMEZAWA Hiroyuki wrote:
> > > Doing some check in move_ptes() after vma_adjust() is not safe.
> > > IOW, when vma's information and information in page-table is incosistent...objrmap
> > > is broken and migartion will cause panic.
> > >
> > > Then...I think there are 2 ways.
> > > 1. use seqcounter in "mm_struct" as previous patch and lock it at mremap.
> > > or
> > > 2. get_user_pages_fast() when do remap.
> >
> > 3 take the anon_vma->lock
> >
>
> <SNIP>
>
> Here is a different version of the same basic idea to skip temporary VMAs
> during migration. Maybe go with this?
>
> +static bool is_vma_temporary_stack(struct vm_area_struct *vma)
> +{
> + if (vma->vm_flags != VM_STACK_FLAGS)
> + return false;
> +
> + /*
> + * Only during exec will the total VM consumed by a process
> + * be exacly the same as the stack
> + */
> + if (vma->vm_mm->stack_vm == 1 && vma->vm_mm->total_vm == 1)
> + return true;
> +
> + return false;
> +}
> +

The assumptions on the vm flags is of course totally wrong. VM_EXEC might
be applied as well as default flags from the mm. The following is the same
basic idea, skip VMAs belonging to processes in exec rather than trying
to hold anon_vma->lock across move_page_tables(). Not tested yet.

==== CUT HERE ====
mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks

Page migration requires rmap to be able to find all ptes mapping a page
at all times, otherwise the migration entry can be instantiated, but it
is possible to leave one behind if the second rmap_walk fails to find
the page. If this page is later faulted, migration_entry_to_page() will
call BUG because the page is locked indicating the page was migrated by
the migration PTE not cleaned up.

There is a race between shift_arg_pages and migration that allows this
bug to trigger. A temporary stack is setup during exec and later moved. If
migration moves a page in the temporary stack and the VMA is then removed,
the migration PTE may not be found leading to a BUG when the stack is faulted.

Ideally, shift_arg_pages must run atomically with respect of rmap_walk by
holding the anon_vma lock but this is problematic as pages must be allocated
for page tables. Instead, this patch skips processes in exec by making an
assumption that a mm with stack_vm == 1 and total_vm == 1 is a process in
exec() that hasn't finalised the temporary stack yet. Memory hot-remove
will try again, sys_move_pages() wouldn't be operating during exec() time
and memory compaction will just continue to another page without concern.

[kamezawa.hiroyu(a)jp.fujitsu.com: Idea for having migration skip temporary stacks]
Signed-off-by: Mel Gorman <mel(a)csn.ul.ie>
---
mm/rmap.c | 28 +++++++++++++++++++++++++++-
1 files changed, 27 insertions(+), 1 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 85f203e..9e20188 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1141,6 +1141,18 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
return ret;
}

+static bool is_vma_in_exec(struct vm_area_struct *vma)
+{
+ /*
+ * Only during exec will the total VM consumed by a process
+ * be exacly the same as the stack and both equal to 1
+ */
+ if (vma->vm_mm->stack_vm == 1 && vma->vm_mm->total_vm == 1)
+ return true;
+
+ return false;
+}
+
/**
* try_to_unmap_anon - unmap or unlock anonymous page using the object-based
* rmap method
@@ -1169,7 +1181,21 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)

list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
struct vm_area_struct *vma = avc->vma;
- unsigned long address = vma_address(page, vma);
+ unsigned long address;
+
+ /*
+ * During exec, a temporary VMA is setup and later moved.
+ * The VMA is moved under the anon_vma lock but not the
+ * page tables leading to a race where migration cannot
+ * find the migration ptes. Rather than increasing the
+ * locking requirements of exec, migration skips
+ * VMAs in processes calling exec.
+ */
+ if (PAGE_MIGRATION && (flags & TTU_MIGRATION) &&
+ is_vma_in_exec(vma))
+ continue;
+
+ address = vma_address(page, vma);
if (address == -EFAULT)
continue;
ret = try_to_unmap_one(page, vma, address, flags);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli on
On Wed, Apr 28, 2010 at 03:57:38PM +0100, Mel Gorman wrote:
> On Wed, Apr 28, 2010 at 03:23:56PM +0100, Mel Gorman wrote:
> > On Wed, Apr 28, 2010 at 02:20:56AM +0200, Andrea Arcangeli wrote:
> > > On Wed, Apr 28, 2010 at 09:13:45AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > Doing some check in move_ptes() after vma_adjust() is not safe.
> > > > IOW, when vma's information and information in page-table is incosistent...objrmap
> > > > is broken and migartion will cause panic.
> > > >
> > > > Then...I think there are 2 ways.
> > > > 1. use seqcounter in "mm_struct" as previous patch and lock it at mremap.
> > > > or
> > > > 2. get_user_pages_fast() when do remap.
> > >
> > > 3 take the anon_vma->lock
> > >
> >
> > <SNIP>
> >
> > Here is a different version of the same basic idea to skip temporary VMAs
> > during migration. Maybe go with this?
> >
> > +static bool is_vma_temporary_stack(struct vm_area_struct *vma)
> > +{
> > + if (vma->vm_flags != VM_STACK_FLAGS)
> > + return false;
> > +
> > + /*
> > + * Only during exec will the total VM consumed by a process
> > + * be exacly the same as the stack
> > + */
> > + if (vma->vm_mm->stack_vm == 1 && vma->vm_mm->total_vm == 1)
> > + return true;
> > +
> > + return false;
> > +}
> > +
>
> The assumptions on the vm flags is of course totally wrong. VM_EXEC might
> be applied as well as default flags from the mm. The following is the same
> basic idea, skip VMAs belonging to processes in exec rather than trying
> to hold anon_vma->lock across move_page_tables(). Not tested yet.

This is better than the other, that made it look like people could set
broken rmap at arbitrary times, at least this shows it's ok only
during execve before anything run, but if we can't take the anon-vma
lock really better than having to make these special checks inside
every rmap_walk that has to be accurate, we should just delay the
linkage of the stack vma into its anon-vma, until after the move_pages.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on
On Wed, Apr 28, 2010 at 05:16:14PM +0200, Andrea Arcangeli wrote:
> On Wed, Apr 28, 2010 at 03:57:38PM +0100, Mel Gorman wrote:
> > On Wed, Apr 28, 2010 at 03:23:56PM +0100, Mel Gorman wrote:
> > > On Wed, Apr 28, 2010 at 02:20:56AM +0200, Andrea Arcangeli wrote:
> > > > On Wed, Apr 28, 2010 at 09:13:45AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > Doing some check in move_ptes() after vma_adjust() is not safe.
> > > > > IOW, when vma's information and information in page-table is incosistent...objrmap
> > > > > is broken and migartion will cause panic.
> > > > >
> > > > > Then...I think there are 2 ways.
> > > > > 1. use seqcounter in "mm_struct" as previous patch and lock it at mremap.
> > > > > or
> > > > > 2. get_user_pages_fast() when do remap.
> > > >
> > > > 3 take the anon_vma->lock
> > > >
> > >
> > > <SNIP>
> > >
> > > Here is a different version of the same basic idea to skip temporary VMAs
> > > during migration. Maybe go with this?
> > >
> > > +static bool is_vma_temporary_stack(struct vm_area_struct *vma)
> > > +{
> > > + if (vma->vm_flags != VM_STACK_FLAGS)
> > > + return false;
> > > +
> > > + /*
> > > + * Only during exec will the total VM consumed by a process
> > > + * be exacly the same as the stack
> > > + */
> > > + if (vma->vm_mm->stack_vm == 1 && vma->vm_mm->total_vm == 1)
> > > + return true;
> > > +
> > > + return false;
> > > +}
> > > +
> >
> > The assumptions on the vm flags is of course totally wrong. VM_EXEC might
> > be applied as well as default flags from the mm. The following is the same
> > basic idea, skip VMAs belonging to processes in exec rather than trying
> > to hold anon_vma->lock across move_page_tables(). Not tested yet.
>
> This is better than the other, that made it look like people could set
> broken rmap at arbitrary times, at least this shows it's ok only
> during execve before anything run, but if we can't take the anon-vma
> lock really better than having to make these special checks inside
> every rmap_walk that has to be accurate, we should just delay the
> linkage of the stack vma into its anon-vma, until after the move_pages.
>

Is it possible to delay the linkage like that? As arguments get copied into
the temporary stack before it gets moved, I'd have expected the normal fault
path to prepare and attach the anon_vma. We could special case it but
that isn't very palatable either.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli on
On Wed, Apr 28, 2010 at 04:23:54PM +0100, Mel Gorman wrote:
> Is it possible to delay the linkage like that? As arguments get copied into
> the temporary stack before it gets moved, I'd have expected the normal fault
> path to prepare and attach the anon_vma. We could special case it but
> that isn't very palatable either.

I'm not sure what is more palatable, but I feel it should be fixed in
execve.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/