From: Rik van Riel on
On 04/26/2010 06:37 PM, Mel Gorman wrote:
> From: KAMEZAWA Hiroyuki<kamezawa.hiroyu(a)jp.fujitsu.com>
>
> From: KAMEZAWA Hiroyuki<kamezawa.hiroyu(a)jp.fujitsu.com>
>
> At page migration, we replace pte with migration_entry, which has
> similar format as swap_entry and replace it with real pfn at the
> end of migration. But there is a race with fork()'s copy_page_range().

Reviewed-by: Rik van Riel <riel(a)redhat.com>

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli on
Ok I had a first look:

On Tue, Apr 27, 2010 at 10:30:50PM +0100, Mel Gorman wrote:
> CPUA CPU B
> do_fork()
> copy_mm() (from process 1 to process2)
> insert new vma to mmap_list (if inode/anon_vma)

Insert to the tail of the anon_vma list...

> pte_lock(process1)
> unmap a page
> insert migration_entry
> pte_unlock(process1)
>
> migrate page copy
> copy_page_range
> remap new page by rmap_walk()

rmap_walk will walk process1 first! It's at the head, the vmas with
unmapped ptes are at the tail so process1 is walked before process2.

> pte_lock(process2)
> found no pte.
> pte_unlock(process2)
> pte lock(process2)
> pte lock(process1)
> copy migration entry to process2
> pte unlock(process1)
> pte unlokc(process2)
> pte_lock(process1)
> replace migration entry
> to new page's pte.
> pte_unlock(process1)

rmap_walk has to lock down process1 before process2, this is the
ordering issue I already mentioned in earlier email. So it cannot
happen and this patch is unnecessary.

The ordering is fundamental and as said anon_vma_link already adds new
vmas to the _tail_ of the anon-vma. And this is why it has to add to
the tail. If anon_vma_link would add new vmas to the head of the list,
the above bug could materialize, but it doesn't so it cannot happen.

In mainline anon_vma_link is called anon_vma_chain_link, see the
list_add_tail there to provide this guarantee.

Because process1 is walked first by CPU A, the migration entry is
replaced by the final pte before copy-migration-entry
runs. Alternatively if copy-migration-entry runs before before
process1 is walked, the migration entry will be copied and found in
process 2.

Comments welcome.
Andrea
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: KAMEZAWA Hiroyuki on
On Wed, 28 Apr 2010 00:22:45 +0200
Andrea Arcangeli <aarcange(a)redhat.com> wrote:

> Ok I had a first look:
>
> On Tue, Apr 27, 2010 at 10:30:50PM +0100, Mel Gorman wrote:
> > CPUA CPU B
> > do_fork()
> > copy_mm() (from process 1 to process2)
> > insert new vma to mmap_list (if inode/anon_vma)
>
> Insert to the tail of the anon_vma list...
>
> > pte_lock(process1)
> > unmap a page
> > insert migration_entry
> > pte_unlock(process1)
> >
> > migrate page copy
> > copy_page_range
> > remap new page by rmap_walk()
>
> rmap_walk will walk process1 first! It's at the head, the vmas with
> unmapped ptes are at the tail so process1 is walked before process2.
>
> > pte_lock(process2)
> > found no pte.
> > pte_unlock(process2)
> > pte lock(process2)
> > pte lock(process1)
> > copy migration entry to process2
> > pte unlock(process1)
> > pte unlokc(process2)
> > pte_lock(process1)
> > replace migration entry
> > to new page's pte.
> > pte_unlock(process1)
>
> rmap_walk has to lock down process1 before process2, this is the
> ordering issue I already mentioned in earlier email. So it cannot
> happen and this patch is unnecessary.
>
> The ordering is fundamental and as said anon_vma_link already adds new
> vmas to the _tail_ of the anon-vma. And this is why it has to add to
> the tail. If anon_vma_link would add new vmas to the head of the list,
> the above bug could materialize, but it doesn't so it cannot happen.
>
> In mainline anon_vma_link is called anon_vma_chain_link, see the
> list_add_tail there to provide this guarantee.
>
> Because process1 is walked first by CPU A, the migration entry is
> replaced by the final pte before copy-migration-entry
> runs. Alternatively if copy-migration-entry runs before before
> process1 is walked, the migration entry will be copied and found in
> process 2.
>

I already explained this doesn't happend and said "I'm sorry".

But considering maintainance, it's not necessary to copy migration ptes
and we don't have to keep a fundamental risks of migration circus.

So, I don't say "we don't need this patch."

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli on
On Wed, Apr 28, 2010 at 02:18:21AM +0200, Andrea Arcangeli wrote:
> On Wed, Apr 28, 2010 at 08:52:03AM +0900, KAMEZAWA Hiroyuki wrote:
> > I already explained this doesn't happend and said "I'm sorry".
>
> Oops I must have overlooked it sorry! I just seen the trace quoted in
> the comment of the patch and that at least would need correction
> before it can be pushed in mainline, or it creates huge confusion to
> see a reverse trace for CPU A for an already tricky piece of code.
>
> > But considering maintainance, it's not necessary to copy migration ptes
> > and we don't have to keep a fundamental risks of migration circus.
> >
> > So, I don't say "we don't need this patch."
>
> split_huge_page also has the same requirement and there is no bug to
> fix, so I don't see why to make special changes for just migrate.c
> when we still have to list_add_tail for split_huge_page.
>
> Furthermore this patch isn't fixing anything in any case and it looks
> a noop to me. If the order ever gets inverted, and process2 ptes are
> scanned before process1 ptes in the rmap_walk, sure the
> copy-page-tables will break and stop until the process1 rmap_walk will
> complete, but that is not enough! You have to repeat the rmap_walk of
> process1 if the order ever gets inverted and this isn't happening in
^^^^^^^2
> the patch so I don't see how it could make any difference even just
> for migrate.c (obviously not for split_huge_page).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli on
On Wed, Apr 28, 2010 at 08:52:03AM +0900, KAMEZAWA Hiroyuki wrote:
> I already explained this doesn't happend and said "I'm sorry".

Oops I must have overlooked it sorry! I just seen the trace quoted in
the comment of the patch and that at least would need correction
before it can be pushed in mainline, or it creates huge confusion to
see a reverse trace for CPU A for an already tricky piece of code.

> But considering maintainance, it's not necessary to copy migration ptes
> and we don't have to keep a fundamental risks of migration circus.
>
> So, I don't say "we don't need this patch."

split_huge_page also has the same requirement and there is no bug to
fix, so I don't see why to make special changes for just migrate.c
when we still have to list_add_tail for split_huge_page.

Furthermore this patch isn't fixing anything in any case and it looks
a noop to me. If the order ever gets inverted, and process2 ptes are
scanned before process1 ptes in the rmap_walk, sure the
copy-page-tables will break and stop until the process1 rmap_walk will
complete, but that is not enough! You have to repeat the rmap_walk of
process1 if the order ever gets inverted and this isn't happening in
the patch so I don't see how it could make any difference even just
for migrate.c (obviously not for split_huge_page).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/