mm,migration: Allow the migration of PageSwapCache pages [Kernel]

Prev: initramfs issue with console missing
Next: Your Mail Box Quota Has Exceeded The Set Limit

From: Mel Gorman on 22 Apr 2010 10:20

On Thu, Apr 22, 2010 at 07:51:53PM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 22 Apr 2010 19:31:06 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu(a)jp.fujitsu.com> wrote:
>
> > On Thu, 22 Apr 2010 19:13:12 +0900
> > Minchan Kim <minchan.kim(a)gmail.com> wrote:
> >
> > > On Thu, Apr 22, 2010 at 6:46 PM, KAMEZAWA Hiroyuki
> > > <kamezawa.hiroyu(a)jp.fujitsu.com> wrote:
> >
> > > > Hmm..in my test, the case was.
> > > >
> > > > Before try_to_unmap:
> > > > � � � �mapcount=1, SwapCache, remap_swapcache=1
> > > > After remap
> > > > � � � �mapcount=0, SwapCache, rc=0.
> > > >
> > > > So, I think there may be some race in rmap_walk() and vma handling or
> > > > anon_vma handling. migration_entry isn't found by rmap_walk.
> > > >
> > > > Hmm..it seems this kind patch will be required for debug.
> > >
>
> Ok, here is my patch for _fix_. But still testing...
> Running well at least for 30 minutes, where I can see bug in 10minutes.
> But this patch is too naive. please think about something better fix.
>
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu(a)jp.fujitsu.com>
>
> At adjust_vma(), vma's start address and pgoff is updated under
> write lock of mmap_sem. This means the vma's rmap information
> update is atoimic only under read lock of mmap_sem.
>
>
> Even if it's not atomic, in usual case, try_to_ummap() etc...
> just fails to decrease mapcount to be 0. no problem.
>
> But at page migration's rmap_walk(), it requires to know all
> migration_entry in page tables and recover mapcount.
>
> So, this race in vma's address is critical. When rmap_walk meet
> the race, rmap_walk will mistakenly get -EFAULT and don't call
> rmap_one(). This patch adds a lock for vma's rmap information.
> But, this is _very slow_.

Ok wow. That is exceptionally well-spotted. This looks like a proper bug
that compaction exposes as opposed to a bug that compaction introduces.

> We need something sophisitcated, light-weight update for this..
>

In the event the VMA is backed by a file, the mapping i_mmap_lock is taken for
the duration of the update and is taken elsewhere where the VMA information
is read such as rmap_walk_file()

In the event the VMA is anon, vma_adjust currently talks no locks and your
patch introduces a new one but why not use the anon_vma lock here? Am I
missing something that requires the new lock?

For example;

==== CUT HERE ====
mm: Take the anon_vma lock in vma_adjust()

vma_adjust() is updating anon VMA information without any locks taken.
In constract, file-backed mappings use the i_mmap_lock. This lack of
locking can result in races with page migration. During rmap_walk(),
vma_address() can return -EFAULT for an address that will soon be valid.
This leaves a dangling migration PTE behind which can later cause a
BUG_ON to trigger when the page is faulted in.

This patch takes the anon_vma->lock during vma_adjust to avoid such
races.

Signed-off-by: Mel Gorman <mel(a)csn.ul.ie>
---
mm/mmap.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..61d6f1d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -578,6 +578,9 @@ again: remove_next = 1 + (end > next->vm_end);
}
}

+ if (vma->anon_vma)
+ spin_lock(&vma->anon_vma->lock);
+
if (root) {
flush_dcache_mmap_lock(mapping);
vma_prio_tree_remove(vma, root);
@@ -620,6 +623,9 @@ again: remove_next = 1 + (end > next->vm_end);
if (mapping)
spin_unlock(&mapping->i_mmap_lock);

+ if (vma->anon_vma)
+ spin_unlock(&vma->anon_vma->lock);
+
if (remove_next) {
if (file) {
fput(file);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mel Gorman on 22 Apr 2010 12:00

On Thu, Apr 22, 2010 at 11:40:06PM +0900, Minchan Kim wrote:
> On Thu, 2010-04-22 at 23:23 +0900, Minchan Kim wrote:
> > On Thu, 2010-04-22 at 19:51 +0900, KAMEZAWA Hiroyuki wrote:
> > > On Thu, 22 Apr 2010 19:31:06 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu(a)jp.fujitsu.com> wrote:
> > >
> > > > On Thu, 22 Apr 2010 19:13:12 +0900
> > > > Minchan Kim <minchan.kim(a)gmail.com> wrote:
> > > >
> > > > > On Thu, Apr 22, 2010 at 6:46 PM, KAMEZAWA Hiroyuki
> > > > > <kamezawa.hiroyu(a)jp.fujitsu.com> wrote:
> > > >
> > > > > > Hmm..in my test, the case was.
> > > > > >
> > > > > > Before try_to_unmap:
> > > > > > mapcount=1, SwapCache, remap_swapcache=1
> > > > > > After remap
> > > > > > mapcount=0, SwapCache, rc=0.
> > > > > >
> > > > > > So, I think there may be some race in rmap_walk() and vma handling or
> > > > > > anon_vma handling. migration_entry isn't found by rmap_walk.
> > > > > >
> > > > > > Hmm..it seems this kind patch will be required for debug.
> > > > >
> > >
> > > Ok, here is my patch for _fix_. But still testing...
> > > Running well at least for 30 minutes, where I can see bug in 10minutes.
> > > But this patch is too naive. please think about something better fix.
> > >
> > > ==
> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu(a)jp.fujitsu.com>
> > >
> > > At adjust_vma(), vma's start address and pgoff is updated under
> > > write lock of mmap_sem. This means the vma's rmap information
> > > update is atoimic only under read lock of mmap_sem.
> > >
> > >
> > > Even if it's not atomic, in usual case, try_to_ummap() etc...
> > > just fails to decrease mapcount to be 0. no problem.
> > >
> > > But at page migration's rmap_walk(), it requires to know all
> > > migration_entry in page tables and recover mapcount.
> > >
> > > So, this race in vma's address is critical. When rmap_walk meet
> > > the race, rmap_walk will mistakenly get -EFAULT and don't call
> > > rmap_one(). This patch adds a lock for vma's rmap information.
> > > But, this is _very slow_.
> > > We need something sophisitcated, light-weight update for this..
> > >
> > >
> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu(a)jp.fujitsu.com>
> > > ---
> > > include/linux/mm_types.h | 1 +
> > > kernel/fork.c | 1 +
> > > mm/mmap.c | 11 ++++++++++-
> > > mm/rmap.c | 3 +++
> > > 4 files changed, 15 insertions(+), 1 deletion(-)
> > >
> > > Index: linux-2.6.34-rc4-mm1/include/linux/mm_types.h
> > > ===================================================================
> > > --- linux-2.6.34-rc4-mm1.orig/include/linux/mm_types.h
> > > +++ linux-2.6.34-rc4-mm1/include/linux/mm_types.h
> > > @@ -183,6 +183,7 @@ struct vm_area_struct {
> > > #ifdef CONFIG_NUMA
> > > struct mempolicy *vm_policy; /* NUMA policy for the VMA */
> > > #endif
> > > + spinlock_t adjust_lock;
> > > };
> > >
> > > struct core_thread {
> > > Index: linux-2.6.34-rc4-mm1/mm/mmap.c
> > > ===================================================================
> > > --- linux-2.6.34-rc4-mm1.orig/mm/mmap.c
> > > +++ linux-2.6.34-rc4-mm1/mm/mmap.c
> > > @@ -584,13 +584,20 @@ again: remove_next = 1 + (end > next->
> > > if (adjust_next)
> > > vma_prio_tree_remove(next, root);
> > > }
> > > -
> > > + /*
> > > + * changing all params in atomic. If not, vma_address in rmap.c
> > > + * can see wrong result.
> > > + */
> > > + spin_lock(&vma->adjust_lock);
> > > vma->vm_start = start;
> > > vma->vm_end = end;
> > > vma->vm_pgoff = pgoff;
> > > + spin_unlock(&vma->adjust_lock);
> > > if (adjust_next) {
> > > + spin_lock(&next->adjust_lock);
> > > next->vm_start += adjust_next << PAGE_SHIFT;
> > > next->vm_pgoff += adjust_next;
> > > + spin_unlock(&next->adjust_lock);
> > > }
> > >
> > > if (root) {
> > > @@ -1939,6 +1946,7 @@ static int __split_vma(struct mm_struct
> > > *new = *vma;
> > >
> > > INIT_LIST_HEAD(&new->anon_vma_chain);
> > > + spin_lock_init(&new->adjust_lock);
> > >
> > > if (new_below)
> > > new->vm_end = addr;
> > > @@ -2338,6 +2346,7 @@ struct vm_area_struct *copy_vma(struct v
> > > if (IS_ERR(pol))
> > > goto out_free_vma;
> > > INIT_LIST_HEAD(&new_vma->anon_vma_chain);
> > > + spin_lock_init(&new_vma->adjust_lock);
> > > if (anon_vma_clone(new_vma, vma))
> > > goto out_free_mempol;
> > > vma_set_policy(new_vma, pol);
> > > Index: linux-2.6.34-rc4-mm1/kernel/fork.c
> > > ===================================================================
> > > --- linux-2.6.34-rc4-mm1.orig/kernel/fork.c
> > > +++ linux-2.6.34-rc4-mm1/kernel/fork.c
> > > @@ -350,6 +350,7 @@ static int dup_mmap(struct mm_struct *mm
> > > goto fail_nomem;
> > > *tmp = *mpnt;
> > > INIT_LIST_HEAD(&tmp->anon_vma_chain);
> > > + spin_lock_init(&tmp->adjust_lock);
> > > pol = mpol_dup(vma_policy(mpnt));
> > > retval = PTR_ERR(pol);
> > > if (IS_ERR(pol))
> > > Index: linux-2.6.34-rc4-mm1/mm/rmap.c
> > > ===================================================================
> > > --- linux-2.6.34-rc4-mm1.orig/mm/rmap.c
> > > +++ linux-2.6.34-rc4-mm1/mm/rmap.c
> > > @@ -332,11 +332,14 @@ vma_address(struct page *page, struct vm
> > > pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> > > unsigned long address;
> > >
> > > + spin_lock(&vma->adjust_lock);
> > > address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> > > if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> > > + spin_unlock(&vma->adjust_lock);
> > > /* page should be within @vma mapping range */
> > > return -EFAULT;
> > > }
> > > + spin_unlock(&vma->adjust_lock);
> > > return address;
> > > }
> > >
> >
> > Nice Catch, Kame. :)
> >
> > For further optimization, we can hold vma->adjust_lock if vma_address
> > returns -EFAULT. But I hope we redesigns it without new locking.
> > But I don't have good idea, now. :(
>
> How about this?
> I just merged ideas of Mel and Kame.:)
>
> It just shows the concept, not formal patch.
>
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f90ea92..61ea742 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -578,6 +578,8 @@ again: remove_next = 1 + (end > next->vm_end);
> }
> }
>
> + if (vma->anon_vma)
> + spin_lock(&vma->anon_vma->lock);
> if (root) {
> flush_dcache_mmap_lock(mapping);
> vma_prio_tree_remove(vma, root);
> @@ -619,7 +621,8 @@ again: remove_next = 1 + (end > next->vm_end);
>
> if (mapping)
> spin_unlock(&mapping->i_mmap_lock);
> -
> + if (vma->anon_vma)
> + spin_unlock(&vma->anon_vma->lock);
> if (remove_next) {
> if (file) {
> fput(file);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 3a53d9f..8075057 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1359,9 +1359,22 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
> spin_lock(&anon_vma->lock);
> list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> struct vm_area_struct *vma = avc->vma;
> - unsigned long address = vma_address(page, vma);
> - if (address == -EFAULT)
> + struct anon_vma *tmp_anon_vma = vma->anon_vma;
> + unsigned long address;
> + int tmp_vma_lock = 0;
> +
> + if (tmp_anon_vma != anon_vma) {
> + spin_lock(&tmp_anon_vma->lock);
> + tmp_vma_lock = 1;
> + }

heh, I thought of a similar approach at the same time as you but missed
this mail until later. However, with this approach I suspect there is a
possibility that two walkers of the same anon_vma list could livelock if
two locks on the list are held at the same time. Am still thinking of
how it could be resolved without introducing new locking.

> + address = vma_address(page, vma);
> + if (address == -EFAULT) {
> + if (tmp_vma_lock)
> + spin_unlock(&tmp_anon_vma->lock);
> continue;
> + }
> + if (tmp_vma_lock)
> + spin_unlock(&tmp_anon_vma->lock);
> ret = rmap_one(page, vma, address, arg);
> if (ret != SWAP_AGAIN)
> break;
> --
> 1.7.0.5
>
>
>
> --
> Kind regards,
> Minchan Kim
>
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mel Gorman on 23 Apr 2010 15:30

On Fri, Apr 23, 2010 at 08:31:35PM +0200, Andrea Arcangeli wrote:
> Hi Mel,
>
> On Thu, Apr 22, 2010 at 04:44:43PM +0100, Mel Gorman wrote:
> > heh, I thought of a similar approach at the same time as you but missed
> > this mail until later. However, with this approach I suspect there is a
> > possibility that two walkers of the same anon_vma list could livelock if
> > two locks on the list are held at the same time. Am still thinking of
> > how it could be resolved without introducing new locking.
>
> Trying to understand this issue and I've some questions. This
> vma_adjust and lock inversion troubles with the anon-vma lock in
> rmap_walk are a new issue introduced by the recent anon-vma changes,
> right?
>

In a manner of speaking. There was no locking going on but prior to the
anon_vma changes, there would have been only one anon_vma lock and the
fix would be easier - just take the lock on anon_vma->lock while the
VMAs are being updated.

> About swapcache, try_to_unmap just nuke the mappings, establish the
> swap entry in the pte (not migration entry), and then there's no need
> to call remove_migration_ptes.

That would be an alternative for swapcache but it's not necessarily
where the problem is.

> So it just need to skip it for
> swapcache. page_mapped must return zero after try_to_unmap returns
> before we're allowed to migrate (plus the page count must be just 1
> and not 2 or more for gup users!).
>
> I don't get what's the problem about swapcache and the races connected
> to it, the moment I hear migration PTE in context of swapcache
> migration I'm confused because there's no migration PTE for swapcache.
>

That was a mistake on my part. The race appears to be between vma_adjust
changing the details of the VMA while rmap_walk is going on. It mistakenly
believes the vma no longer spans the address, gets -EFAULT from vma_address
and doesn't clean up the migration PTE. This is later encountered but the
page lock is no longer held and it bugs. An alternative would be to clean
up the migration PTE of unlocked pages on the assumption it was due to this
race but it's a bit sloppier.

> The new page will have no mappings either, it just needs to be part of
> the swapcache with the same page->index = swapentry, indexed in the
> radix tree with that page->index, and paga->mapping pointing to
> swapcache. Then new page faults will bring it in the pageatables. The
> lookup_swap_cache has to be serialized against some lock, it should be
> the radix tree lock? So the migration has to happen with that lock
> hold no?

Think migrate_page_move_mapping() is what you're looking for? It takes
the mapping tree lock.

>We can't just migrate swapcache without stopping swapcache
> radix tree lookups no? I didn't digest the full migrate.c yet and I
> don't see where it happens. Freeing the swapcache while simpler and
> safer, would be quite bad as it'd create I/O for potentially hot-ram.
>
> About the refcounting of anon-vma in migrate.c I think it'd be much
> simpler if zap_page_range and folks would just wait (like they do if
> they find a pmd_trans_huge && pmd_trans_splitting pmd), there would be
> no need of refcounting the anon-vma that way.
>

I'm not getting what you're suggesting here. The refcount is to make
sure the anon_vma doesn't go away after the page mapcount reaches zero.
What are we waiting for?

> I assume whatever is added to rmap_walk I also have to add to
> split_huge_page later when switching to mainline anon-vma code (for
> now I stick to 2.6.32 anon-vma code to avoid debugging anon-vma-chain,
> memory compaction, swapcache migration and transparent hugepage at the
> same time, which becomes a little beyond feasibility).
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mel Gorman on 24 Apr 2010 07:00

On Fri, Apr 23, 2010 at 09:39:48PM +0200, Andrea Arcangeli wrote:
> On Fri, Apr 23, 2010 at 08:23:12PM +0100, Mel Gorman wrote:
> > On Fri, Apr 23, 2010 at 08:31:35PM +0200, Andrea Arcangeli wrote:
> > > Hi Mel,
> > >
> > > On Thu, Apr 22, 2010 at 04:44:43PM +0100, Mel Gorman wrote:
> > > > heh, I thought of a similar approach at the same time as you but missed
> > > > this mail until later. However, with this approach I suspect there is a
> > > > possibility that two walkers of the same anon_vma list could livelock if
> > > > two locks on the list are held at the same time. Am still thinking of
> > > > how it could be resolved without introducing new locking.
> > >
> > > Trying to understand this issue and I've some questions. This
> > > vma_adjust and lock inversion troubles with the anon-vma lock in
> > > rmap_walk are a new issue introduced by the recent anon-vma changes,
> > > right?
> > >
> >
> > In a manner of speaking. There was no locking going on but prior to the
> > anon_vma changes, there would have been only one anon_vma lock and the
> > fix would be easier - just take the lock on anon_vma->lock while the
> > VMAs are being updated.
>
> So it was very much a bug before too and we could miss to find some
> pte mapping the page if vm_start was adjusted?
>

I thought it was but I was looking at an rc kernel instead of 2.6.33.
This looked as if it was safe but it's not any more with the new
anon_vma scheme.

> Also note, expand_downwards also moves vm_start with only the
> anon_vma->lock as it has to serialize against other expand_downwards
> and the rmap_walk. But expand_downwards takes the lock and it was safe
> before.
>
> Also for swapping even if things screwup it's no big deal, because it
> will just skip, but migration has to find all ptes in
> remove_migration_ptes and try_to_unmap also has to unmap everything.
>

It either has to find them all or it has to be capable of a
lazy-cleanup. I had lazy cleanup patch but it was dropped because we
felt it should have been possible to properly lock this. I'm beginning
to think it can't because there appears to be a few cases where the VM
doesn't care if it doesn't find all the mappings.

> In the split_huge_page case even the ordering at which newly allocated
> vmas for the child are queued is critical, they've to be put at the
> end of the list to be safe (otherwise do_wp_huge_page may not wait and
> we may fail to mark the huge_pmd in the child as pmd_splitting).
>

There might also be a locking snarl there as well then. I confess that
the details of transparent hugepage support have fallen back out of my
head within the last two weeks.

> > > About swapcache, try_to_unmap just nuke the mappings, establish the
> > > swap entry in the pte (not migration entry), and then there's no need
> > > to call remove_migration_ptes.
> >
> > That would be an alternative for swapcache but it's not necessarily
> > where the problem is.s
>
> Hmm try_to_unmap already nukes all swap entries without creating any
> migration pte for swapcache as far as I can tell.
>

When a mapped swapcache is unmapped, a migration PTE is put in place. If
that was not the case, we wouldn't be hitting the bug in the first
place.

> > > So it just need to skip it for
> > > swapcache. page_mapped must return zero after try_to_unmap returns
> > > before we're allowed to migrate (plus the page count must be just 1
> > > and not 2 or more for gup users!).
> > >
> > > I don't get what's the problem about swapcache and the races connected
> > > to it, the moment I hear migration PTE in context of swapcache
> > > migration I'm confused because there's no migration PTE for swapcache.
> > >
> >
> > That was a mistake on my part. The race appears to be between vma_adjust
> > changing the details of the VMA while rmap_walk is going on. It mistakenly
> > believes the vma no longer spans the address, gets -EFAULT from vma_address
> > and doesn't clean up the migration PTE. This is later encountered but the
> > page lock is no longer held and it bugs. An alternative would be to clean
> > up the migration PTE of unlocked pages on the assumption it was due to this
> > race but it's a bit sloppier.
>
> Agreed, it's sure better to close the race... the other may have even
> more implications.

True, which is why I'm not keen on lazy cleanup.

> It's good to retain the invariant that when a
> migration PTE exists the page also still exists and it's locked
> (locked really mostly matters for pagecache I guess, but it's ok).
>
> > > The new page will have no mappings either, it just needs to be part of
> > > the swapcache with the same page->index = swapentry, indexed in the
> > > radix tree with that page->index, and paga->mapping pointing to
> > > swapcache. Then new page faults will bring it in the pageatables. The
> > > lookup_swap_cache has to be serialized against some lock, it should be
> > > the radix tree lock? So the migration has to happen with that lock
> > > hold no?
> >
> > Think migrate_page_move_mapping() is what you're looking for? It takes
> > the mapping tree lock.
>
> Yep exactly!
>
> > >We can't just migrate swapcache without stopping swapcache
> > > radix tree lookups no? I didn't digest the full migrate.c yet and I
> > > don't see where it happens. Freeing the swapcache while simpler and
> > > safer, would be quite bad as it'd create I/O for potentially hot-ram.
> > >
> > > About the refcounting of anon-vma in migrate.c I think it'd be much
> > > simpler if zap_page_range and folks would just wait (like they do if
> > > they find a pmd_trans_huge && pmd_trans_splitting pmd), there would be
> > > no need of refcounting the anon-vma that way.
> > >
> >
> > I'm not getting what you're suggesting here. The refcount is to make
> > sure the anon_vma doesn't go away after the page mapcount reaches zero.
> > What are we waiting for?
>
> Causing zap_page-range to Wait the end of migration when it encounters
> migration ptes instead of skipping them all together by only releasing
> the rss and doing nothing about them. If the pte can't go away, so the
> mm so the vma and so the anon-vma. I'm not suggesting to change that,
> but I guess that's the way I would have done if I would have
> implemented it, it'd avoid refcounting.

Maybe.

> Just like I did in
> split_huge_page I count the number of pmd marked splitting, and I
> compare it to the number of pmds that are converted from huge to
> not-huge and I compared that again with the page_mapcount. If the
> three numbers aren't equal I bug. It simply can't go wrong unnoticed
> that way. I only can do that because I stop the zapping.
>

That would be another way of doing it all right. Count how many
migration ptes we created, pass that to rmap_walk. If they don't match,
assume a race and do it again before the page is unlocked.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mel Gorman on 24 Apr 2010 07:00

On Fri, Apr 23, 2010 at 11:35:49PM +0200, Andrea Arcangeli wrote:
> On Fri, Apr 23, 2010 at 09:39:48PM +0200, Andrea Arcangeli wrote:
> > On Fri, Apr 23, 2010 at 08:23:12PM +0100, Mel Gorman wrote:
> > > On Fri, Apr 23, 2010 at 08:31:35PM +0200, Andrea Arcangeli wrote:
> > > > Hi Mel,
> > > >
> > > > On Thu, Apr 22, 2010 at 04:44:43PM +0100, Mel Gorman wrote:
> > > > > heh, I thought of a similar approach at the same time as you but missed
> > > > > this mail until later. However, with this approach I suspect there is a
> > > > > possibility that two walkers of the same anon_vma list could livelock if
> > > > > two locks on the list are held at the same time. Am still thinking of
> > > > > how it could be resolved without introducing new locking.
> > > >
> > > > Trying to understand this issue and I've some questions. This
> > > > vma_adjust and lock inversion troubles with the anon-vma lock in
> > > > rmap_walk are a new issue introduced by the recent anon-vma changes,
> > > > right?
> > > >
> > >
> > > In a manner of speaking. There was no locking going on but prior to the
> > > anon_vma changes, there would have been only one anon_vma lock and the
> > > fix would be easier - just take the lock on anon_vma->lock while the
> > > VMAs are being updated.
> >
> > So it was very much a bug before too and we could miss to find some
> > pte mapping the page if vm_start was adjusted?
>
> Well I looked deeper into it myself as I wanted to have this bit (and
> other bits) sorted out in aa.git, and definitely this is a bug
> introduced by the newest anon-vma changes in 2.6.34-rc so aa.git
> cannot be affected as it's using the 2.6.33 anon-vma (and prev) code.
>

I think you're right. This is a new bug introduced by the anon_vma changes. On
the plus side, it means we don't have to worry about -stable.

> vma_adjust already takes the anon_vma->lock and of course I also
> further verified that trying to apply your snippet to vma_adjust
> results in immediately deadlock as the very same lock is already taken
> in my tree as it's the same anon-vma (simpler).

Yes, I expected that. Previously, there was only one anon_vma so if you
double-take the lock, bad things happen.

> So aa.git will be
> immune from these bugs for now.
>

It should be. I expect that's why you have never seen the bugon in
swapops.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2
Prev: initramfs issue with console missing
Next: Your Mail Box Quota Has Exceeded The Set Limit