From: Peter Zijlstra on
On Sat, 2010-04-10 at 11:21 -0700, Linus Torvalds wrote:
>

> Ho humm.
>
> Maybe I'm crazy, but something started bothering me. And I started
> wondering: when is the 'page->mapping' of an anonymous page actually
> cleared?
>
> The thing is, the mapping of an anonymous page is actually cleared only
> when the page is _freed_, in "free_hot_cold_page()".
>
> Now, let's think about that. And in particular, let's think about how that
> relates to the freeing of the 'anon_vma' that the page->mapping points to.
>
> The way the anon_vma is freed is when the mapping is torn down, and we do
> roughly:
>
> tlb = tlb_gather_mmu(mm,..)
> ..
> unmap_vmas(&tlb, vma ..
> ..
> free_pgtables()
> ..
> tlb_finish_mmu(tlb, start, end);
>
> and we actually unmap all the pages in "unmap_vmas()", and then _after_
> unmapping all the pages we do the "unlink_anon_vmas(vma);" in
> "free_pgtables()". Fine so far - the anon_vma stay around until after the
> page has been happily unmapped.
>
> But "unmapped all the pages" is _not_ actually the same as "free'd all the
> pages". The actual _freeing_ of the page happens generally in
> tlb_finish_mmu(), because we can free the page only after we've flushed
> any TLB entries.
>
> So what we have in that tlb_gather structure is a list of _pending_ pages
> to be freed, while we already actually free'd the anon_vmas earlier!
>
> Now, the thing is, tlb_gather_mmu() begins a preempt-safe region (because
> we use a per-cpu variable), but as far as I can tell it is _not_ an
> RCU-safe region.
>
> So I think we might actually get a real RCU freeing event while this all
> happens. So now the 'anon_vma' that 'page->mapping' points to has not just
> been released back to the SLUB caches, the page itself might have been
> released too.
>
> I dunno. Does the above sound at all sane? Or am I just raving?
>
> Something hacky like the above might fix it if I'm not just raving. I
> really might be missing something here.

Right, so unless you have CONFIG_TREE_PREEMPT_RCU=y, the preempt-disable
== RCU read lock assumption does hold.

But even with your patch it doesn't close all holes because while
zap_pte_range() can remove the last mapcount of the page, the
page_remove_tlb() et al. don't need to be the last use count of the
page.

Concurrent reclaim/gup/whatever could still have a count out on the page
delaying the actual free beyond the tlb gather RCU section.

So the reason page->mapping isn't cleared in page_remove_rmap() isn't
detailed beyond a (possible) race with page_add_anon_rmap() (which I
guess would be reclaim trying to unmap the page and a fault re-instating
it).

This also complicates the whole page_lock_anon_vma() thing, so it would
be nice to be able to remove this race and clear page->mapping in
page_remove_rmap().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Mon, 2010-04-12 at 11:19 -0400, Rik van Riel wrote:
> On 04/12/2010 10:40 AM, Peter Zijlstra wrote:
>
> > So the reason page->mapping isn't cleared in page_remove_rmap() isn't
> > detailed beyond a (possible) race with page_add_anon_rmap() (which I
> > guess would be reclaim trying to unmap the page and a fault re-instating
> > it).
> >
> > This also complicates the whole page_lock_anon_vma() thing, so it would
> > be nice to be able to remove this race and clear page->mapping in
> > page_remove_rmap().
>
> For anonymous pages, I don't see where the race comes from.
>
> Both do_swap_page and the reclaim code hold the page lock
> across the entire operation, so they are already excluding
> each other.
>
> Hugh, do you remember what the race between page_remove_rmap
> and page_add_anon_rmap is/was all about?
>
> I don't see a race in the current code...


Something like the below would be nice if possible.


---
mm/rmap.c | 44 +++++++++++++++++++++++++++++++-------------
1 files changed, 31 insertions(+), 13 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index eaa7a09..241f75d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -286,7 +286,22 @@ void __init anon_vma_init(void)

/*
* Getting a lock on a stable anon_vma from a page off the LRU is
- * tricky: page_lock_anon_vma rely on RCU to guard against the races.
+ * tricky:
+ *
+ * page_add_anon_vma()
+ * atomic_add_negative(page->_mapcount);
+ * page->mapping = anon_vma;
+ *
+ *
+ * page_remove_rmap()
+ * atomic_add_negative();
+ * page->mapping = anon_vma;
+ *
+ * So we have to first read page->mapping(), and then verify
+ * _mapcount, and make sure we order them correctly.
+ *
+ * We take anon_vma->lock in between so that if we see the anon_vma
+ * with a mapcount we know it won't go away on us.
*/
struct anon_vma *page_lock_anon_vma(struct page *page)
{
@@ -294,14 +309,24 @@ struct anon_vma *page_lock_anon_vma(struct page *page)
unsigned long anon_mapping;

rcu_read_lock();
- anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
+ anon_mapping = (unsigned long)rcu_dereference(page->mapping);
if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
goto out;
- if (!page_mapped(page))
- goto out;

anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
spin_lock(&anon_vma->lock);
+
+ /*
+ * Order the reading of page->mapping and page->_mapcount against the
+ * mb() implied by the atomic_add_negative() in page_remove_rmap().
+ */
+ smp_rmb();
+ if (!page_mapped(page)) {
+ spin_unlock(&anon_vma->lock);
+ anon_vma = NULL;
+ goto out;
+ }
+
return anon_vma;
out:
rcu_read_unlock();
@@ -864,15 +889,8 @@ void page_remove_rmap(struct page *page)
__dec_zone_page_state(page, NR_FILE_MAPPED);
mem_cgroup_update_file_mapped(page, -1);
}
- /*
- * It would be tidy to reset the PageAnon mapping here,
- * but that might overwrite a racing page_add_anon_rmap
- * which increments mapcount after us but sets mapping
- * before us: so leave the reset to free_hot_cold_page,
- * and remember that it's only reliable while mapped.
- * Leaving it set also helps swapoff to reinstate ptes
- * faster for those pages still in swapcache.
- */
+
+ page->mapping = NULL;
}

/*


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Mon, 2010-04-12 at 09:46 -0700, Linus Torvalds wrote:
>
> On Mon, 12 Apr 2010, Rik van Riel wrote:
>
> > On 04/12/2010 12:01 PM, Peter Zijlstra wrote:
> >
> > > @@ -864,15 +889,8 @@ void page_remove_rmap(struct page *page)
> > > __dec_zone_page_state(page, NR_FILE_MAPPED);
> > > mem_cgroup_update_file_mapped(page, -1);
> > > }
> > > - /*
> > > - * It would be tidy to reset the PageAnon mapping here,
> > > - * but that might overwrite a racing page_add_anon_rmap
> > > - * which increments mapcount after us but sets mapping
> > > - * before us: so leave the reset to free_hot_cold_page,
> > > - * and remember that it's only reliable while mapped.
> > > - * Leaving it set also helps swapoff to reinstate ptes
> > > - * faster for those pages still in swapcache.
> > > - */
> > > +
> > > + page->mapping = NULL;
> > > }
> >
> > That would be a bug for file pages :)
> >
> > I could see how it could work for anonymous memory, though.
>
> I think it's scary for anonymous pages too. The _common_ case of
> page_remove_rmap() is from unmap/exit, which holds no locks on the page
> what-so-ever. So assuming the page could be reachable some other way (swap
> cache etc), I think the above is pretty scary.

Fully agreed.

> Also do note that the bug we've been chasing has _always_ had that test
> for "page_mapped(page)". See my other email about why the unmapped case
> isn't even interesting, because it's so easy to see how page->mapping can
> be stale for unmapped pages.
>
> It's the _mapped_ case that is interesting, not the unmapped one. So
> setting page->mapping to NULL when unmapping is perhaps a nice consistency
> issue ("never have stale pointers"), but it's missing the fact that it's
> not really the case we care about.

Yes, I don't think this is the problem that has been plaguing us for
over a week now.

But while staring at that code it did get me worried that the current
code (page_lock_anon_vma):

- is missing the smp_read_barrier_depends() after the ACCESS_ONCE
- isn't properly ordered wrt page->mapping and page->_mapcount.
- doesn't appear to guarantee much at all when returning an anon_vma
since it locks after checking page->_mapcount so:
* it can return !NULL for an unmapped page (your patch cures that)
* it can return !NULL but for a different anon_vma
(my earlier patch checking page_rmapping() after the spin_lock
cures that, but doesn't cure the above):

[ highly unlikely but not impossible race ]

page_referenced(page_A)

try_to_unmap(page_A)

unrelated fault

fault page_A

CPU0 CPU1 CPU2 CPU3

rcu_read_lock()
anon_vma = page->mapping;
if (!anon_vma & ANON_BIT)
goto out
if (!page_mapped(page))
goto out

page_remove_rmap()
...
anon_vma_free()-----\
v
anon_vma_alloc()

anon_vma_alloc()
page_add_anon_rmap()
^
spin_lock(anon_vma->lock)----------/


Now I don't think the above can happen due to how our slab
allocators work, they won't share a slab page between cpus like
that, but once we make the whole thing preemptible this race
becomes a lot more likely.


So a page_lock_anon_vma(), that looks a little like the below should
(I think) cure all our problems with it.


struct anon_vma *page_lock_anon_vma(struct page *page)
{
struct anon_vma *anon_vma;
unsigned long anon_mapping;

rcu_read_lock();
again:
anon_mapping = (unsigned long)rcu_dereference(page->mapping);
if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
goto out;
anon_vma = (struct anon_vma *)(anon_mapping - PAGE_MAPPING_ANON);

/*
* The RCU read lock ensures we can safely dereference anon_vma
* since it ensures the backing slab won't go away. It will however
* not guarantee it's the right object.
*
* First take the anon_vma->lock, this will, per anon_vma_unlink()
* avoid this anon_vma from being freed if it is a valid object.
*/
spin_lock(&anon_vma->lock);

/*
* Secondly, we have to re-read page->mapping, so ensure it
* has not changed, rely on spin_lock() being at least a
* compiler barrier to force the re-read.
*/
if (unlikely(page_rmapping(page) != anon_vma)) {
spin_unlock(&anon_vma->lock);
goto again;
}

/*
* Ensure we read page->mapping before page->_mapcount,
* orders against atomic_add_negative() in page_remove_rmap().
*/
smp_rmb();

/*
* Finally check that the page is still mapped,
* if not, this can't possibly be the right anon_vma.
*/
if (!page_mapped(page))
goto unlock;

return anon_vma;

unlock:
spin_unlock(&anon_vma->lock);
out:
rcu_read_unlock();
return NULL;
}


With this, I think we can actually drop the RCU read lock when returning
since if this is indeed a valid anon_vma for this page, then the page is
still mapped, and hence the anon_vma was not deleted, and a possible
future delete will be held back by us holding the anon_vma->lock.

Now I could be totally wrong and have confused myself throroughly, but
how does this look?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Borislav Petkov on
From: Rik van Riel <riel(a)redhat.com>
Date: Mon, Apr 12, 2010 at 02:40:22PM -0400

> On 04/12/2010 12:26 PM, Linus Torvalds wrote:
>
> >But there is a _much_ more subtle case that involved swapping.
> >
> >So guys, here's my fairly simple theory on what happens:
>
> That bug looks entirely possible. Given that Borislav
> has heavy swapping going on, it is quite possible that
> this is the bug he has been triggering.

Yeah, about that. I dunno whether you guys saw that but the machine has
8Gb of RAM and shouldn't be swapping, AFAIK. The largest mem usage I
saw was 5Gb used, most of which pagecache. So I was kinda doubtful when
Linus came up with the swapping theory earlier. I'll pay attention to
the SwapCached in /proc/meminfo more to see whether we do any swapping.
It could be that there is a small amount which is swapped out for
whatever reason... Maybe that's the bug...

But I'll give the patch a run anyway in an hour or so anyway.

--
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Mon, 2010-04-12 at 20:40 +0200, Peter Zijlstra wrote:

Hmm, if interleaved like so

> struct anon_vma *page_lock_anon_vma(struct page *page)
> {
> struct anon_vma *anon_vma;
> unsigned long anon_mapping;

page_remove_rmap()
anon_vma_unlink()
anon_vma_free()

So that the below will all observe the old page->mapping:

> rcu_read_lock();
> again:
> anon_mapping = (unsigned long)rcu_dereference(page->mapping);
> if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> goto out;
> anon_vma = (struct anon_vma *)(anon_mapping - PAGE_MAPPING_ANON);
>
> /*
> * The RCU read lock ensures we can safely dereference anon_vma
> * since it ensures the backing slab won't go away. It will however
> * not guarantee it's the right object.
> *
> * First take the anon_vma->lock, this will, per anon_vma_unlink()
> * avoid this anon_vma from being freed if it is a valid object.
> */
> spin_lock(&anon_vma->lock);
>
> /*
> * Secondly, we have to re-read page->mapping, so ensure it
> * has not changed, rely on spin_lock() being at least a
> * compiler barrier to force the re-read.
> */
> if (unlikely(page_rmapping(page) != anon_vma)) {
> spin_unlock(&anon_vma->lock);
> goto again;
> }

page_add_anon_rmap(), so that the page_mapped() test below would be
positive,

> /*
> * Ensure we read page->mapping before page->_mapcount,
> * orders against atomic_add_negative() in page_remove_rmap().
> */
> smp_rmb();
>
> /*
> * Finally check that the page is still mapped,
> * if not, this can't possibly be the right anon_vma.
> */
> if (!page_mapped(page))
> goto unlock;

We could here return a non-valid and already freed anon_vma.

> return anon_vma;
>
> unlock:
> spin_unlock(&anon_vma->lock);
> out:
> rcu_read_unlock();
> return NULL;
> }
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/