From: Rik van Riel on
On 04/12/2010 11:44 AM, Linus Torvalds wrote:
> On Sun, 11 Apr 2010, Rik van Riel wrote:
>>
>> Looking around the code some more, zap_pte_range()
>> calls page_remove_rmap(), which leaves the
>> page->mapping in place and has this comment:
>
> See my earlier email about this exact issue. It's well-known that there
> are stale page->mapping pointers. The "page_mapped()" check _should_ have
> meant that in that case we never follow them, though.

Good point. I wonder if we have some SMP reordering
issue then?

>> I wonder if we can clear page->mapping here, if
>> list_is_singular(anon_vma->head). That way we
>> will not leave stale pointers behind.
>
> What does that help? What if list _isn't_ singular?

Yeah, that was a bad idea. Looking at the same code for
11 days straight seems to have put some knots in my brain :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Rik van Riel on
On 04/12/2010 12:01 PM, Peter Zijlstra wrote:

> @@ -864,15 +889,8 @@ void page_remove_rmap(struct page *page)
> __dec_zone_page_state(page, NR_FILE_MAPPED);
> mem_cgroup_update_file_mapped(page, -1);
> }
> - /*
> - * It would be tidy to reset the PageAnon mapping here,
> - * but that might overwrite a racing page_add_anon_rmap
> - * which increments mapcount after us but sets mapping
> - * before us: so leave the reset to free_hot_cold_page,
> - * and remember that it's only reliable while mapped.
> - * Leaving it set also helps swapoff to reinstate ptes
> - * faster for those pages still in swapcache.
> - */
> +
> + page->mapping = NULL;
> }

That would be a bug for file pages :)

I could see how it could work for anonymous memory, though.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on


On Mon, 12 Apr 2010, Borislav Petkov wrote:
> >
> > If the warnings do happen, they are not going to be printing out any
> > hugely informative data apart from the fact that the bad case happened at
> > all. But If they do trigger, I can try to improve on them - it's just not
> > worth trying to make them any more interesting if they never trigger.
>
> Haa, I think you're gonna want to improve them :)
>
> WARN_ONCE(1, "page->mapping does not exist in vma chain");
>
> triggered on the first resume showing a rather messy 4 WARN_ONCEs. Had I
> more cores, there maybe would've been more of them :) Maybe need locking
> if clean output is of interest (see below).

Goodie.

I can't trigger this on my machine (not that I tried very hard - but I did
do some swapping loads etc by limiting my memory to just 1GB etc). So I'm
pretty sure my verification code is "correct", and verifies things that
should be right.

And the fact that it triggers under the exact load that you use to then
trigger the bug is a damn good thing. That means that we are finally on
the right track, and we have somethign that correlates well with the
actual bug.

> So, anyway, if I can read this correctly, there is a page->mapping
> anon_vma which is _not_ in the anon_vmas chain of the vma
> (avc->same_vma).

Yes, and that is supposed to be a no-no. The page is clearly associated
with the vma in question (since we are unmapping it through that vma), but
the vma list of 'anon_vma's doesn't actually have the one that
'page->mapping' points to.

And that, in turn, means that we've lost sight of the 'page->mapping'
anon_vma, and THAT in turn means that it could well have been free'd as
being no longer referenced.

And if it was free'd, it could be re-allocated as something else (after
the RCU grace period), and that directly explains your oops.

> By the way, I completely understand when you say that your head hurts
> from looking at this :).

Well, I have to say that I'm happy I've spent the time on it, because this
way I got to learn all the new rules. It's just that I really wish I
wouldn't have _had_ to.

Anyway, I'll have to think way more about this to see if I can come up
with a debugging patch that shows more details about what actually caused
this to happen in the first place. But we definitely have a smoking gun.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on


On Mon, 12 Apr 2010, Linus Torvalds wrote:
>
> Yes, and that is supposed to be a no-no. The page is clearly associated
> with the vma in question (since we are unmapping it through that vma), but
> the vma list of 'anon_vma's doesn't actually have the one that
> 'page->mapping' points to.
>
> And that, in turn, means that we've lost sight of the 'page->mapping'
> anon_vma, and THAT in turn means that it could well have been free'd as
> being no longer referenced.
>
> And if it was free'd, it could be re-allocated as something else (after
> the RCU grace period), and that directly explains your oops.

I have a new theory. And this new theory is completely different from all
the other things we've been looking at.

The new theory is really simple: 'page->mapping' has been re-set to the
wrong mapping.

Now, there is one case where we reset page->mapping _intentionally_,
namely in the COW-breaking case of having the last user
("page_move_anon_rmap"). And that looks fine, and happens under normal
loads all the time. We _want_ to do it there.

But there is a _much_ more subtle case that involved swapping.

So guys, here's my fairly simple theory on what happens:

- page gets allocated/mapped by process A. Let's call the anon_vma we
associate the page with 'A' to keep it easy to track.

- Process A forks, creating process B. The anon_vma in B is 'B', and has
a chain that looks like 'B' -> 'A'. Everything is fine.

- Swapping happens. The page (with mapping pointing to 'A') gets swapped
out (perhaps not to disk - it's enough to assume that it's just not
mapped any more, and lives entirely in the swap-cache)

- Process B pages it in, which goes like this:

do_swap_page ->
page = lookup_swap_cache(entry);
...
set_pte_at(mm, address, page_table, pte);
page_add_anon_rmap(page, vma, address);

And think about what happens here!

In particular, what happens is that this will now be the "first" mapping
of that page, so page_add_anon_rmap() will do

if (first)
__page_set_anon_rmap(page, vma, address);

and notice what anon_vma it will use? It will use the anon_vma for process
B!

So now page->mapping actually points to anon_vma 'B', not 'A' like it used
to.

What happens then? Trivial: process 'A' also pages it in (nothing happens,
it's not the first mapping), and then process 'B' execve's or exits or
unmaps, making anon_vma B go away.

End result: process A has a page that points to anon_vma B, but anon_vma B
does not exist any more. This can go on forever. Forget about RCU grace
periods, forget about locking, forget anything like that. The bug is
simply that page->mapping points to an anon_vma that was correct at one
point, but was _not_ the one that was shared by all users of that possible
mapping.

The patch below is my largely mindless try at fixing this. It's untested.
I'm not entirely sure that it actually works. But it makes some amount of
conceptual sense. No?

Linus

---
mm/rmap.c | 15 +++++++++++++--
1 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index ee97d38..4bad326 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -734,9 +734,20 @@ void page_move_anon_rmap(struct page *page,
static void __page_set_anon_rmap(struct page *page,
struct vm_area_struct *vma, unsigned long address)
{
- struct anon_vma *anon_vma = vma->anon_vma;
+ struct anon_vma_chain *avc;
+ struct anon_vma *anon_vma;
+
+ BUG_ON(!vma->anon_vma);
+
+ /*
+ * We must use the _oldest_ possible anon_vma for the page mapping!
+ *
+ * So take the last AVC chain entry in the vma, which is the deepest
+ * ancestor, and use the anon_vma from that.
+ */
+ avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
+ anon_vma = avc->anon_vma;

- BUG_ON(!anon_vma);
anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
page->mapping = (struct address_space *) anon_vma;
page->index = linear_page_index(vma, address);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on


On Mon, 12 Apr 2010, Rik van Riel wrote:

> On 04/12/2010 12:01 PM, Peter Zijlstra wrote:
>
> > @@ -864,15 +889,8 @@ void page_remove_rmap(struct page *page)
> > __dec_zone_page_state(page, NR_FILE_MAPPED);
> > mem_cgroup_update_file_mapped(page, -1);
> > }
> > - /*
> > - * It would be tidy to reset the PageAnon mapping here,
> > - * but that might overwrite a racing page_add_anon_rmap
> > - * which increments mapcount after us but sets mapping
> > - * before us: so leave the reset to free_hot_cold_page,
> > - * and remember that it's only reliable while mapped.
> > - * Leaving it set also helps swapoff to reinstate ptes
> > - * faster for those pages still in swapcache.
> > - */
> > +
> > + page->mapping = NULL;
> > }
>
> That would be a bug for file pages :)
>
> I could see how it could work for anonymous memory, though.

I think it's scary for anonymous pages too. The _common_ case of
page_remove_rmap() is from unmap/exit, which holds no locks on the page
what-so-ever. So assuming the page could be reachable some other way (swap
cache etc), I think the above is pretty scary.

Also do note that the bug we've been chasing has _always_ had that test
for "page_mapped(page)". See my other email about why the unmapped case
isn't even interesting, because it's so easy to see how page->mapping can
be stale for unmapped pages.

It's the _mapped_ case that is interesting, not the unmapped one. So
setting page->mapping to NULL when unmapping is perhaps a nice consistency
issue ("never have stale pointers"), but it's missing the fact that it's
not really the case we care about.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/