From: Linus Torvalds on


On Sun, 11 Apr 2010, Borislav Petkov wrote:
>
> > Conversely, if you still see the oops (rather than the watchdog), that
> > means that we actually have pages that are still marked mapped, and that
> > despite that mapped state have a stale page->mapping pointer. I actually
> > find that the more likely case, because otherwise the window is _so_ small
> > that I don't see how you can hit the oops so reliably.
>
> Ok, did test with the all 5 patches applied. It oopsed with the same
> trace, see below. Except one kernel/sched.c:3555 warning checking
> spinlock count overflowing, nothing else. :(

Ok, that preempt-count thing is a real problem, but should be unrelated to
your issues.

Anyway, so this all means that we definitely have lost sight of an
'anon_vma', even if page->mapping still points to it, and even though the
page is still mapped.

I'll see if I can come up with a patch to do the same kind of validation
on page->mapping as on the anon-vma chains themselves.

> I tried to see whether the page->mapping pointer is stale, I dunno,
> maybe there could be something in the register dump which could tell us
> what's happening.

Sadly, you cannot tell by the pointer. A stale pointer still is a
perfectly fine kernel pointer, it's just that we've long since released
the anon_vma it used to point to, and now it points to some random other
data structure.

> So, it really looks like at least that list_head in anon_vma is
> bollocks, or even the whole anon_vma. So if this is correct, it is
> highly likely that the anon_vma is already freed material or not
> initialized at all.

Yes, it's pretty certain it is long free'd, and re-allocated to something
else.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on


On Sun, 11 Apr 2010, Linus Torvalds wrote:
>
> I'll see if I can come up with a patch to do the same kind of validation
> on page->mapping as on the anon-vma chains themselves.

Ok, this may or may not work. It hasn't triggered for me, which may be
because it's broken, but maybe it's because I'm not doing whatever it is
you are doing to break our VM.

It checks each anonymous page at unmap time against the vma it gets
unmapped from. It depends on the previous vma_verify debugging patch, and
it would be interesting to hear whether this patch causes any new warnngs
for you..

If the warnings do happen, they are not going to be printing out any
hugely informative data apart from the fact that the bad case happened at
all. But If they do trigger, I can try to improve on them - it's just not
worth trying to make them any more interesting if they never trigger.

Linus

---
mm/memory.c | 21 +++++++++++++++++++++
mm/mmap.c | 2 +-
2 files changed, 22 insertions(+), 1 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 833952d..5d2df59 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -890,6 +890,25 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
return ret;
}

+extern void verify_vma(struct vm_area_struct *);
+
+static void verify_anon_page(struct vm_area_struct *vma, struct page *page)
+{
+ struct anon_vma *anon_vma = vma->anon_vma;
+ struct anon_vma *need_anon_vma = page_anon_vma(page);
+ struct anon_vma_chain *avc;
+
+ verify_vma(vma);
+ if (WARN_ONCE(!anon_vma, "anonymous page in vma without anon_vma"))
+ return;
+ list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) {
+ WARN_ONCE(avc->vma != vma, "anon_vma_chain vma entry doesn't match");
+ if (avc->anon_vma == need_anon_vma)
+ return;
+ }
+ WARN_ONCE(1, "page->mapping does not exist in vma chain");
+}
+
static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
@@ -940,6 +959,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
tlb_remove_tlb_entry(tlb, pte, addr);
if (unlikely(!page))
continue;
+ if (PageAnon(page))
+ verify_anon_page(vma, page);
if (unlikely(details) && details->nonlinear_vma
&& linear_page_index(details->nonlinear_vma,
addr) != page->index)
diff --git a/mm/mmap.c b/mm/mmap.c
index 890c169..461f59c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1565,7 +1565,7 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,

EXPORT_SYMBOL(get_unmapped_area);

-static void verify_vma(struct vm_area_struct *vma)
+void verify_vma(struct vm_area_struct *vma)
{
if (vma->anon_vma) {
struct anon_vma_chain *avc;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Rik van Riel on
On 04/12/2010 10:40 AM, Peter Zijlstra wrote:

> So the reason page->mapping isn't cleared in page_remove_rmap() isn't
> detailed beyond a (possible) race with page_add_anon_rmap() (which I
> guess would be reclaim trying to unmap the page and a fault re-instating
> it).
>
> This also complicates the whole page_lock_anon_vma() thing, so it would
> be nice to be able to remove this race and clear page->mapping in
> page_remove_rmap().

For anonymous pages, I don't see where the race comes from.

Both do_swap_page and the reclaim code hold the page lock
across the entire operation, so they are already excluding
each other.

Hugh, do you remember what the race between page_remove_rmap
and page_add_anon_rmap is/was all about?

I don't see a race in the current code...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on


On Sun, 11 Apr 2010, Rik van Riel wrote:
>
> Looking around the code some more, zap_pte_range()
> calls page_remove_rmap(), which leaves the
> page->mapping in place and has this comment:

See my earlier email about this exact issue. It's well-known that there
are stale page->mapping pointers. The "page_mapped()" check _should_ have
meant that in that case we never follow them, though.

> I wonder if we can clear page->mapping here, if
> list_is_singular(anon_vma->head). That way we
> will not leave stale pointers behind.

What does that help? What if list _isn't_ singular?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on


On Sun, 11 Apr 2010, Rik van Riel wrote:
>
> Another thing I just thought of.
>
> The anon_vma struct will not be reused for something completely
> different due to the SLAB_DESTROY_BY_RCU flag that the anon_vma_cachep
> is created with.

Rik, we _know_ it got re-used by something totally different. That's
clearly the problem. The page->mapping pointer does _not_ point to an
anon_vma any more. That's the problem here.

What we need to figure out is how we have a page on the LRU list that is
still marked as 'mapped' that has that stale mapping pointer.

I can easily see how the stale mapping pointer happens for a non-mapped
page. That part is trivial. Here's a simple case:

- vmscan does that whole "isolate LRU pages", and one of them is a (at
that time mapped) anonymous page. It's now not on any LRU lists at all.

- vmscan ends up waiting for pageout and/or writeback while holding that
list of pages.

- in the meantime, the process that had the page exists or unmaps,
unmapping the page and freeing the vma and the anon_vma.

- vmscan eventually gets to the page, and does that page_referenced()
dance. page->mapping points to something that is long long gone (as in
"IO access lifetimes", so we're talking something that has been freed
literally milliseconds ago, rather than any RCU delays)

So I can see the stale page->mapping pointer happening. That part is even
trivial. What I don't see is how the page would be still marked 'mapped'.
Everything that actually free's the vma/anon_vmas should also have
unmapped the page before that - even if it didn't _free_ the page.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/