rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA [Kernel]

Prev: [Bug #15615] NULL pointer deref in task_is_waking
Next: [Bug #15603] lockdep warning at boot time when determining whether to resume

From: Linus Torvalds on 10 Apr 2010 19:40

On Sun, 11 Apr 2010, Johannes Weiner wrote:
>
> Did you have something in mind that I missed?

Mostly that the corner cases will never matter, and I'd prefer to keep the
code simpler than to care deeply.

For example, the only case you'd see vm_ops->close() is for special device
mappings. It's true that they cannot have their vma's merged, but it's
also true that they (a) will seldom have anon_vma's anyway and (b) would
never get mapped very many times so that anon_vma merging would be an
issue.

In other words, it's a "don't care" situation, where to keep the code
simpler we just document that we don't care.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds on 11 Apr 2010 13:20

On Sun, 11 Apr 2010, Borislav Petkov wrote:
>
> Ok, I could verify that the three patches we were talking about still
> can't fix the issue. However, just to make sure I'm sending the versions
> of the patches I used for you guys to check.

Yup, the patches are the ones I wanted you to try.

So either my fixes were buggy (possible, especially for the vma_adjust
case), or there are other bugs still lurking.

The scary part is that the _old_ anon_vma code didn't really care about
the anon_vma all that deeply. It was just a placeholder, if you got some
of it wrong the worst that would probably happen would be that a page
could never find all the mappings it had. So it was a possible swap
efficiency problem when we cannot get rid of all mapped pages, but if it
only happens for some small and unusual special case, nobody would ever
have noticed.

With the new code, when you have a page that is associated with a stale
anon_vma, you get the page_referenced() oops instead.

And I can't find the bug. Everything I've looked at looks fine. So I'm
going to ask you to start applying "validation patches" - code to check
some internal consistency, and seeing if we break that internal
consistency somewhere.

It may be that Rik has some patches like this from his development work,
but here's the first one. This patch should have caught the vma_adjust()
problem, but all it caught for me was that "anon_vma_clone()" ended up
cloning the avc entries in the wrong order so the lists didn't actually
look exactly the same.

The patch fixes that case, so if this triggers any warnings for you, I
think it's a real bug.

But I'm pretty sure that the problem is that we have a "page->mapping"
that points to an anon_vma that no longer exists, and you can easily get
that while still having valid vma chains - they just aren't necessarily
the complete _set_ of chains they should be.

[ In particular, I think that the _real_ problem is that we don't clear
"page->mapping" when we unmap a page.

See the comment at the end of page_remove_rmap(), and it also explains
the test for "page_mapped()" in page_lock_anon_vma().

But I think the bug you see might be exactly the race between
page_mapped() and actually getting the anon_vma spinlock. I'd have
expected that window to be too small to ever hit, though, which is why I
find it a bit unlikely. But it would explain why you _sometimes_
actually get a hung spinlock too - you never get the spinlock at all,
and somebody replaced the data with something that the spinlock code
thinks is a locked spinlock - but is no longer a spinlock at all ]

Linus

---
mm/mmap.c | 18 ++++++++++++++++++
mm/rmap.c | 2 +-
2 files changed, 19 insertions(+), 1 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..890c169 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1565,6 +1565,22 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,

EXPORT_SYMBOL(get_unmapped_area);

+static void verify_vma(struct vm_area_struct *vma)
+{
+ if (vma->anon_vma) {
+ struct anon_vma_chain *avc;
+ if (WARN_ONCE(list_empty(&vma->anon_vma_chain), "vma has anon_vma but empty chain"))
+ return;
+ /* The first entry of the avc chain should match! */
+ avc = list_entry(vma->anon_vma_chain.next, struct anon_vma_chain, same_vma);
+ WARN_ONCE(avc->anon_vma != vma->anon_vma, "anon_vma entry doesn't match anon_vma_chain");
+ WARN_ONCE(avc->vma != vma, "vma entry doesn't match anon_vma_chain");
+ } else {
+ WARN_ONCE(!list_empty(&vma->anon_vma_chain), "vma has no anon_vma but has chain");
+ }
+}
+
+
/* Look up the first VMA which satisfies addr < vm_end, NULL if none. */
struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
{
@@ -1598,6 +1614,8 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
mm->mmap_cache = vma;
}
}
+ if (vma)
+ verify_vma(vma);
return vma;
}

diff --git a/mm/rmap.c b/mm/rmap.c
index eaa7a09..ee97d38 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -182,7 +182,7 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
{
struct anon_vma_chain *avc, *pavc;

- list_for_each_entry(pavc, &src->anon_vma_chain, same_vma) {
+ list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) {
avc = anon_vma_chain_alloc();
if (!avc)
goto enomem_failure;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds on 11 Apr 2010 13:30

On Sun, 11 Apr 2010, Linus Torvalds wrote:
>
> But I think the bug you see might be exactly the race between
> page_mapped() and actually getting the anon_vma spinlock. I'd have
> expected that window to be too small to ever hit, though, which is why I
> find it a bit unlikely. But it would explain why you _sometimes_
> actually get a hung spinlock too - you never get the spinlock at all,
> and somebody replaced the data with something that the spinlock code
> thinks is a locked spinlock - but is no longer a spinlock at all ]

Actually, so if it's that race, then we might get rid of the oops with
this total hack.

NOTE! If this is the race, then the hack really is just a hack, because it
doesn't really solve anything. We still take the spinlock, and if bad
things has happened, _that_ can still very much fail, and you get the
watchdog lockup message instead. So this doesn't really fix anything.

But if this patch changes behavior, and you no longer see the oops, that
tells us _something_. I'm not sure how useful that "something" is, but it
at least means that there are no _mapped_ pages that have that stale
anon_vma pointer in page->mapping.

Conversely, if you still see the oops (rather than the watchdog), that
means that we actually have pages that are still marked mapped, and that
despite that mapped state have a stale page->mapping pointer. I actually
find that the more likely case, because otherwise the window is _so_ small
that I don't see how you can hit the oops so reliably.

Anyway - probably worth testing, along with the verify_vma() patch. If
nothing else, if there is no new behavior, even that tells us something.
Even if that "something" is not a huge piece of information.

Linus

---
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -302,7 +302,11 @@ struct anon_vma *page_lock_anon_vma(struct page *page)

anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
spin_lock(&anon_vma->lock);
- return anon_vma;
+
+ if (page_mapped(page))
+ return anon_vma;
+
+ spin_unlock(&anon_vma->lock);
out:
rcu_read_unlock();
return NULL;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Rik van Riel on 11 Apr 2010 16:00

On 04/11/2010 01:16 PM, Linus Torvalds wrote:

> NOTE! If this is the race, then the hack really is just a hack, because it
> doesn't really solve anything. We still take the spinlock, and if bad
> things has happened, _that_ can still very much fail, and you get the
> watchdog lockup message instead. So this doesn't really fix anything.

Looking around the code some more, zap_pte_range()
calls page_remove_rmap(), which leaves the
page->mapping in place and has this comment:

/*
* It would be tidy to reset the PageAnon mapping here,
* but that might overwrite a racing page_add_anon_rmap
* which increments mapcount after us but sets mapping
* before us: so leave the reset to free_hot_cold_page,
* and remember that it's only reliable while mapped.
* Leaving it set also helps swapoff to reinstate ptes
* faster for those pages still in swapcache.
*/

I wonder if we can clear page->mapping here, if
list_is_singular(anon_vma->head). That way we
will not leave stale pointers behind.

Adding another VMA to the anon_vma can happen
at fork time - which will not happen simultaneously
with exit or munmap, because the mmap_sem is taken
for write during either code path.

Am I overlooking something here?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Rik van Riel on 11 Apr 2010 17:50

On 04/11/2010 01:16 PM, Linus Torvalds wrote:

> Actually, so if it's that race, then we might get rid of the oops with
> this total hack.

Another thing I just thought of.

The anon_vma struct will not be reused for something completely
different due to the SLAB_DESTROY_BY_RCU flag that the anon_vma_cachep
is created with.

The anon_vma_chain structs are allocated from a slab without that
flag, so they can be reused for something else in the middle of
an RCU section.

Is that something worth fixing, or is this so subtle that we'd
rather not have the code rely on this kind of behaviour at all?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: [Bug #15615] NULL pointer deref in task_is_waking
Next: [Bug #15603] lockdep warning at boot time when determining whether to resume