From: Linus Torvalds on


On Sat, 10 Apr 2010, Linus Torvalds wrote:
>
> I dunno. Does the above sound at all sane? Or am I just raving?
>
> Something hacky like the above might fix it if I'm not just raving. I
> really might be missing something here.

Btw, if this turns out to be accurate, the real fix is to probably just
have a separate phase at the very end to actually release all the vma's,
rather than do it in "free_page_tables()". We don't want to make the
tlb-gather any more atomic than it already is. In fact, Nick is trying to
make it preemptible.

So the patch included in that mail was meant very much as a "let's test my
crazy theory" patch, rather than as the real solution.

The patch is also untested. Maybe it doesn't work at all and introduces
new bugs. Caveat emptor.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Rik van Riel on
On 04/10/2010 02:21 PM, Linus Torvalds wrote:

> Maybe I'm crazy, but something started bothering me. And I started
> wondering: when is the 'page->mapping' of an anonymous page actually
> cleared?
>
> The thing is, the mapping of an anonymous page is actually cleared only
> when the page is _freed_, in "free_hot_cold_page()".

Which is also where they are removed from the LRU.
The plot thickens...

> Now, let's think about that. And in particular, let's think about how that
> relates to the freeing of the 'anon_vma' that the page->mapping points to.
>
> The way the anon_vma is freed is when the mapping is torn down, and we do
> roughly:
>
> tlb = tlb_gather_mmu(mm,..)
> ..
> unmap_vmas(&tlb, vma ..
> ..
> free_pgtables()
> ..
> tlb_finish_mmu(tlb, start, end);

Looks like we should move the anon_vma freeing from free_pgtables
over to remove_vma?

This code is just below the tlb_finish_mmu in exit_mmap:

/*
* Walk the list again, actually closing and freeing it,
* with preemption enabled, without holding any MM locks.
*/
while (vma)
vma = remove_vma(vma);

This comment in free_pgtables is a little suspect:

/*
* Hide vma from rmap and truncate_pagecache before freeing
* pgtables
*/
unlink_anon_vmas(vma);
unlink_file_vma(vma);

After all, the rmap code will quickly notice that there either are
no page tables, or the page tables no longer have anything in them.

It looks like we may have had this use-after-free bug in the VM for
quite a while... I am not entirely sure what exposed the bug, but
I can see how it works.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on


On Sat, 10 Apr 2010, Borislav Petkov wrote:
> From: Borislav Petkov <bp(a)alien8.de>
> Date: Sat, Apr 10, 2010 at 08:51:45PM +0200
>
> > Anyways, testing...
>
> Nope, still b0rked. And this time is not a funny pattern but
> ffffffffffffffe0 we had originally.

Ok, I think that just depends on who happens to re-use the allocation and
how it does it.

I'm pretty sure it's a use-after-free issue, where we have free'd an
anon_vma too early, even though it has pages associated with it.

If it wasn't the RCU case, it's just something else.

I think it's worth looking at "vma_adjust()", because as I already
mentioned to Rik earlier - the code is very hard to understand, and it's
accrued crud over many many years.

And vma_adjust is the one place that does that anon_vma_merge(), which is
apart from the actual unmapping sequence the only other place that
actually free's anon_vmas. So there are reasons to be very suspicious of
that code.

And I think that code can actually lose an anon_vma chain. It's totally
screwing up the "import anonvma" case: when it does

if (anon_vma_clone(importer, vma)) {
return -ENOMEM;
}
importer->anon_vma = anon_vma;

we can actually have "importer == vma", but "anon_vma = next->anon_vma".

In which case we actually end up with an _empty_ chain (because importer
didn't have a chain to begin with!) but "importer->anon_vma" points to an
anon_vma.

And then when we do that "remove_next", we actually get rid of the only
chain we ever had, and have lost all our references to the anon_vma.

That looks _horribly_ buggy.

Also, the conditional nesting makes no sense (the whole anon_vma_clone()
only makes sense if importer is set, and it is only ever set _inside_ the
earlier if-statement, so the whole code should be moved inside there), nor
does some of the comments.

This patch is scary and untested, but the more I look at that code, the
more convinced I am that vma_adjust was _really_ badly screwed up. The
patch below may make things worse. I'll test it myself too, but I'm
sending it out first, since I was writing the email as I was looking at
the piece of cr*p.

Linus

---
mm/mmap.c | 24 ++++++++----------------
1 files changed, 8 insertions(+), 16 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index acb023e..f90ea92 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -507,11 +507,12 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
struct address_space *mapping = NULL;
struct prio_tree_root *root = NULL;
struct file *file = vma->vm_file;
- struct anon_vma *anon_vma = NULL;
long adjust_next = 0;
int remove_next = 0;

if (next && !insert) {
+ struct vm_area_struct *exporter = NULL;
+
if (end >= next->vm_end) {
/*
* vma expands, overlapping all the next, and
@@ -519,7 +520,7 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
*/
again: remove_next = 1 + (end > next->vm_end);
end = next->vm_end;
- anon_vma = next->anon_vma;
+ exporter = next;
importer = vma;
} else if (end > next->vm_start) {
/*
@@ -527,7 +528,7 @@ again: remove_next = 1 + (end > next->vm_end);
* mprotect case 5 shifting the boundary up.
*/
adjust_next = (end - next->vm_start) >> PAGE_SHIFT;
- anon_vma = next->anon_vma;
+ exporter = next;
importer = vma;
} else if (end < vma->vm_end) {
/*
@@ -536,28 +537,19 @@ again: remove_next = 1 + (end > next->vm_end);
* mprotect case 4 shifting the boundary down.
*/
adjust_next = - ((vma->vm_end - end) >> PAGE_SHIFT);
- anon_vma = next->anon_vma;
+ exporter = vma;
importer = next;
}
- }

- /*
- * When changing only vma->vm_end, we don't really need anon_vma lock.
- */
- if (vma->anon_vma && (insert || importer || start != vma->vm_start))
- anon_vma = vma->anon_vma;
- if (anon_vma) {
/*
* Easily overlooked: when mprotect shifts the boundary,
* make sure the expanding vma has anon_vma set if the
* shrinking vma had, to cover any anon pages imported.
*/
- if (importer && !importer->anon_vma) {
- /* Block reverse map lookups until things are set up. */
- if (anon_vma_clone(importer, vma)) {
+ if (exporter && exporter->anon_vma && !importer->anon_vma) {
+ if (anon_vma_clone(importer, exporter))
return -ENOMEM;
- }
- importer->anon_vma = anon_vma;
+ importer->anon_vma = exporter->anon_vma;
}
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on


On Sat, 10 Apr 2010, Linus Torvalds wrote:
>
> This patch is scary and untested, but the more I look at that code, the
> more convinced I am that vma_adjust was _really_ badly screwed up. The
> patch below may make things worse. I'll test it myself too, but I'm
> sending it out first, since I was writing the email as I was looking at
> the piece of cr*p.

Ok, it boots. Which means it must be bug-free and perfect. And I really am
convinced that the old vma_adjust() use of anon_vma_clone() was _totally_
broken, so this really could explain everything.

The RCU grace period thing for the TLB flush does look like a real bug
too, but it's one that is probably impossible to hit in practice.

A broken vma_adjust(), however, would seem to be trivial to hit once you
just get the right memory freeing patterns going, because the anon_vma
would easily be _loong_ gone because we didn't create a chain to it at
all, so the anon_vma code decided that it's not used any more.

So I'm actually pretty optimistic that this really is it.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Rik van Riel on
On 04/10/2010 04:05 PM, Linus Torvalds wrote:

> And vma_adjust is the one place that does that anon_vma_merge(), which is
> apart from the actual unmapping sequence the only other place that
> actually free's anon_vmas. So there are reasons to be very suspicious of
> that code.

It frees anon_vma_chain structures, but not actual anon_vmas.

Walking the anon_vma (from rmap) requires the anon_vma->lock,
which is taken in anon_vma_merge whenever a chain is unlinked.

> And I think that code can actually lose an anon_vma chain. It's totally
> screwing up the "import anonvma" case: when it does
>
> if (anon_vma_clone(importer, vma)) {
> return -ENOMEM;
> }
> importer->anon_vma = anon_vma;
>
> we can actually have "importer == vma", but "anon_vma = next->anon_vma".

A few lines up from that code, we have:

if (vma->anon_vma && (insert || importer || start !=
vma->vm_start))
anon_vma = vma->anon_vma;

So anon_vma should always be vma->anon_vma.

If we have already imported an anon_vma, we will not
do so twice, because of the !importer->anon_vma check.

What am I overlooking?

> In which case we actually end up with an _empty_ chain (because importer
> didn't have a chain to begin with!) but "importer->anon_vma" points to an
> anon_vma.

If we import a chain, from vma to importer, importer->anon_vma
will be equal to vma->anon_vma.

I do not see how 'importer' could get a state different from 'vma'.

> Also, the conditional nesting makes no sense (the whole anon_vma_clone()
> only makes sense if importer is set, and it is only ever set _inside_ the
> earlier if-statement, so the whole code should be moved inside there), nor
> does some of the comments.

No argument there, vma_adjust is very hard to read and it took
me a few days to convince myself that my changes kept things
equivalent to how they were before.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/