rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA [Kernel]

Prev: [Bug #15615] NULL pointer deref in task_is_waking
Next: [Bug #15603] lockdep warning at boot time when determining whether to resume

From: Linus Torvalds on 9 Apr 2010 20:40

On Sat, 10 Apr 2010, Johannes Weiner wrote:
>
> That leaves the chance that my code was correct and we leave a conceptual
> error around somewhere that can materialize again.

Absolutely. I really don't know whether your merge routine works or not.
I'd just rather not have to even _try_ to understand it.

I have a fairly simple rule for most of the code I see: if I have a hard
time understanding why it should work, I don't really want to rely on it.

> But I am at a point where simplification never sounded more blissful, so
> yeah, I like it :)

Exactly. This is the "let's limit things a bit to keep them much simpler.

> Let's hope it fixes Boris's issue.

I'm going to just guess that it won't, and that Boris' issue was actually
due to something else entirely, and we've all been staring at totally the
wrong code.

But we can hope.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Rik van Riel on 10 Apr 2010 10:50

On 04/10/2010 07:26 AM, Borislav Petkov wrote:

> This time we got stuck on the anon_vma->lock (yep, we've seen that
> oopsie before). So, it might be that we _really_ are staring at the
> wrong code... Back to square one.

This is a different bug, though.

If the null pointer dereference is gone, Linus's patch
fixed that bug and we can move forward to fixing the
anon_vma->lock bug.

I'll start auditing the code to see if we forget to
unlock the anon_vma in some unlikely error path...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds on 10 Apr 2010 11:30

On Sat, 10 Apr 2010, Borislav Petkov wrote:
> >
> > I will keep running that kernel in the next couple of days and keep you
> > informed in case this is the fix we're gonna use.
>
> Yep, you jinxed it :)
>
> This time we got stuck on the anon_vma->lock (yep, we've seen that
> oopsie before). So, it might be that we _really_ are staring at the
> wrong code... Back to square one.

No, I think we're good. I suspect this is a different issue. Do you have
lockdep enabled, along with mutex and spinlock debugging etc? That might
help pinpoint what triggers this.

But I think the fact that you are apparently not able to get the list
corruption is a good sign. Of course, it might just be harder to trigger,
and these things could all be a sign of a different bug, but my gut feel
is that we did fix something, and you are just damn good at stressing the
new code. Kudos.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds on 10 Apr 2010 13:20

On Sat, 10 Apr 2010, Borislav Petkov wrote:
>
> And I got an oops again, this time the #GP from couple of days ago.

Oh damn. So the list corruption really does happen still.

And the pattern is similar, but not the same: now it's 0032323200323232,
rather than 002e2e2e002e2e2e. Very intriguing. 0x32 instead of 0x2e, but
the same pattern of duplicated bytes. And not very helpful in that it
still doesn't actually make any sense.

> <thinking out loud>
>
> I'm starting to think that maybe there could be something wrong with the
> machine I'm running it on. Especially since there are only two people
> who reported this issue, Steinar and me, so how probable is it that
> maybe those two machines have failing RAM module somewhere? Or some
> other data corrupting thing? Although I should be getting mchecks...
> Hmm...

No. Just the fact that there are two people who reported the same
thing is already a pretty strong sign that it's real. Also, hardware
problems don't tend to be as consistent in the details as yours have
been.

And in fact I have seen it personally (but couldn't reproduce it) on the
kids mac mini after you reported it.

So I'm convinced the problem is real, and just not so easily
triggered, and you're being a great tester.

Linus
--
Here's the one I've seen, in case you care. I haven't posted it, because
it doesn't really add anything new.

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<c02850cf>] page_referenced+0xd6/0x199
*pde = 21d73067 *pte = 00000000
Oops: 0000 [#2] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/host2/target2:0:0/2:0:0:0/block/sda/uevent
Modules linked in: [last unloaded: scsi_wait_scan]

Pid: 14440, comm: firefox Tainted: G D 2.6.34-rc2-00391-gfc1203c #3 Mac-F4208EC8/Macmini1,1
EIP: 0060:[<c02850cf>] EFLAGS: 00210287 CPU: 1
EIP is at page_referenced+0xd6/0x199
EAX: f59e65d4 EBX: c10b5480 ECX: 00000000 EDX: fffffff0
ESI: f59e65d0 EDI: 00000000 EBP: d8f77cd8 ESP: d8f77ca0
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process firefox (pid: 14440, ti=d8f76000 task=cb795440 task.ti=d8f76000)
Stack:
f59e65d4 00000000 fffffff0 c15ba000 d8f77cbc c02885b8 c07972c4 d8f77cdc
c0276712 00000000 00000001 c10b5498 c10b5480 d8f77e94 d8f77d58 c0276b53
d8f77d48 00000000 00000000 00000000 0000001d d8f77de8 00000001 c07972c4
Call Trace:
[<c02885b8>] ? swapcache_free+0x1b/0x24
[<c0276712>] ? __remove_mapping+0x90/0xb2
[<c0276b53>] ? shrink_page_list+0x109/0x3ba
[<c0277099>] ? shrink_inactive_list+0x295/0x48e
[<c0273d68>] ? determine_dirtyable_memory+0x34/0x4b
[<c0273dd0>] ? get_dirty_limits+0x16/0x26d
[<c027750c>] ? shrink_zone+0x27a/0x327
[<c03c55a5>] ? i915_gem_shrink+0x67/0x22c
[<c0277e6d>] ? do_try_to_free_pages+0x17d/0x292
[<c0278078>] ? try_to_free_pages+0x6a/0x72
[<c0275cd7>] ? isolate_pages_global+0x0/0x1bd
[<c0273210>] ? __alloc_pages_nodemask+0x2c2/0x447
[<c027f1c1>] ? handle_mm_fault+0x188/0x605
[<c02192c3>] ? do_page_fault+0x253/0x269
[<c0219070>] ? do_page_fault+0x0/0x269
[<c05b9e82>] ? error_code+0x66/0x6c
[<c05b0000>] ? azx_probe+0x5e8/0x8ae
[<c0219070>] ? do_page_fault+0x0/0x269
Code: f9 f2 74 18 ff 75 08 8d 45 f0 50 89 d8 e8 62 f6 ff ff 01 c7 59 83 7d f0 00 58 74 20 8b 55 d0 8b 42 10 83 e8 10 89 45 d0 8b 55 d0 <8b> 42 10 0f 18 00 90 89 d0 83 c0 10 39 45 c8 75 ab fe 06 e9 90
EIP: [<c02850cf>] page_referenced+0xd6/0x199 SS:ESP 0068:d8f77ca0
CR2: 0000000000000000
---[ end trace 890710798f4c0070 ]---

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds on 10 Apr 2010 14:30

On Sat, 10 Apr 2010, Linus Torvalds wrote:
> On Sat, 10 Apr 2010, Borislav Petkov wrote:
> >
> > And I got an oops again, this time the #GP from couple of days ago.
>
> Oh damn. So the list corruption really does happen still.

Ho humm.

Maybe I'm crazy, but something started bothering me. And I started
wondering: when is the 'page->mapping' of an anonymous page actually
cleared?

The thing is, the mapping of an anonymous page is actually cleared only
when the page is _freed_, in "free_hot_cold_page()".

Now, let's think about that. And in particular, let's think about how that
relates to the freeing of the 'anon_vma' that the page->mapping points to.

The way the anon_vma is freed is when the mapping is torn down, and we do
roughly:

tlb = tlb_gather_mmu(mm,..)
..
unmap_vmas(&tlb, vma ..
..
free_pgtables()
..
tlb_finish_mmu(tlb, start, end);

and we actually unmap all the pages in "unmap_vmas()", and then _after_
unmapping all the pages we do the "unlink_anon_vmas(vma);" in
"free_pgtables()". Fine so far - the anon_vma stay around until after the
page has been happily unmapped.

But "unmapped all the pages" is _not_ actually the same as "free'd all the
pages". The actual _freeing_ of the page happens generally in
tlb_finish_mmu(), because we can free the page only after we've flushed
any TLB entries.

So what we have in that tlb_gather structure is a list of _pending_ pages
to be freed, while we already actually free'd the anon_vmas earlier!

Now, the thing is, tlb_gather_mmu() begins a preempt-safe region (because
we use a per-cpu variable), but as far as I can tell it is _not_ an
RCU-safe region.

So I think we might actually get a real RCU freeing event while this all
happens. So now the 'anon_vma' that 'page->mapping' points to has not just
been released back to the SLUB caches, the page itself might have been
released too.

I dunno. Does the above sound at all sane? Or am I just raving?

Something hacky like the above might fix it if I'm not just raving. I
really might be missing something here.

Linus

---
include/asm-generic/tlb.h | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index e43f976..2678118 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -14,6 +14,7 @@
#define _ASM_GENERIC__TLB_H

#include <linux/swap.h>
+#include <linux/rcupdate.h>
#include <asm/pgalloc.h>
#include <asm/tlbflush.h>

@@ -62,6 +63,7 @@ tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)

tlb->fullmm = full_mm_flush;

+ rcu_read_lock();
return tlb;
}

@@ -90,6 +92,7 @@ tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
/* keep the page table cache within bounds */
check_pgt_cache();

+ rcu_read_unlock();
put_cpu_var(mmu_gathers);
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: [Bug #15615] NULL pointer deref in task_is_waking
Next: [Bug #15603] lockdep warning at boot time when determining whether to resume