Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) [Kernel]

Prev: Pending patches for 802.11 not marked stable or which requires a manual backport
Next: Your mailbox has exceeded one or more size limits

From: Borislav Petkov on 2 Apr 2010 14:10

Hi,

I've got the following oopsie two times now when hibernating - this
means, I don't get it everytime I hibernate but only sometimes, say once
in a blue moon.

And yeah, I couldn't catch it over serial console so I had to make ugly
pictures. By the way, the numbers in the filenames increment as I scroll
down the whole oops (yep, it hadn't completely frozen and I still could
do Shift->PgUp or Shift->PgDn on the console):

http://www.kernel.org/pub/linux/kernel/people/bp/

So, here's what I could decipher from the oopsie, someone else who's
more knowledgeable in mm, rmap and anon_vma's list traversal should be
able to tell what goes wrong there.

EIP is at page_referenced+0xee

which is

<disasm>
10c4: 41 01 c4 add %eax,%r12d
10c7: 83 7d cc 00 cmpl $0x0,-0x34(%rbp)
10cb: 74 19 je 10e6 <page_referenced+0xff>
10cd: 4d 8b 6d 20 mov 0x20(%r13),%r13
10d1: 49 83 ed 20 sub $0x20,%r13

10d5: 49 8b 45 20 mov 0x20(%r13),%rax <--------------

10d9: 0f 18 08 prefetcht0 (%rax)
10dc: 49 8d 45 20 lea 0x20(%r13),%rax
10e0: 48 39 45 80 cmp %rax,-0x80(%rbp)
</disasm>

Corresponding asm:

<asm>
.loc 1 496 0
movq 32(%r13), %r13 # <variable>.same_anon_vma.next, __mptr.451
..LVL295:
subq $32, %r13 #, avc
..LVL296:
..L184:
..LBE1278:
movq 32(%r13), %rax # <variable>.same_anon_vma.next, <variable>.same_anon_vma.next <----------------
prefetcht0 (%rax) # <variable>.same_anon_vma.next
leaq 32(%r13), %rax #, tmp97
cmpq %rax, -128(%rbp) # tmp97, %sfp
jne .L187 #,
..L186:
.loc 1 514 0
movq %r14, %rdi # anon_vma,
call page_unlock_anon_vma #
</asm>

and the NULL pointer in question is being written into %r13 and then 32
is subtracted from it (I'm guessing container_of()). This is consistent
with the register snapshot - %r13 contains 0xffffffffffffffe0 which is
-32 and with the code dump in the oops, in CIMG1640.JPG code points to
opcode 49 8b 45 20.

Which is the following piece of code in <mm/rmap.c:page_referenced_anon()>.

<source>

mapcount = page_mapcount(page);
list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
struct vm_area_struct *vma = avc->vma;
unsigned long address = vma_address(page, vma);
if (address == -EFAULT)
continue;

</source>

which tells us that same_anon_vma.next is NULL. Hmm...

--
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Rik van Riel on 2 Apr 2010 18:10

On 04/02/2010 02:37 PM, Linus Torvalds wrote:
> On Fri, 2 Apr 2010, Andrew Morton wrote:
>> On Fri, 2 Apr 2010 11:09:14 -0700 (PDT) Linus Torvalds<torvalds(a)linux-foundation.org> wrote:
>>
>>>
>>> I think this is likely due to the new scalable anon_vma linking by Rik.
>>
>> Similar to https://bugzilla.kernel.org/show_bug.cgi?id=15680
>
> Yup, looks like the same thing, except that bugzilla entry was due to
> swapping rather than hibernation and memory shrinking. But same end
> result, just different reasons for why we were trying to shrink the page
> lists.

Interesting that it is a null pointer dereference, given
that we do not zero out the anon_vma_chain structs before
freeing them.

Page_referenced_anon() takes the anon_vma->lock before
walking the list. The three places where we modify the
anon_vma_chain->same_anon_vma list, we also hold the
lock.

No doubt something in mm/ is doing something silly, but
I have not found anything yet :(

If I had to guess, I'd say maybe we got one of the
mprotect & vma_adjust cases wrong. Maybe a page stayed
around in the LRU (and in a process?) after its anon_vma
already got freed?

There has to be a reason why a very heavy AIM7 workload
and some other stress tests did not trigger it, but a few
people are able to trigger it on their systems...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Rik van Riel on 4 Apr 2010 13:30

On 04/04/2010 12:12 PM, Minchan Kim wrote:

> While I review the code again due to this BUG, I found some strange
> thing.
>
> In anon_vma_fork, if anon_vma_clone is successful but anon_vma_alloc is
> failed, what happens? Parent VMA's anon_vmas have anon_vma_chain which
> has vma which is destroyed.
> I couldn't find any clean routine to remove this garbage.
> I am missing something?

Good catch. The parent VMA's anon_vmas will get delinked
eventually, but we need to get rid of the newly allocated
child anon_vmas. You found a hopefully rare memory leak...

We need a call to unlink_anon_vmas(vma) at the error label
to do that.

> But I think it isn't related to this bug because oops point is not
> vma_address but anon_vma_chain.next.

Agreed, it's probably not it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: KOSAKI Motohiro on 6 Apr 2010 05:00

>
> I think this is likely due to the new scalable anon_vma linking by Rik.
> Nothing else I can imagine should have introduced anything like it.
>
> Rik: the picures have the information, but you need to look at several to
> see both the oops and the backtrace. Here's a condensed version:
>
> shrink_all_memory ->
> do_try_to_free_pages ->
> shrink_zone ->
> shrink_inactive_list ->
> shrink_page_list ->
> page_referenced
>
> where page_referenced() oopses due page_referenced_anon() as per
> Borislav's description below.
>
> Added all the usual suspects to the Cc list. Left the full report appended
> so that the new people don't have to search for it on lkml.

Today, I've reviewed this patch carefully. but I haven't found any bug.

1) anon_vma->list is alwasys protected anon_vma->lock.
2) If anyone forget to take lock, list_add() and/or list_del() never
assign to NULL.

then, NULL mean either three possibility.

a) we see uninitialized data
b) we see after freed data
c) we see memory corruption by another bug

but (a) can't happen because

static inline void __list_add()
{
next->prev = new;
new->next = next;
new->prev = prev;
prev->next = new; (*)
}

If uninitialized var is linked to avc list, new->next was already !NULL.

(b) is also impossible. SLAB_DESTROY_BY_RCU delay the page for anon_vma
freeing until next rcu period. It mean rcu_read_lock()+page_mapped()
can see kfree()ed page. but it is safe. noone corrupt it.

now I doubt (c) ;-)

Also, I've runned stress workload with shrink_all_memory() today. but
I couldn't reproduce the issue. hmm.. (perhaps I'm no lucky guy.
I'm frequently fail to reproduce)

I'll continue to work.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: KOSAKI Motohiro on 6 Apr 2010 06:10

> (b) is also impossible. SLAB_DESTROY_BY_RCU delay the page for anon_vma
> freeing until next rcu period. It mean rcu_read_lock()+page_mapped()
> can see kfree()ed page. but it is safe. noone corrupt it.

by the way: I haven't understand why rik's per process anon_vma concept
works correctly with ksm. ksm increase anon_vma->ksm_refcount. but it seems
not guranteed vma->anon_vma and page->anon_vma are the same.

but I guess bug reporter doesn't use ksm, it's minor feature.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2 3
Prev: Pending patches for 802.11 not marked stable or which requires a manual backport
Next: Your mailbox has exceeded one or more size limits