2.6.16-rc1-mm3 [Kernel]

Next: Grant

From: Nick Piggin on 26 Jan 2006 14:50

Michal Piotrowski wrote:
> Hi,
>
> On 25/01/06, Nick Piggin <nickpiggin(a)yahoo.com.au> wrote:
>
>>Hi,
>>
>>Michal Piotrowski wrote:
>>
>>>------------[ cut here ]------------
>>>kernel BUG at /usr/src/linux-mm/include/linux/mm.h:302!
>>>invalid opcode: 0000 [#1]
>>>PREEMPT SMP DEBUG_PAGEALLOC
>>>last sysfs file: /class/vc/vcsa7/dev
>>>Modules linked in: binfmt_misc thermal fan processor ipv6 w83627hf
>>>hwmon_vid hwmon i2c_isa snd_intel8x0 snd_ac97_codec snd_ac97_bus
>>>sk98lin snd_pcm_oss snd_mixer_oss skge intel_agp snd_pcm snd_timer snd
>>>soundcore i2c_i801 parport_pc parport snd_page_alloc 8250_pnp 8250
>>>serial_core agpgart rtc ide_cd cdrom hw_random unix
>>>CPU: 0
>>>EIP: 0060:[<b013fe81>] Not tainted VLI
>>>EFLAGS: 00210246 (2.6.16-rc1-mm3 #1)
>>>EIP is at release_pages+0x33/0x15e
>>
>>Is it repeatable?
>>
>>If so, I'd imagine it must be a specific driver page which is not properly
>>refcounted somewhere. A bug in generic code would have shown up elsewhere
>>by now.
>>
>>Can you try something like the attached patch and see what it gives you?
>>

Thanks, it confirms my suspicions.

Can you try the following patch, please?
It appears the warnings were brought out by my improvement to
the put_page_testzero debugging code (which previously did not
check that we might be attempting to free a constituent compound
page).

Can you test the following patch please?

--
SUSE Labs, Novell Inc.

From: Michal Piotrowski on 26 Jan 2006 15:00

Hi,

On 26/01/06, Nick Piggin <nickpiggin(a)yahoo.com.au> wrote:
[snip]
> Thanks, it confirms my suspicions.
>
> Can you try the following patch, please?
> It appears the warnings were brought out by my improvement to
> the put_page_testzero debugging code (which previously did not
> check that we might be attempting to free a constituent compound
> page).
>
> Can you test the following patch please?
>
> --
> SUSE Labs, Novell Inc.
>
>
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h
> +++ linux-2.6/include/linux/mm.h
> @@ -15,6 +15,7 @@
> #include <linux/prio_tree.h>
> #include <linux/fs.h>
> #include <linux/mutex.h>
> +#include <linux/kallsyms.h>
>
> struct mempolicy;
> struct anon_vma;
> @@ -264,6 +265,8 @@ struct page {
> void *virtual; /* Kernel virtual address (NULL if
> not kmapped, ie. highmem) */
> #endif /* WANT_PAGE_VIRTUAL */
> +
> + void *debug;
> };
>
> #define page_private(page) ((page)->private)
> @@ -294,8 +297,14 @@ struct page {
> */
> static inline int put_page_testzero(struct page *page)
> {
> - BUG_ON(atomic_read(&page->_count) == 0);
> - return atomic_dec_and_test(&page->_count);
> + if (unlikely(atomic_read(&page->_count) == 0)) {
> + printk(KERN_WARNING "put_page_testzero found free page (flags = %lx)\n", page->flags);
> + if (page->debug)
> + print_symbol(KERN_WARNING "nopage is %s\n", (unsigned long)page->debug);
> + WARN_ON(1);
> + return 0;
> + } else
> + return atomic_dec_and_test(&page->_count);
> }
>
> /*
> Index: linux-2.6/mm/memory.c
> ===================================================================
> --- linux-2.6.orig/mm/memory.c
> +++ linux-2.6/mm/memory.c
> @@ -2056,6 +2056,8 @@ retry:
> if (new_page == NOPAGE_OOM)
> return VM_FAULT_OOM;
>
> + new_page->debug = (struct address_space *)vma->vm_ops->nopage;
> +
> /*
> * Should we do an early C-O-W break?
> */
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -521,6 +521,8 @@ static int prep_new_page(struct page *pa
> if (PageReserved(page))
> return 1;
>
> + page->debug = NULL;
> +
> page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
> 1 << PG_referenced | 1 << PG_arch_1 |
> 1 << PG_checked | 1 << PG_mappedtodisk);
>
>
>

I have tried this patch, here is dmesg:
http://www.stardust.webpages.pl/files/mm/2.6.16-rc1-mm3/mm-dmesg

Regards,
Michal Piotrowski
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 26 Jan 2006 15:00

Nick Piggin wrote:

> Thanks, it confirms my suspicions.
>
> Can you try the following patch, please?
> It appears the warnings were brought out by my improvement to
> the put_page_testzero debugging code (which previously did not
> check that we might be attempting to free a constituent compound
> page).
>
> Can you test the following patch please?
>

Sorry, wrong patch.

Note the warnings you are seeing should not result in memory
corruption, but will result in the given hugepage leaking.

--
SUSE Labs, Novell Inc.

From: Andy Whitcroft on 26 Jan 2006 19:30

Pekka Enberg wrote:
> Hi Andy,
>
> Pekka Enberg wrote:
>
>>>Does vanilla 2.6.16-rc1 work for you? The oops definitely makes me think
>>>it's slab related but the other patches don't seem likely suspects.
>
>
> On 1/25/06, Andy Whitcroft <apw(a)shadowen.org> wrote:
>
>>None of the other patches you suggested seem to be it either :/. Yes
>>2.6.16-rc1 was ok on the boxs in question.
>
>
> Then I dont see how it could be slab related. At this point, the only
> suggestion I have is bisecting akpm-style:

Yes. I think I have this one. It appears that the patch below is the
trigger for all our recent panic woe's. The last of the testing should
complete in the next few hours and I will be able to confirm that
hypothesis; results so far are all good.

reduce-size-of-percpudata-and-make-sure-per_cpuobject.patch

From the nature of the patch I would guess its likely not the patch
itself that is at issue but some errant user of percpu space. Perhaps a
more gentle approach is needed such that we get to the point at which
consoles are available and we can report the issue (at least as an
option).

Eric can give us some help confirming whether there is an issue with the
patch or finding the source of the errant access to it?

Cheers.

-apw
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andrew Morton on 26 Jan 2006 22:30

Andy Whitcroft <apw(a)shadowen.org> wrote:
>
> Yes. I think I have this one. It appears that the patch below is the
> trigger for all our recent panic woe's. The last of the testing should
> complete in the next few hours and I will be able to confirm that
> hypothesis; results so far are all good.
>
> reduce-size-of-percpudata-and-make-sure-per_cpuobject.patch

That patch did have some missed conversions, which might well explain the
crash.

Thanks for narrowing it down - I'll keep that patch in next -mm (and will
include the known fixups). Could you please boot test that? If we're
still in trouble, I'll drop it.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8
Next: Grant