From: Andre Noll on
Hi

MATLAB triggers the following bug on both of our new 16-way opteron
machines (64G Ram): The same kernel is running with no problems on a
bunch of smaller (8-way, 4-way, max 32G Ram) cluster nodes.

Any hints?
Andre


----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at ...aid0/home/maan/scm/stable/linux-2.6.18.y/mm/rmap.c:522
invalid opcode: 0000 [1] SMP
CPU 14
Pid: 12948, comm: MATLAB Not tainted 2.6.18-tt64-6 #1
RIP: 0010:[<ffffffff8015ee54>] [<ffffffff8015ee54>] page_remove_rmap+0x13/0x2d
RSP: 0018:ffff810207a19d70 EFLAGS: 00010286
RAX: 00000000ffffffff RBX: ffff81101aa1dd90 RCX: 000000000000001c
RDX: 00002aaad63b2000 RSI: 00002aaad63b2000 RDI: ffff81102d129ce8
RBP: 0000000f59e3b067 R08: 0000000000000023 R09: ffff810e30000680
R10: ffff8106079df408 R11: ffff8105970284a8 R12: ffff81102d129ce8
R13: ffff810e3005e160 R14: 00002aaad63b2000 R15: 0000000000000000
FS: 00002b3f6aa704a0(0000) GS:ffff810e301b9440(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00002aaad8592000 CR3: 00000005bdd17000 CR4: 00000000000006a0
Process MATLAB (pid: 12948, threadinfo ffff810207a18000, task ffff8101ea581080)
Stack: ffffffff80157a37 ffff810e30000680 00002aaad63b3000 ffff81101aa1dd98
ffff81102fb53668 ffffffffffffffb8 ffff8106079df400 ffff810207a19e98
00002aaad6400000 ffff810feecdc088 00002aaad6400000 ffff810b24d11588
Call Trace:
[<ffffffff80157a37>] zap_pte_range+0x1c4/0x2c0
[<ffffffff80157d0e>] unmap_page_range+0x1db/0x23a
[<ffffffff80157e5b>] unmap_vmas+0xee/0x1e3
[<ffffffff8015c6fe>] unmap_region+0xb4/0x127
[<ffffffff8015caa7>] do_munmap+0x183/0x19a
[<ffffffff8015caf7>] sys_munmap+0x39/0x52
[<ffffffff80109726>] system_call+0x7e/0x83


Code: 0f 0b 68 c0 4e 46 80 c2 0a 02 31 f6 f6 47 18 01 40 0f 94 c6
RIP [<ffffffff8015ee54>] page_remove_rmap+0x13/0x2d
RSP <ffff810207a19d70>
<0>Bad page state in process 'MATLAB'
page:ffff81102d129ce8 flags:0x1e00000000000014 mapping:0000000000000000 mapcount:-1 count:0
Trying to fix it up, but a reboot is needed
Backtrace:

Call Trace:
[<ffffffff8014f325>] bad_page+0x51/0x7b
[<ffffffff8014f788>] prep_new_page+0x57/0x15f
[<ffffffff8014feb1>] buffered_rmqueue+0x128/0x14a
[<ffffffff8015002c>] get_page_from_freelist+0xbd/0xe2
[<ffffffff801500a3>] __alloc_pages+0x52/0x29f
[<ffffffff80159a26>] do_anonymous_page+0x46/0x1b8
[<ffffffff8015a023>] __handle_mm_fault+0x18f/0x29d
[<ffffffff8011b359>] do_page_fault+0x1bd/0x4e7
[<ffffffff8015bf42>] do_mmap_pgoff+0x5fd/0x6de
[<ffffffff8022094b>] __up_write+0x14/0x108
[<ffffffff8010a3f9>] error_exit+0x0/0x84

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at ...aid0/home/maan/scm/stable/linux-2.6.18.y/mm/rmap.c:522
invalid opcode: 0000 [2] SMP
CPU 14
Pid: 12079, comm: MATLAB Tainted: G B 2.6.18-tt64-6 #1
RIP: 0010:[<ffffffff8015ee54>] [<ffffffff8015ee54>] page_remove_rmap+0x13/0x2d
RSP: 0018:ffff810a0916fd70 EFLAGS: 00010286
RAX: 00000000ffffffff RBX: ffff810e7c68d300 RCX: 000000000000001c
RDX: ffff810e30000000 RSI: 00002aab5ea60000 RDI: ffff81102bb25c10
RBP: 0000000ef53ee067 R08: 0000000000000023 R09: ffff810e30000680
R10: ffff810c01b69c88 R11: ffff810bce5484a8 R12: ffff81102bb25c10
R13: ffff810e3005e160 R14: 00002aab5ea60000 R15: 0000000000000000
FS: 00002b1f8100d4a0(0000) GS:ffff810e301b9440(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aab31345000 CR3: 0000000c1bebd000 CR4: 00000000000006a0
Process MATLAB (pid: 12079, threadinfo ffff810a0916e000, task ffff810a092d2180)
Stack: ffffffff80157a37 ffff810e30000680 00002aab5ea61000 ffff810e7c68d308
ffff81102a0b6ee8 00000000ffffff9f ffff810c01b69c80 ffff810a0916fe98
00002aab5ec00000 ffff810fd4924298 00002aab5ec00000 ffff8100812777a8
Call Trace:
[<ffffffff80157a37>] zap_pte_range+0x1c4/0x2c0
[<ffffffff80157d0e>] unmap_page_range+0x1db/0x23a
[<ffffffff80157e5b>] unmap_vmas+0xee/0x1e3
[<ffffffff8015c6fe>] unmap_region+0xb4/0x127
[<ffffffff8015caa7>] do_munmap+0x183/0x19a
[<ffffffff8015caf7>] sys_munmap+0x39/0x52
[<ffffffff80109726>] system_call+0x7e/0x83


Code: 0f 0b 68 c0 4e 46 80 c2 0a 02 31 f6 f6 47 18 01 40 0f 94 c6
RIP [<ffffffff8015ee54>] page_remove_rmap+0x13/0x2d
RSP <ffff810a0916fd70>
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at ...aid0/home/maan/scm/stable/linux-2.6.18.y/mm/rmap.c:522
invalid opcode: 0000 [3] SMP
CPU 15
Pid: 20344, comm: MATLAB Tainted: G B 2.6.18-tt64-6 #1
RIP: 0010:[<ffffffff8015ee54>] [<ffffffff8015ee54>] page_remove_rmap+0x13/0x2d
RSP: 0018:ffff8101113b7d70 EFLAGS: 00010286
RAX: 00000000ffffffff RBX: ffff810e87e8d660 RCX: 000000000000001c
RDX: ffff810e30000000 RSI: 00002aaaf6acc000 RDI: ffff81102ef01410
RBP: 0000000fe24ee067 R08: 0000000000000023 R09: ffff810e30000680
R10: ffff81010a2d5248 R11: ffff810162b6a608 R12: ffff81102ef01410
R13: ffff810e300639e0 R14: 00002aaaf6acc000 R15: 0000000000000000
FS: 00002b88026364a0(0000) GS:ffff810e301f4440(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00002aab0284c0d0 CR3: 00000001038e3000 CR4: 00000000000006a0
Process MATLAB (pid: 20344, threadinfo ffff8101113b6000, task ffff81018fbd18a0)
Stack: ffffffff80157a37 ffff810e30000680 00002aaaf6acd000 ffff810e87e8d668
ffff81102a33aee8 00000000ffffff3f ffff81010a2d5240 ffff8101113b7e98
00002aaaf6c00000 ffff810ff0aa7818 00002aaaf6c00000 ffff810eaffd3da8
Call Trace:
[<ffffffff80157a37>] zap_pte_range+0x1c4/0x2c0
[<ffffffff80157d0e>] unmap_page_range+0x1db/0x23a
[<ffffffff80157e5b>] unmap_vmas+0xee/0x1e3
[<ffffffff8015c6fe>] unmap_region+0xb4/0x127
[<ffffffff8015caa7>] do_munmap+0x183/0x19a
[<ffffffff8015caf7>] sys_munmap+0x39/0x52
[<ffffffff80109726>] system_call+0x7e/0x83


Code: 0f 0b 68 c0 4e 46 80 c2 0a 02 31 f6 f6 47 18 01 40 0f 94 c6
RIP [<ffffffff8015ee54>] page_remove_rmap+0x13/0x2d
RSP <ffff8101113b7d70>
<3>swap_free: Unused swap offset entry 00000060
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at ...aid0/home/maan/scm/stable/linux-2.6.18.y/mm/rmap.c:522
invalid opcode: 0000 [4] SMP
CPU 14
Pid: 5985, comm: MATLAB Tainted: G B 2.6.18-tt64-6 #1
RIP: 0010:[<ffffffff8015ee54>] [<ffffffff8015ee54>] page_remove_rmap+0x13/0x2d
RSP: 0018:ffff810204875d70 EFLAGS: 00010286
RAX: 00000000ffffffff RBX: ffff810e87e852b0 RCX: ffff8101ef34d9
From: Nick Piggin on
Andre Noll wrote:
> Hi
>
> MATLAB triggers the following bug on both of our new 16-way opteron
> machines (64G Ram): The same kernel is running with no problems on a
> bunch of smaller (8-way, 4-way, max 32G Ram) cluster nodes.
>
> Any hints?
> Andre
>
>
> ----------- [cut here ] --------- [please bite here ] ---------
> Kernel BUG at ...aid0/home/maan/scm/stable/linux-2.6.18.y/mm/rmap.c:522

Ah, this old thing. I hope it is repeatable?

What we really want is the bit before this, the "Eeek! page_mapcount went
negative" part.

It is also nice if we can work out where the page actually came from. The
following attached patch should help out a bit with that, if you could
run with it?

Thanks a lot for reporting, this is very useful.

Nick

--
SUSE Labs, Novell Inc.
From: Andre Noll on
On 23:59, Nick Piggin wrote:

> Ah, this old thing. I hope it is repeatable?

Well, it happened on both of the new machines we got last week. One
of these is still up BTW and I'm able to ssh into it.

> What we really want is the bit before this, the "Eeek! page_mapcount went
> negative" part.

There's no such message in the log. The preceeding lines are just normal
startup messages:

Adding 16779852k swap on /dev/sda1. Priority:42 extents:1 across:16779852k
Adding 16779852k swap on /dev/sdb1. Priority:42 extents:1 across:16779852k
process `syslogd' is using obsolete setsockopt SO_BSDCOMPAT

> It is also nice if we can work out where the page actually came from. The
> following attached patch should help out a bit with that, if you could
> run with it?

Okay. I'll reboot with your patch and let you know if it crashes again.

Thanks for the quick response.
Andre
--
The only person who always got his work done by Friday was Robinson Crusoe
From: Peter Zijlstra on
On Wed, 2006-10-04 at 17:42 +0200, Andre Noll wrote:
> On 23:59, Nick Piggin wrote:
>
> > Ah, this old thing. I hope it is repeatable?
>
> Well, it happened on both of the new machines we got last week. One
> of these is still up BTW and I'm able to ssh into it.
>
> > What we really want is the bit before this, the "Eeek! page_mapcount went
> > negative" part.
>
> There's no such message in the log. The preceeding lines are just normal
> startup messages:
>
> Adding 16779852k swap on /dev/sda1. Priority:42 extents:1 across:16779852k
> Adding 16779852k swap on /dev/sdb1. Priority:42 extents:1 across:16779852k
> process `syslogd' is using obsolete setsockopt SO_BSDCOMPAT
>
> > It is also nice if we can work out where the page actually came from. The
> > following attached patch should help out a bit with that, if you could
> > run with it?
>
> Okay. I'll reboot with your patch and let you know if it crashes again.

enable CONFIG_DEBUG_VM to get that.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andre Noll on
On 17:49, Peter Zijlstra wrote:

> enable CONFIG_DEBUG_VM to get that.

Yup, that was disabled. It's enabled now.

Thanks
Andre

--
The only person who always got his work done by Friday was Robinson Crusoe