|
From: Andre Noll on 4 Oct 2006 06:50 Hi MATLAB triggers the following bug on both of our new 16-way opteron machines (64G Ram): The same kernel is running with no problems on a bunch of smaller (8-way, 4-way, max 32G Ram) cluster nodes. Any hints? Andre ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at ...aid0/home/maan/scm/stable/linux-2.6.18.y/mm/rmap.c:522 invalid opcode: 0000 [1] SMP CPU 14 Pid: 12948, comm: MATLAB Not tainted 2.6.18-tt64-6 #1 RIP: 0010:[<ffffffff8015ee54>] [<ffffffff8015ee54>] page_remove_rmap+0x13/0x2d RSP: 0018:ffff810207a19d70 EFLAGS: 00010286 RAX: 00000000ffffffff RBX: ffff81101aa1dd90 RCX: 000000000000001c RDX: 00002aaad63b2000 RSI: 00002aaad63b2000 RDI: ffff81102d129ce8 RBP: 0000000f59e3b067 R08: 0000000000000023 R09: ffff810e30000680 R10: ffff8106079df408 R11: ffff8105970284a8 R12: ffff81102d129ce8 R13: ffff810e3005e160 R14: 00002aaad63b2000 R15: 0000000000000000 FS: 00002b3f6aa704a0(0000) GS:ffff810e301b9440(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00002aaad8592000 CR3: 00000005bdd17000 CR4: 00000000000006a0 Process MATLAB (pid: 12948, threadinfo ffff810207a18000, task ffff8101ea581080) Stack: ffffffff80157a37 ffff810e30000680 00002aaad63b3000 ffff81101aa1dd98 ffff81102fb53668 ffffffffffffffb8 ffff8106079df400 ffff810207a19e98 00002aaad6400000 ffff810feecdc088 00002aaad6400000 ffff810b24d11588 Call Trace: [<ffffffff80157a37>] zap_pte_range+0x1c4/0x2c0 [<ffffffff80157d0e>] unmap_page_range+0x1db/0x23a [<ffffffff80157e5b>] unmap_vmas+0xee/0x1e3 [<ffffffff8015c6fe>] unmap_region+0xb4/0x127 [<ffffffff8015caa7>] do_munmap+0x183/0x19a [<ffffffff8015caf7>] sys_munmap+0x39/0x52 [<ffffffff80109726>] system_call+0x7e/0x83 Code: 0f 0b 68 c0 4e 46 80 c2 0a 02 31 f6 f6 47 18 01 40 0f 94 c6 RIP [<ffffffff8015ee54>] page_remove_rmap+0x13/0x2d RSP <ffff810207a19d70> <0>Bad page state in process 'MATLAB' page:ffff81102d129ce8 flags:0x1e00000000000014 mapping:0000000000000000 mapcount:-1 count:0 Trying to fix it up, but a reboot is needed Backtrace: Call Trace: [<ffffffff8014f325>] bad_page+0x51/0x7b [<ffffffff8014f788>] prep_new_page+0x57/0x15f [<ffffffff8014feb1>] buffered_rmqueue+0x128/0x14a [<ffffffff8015002c>] get_page_from_freelist+0xbd/0xe2 [<ffffffff801500a3>] __alloc_pages+0x52/0x29f [<ffffffff80159a26>] do_anonymous_page+0x46/0x1b8 [<ffffffff8015a023>] __handle_mm_fault+0x18f/0x29d [<ffffffff8011b359>] do_page_fault+0x1bd/0x4e7 [<ffffffff8015bf42>] do_mmap_pgoff+0x5fd/0x6de [<ffffffff8022094b>] __up_write+0x14/0x108 [<ffffffff8010a3f9>] error_exit+0x0/0x84 ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at ...aid0/home/maan/scm/stable/linux-2.6.18.y/mm/rmap.c:522 invalid opcode: 0000 [2] SMP CPU 14 Pid: 12079, comm: MATLAB Tainted: G B 2.6.18-tt64-6 #1 RIP: 0010:[<ffffffff8015ee54>] [<ffffffff8015ee54>] page_remove_rmap+0x13/0x2d RSP: 0018:ffff810a0916fd70 EFLAGS: 00010286 RAX: 00000000ffffffff RBX: ffff810e7c68d300 RCX: 000000000000001c RDX: ffff810e30000000 RSI: 00002aab5ea60000 RDI: ffff81102bb25c10 RBP: 0000000ef53ee067 R08: 0000000000000023 R09: ffff810e30000680 R10: ffff810c01b69c88 R11: ffff810bce5484a8 R12: ffff81102bb25c10 R13: ffff810e3005e160 R14: 00002aab5ea60000 R15: 0000000000000000 FS: 00002b1f8100d4a0(0000) GS:ffff810e301b9440(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00002aab31345000 CR3: 0000000c1bebd000 CR4: 00000000000006a0 Process MATLAB (pid: 12079, threadinfo ffff810a0916e000, task ffff810a092d2180) Stack: ffffffff80157a37 ffff810e30000680 00002aab5ea61000 ffff810e7c68d308 ffff81102a0b6ee8 00000000ffffff9f ffff810c01b69c80 ffff810a0916fe98 00002aab5ec00000 ffff810fd4924298 00002aab5ec00000 ffff8100812777a8 Call Trace: [<ffffffff80157a37>] zap_pte_range+0x1c4/0x2c0 [<ffffffff80157d0e>] unmap_page_range+0x1db/0x23a [<ffffffff80157e5b>] unmap_vmas+0xee/0x1e3 [<ffffffff8015c6fe>] unmap_region+0xb4/0x127 [<ffffffff8015caa7>] do_munmap+0x183/0x19a [<ffffffff8015caf7>] sys_munmap+0x39/0x52 [<ffffffff80109726>] system_call+0x7e/0x83 Code: 0f 0b 68 c0 4e 46 80 c2 0a 02 31 f6 f6 47 18 01 40 0f 94 c6 RIP [<ffffffff8015ee54>] page_remove_rmap+0x13/0x2d RSP <ffff810a0916fd70> ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at ...aid0/home/maan/scm/stable/linux-2.6.18.y/mm/rmap.c:522 invalid opcode: 0000 [3] SMP CPU 15 Pid: 20344, comm: MATLAB Tainted: G B 2.6.18-tt64-6 #1 RIP: 0010:[<ffffffff8015ee54>] [<ffffffff8015ee54>] page_remove_rmap+0x13/0x2d RSP: 0018:ffff8101113b7d70 EFLAGS: 00010286 RAX: 00000000ffffffff RBX: ffff810e87e8d660 RCX: 000000000000001c RDX: ffff810e30000000 RSI: 00002aaaf6acc000 RDI: ffff81102ef01410 RBP: 0000000fe24ee067 R08: 0000000000000023 R09: ffff810e30000680 R10: ffff81010a2d5248 R11: ffff810162b6a608 R12: ffff81102ef01410 R13: ffff810e300639e0 R14: 00002aaaf6acc000 R15: 0000000000000000 FS: 00002b88026364a0(0000) GS:ffff810e301f4440(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00002aab0284c0d0 CR3: 00000001038e3000 CR4: 00000000000006a0 Process MATLAB (pid: 20344, threadinfo ffff8101113b6000, task ffff81018fbd18a0) Stack: ffffffff80157a37 ffff810e30000680 00002aaaf6acd000 ffff810e87e8d668 ffff81102a33aee8 00000000ffffff3f ffff81010a2d5240 ffff8101113b7e98 00002aaaf6c00000 ffff810ff0aa7818 00002aaaf6c00000 ffff810eaffd3da8 Call Trace: [<ffffffff80157a37>] zap_pte_range+0x1c4/0x2c0 [<ffffffff80157d0e>] unmap_page_range+0x1db/0x23a [<ffffffff80157e5b>] unmap_vmas+0xee/0x1e3 [<ffffffff8015c6fe>] unmap_region+0xb4/0x127 [<ffffffff8015caa7>] do_munmap+0x183/0x19a [<ffffffff8015caf7>] sys_munmap+0x39/0x52 [<ffffffff80109726>] system_call+0x7e/0x83 Code: 0f 0b 68 c0 4e 46 80 c2 0a 02 31 f6 f6 47 18 01 40 0f 94 c6 RIP [<ffffffff8015ee54>] page_remove_rmap+0x13/0x2d RSP <ffff8101113b7d70> <3>swap_free: Unused swap offset entry 00000060 ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at ...aid0/home/maan/scm/stable/linux-2.6.18.y/mm/rmap.c:522 invalid opcode: 0000 [4] SMP CPU 14 Pid: 5985, comm: MATLAB Tainted: G B 2.6.18-tt64-6 #1 RIP: 0010:[<ffffffff8015ee54>] [<ffffffff8015ee54>] page_remove_rmap+0x13/0x2d RSP: 0018:ffff810204875d70 EFLAGS: 00010286 RAX: 00000000ffffffff RBX: ffff810e87e852b0 RCX: ffff8101ef34d9
From: Nick Piggin on 4 Oct 2006 10:10 Andre Noll wrote: > Hi > > MATLAB triggers the following bug on both of our new 16-way opteron > machines (64G Ram): The same kernel is running with no problems on a > bunch of smaller (8-way, 4-way, max 32G Ram) cluster nodes. > > Any hints? > Andre > > > ----------- [cut here ] --------- [please bite here ] --------- > Kernel BUG at ...aid0/home/maan/scm/stable/linux-2.6.18.y/mm/rmap.c:522 Ah, this old thing. I hope it is repeatable? What we really want is the bit before this, the "Eeek! page_mapcount went negative" part. It is also nice if we can work out where the page actually came from. The following attached patch should help out a bit with that, if you could run with it? Thanks a lot for reporting, this is very useful. Nick -- SUSE Labs, Novell Inc.
From: Andre Noll on 4 Oct 2006 11:50 On 23:59, Nick Piggin wrote: > Ah, this old thing. I hope it is repeatable? Well, it happened on both of the new machines we got last week. One of these is still up BTW and I'm able to ssh into it. > What we really want is the bit before this, the "Eeek! page_mapcount went > negative" part. There's no such message in the log. The preceeding lines are just normal startup messages: Adding 16779852k swap on /dev/sda1. Priority:42 extents:1 across:16779852k Adding 16779852k swap on /dev/sdb1. Priority:42 extents:1 across:16779852k process `syslogd' is using obsolete setsockopt SO_BSDCOMPAT > It is also nice if we can work out where the page actually came from. The > following attached patch should help out a bit with that, if you could > run with it? Okay. I'll reboot with your patch and let you know if it crashes again. Thanks for the quick response. Andre -- The only person who always got his work done by Friday was Robinson Crusoe
From: Peter Zijlstra on 4 Oct 2006 12:00 On Wed, 2006-10-04 at 17:42 +0200, Andre Noll wrote: > On 23:59, Nick Piggin wrote: > > > Ah, this old thing. I hope it is repeatable? > > Well, it happened on both of the new machines we got last week. One > of these is still up BTW and I'm able to ssh into it. > > > What we really want is the bit before this, the "Eeek! page_mapcount went > > negative" part. > > There's no such message in the log. The preceeding lines are just normal > startup messages: > > Adding 16779852k swap on /dev/sda1. Priority:42 extents:1 across:16779852k > Adding 16779852k swap on /dev/sdb1. Priority:42 extents:1 across:16779852k > process `syslogd' is using obsolete setsockopt SO_BSDCOMPAT > > > It is also nice if we can work out where the page actually came from. The > > following attached patch should help out a bit with that, if you could > > run with it? > > Okay. I'll reboot with your patch and let you know if it crashes again. enable CONFIG_DEBUG_VM to get that. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Andre Noll on 4 Oct 2006 12:20
On 17:49, Peter Zijlstra wrote: > enable CONFIG_DEBUG_VM to get that. Yup, that was disabled. It's enabled now. Thanks Andre -- The only person who always got his work done by Friday was Robinson Crusoe |