Prev: Any monthly web hosting companies?
Next: "TLB parity error in virtual array; TLB error 'instruction"?(acpidump)
From: Robert Redelmeier on 12 Mar 2010 23:38 ANTant(a)zimage.com wrote in part: > Bah, the error came back again after my tests: > > dmesg: > [32399.988020] Machine check events logged > > From /var/log/messages: > Mar 12 14:45:16 foobar kernel: [32399.988020] Machine check events logged > Mar 12 14:45:16 foobar mcelog: HARDWARE ERROR. This is *NOT* a software problem! > Mar 12 14:45:16 foobar mcelog: Please contact your hardware vendor > Mar 12 14:45:16 foobar mcelog: MCE 0 > Mar 12 14:45:16 foobar mcelog: CPU 1 1 instruction cache > Mar 12 14:45:16 foobar mcelog: ADDR c11b6ff0 > Mar 12 14:45:16 foobar mcelog: TIME 1268433916 Fri Mar 12 14:45:16 2010 > Mar 12 14:45:16 foobar mcelog: TLB parity error in virtual array > Mar 12 14:45:16 foobar mcelog: TLB error 'instruction transaction, level 1' > Mar 12 14:45:16 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 > Mar 12 14:45:16 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0 > Mar 12 14:45:16 foobar mcelog: CPUID Vendor AMD Family 15 Model 43 Noting the addr is in kernel space and the instruction cache, this is going to take much ingenuity to replicate :( -- Robert
From: Yousuf Khan on 13 Mar 2010 00:11 Robert Redelmeier wrote: > Ant <ant(a)zimage.comant> wrote in part: >> I am planning to leave them running for about 15 hours straight until >> I need to use the box locally again tonight. I am curious if I will >> get no errors and crashes like yesterday's seven processes test. > > Yes, this seems to be running well. I'm not sure what else to suggest. > Odd to see stability under load but instability at idle. mobo caps/PS? > You might try running 66 `burnMMX O` or 132 `burnMMX N` or even 264 > `burnMMX M` to increase TLB swapping (more smaller maps). > But that may be too much trouble. My guess here, but is it possible that the TLB only decays if it isn't being used constantly? Yousuf Khan
From: Ant on 13 Mar 2010 01:42 On 3/12/2010 8:38 PM PT, Robert Redelmeier typed: >> Bah, the error came back again after my tests: >> >> dmesg: >> [32399.988020] Machine check events logged >> >> From /var/log/messages: >> Mar 12 14:45:16 foobar kernel: [32399.988020] Machine check events logged >> Mar 12 14:45:16 foobar mcelog: HARDWARE ERROR. This is *NOT* a software problem! >> Mar 12 14:45:16 foobar mcelog: Please contact your hardware vendor >> Mar 12 14:45:16 foobar mcelog: MCE 0 >> Mar 12 14:45:16 foobar mcelog: CPU 1 1 instruction cache >> Mar 12 14:45:16 foobar mcelog: ADDR c11b6ff0 >> Mar 12 14:45:16 foobar mcelog: TIME 1268433916 Fri Mar 12 14:45:16 2010 >> Mar 12 14:45:16 foobar mcelog: TLB parity error in virtual array >> Mar 12 14:45:16 foobar mcelog: TLB error 'instruction transaction, level 1' >> Mar 12 14:45:16 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 >> Mar 12 14:45:16 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0 >> Mar 12 14:45:16 foobar mcelog: CPUID Vendor AMD Family 15 Model 43 > > Noting the addr is in kernel space and the instruction cache, > this is going to take much ingenuity to replicate :( You just gave me an idea: # cat /var/log/messages |grep MCGSTATUS Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 6 09:37:07 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 6 14:29:37 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 6 15:12:07 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 6 16:19:37 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 6 17:42:07 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 6 18:14:37 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 6 18:42:07 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 6 18:59:37 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 6 19:32:07 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 6 19:39:37 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 6 20:12:07 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 6 21:14:37 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 6 21:47:07 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 6 22:24:37 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 6 23:32:07 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 7 08:25:24 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 7 13:52:54 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 7 15:35:24 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 7 15:42:54 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 7 15:50:24 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 7 15:57:54 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 7 17:30:24 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 7 18:07:54 foobar mcelog: STATUS d400000000010011 MCGSTATUS 0 Mar 7 19:55:24 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 7 20:12:54 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 7 21:55:24 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 7 22:02:54 foobar mcelog: STATUS d400000000010011 MCGSTATUS 0 Mar 7 23:05:24 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 8 14:22:55 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 8 15:55:25 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 9 01:52:55 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 9 05:15:25 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 9 06:27:55 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 9 06:40:25 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 9 21:17:55 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 10 00:35:25 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 10 01:27:55 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 10 02:50:25 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 10 04:52:55 foobar mcelog: STATUS d400000000010011 MCGSTATUS 0 Mar 10 05:10:25 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 10 22:42:55 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 10 22:50:25 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 10 23:57:55 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 11 00:17:55 foobar mcelog: STATUS 9000000000000171 MCGSTATUS 0 Mar 12 14:45:16 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 12 22:02:46 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 There are also two d400000000010011 and one 9000000000000171 addresses. Do they give any more clues? I wonder if we can test those areas outside of my Debian with any LiveCD? I wonder if Memtest86 even tests those. I need to look at the kernel panics errors in text modes and see if they match too, but they scroll off too much. :( -- "Whence we see spiders, flies, or ants entombed and preserved forever in amber, a more than royal tomb." --Sir Francis Bacon in Historia Vit� et Mortis; Sylva Sylvarum, Cent. i. Exper. 100. /\___/\ / /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site) | |o o| | Ant's Quality Foraged Links: http://aqfl.net \ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT ( ) or ANTant(a)zimage.com Ant is currently not listening to any songs on his home computer.
From: Ant on 13 Mar 2010 01:48 On 3/12/2010 9:11 PM PT, Yousuf Khan typed: > My guess here, but is it possible that the TLB only decays if it isn't > being used constantly? Last night I ran memtest86+ v4.00's test #9. http://www.memtest86.com/tech.html#descri says: "Test 9 [Bit fade test, 90 min, 2 patterns] The bit fade test initializes all of memory with a pattern and then sleeps for 90 minutes. Then memory is examined to see if any memory bits have changed. All ones and all zero patterns are used. This test takes 3 hours to complete. The Bit Fade test is not included in the normal test sequence and must be run manually via the runtime configuration menu." I only ran it for over 3.25 hours and it passed (only one test). Shouldn't this test that problem? Or is that TLB somewhere else? Maybe I need to run it longer and more? -- "Ants die in sugar." --Malawi /\___/\ / /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site) | |o o| | Ant's Quality Foraged Links: http://aqfl.net \ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT ( ) or ANTant(a)zimage.com Ant is currently not listening to any songs on his home computer.
From: Yousuf Khan on 13 Mar 2010 01:49
ANTant(a)zimage.com wrote: > Bah, the error came back again after my tests: > > dmesg: > [32399.988020] Machine check events logged > > From /var/log/messages: > Mar 12 14:45:16 foobar kernel: [32399.988020] Machine check events logged > Mar 12 14:45:16 foobar mcelog: HARDWARE ERROR. This is *NOT* a software problem! > Mar 12 14:45:16 foobar mcelog: Please contact your hardware vendor > Mar 12 14:45:16 foobar mcelog: MCE 0 > Mar 12 14:45:16 foobar mcelog: CPU 1 1 instruction cache > Mar 12 14:45:16 foobar mcelog: ADDR c11b6ff0 > Mar 12 14:45:16 foobar mcelog: TIME 1268433916 Fri Mar 12 14:45:16 2010 > Mar 12 14:45:16 foobar mcelog: TLB parity error in virtual array > Mar 12 14:45:16 foobar mcelog: TLB error 'instruction transaction, level 1' > Mar 12 14:45:16 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 > Mar 12 14:45:16 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0 > Mar 12 14:45:16 foobar mcelog: CPUID Vendor AMD Family 15 Model 43 Error always on CPU 1? Maybe try to disable that core? Yousuf Khan |