From: Robert Redelmeier on
ANTant(a)zimage.com wrote in part:
> Bah, the error came back again after my tests:
>
> dmesg:
> [32399.988020] Machine check events logged
>
> From /var/log/messages:
> Mar 12 14:45:16 foobar kernel: [32399.988020] Machine check events logged
> Mar 12 14:45:16 foobar mcelog: HARDWARE ERROR. This is *NOT* a software problem!
> Mar 12 14:45:16 foobar mcelog: Please contact your hardware vendor
> Mar 12 14:45:16 foobar mcelog: MCE 0
> Mar 12 14:45:16 foobar mcelog: CPU 1 1 instruction cache
> Mar 12 14:45:16 foobar mcelog: ADDR c11b6ff0
> Mar 12 14:45:16 foobar mcelog: TIME 1268433916 Fri Mar 12 14:45:16 2010
> Mar 12 14:45:16 foobar mcelog: TLB parity error in virtual array
> Mar 12 14:45:16 foobar mcelog: TLB error 'instruction transaction, level 1'
> Mar 12 14:45:16 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
> Mar 12 14:45:16 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
> Mar 12 14:45:16 foobar mcelog: CPUID Vendor AMD Family 15 Model 43




Noting the addr is in kernel space and the instruction cache,
this is going to take much ingenuity to replicate :(


-- Robert


From: Yousuf Khan on
Robert Redelmeier wrote:
> Ant <ant(a)zimage.comant> wrote in part:
>> I am planning to leave them running for about 15 hours straight until
>> I need to use the box locally again tonight. I am curious if I will
>> get no errors and crashes like yesterday's seven processes test.
>
> Yes, this seems to be running well. I'm not sure what else to suggest.
> Odd to see stability under load but instability at idle. mobo caps/PS?
> You might try running 66 `burnMMX O` or 132 `burnMMX N` or even 264
> `burnMMX M` to increase TLB swapping (more smaller maps).
> But that may be too much trouble.

My guess here, but is it possible that the TLB only decays if it isn't
being used constantly?

Yousuf Khan
From: Ant on
On 3/12/2010 8:38 PM PT, Robert Redelmeier typed:

>> Bah, the error came back again after my tests:
>>
>> dmesg:
>> [32399.988020] Machine check events logged
>>
>> From /var/log/messages:
>> Mar 12 14:45:16 foobar kernel: [32399.988020] Machine check events logged
>> Mar 12 14:45:16 foobar mcelog: HARDWARE ERROR. This is *NOT* a software problem!
>> Mar 12 14:45:16 foobar mcelog: Please contact your hardware vendor
>> Mar 12 14:45:16 foobar mcelog: MCE 0
>> Mar 12 14:45:16 foobar mcelog: CPU 1 1 instruction cache
>> Mar 12 14:45:16 foobar mcelog: ADDR c11b6ff0
>> Mar 12 14:45:16 foobar mcelog: TIME 1268433916 Fri Mar 12 14:45:16 2010
>> Mar 12 14:45:16 foobar mcelog: TLB parity error in virtual array
>> Mar 12 14:45:16 foobar mcelog: TLB error 'instruction transaction, level 1'
>> Mar 12 14:45:16 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
>> Mar 12 14:45:16 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
>> Mar 12 14:45:16 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
>
> Noting the addr is in kernel space and the instruction cache,
> this is going to take much ingenuity to replicate :(

You just gave me an idea: # cat /var/log/messages |grep MCGSTATUS
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 09:37:07 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 14:29:37 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 15:12:07 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 16:19:37 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 17:42:07 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 18:14:37 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 18:42:07 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 18:59:37 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 19:32:07 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 19:39:37 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 20:12:07 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 21:14:37 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 21:47:07 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 22:24:37 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 23:32:07 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 7 08:25:24 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 7 13:52:54 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 7 15:35:24 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 7 15:42:54 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 7 15:50:24 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 7 15:57:54 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 7 17:30:24 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 7 18:07:54 foobar mcelog: STATUS d400000000010011 MCGSTATUS 0
Mar 7 19:55:24 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 7 20:12:54 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 7 21:55:24 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 7 22:02:54 foobar mcelog: STATUS d400000000010011 MCGSTATUS 0
Mar 7 23:05:24 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 8 14:22:55 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 8 15:55:25 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 9 01:52:55 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 9 05:15:25 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 9 06:27:55 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 9 06:40:25 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 9 21:17:55 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 10 00:35:25 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 10 01:27:55 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 10 02:50:25 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 10 04:52:55 foobar mcelog: STATUS d400000000010011 MCGSTATUS 0
Mar 10 05:10:25 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 10 22:42:55 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 10 22:50:25 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 10 23:57:55 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 11 00:17:55 foobar mcelog: STATUS 9000000000000171 MCGSTATUS 0
Mar 12 14:45:16 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 12 22:02:46 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0

There are also two d400000000010011 and one 9000000000000171 addresses.
Do they give any more clues? I wonder if we can test those areas outside
of my Debian with any LiveCD? I wonder if Memtest86 even tests those.

I need to look at the kernel panics errors in text modes and see if they
match too, but they scroll off too much. :(
--
"Whence we see spiders, flies, or ants entombed and preserved forever in
amber, a more than royal tomb." --Sir Francis Bacon in Historia Vit� et
Mortis; Sylva Sylvarum, Cent. i. Exper. 100.
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT
( ) or ANTant(a)zimage.com
Ant is currently not listening to any songs on his home computer.
From: Ant on
On 3/12/2010 9:11 PM PT, Yousuf Khan typed:

> My guess here, but is it possible that the TLB only decays if it isn't
> being used constantly?

Last night I ran memtest86+ v4.00's test #9.
http://www.memtest86.com/tech.html#descri says: "Test 9 [Bit fade test,
90 min, 2 patterns]

The bit fade test initializes all of memory with a pattern and then
sleeps for 90 minutes. Then memory is examined to see if any memory bits
have changed. All ones and all zero patterns are used. This test takes 3
hours to complete. The Bit Fade test is not included in the normal test
sequence and must be run manually via the runtime configuration menu."

I only ran it for over 3.25 hours and it passed (only one test).
Shouldn't this test that problem? Or is that TLB somewhere else? Maybe I
need to run it longer and more?
--
"Ants die in sugar." --Malawi
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT
( ) or ANTant(a)zimage.com
Ant is currently not listening to any songs on his home computer.
From: Yousuf Khan on
ANTant(a)zimage.com wrote:
> Bah, the error came back again after my tests:
>
> dmesg:
> [32399.988020] Machine check events logged
>
> From /var/log/messages:
> Mar 12 14:45:16 foobar kernel: [32399.988020] Machine check events logged
> Mar 12 14:45:16 foobar mcelog: HARDWARE ERROR. This is *NOT* a software problem!
> Mar 12 14:45:16 foobar mcelog: Please contact your hardware vendor
> Mar 12 14:45:16 foobar mcelog: MCE 0
> Mar 12 14:45:16 foobar mcelog: CPU 1 1 instruction cache
> Mar 12 14:45:16 foobar mcelog: ADDR c11b6ff0
> Mar 12 14:45:16 foobar mcelog: TIME 1268433916 Fri Mar 12 14:45:16 2010
> Mar 12 14:45:16 foobar mcelog: TLB parity error in virtual array
> Mar 12 14:45:16 foobar mcelog: TLB error 'instruction transaction, level 1'
> Mar 12 14:45:16 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
> Mar 12 14:45:16 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
> Mar 12 14:45:16 foobar mcelog: CPUID Vendor AMD Family 15 Model 43


Error always on CPU 1? Maybe try to disable that core?

Yousuf Khan