From: Ant on
Hello.

Lately, I have been getting random and rare kernel panics on my old
Debian/Linux box (tried both Kernel versions 2.6.30 and 2.6.32). I
couldn't figure out what it was until I discovered mcelog a couple days
ago, and it revealed interesting scary datas in my dmesg/messages and
syslog:

# cat /var/log/messages
....
Mar 7 08:25:24 MyLinuxBox kernel: [ 3299.988026] Machine check events
logged
Mar 7 08:25:24 MyLinuxBox mcelog: HARDWARE ERROR. This is *NOT* a
software problem!
Mar 7 08:25:24 MyLinuxBox mcelog: Please contact your hardware vendor
Mar 7 08:25:24 MyLinuxBox mcelog: MCE 0
Mar 7 08:25:24 MyLinuxBox mcelog: CPU 1 1 instruction cache
Mar 7 08:25:24 MyLinuxBox mcelog: ADDR c11b6ff0
Mar 7 08:25:24 MyLinuxBox mcelog: TIME 1267979124 Sun Mar 7 08:25:24 2010
Mar 7 08:25:24 MyLinuxBox mcelog: TLB parity error in virtual array
Mar 7 08:25:24 MyLinuxBox mcelog: TLB error 'instruction transaction,
level 1'
Mar 7 08:25:24 MyLinuxBox mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 7 08:25:24 MyLinuxBox mcelog: MCGCAP 105 APICID 1 SOCKETID 0
Mar 7 08:25:24 MyLinuxBox mcelog: CPUID Vendor AMD Family 15 Model 43

I am not familiar with hardwares, so I assume this is very bad, but what
part(s) is/are bad? Is my old Athlon 64 X2 CPU dying/damaged? I have had
it and its motherboard since 12/24/2006, so it is not that old yet. I
have the full details on my secondary machine at
http://alpha.zimage.com/~ant/antfarm/about/computers.txt ...

Although, this might be related to the PSU's death back in early
December 2009. My friend and I believe it also took out my EVGA GeForce
8800 GT video card and damage a 512 MB of RAM (tested 3 GB with and each
piece with memtest86+ v4.00 to narrow it down).
http://alpha.zimage.com/~ant/antfarm/about/toys.html has a log of the
details of my systems. I did run memtest86+ again a couple weeks ago and
this morning for 5-6 hours, but not got no errors after five full tests
(passed). I also do not overclock/OC.

Thank you in advance. :)
--
"Above ground I shall be food for kites; below I shall be food for
mole-crickets and ants. Why rob one to feed the other?" --Juang-zu (4th
Century B.C.)
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT
( ) or ANTant(a)zimage.com
Ant is currently not listening to any songs on his home computer.
From: Yousuf Khan on
Ant wrote:
> Hello.
>
> Lately, I have been getting random and rare kernel panics on my old
> Debian/Linux box (tried both Kernel versions 2.6.30 and 2.6.32). I
> couldn't figure out what it was until I discovered mcelog a couple days
> ago, and it revealed interesting scary datas in my dmesg/messages and
> syslog:
>
> # cat /var/log/messages
> ...
> Mar 7 08:25:24 MyLinuxBox kernel: [ 3299.988026] Machine check events
> logged
> Mar 7 08:25:24 MyLinuxBox mcelog: HARDWARE ERROR. This is *NOT* a
> software problem!
> Mar 7 08:25:24 MyLinuxBox mcelog: Please contact your hardware vendor
> Mar 7 08:25:24 MyLinuxBox mcelog: MCE 0
> Mar 7 08:25:24 MyLinuxBox mcelog: CPU 1 1 instruction cache
> Mar 7 08:25:24 MyLinuxBox mcelog: ADDR c11b6ff0
> Mar 7 08:25:24 MyLinuxBox mcelog: TIME 1267979124 Sun Mar 7 08:25:24 2010
> Mar 7 08:25:24 MyLinuxBox mcelog: TLB parity error in virtual array
> Mar 7 08:25:24 MyLinuxBox mcelog: TLB error 'instruction transaction,
> level 1'
> Mar 7 08:25:24 MyLinuxBox mcelog: STATUS 9400000000010011 MCGSTATUS 0
> Mar 7 08:25:24 MyLinuxBox mcelog: MCGCAP 105 APICID 1 SOCKETID 0
> Mar 7 08:25:24 MyLinuxBox mcelog: CPUID Vendor AMD Family 15 Model 43
>
> I am not familiar with hardwares, so I assume this is very bad, but what
> part(s) is/are bad? Is my old Athlon 64 X2 CPU dying/damaged? I have had
> it and its motherboard since 12/24/2006, so it is not that old yet. I
> have the full details on my secondary machine at
> http://alpha.zimage.com/~ant/antfarm/about/computers.txt ...


Yeah, the TLB stands for Translation Lookaside Buffer, it's the part of
the processor that keeps track of memory pages. I'm not sure what they
are referring to when they talk about "virtual array", unless it has
something to do with OS virtualization. In any case, if your TLB is
damaged, then various programs will fail if their memory pages get
tracked by that TLB entry.

> Although, this might be related to the PSU's death back in early
> December 2009. My friend and I believe it also took out my EVGA GeForce
> 8800 GT video card and damage a 512 MB of RAM (tested 3 GB with and each
> piece with memtest86+ v4.00 to narrow it down).
> http://alpha.zimage.com/~ant/antfarm/about/toys.html has a log of the
> details of my systems. I did run memtest86+ again a couple weeks ago and
> this morning for 5-6 hours, but not got no errors after five full tests
> (passed). I also do not overclock/OC.
>
> Thank you in advance. :)

If that PSU failure took out so much other hardware, then it's likely it
took out your processor too, and it took longer for it to finally fail.
CPU chips tend to be more robust than memory chips and GPU chips, a lot
more redundancy, so they may show the signs of the failure much later.

Memtest86+ won't find faults inside the CPU, it only tests for faults in
the RAM.

Yousuf Khan
From: Robert Redelmeier on
Ant <ant(a)zimage.comant> wrote in part:
> part(s) is/are bad? Is my old Athlon 64 X2 CPU dying/damaged? [snip]
>
> Although, this might be related to the PSU's death back in early
> December 2009. My friend and I believe it also took out my EVGA
> GeForce 8800 GT video card and damage a 512 MB of RAM (tested
> 3 GB with and each piece with memtest86+ v4.00 to narrow it down).

As Yousef has mentioned, any PSU failure serious enough to
damage RAM could easily damage the CPU. Especially AMD with
the RAM controller and busses inside the CPU.


> http://alpha.zimage.com/~ant/antfarm/about/toys.html has a log of
> the details of my systems. I did run memtest86+ again a couple
> weeks ago and this morning for 5-6 hours, but not got no errors
> after five full tests (passed). I also do not overclock/OC.

memtest86 is a good pgm, but it is more extensive than intensive.
It tests all memory, but not especially hard. If you want to
diagnose further, you could try running a few dozen copies of my
`burnMMX P`. It is a bit old and not quite as high bandwidth as
possible on newer processors. If there is no error, they should
stay running indefinitely. Watch for terminations and/or dmesg.
Run by `nice -19` should increase TLB transitions.

-- Robert author `cpuburn` http://pages.sbcglobal.net/redelm

From: Ant on
On 3/8/2010 7:01 PM PT, Yousuf Khan typed:

> If that PSU failure took out so much other hardware, then it's likely it
> took out your processor too, and it took longer for it to finally fail.
> CPU chips tend to be more robust than memory chips and GPU chips, a lot
> more redundancy, so they may show the signs of the failure much later.

Ah, that could be it. So far, a 512 MB of RAM and video card went bust
with the PSU. Too bad my friend and I did not see physical evidences of
busted caps, discolorations, etc. :(


> Memtest86+ won't find faults inside the CPU, it only tests for faults in
> the RAM.

What's a good way to test the CPU? I tried sys_basher, unraring 10 GB of
datas, memtest86+ v4.00 (you said it is only for RAM), etc. None of them
caused kernel panics. The crashes seem to happen during idled time. I do
not use AMD's Cool'n' Quiet and PowerNow-K8.
--
"To conquer the world, we must be as meticulous and calculating as a
colony of ants on the march." --Julius Caesar
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT
( ) or ANTant(a)zimage.com
Ant is currently not listening to any songs on his home computer.
From: Ant on
On 3/8/2010 8:49 PM PT, Robert Redelmeier typed:

> As Yousef has mentioned, any PSU failure serious enough to
> damage RAM could easily damage the CPU. Especially AMD with
> the RAM controller and busses inside the CPU.

Damn. Intel CPUs does better with this?


> memtest86 is a good pgm, but it is more extensive than intensive.
> It tests all memory, but not especially hard. If you want to
> diagnose further, you could try running a few dozen copies of my
> `burnMMX P`. It is a bit old and not quite as high bandwidth as
> possible on newer processors. If there is no error, they should
> stay running indefinitely. Watch for terminations and/or dmesg.
> Run by `nice -19` should increase TLB transitions.
>
> -- Robert author `cpuburn` http://pages.sbcglobal.net/redelm

Thanks. I will try it when I don't need to use the box. You should
update your program to support the newer processors. :)
--
"Is this stuff any good for ants?" "No, it kills them." --unknown
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT
( ) or ANTant(a)zimage.com
Ant is currently not listening to any songs on his home computer.