Prev: Any monthly web hosting companies?
Next: "TLB parity error in virtual array; TLB error 'instruction"?(acpidump)
From: Ant on 8 Mar 2010 11:45 Hello. Lately, I have been getting random and rare kernel panics on my old Debian/Linux box (tried both Kernel versions 2.6.30 and 2.6.32). I couldn't figure out what it was until I discovered mcelog a couple days ago, and it revealed interesting scary datas in my dmesg/messages and syslog: # cat /var/log/messages .... Mar 7 08:25:24 MyLinuxBox kernel: [ 3299.988026] Machine check events logged Mar 7 08:25:24 MyLinuxBox mcelog: HARDWARE ERROR. This is *NOT* a software problem! Mar 7 08:25:24 MyLinuxBox mcelog: Please contact your hardware vendor Mar 7 08:25:24 MyLinuxBox mcelog: MCE 0 Mar 7 08:25:24 MyLinuxBox mcelog: CPU 1 1 instruction cache Mar 7 08:25:24 MyLinuxBox mcelog: ADDR c11b6ff0 Mar 7 08:25:24 MyLinuxBox mcelog: TIME 1267979124 Sun Mar 7 08:25:24 2010 Mar 7 08:25:24 MyLinuxBox mcelog: TLB parity error in virtual array Mar 7 08:25:24 MyLinuxBox mcelog: TLB error 'instruction transaction, level 1' Mar 7 08:25:24 MyLinuxBox mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 7 08:25:24 MyLinuxBox mcelog: MCGCAP 105 APICID 1 SOCKETID 0 Mar 7 08:25:24 MyLinuxBox mcelog: CPUID Vendor AMD Family 15 Model 43 I am not familiar with hardwares, so I assume this is very bad, but what part(s) is/are bad? Is my old Athlon 64 X2 CPU dying/damaged? I have had it and its motherboard since 12/24/2006, so it is not that old yet. I have the full details on my secondary machine at http://alpha.zimage.com/~ant/antfarm/about/computers.txt ... Although, this might be related to the PSU's death back in early December 2009. My friend and I believe it also took out my EVGA GeForce 8800 GT video card and damage a 512 MB of RAM (tested 3 GB with and each piece with memtest86+ v4.00 to narrow it down). http://alpha.zimage.com/~ant/antfarm/about/toys.html has a log of the details of my systems. I did run memtest86+ again a couple weeks ago and this morning for 5-6 hours, but not got no errors after five full tests (passed). I also do not overclock/OC. Thank you in advance. :) -- "Above ground I shall be food for kites; below I shall be food for mole-crickets and ants. Why rob one to feed the other?" --Juang-zu (4th Century B.C.) /\___/\ / /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site) | |o o| | Ant's Quality Foraged Links: http://aqfl.net \ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT ( ) or ANTant(a)zimage.com Ant is currently not listening to any songs on his home computer.
From: Yousuf Khan on 8 Mar 2010 22:01 Ant wrote: > Hello. > > Lately, I have been getting random and rare kernel panics on my old > Debian/Linux box (tried both Kernel versions 2.6.30 and 2.6.32). I > couldn't figure out what it was until I discovered mcelog a couple days > ago, and it revealed interesting scary datas in my dmesg/messages and > syslog: > > # cat /var/log/messages > ... > Mar 7 08:25:24 MyLinuxBox kernel: [ 3299.988026] Machine check events > logged > Mar 7 08:25:24 MyLinuxBox mcelog: HARDWARE ERROR. This is *NOT* a > software problem! > Mar 7 08:25:24 MyLinuxBox mcelog: Please contact your hardware vendor > Mar 7 08:25:24 MyLinuxBox mcelog: MCE 0 > Mar 7 08:25:24 MyLinuxBox mcelog: CPU 1 1 instruction cache > Mar 7 08:25:24 MyLinuxBox mcelog: ADDR c11b6ff0 > Mar 7 08:25:24 MyLinuxBox mcelog: TIME 1267979124 Sun Mar 7 08:25:24 2010 > Mar 7 08:25:24 MyLinuxBox mcelog: TLB parity error in virtual array > Mar 7 08:25:24 MyLinuxBox mcelog: TLB error 'instruction transaction, > level 1' > Mar 7 08:25:24 MyLinuxBox mcelog: STATUS 9400000000010011 MCGSTATUS 0 > Mar 7 08:25:24 MyLinuxBox mcelog: MCGCAP 105 APICID 1 SOCKETID 0 > Mar 7 08:25:24 MyLinuxBox mcelog: CPUID Vendor AMD Family 15 Model 43 > > I am not familiar with hardwares, so I assume this is very bad, but what > part(s) is/are bad? Is my old Athlon 64 X2 CPU dying/damaged? I have had > it and its motherboard since 12/24/2006, so it is not that old yet. I > have the full details on my secondary machine at > http://alpha.zimage.com/~ant/antfarm/about/computers.txt ... Yeah, the TLB stands for Translation Lookaside Buffer, it's the part of the processor that keeps track of memory pages. I'm not sure what they are referring to when they talk about "virtual array", unless it has something to do with OS virtualization. In any case, if your TLB is damaged, then various programs will fail if their memory pages get tracked by that TLB entry. > Although, this might be related to the PSU's death back in early > December 2009. My friend and I believe it also took out my EVGA GeForce > 8800 GT video card and damage a 512 MB of RAM (tested 3 GB with and each > piece with memtest86+ v4.00 to narrow it down). > http://alpha.zimage.com/~ant/antfarm/about/toys.html has a log of the > details of my systems. I did run memtest86+ again a couple weeks ago and > this morning for 5-6 hours, but not got no errors after five full tests > (passed). I also do not overclock/OC. > > Thank you in advance. :) If that PSU failure took out so much other hardware, then it's likely it took out your processor too, and it took longer for it to finally fail. CPU chips tend to be more robust than memory chips and GPU chips, a lot more redundancy, so they may show the signs of the failure much later. Memtest86+ won't find faults inside the CPU, it only tests for faults in the RAM. Yousuf Khan
From: Robert Redelmeier on 8 Mar 2010 23:49 Ant <ant(a)zimage.comant> wrote in part: > part(s) is/are bad? Is my old Athlon 64 X2 CPU dying/damaged? [snip] > > Although, this might be related to the PSU's death back in early > December 2009. My friend and I believe it also took out my EVGA > GeForce 8800 GT video card and damage a 512 MB of RAM (tested > 3 GB with and each piece with memtest86+ v4.00 to narrow it down). As Yousef has mentioned, any PSU failure serious enough to damage RAM could easily damage the CPU. Especially AMD with the RAM controller and busses inside the CPU. > http://alpha.zimage.com/~ant/antfarm/about/toys.html has a log of > the details of my systems. I did run memtest86+ again a couple > weeks ago and this morning for 5-6 hours, but not got no errors > after five full tests (passed). I also do not overclock/OC. memtest86 is a good pgm, but it is more extensive than intensive. It tests all memory, but not especially hard. If you want to diagnose further, you could try running a few dozen copies of my `burnMMX P`. It is a bit old and not quite as high bandwidth as possible on newer processors. If there is no error, they should stay running indefinitely. Watch for terminations and/or dmesg. Run by `nice -19` should increase TLB transitions. -- Robert author `cpuburn` http://pages.sbcglobal.net/redelm
From: Ant on 9 Mar 2010 01:37 On 3/8/2010 7:01 PM PT, Yousuf Khan typed: > If that PSU failure took out so much other hardware, then it's likely it > took out your processor too, and it took longer for it to finally fail. > CPU chips tend to be more robust than memory chips and GPU chips, a lot > more redundancy, so they may show the signs of the failure much later. Ah, that could be it. So far, a 512 MB of RAM and video card went bust with the PSU. Too bad my friend and I did not see physical evidences of busted caps, discolorations, etc. :( > Memtest86+ won't find faults inside the CPU, it only tests for faults in > the RAM. What's a good way to test the CPU? I tried sys_basher, unraring 10 GB of datas, memtest86+ v4.00 (you said it is only for RAM), etc. None of them caused kernel panics. The crashes seem to happen during idled time. I do not use AMD's Cool'n' Quiet and PowerNow-K8. -- "To conquer the world, we must be as meticulous and calculating as a colony of ants on the march." --Julius Caesar /\___/\ / /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site) | |o o| | Ant's Quality Foraged Links: http://aqfl.net \ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT ( ) or ANTant(a)zimage.com Ant is currently not listening to any songs on his home computer.
From: Ant on 9 Mar 2010 01:43
On 3/8/2010 8:49 PM PT, Robert Redelmeier typed: > As Yousef has mentioned, any PSU failure serious enough to > damage RAM could easily damage the CPU. Especially AMD with > the RAM controller and busses inside the CPU. Damn. Intel CPUs does better with this? > memtest86 is a good pgm, but it is more extensive than intensive. > It tests all memory, but not especially hard. If you want to > diagnose further, you could try running a few dozen copies of my > `burnMMX P`. It is a bit old and not quite as high bandwidth as > possible on newer processors. If there is no error, they should > stay running indefinitely. Watch for terminations and/or dmesg. > Run by `nice -19` should increase TLB transitions. > > -- Robert author `cpuburn` http://pages.sbcglobal.net/redelm Thanks. I will try it when I don't need to use the box. You should update your program to support the newer processors. :) -- "Is this stuff any good for ants?" "No, it kills them." --unknown /\___/\ / /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site) | |o o| | Ant's Quality Foraged Links: http://aqfl.net \ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT ( ) or ANTant(a)zimage.com Ant is currently not listening to any songs on his home computer. |