Logs and dumps for kernel panics to collect and analyze? Dying CPU? [Linux Hardware]

Prev: Logs and dumps for kernel panics to collect and analyze? DyingCPU?
Next: Intel 4965AGN Supported in Kernel 2.6.24?

From: Darren Salt on 6 Mar 2010 17:13

I demand that Ant may or may not have written...

> Uh oh. I just discovered mcelog and something new and scary in its
> /var/log/syslog:
[snip]
> Mar 6 08:24:37 foobar kernel: [40799.988036] Machine check events logged
> Mar 6 08:45:19 foobar -- MARK --
> Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
> problem!
> Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
> Mar 6 08:52:09 foobar mcelog: MCE 0
> Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
> Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
> Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
> Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
> Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction,
> level 1'
> Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
> Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
> Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
[snip duplicate entries]

Ouch.

> What does that mean? Dying CPU (had it since 12/24/2006)?

12/12/2007? ;-)

(Hint: use ISO8601 date formats or use month names. Broken-endian dates can
all too easily cause error; fortunately, that one's unambiguous.)

Anyway, it does look like a fault in that CPU. I'd certainly be considering
replacing it, though due to your earlier mention of kernel panics, I wouldn't
rule out board problems either; are there any visible signs of hardware
problems (leaky/bulging capacitors etc.)? Checking the PSU is probably also
worthwhile.

(http://en.wikipedia.org/wiki/Translation_lookaside_buffer describes the
affected area of the CPU.)

> Maybe that's why memtest86+ didn't find any problems last week.

That doesn't seem to be relevant.

> On 3/5/2010 11:12 PM PT, Ant typed:

(And that one hasn't happened yet.)

>> Is /var/log/syslog the only place where Linux keeps records of kernel
>> (v2.6.30 and v2.6.32) panics? dmesg and /var/log/messages doesn't seem to
>> show anything about the crashes unless I am misreading them. I am trying
>> to figure out a rare and random kernel panic issue on my old Debian box.

http://www.mjmwired.net/kernel/Documentation/networking/netconsole.txt

That needs a second computer, but it will at least allow most panics to be
captured. (Exceptions include hard hangs, where there may be no panic which
can be reported, and problems which affect the network interface over which
the log is being sent.)

[snip]
--
| Darren Salt | linux at youmustbejoking | nr. Ashington, | Doon
| using Debian GNU/Linux | or ds ,demon,co,uk | Northumberland | Army
| + They're after you...

I'd like to, but I'm going to count the bristles in my toothbrush.

From: Darren Salt on 7 Mar 2010 08:52

I demand that Ant may or may not have written...

> On 3/6/2010 2:13 PM PT, Darren Salt typed:
>>> Uh oh. I just discovered mcelog and something new and scary in its
>>> /var/log/syslog:
>> [snip]
>>> Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
>>> problem!
>>> Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
>>> Mar 6 08:52:09 foobar mcelog: MCE 0
>>> Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
>>> Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
>>> Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
>>> Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
>>> Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction,
>>> level 1'
>>> Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
>>> Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
>>> Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
>> [snip duplicate entries]
>> Ouch.

> :(

>>> What does that mean? Dying CPU (had it since 12/24/2006)?
>> 12/12/2007? ;-)

> Eh?

Normalisation in progress. ;-)

>> (Hint: use ISO8601 date formats or use month names. Broken-endian dates
>> can all too easily cause error; fortunately, that one's unambiguous.)

> I don't get it. :(

Well... today is 7/3/2010 or 3/7/2010, according to locale; it is better
represented as 2010-03-07.

>> Anyway, it does look like a fault in that CPU. I'd certainly be
>> considering replacing it, though due to your earlier mention of kernel
>> panics, I wouldn't rule out board problems either; are there any visible
>> signs of hardware problems (leaky/bulging capacitors etc.)? Checking the
>> PSU is probably also worthwhile.

> Hmmm, I just swapped my PSU because the old one (FSP650-80GLC PSU (650
> watts) from 5/14/2007) died on 12/2009. I recalled days before,
> something smelled burning but I couldn't figure out where it came from
> since I had two desktops. I guess it was the PSU that went poof!

I've had that happen once here. Advice given was to replace the whole lot
because of possible damage to components, and I can see where that's coming
from: brief over-voltage or over-current. (Would anybody who knows more about
your typical switched-mode PSU care to comment?)

> At the same time, my EVGA GeForce 8800 GT video card had to be RMA'ed since
> it didn't work anymore since the new PSU still wouldn't boot the box up at
> all.

Dead card, due to The Way of the Exploding PSU?

> After getting a RMA'ed refurbished video card back, my box was fine for a
> bit and then got kernel panics once in a while. Then, it seems to become
> more frequently slowly. One day in February, I ran memtest86+ v4.00 for
> like five hours and found lots of errors. My friend and I narrowed it down
> to a 512 MB RAM

I've seen bad RAM before. On visual inspection, it looks exactly like good
RAM.

> and left with 2.5 GB remaining (still plenty for an old
> Linux workstation!). Oh and we didn't see anything burned, busted, etc.

That's the thing. It might not *look* damaged...

> It sounds like that PSU bust damaged a lot of my hardwares. Argh! I
> don't have the time and resources to build another one

Yet you have the time to respond here. ;-)

> (guess I could do a clean install with it too :P). :(

Hmm...

>> (http://en.wikipedia.org/wiki/Translation_lookaside_buffer describes the
>> affected area of the CPU.)

> Hmm, I wonder if that 512 MB RAM that memtest86 detected having errors
> wasn't bad?

Chances are that memtest86 was right. (I can see how bad memory might cause
incorrect TLB entries, but not parity errors.)

>>> Maybe that's why memtest86+ didn't find any problems last week.
>> That doesn't seem to be relevant.

> Why do you say that? I am going to run it again soon to double check.

It's testing the memory, and (probably) isn't making use of logical
addressing. If it isn't, then it's not going to be making use of the TLB, so
it's not going to cause MCEs. (Or perhaps they *were* happening, but
memtest86+ was ignoring them.)

>> http://www.mjmwired.net/kernel/Documentation/networking/netconsole.txt
>> That needs a second computer, but it will at least allow most panics to be
>> captured. (Exceptions include hard hangs, where there may be no panic
>> which can be reported, and problems which affect the network interface
>> over which the log is being sent.)

> Interesting. I wished Linux's Kernel panics would log to a file like
> Windows' memory dumps from blue screens so I can use a debugger to see what
> the dumps.

Logging to a file isn't an option (at this point, things are probably too far
gone for this to be practical); but they could, perhaps, be stored in some
non-volatile memory. (You'd need at least 16K for this, ideally 64K or more;
and I don't think that there's enough in your typical PC RTC.)

--
| Darren Salt | linux at youmustbejoking | nr. Ashington, | Doon
| using Debian GNU/Linux | or ds ,demon,co,uk | Northumberland | Army
| + This comment has been censored.

Would ye both eat your cake and have your cake?

From: Darren Salt on 7 Mar 2010 22:08

I demand that Ant may or may not have written...

> On 3/7/2010 5:52 AM PT, Darren Salt typed:
[snip]
>>>> Anyway, it does look like a fault in that CPU. I'd certainly be
>>>> considering replacing it, though due to your earlier mention of kernel
>>>> panics, I wouldn't rule out board problems either; are there any visible
>>>> signs of hardware problems (leaky/bulging capacitors etc.)? Checking the
>>>> PSU is probably also worthwhile.
>>> Hmmm, I just swapped my PSU because the old one (FSP650-80GLC PSU (650
>>> watts) from 5/14/2007) died on 12/2009. I recalled days before,
>>> something smelled burning but I couldn't figure out where it came from
>>> since I had two desktops. I guess it was the PSU that went poof!
>> I've had that happen once here. Advice given was to replace the whole lot
>> because of possible damage to components, and I can see where that's
>> coming from: brief over-voltage or over-current. (Would anybody who knows
>> more about your typical switched-mode PSU care to comment?)

> :( It sounds common I guess.

Cheap components, I shouldn't wonder.

> I ran memtest86+ v4.000 overnight for over five hours. It had two passes
> and almost done with the third one on its test 8. I guess RAM is still OK!

Probably. :-)

>>> At the same time, my EVGA GeForce 8800 GT video card had to be RMA'ed
>>> since it didn't work anymore since the new PSU still wouldn't boot the
>>> box up at all.
>> Dead card, due to The Way of the Exploding PSU?

> I guess so if it stopped working right after PSU went dead and repalced
> with a new one. Or a coincident?

Coincidence, I'd say. Too much of one does seem rather likely.

>>> After getting a RMA'ed refurbished video card back, my box was fine for a
>>> bit and then got kernel panics once in a while. Then, it seems to become
>>> more frequently slowly. One day in February, I ran memtest86+ v4.00 for
>>> like five hours and found lots of errors. My friend and I narrowed it
>>> down to a 512 MB RAM
>> I've seen bad RAM before. On visual inspection, it looks exactly like good
>> RAM.

> Yeah. It's old too (four years I think)!

Good hardware should last quite a bit longer than that. Assuming that it /is/
good hardware, of course...

>>> and left with 2.5 GB remaining (still plenty for an old
>>> Linux workstation!). Oh and we didn't see anything burned, busted, etc.
>> That's the thing. It might not *look* damaged...

> Right, but you asked if there were any physical damages from our eyes. :P

Yes, on the grounds that it's not worth looking further if you see obvious
damage. :-�

[snip]
>>> Hmm, I wonder if that 512 MB RAM that memtest86 detected having errors
>>> wasn't bad?
>> Chances are that memtest86 was right. (I can see how bad memory might
>> cause incorrect TLB entries, but not parity errors.)

> So parity errors are from CPU only? I am not an expert in hardwares area.

If you happen to be using ECC RAM, errors can be reported from that too.
Hopefully, they'd be correctable ones...

>>>>> Maybe that's why memtest86+ didn't find any problems last week.
>>>> That doesn't seem to be relevant.
>>> Why do you say that? I am going to run it again soon to double check.
>> It's testing the memory, and (probably) isn't making use of logical
>> addressing. If it isn't, then it's not going to be making use of the TLB,
>> so it's not going to cause MCEs. (Or perhaps they *were* happening, but
>> memtest86+ was ignoring them.)

> So how can I test this with another bootable tool like memtest86+?

Boot from USB or CD, drop to a text console, stress it with a kernel compile
or something (preferably without touching disk). Wait. :-)

>>> Interesting. I wished Linux's Kernel panics would log to a file like
>>> Windows' memory dumps from blue screens so I can use a debugger to see
>>> what the dumps.
>> Logging to a file isn't an option (at this point, things are probably too
>> far gone for this to be practical); but they could, perhaps, be stored in
>> some non-volatile memory. (You'd need at least 16K for this, ideally 64K
>> or more; and I don't think that there's enough in your typical PC RTC.)

> Bummer. I am surprised Linux doesn't do this, but MS does with its
> NT-based Windows.

Given suitable storage, and the right kernel options...

--
| Darren Salt | linux at youmustbejoking | nr. Ashington, | Doon
| using Debian GNU/Linux | or ds ,demon,co,uk | Northumberland | Army
| + http://www.youmustbejoking.demon.co.uk/ & http://tlasd.wordpress.com/

Scotty: "It's comin' apart, lad!"

From: Darren Salt on 8 Mar 2010 08:56

I demand that Ant may or may not have written...

> On 3/7/2010 7:08 PM PT, Darren Salt typed:
[snip]
>>>>> Hmm, I wonder if that 512 MB RAM that memtest86 detected having errors
>>>>> wasn't bad?
>>>> Chances are that memtest86 was right. (I can see how bad memory might
>>>> cause incorrect TLB entries, but not parity errors.)
>>> So parity errors are from CPU only? I am not an expert in hardwares area.
>> If you happen to be using ECC RAM, errors can be reported from that too.
>> Hopefully, they'd be correctable ones...

> Hmm, I don't know if my RAM uses ECC? How can I check?

dmidecode will tell you. ECC was also "a bit" more expensive, although when I
last looked at memory prices etc., I noticed ECC showing up a lot more and
relatively inexpensively.

>>> So how can I test this with another bootable tool like memtest86+?
>> Boot from USB or CD, drop to a text console, stress it with a kernel
>> compile or something (preferably without touching disk). Wait. :-)

> Hmm, I did that in my regular Debian and no problems! I used sys_basher,
> unrar 10 GB of datas, etc. I can't make it happen with stress tests. Most
> of the kernel panics happened when idled! :D

Fine; then let it idle for a while too...

--
| Darren Salt | linux at youmustbejoking | nr. Ashington, | Doon
| using Debian GNU/Linux | or ds ,demon,co,uk | Northumberland | Army
| + http://www.youmustbejoking.demon.co.uk/ & http://tartarus.org/ds/

You will hear good news from one you thought unfriendly to you.

|
Pages: 1
Prev: Logs and dumps for kernel panics to collect and analyze? DyingCPU?
Next: Intel 4965AGN Supported in Kernel 2.6.24?