From: Ant on
> The crashes seem to happen during idled time. I do
> not use AMD's Cool'n' Quiet and PowerNow-K8.

FYI. For the first time, I got a kernel panic when I was my computer.
Mostly, surfing the Web in Mozilla's SeaMonkey v2.0.4. So, it is not
tied to idled times then.
From: Ant on
On 4/24/2010 11:11 PM PT, Ant typed:

>> The crashes seem to happen during idled time. I do
>> not use AMD's Cool'n' Quiet and PowerNow-K8.
>
> FYI. For the first time, I got a kernel panic when I was my computer.
> Mostly, surfing the Web in Mozilla's SeaMonkey v2.0.4. So, it is not
> tied to idled times then.

And another. Grr.
--
"Have I told you how much I like ants, huh? Especially fried in a subtle
blend of mech fluid and grated gears?" --Rampage to Inferno,
"Transmutate" in Transformers (Beast Wars)
/\___/\ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
/ /\ /\ \ Ant's Quality Foraged Links: http://aqfl.net
| |o o| |
\ _ / If crediting, then use Ant nickname and AQFL URL/link.
( ) If e-mailing, then axe ANT from its address if needed.
Ant is currently not listening to any songs on this computer.
From: Yousuf Khan on
Ant wrote:
> On 4/24/2010 11:11 PM PT, Ant typed:
>
>>> The crashes seem to happen during idled time. I do
>>> not use AMD's Cool'n' Quiet and PowerNow-K8.
>>
>> FYI. For the first time, I got a kernel panic when I was my computer.
>> Mostly, surfing the Web in Mozilla's SeaMonkey v2.0.4. So, it is not
>> tied to idled times then.
>
> And another. Grr.

It's probably getting worse. Might be time to think about replacement.

Yousuf Khan
From: Ant on
On 4/26/2010 8:10 AM PT, Yousuf Khan typed:

> It's probably getting worse. Might be time to think about replacement.

Yeah, probably at the end of this year when I update my newer Windows'
box's hardwares. If it fails completely before it, then I will just go
back to my single core Athlon 64 system left overs I have here.

BTW, I still cannot reproduce these machine errors and kernel panics
outside of my 2005's Debian installation with an Ubuntu liveCD after 15
hours of some usage and idled. I wonder if my old Debian installation
is causing it instead of hardwares, but then that doesn't explain why it
was stable before the PSU, videocard, RAM, and other failures a few
months ago. I probably need to do a full clean reinstall and reconfigure
from scratch which I don't have time these days. I will probably save
that when I swap/upgrade my hardwares later on. I just want to reproduce
this outside of my old Debian installation! Grr!! :(

Here's more interesting. After I booted back to old Debian and over 1.5
days of uptime, I haven't gotten any new kernel panics since 4/25/2010
4(don't remember the exact minute) AM PDT and machine errors (4/21/2010
3:26 AM PDT) so far...

Again, the issues are not easily reproducable. They come and go! No
specific patterns now (not related to temperatures, idle times, etc.). :(
--
"Yeah, what's left of it. I was in the militia -- national guard...
That's good! Wasn't any war any more than there's war between men and
ants." --stranger; "And we're eat-able ants. I found that out... What
will they do with us?" --Pierson from H.G. Wells' The War of the Worlds
/\___/\ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
/ /\ /\ \ Ant's Quality Foraged Links: http://aqfl.net
| |o o| |
\ _ / If crediting, then use Ant nickname and AQFL URL/link.
( ) If e-mailing, then axe ANT from its address if needed.
Ant is currently not listening to any songs on this computer.
From: Ant on
On Mar 16, 1:02 pm, ANT...(a)zimage.com wrote:
> >> Having a better look through your logs, I see this addr is
> >> very common (almost all errs are at this addr).  Aren't
> >> you curious about the instruction that produced the errors?
> >> /boot/System.map should contain the addr of all kernel fns,
> >> and there should be some way to lookup modules.
>
> > I did a "cat /var/log/messages |grep ADDR" and found these addresses:
> > c104e3f0
> > c106e8c0
> > c11b6ff0 (most common)
>
> > But none of them matched to /boot/System.map-2.6.32-trunk-686. Here are
> > close addresses around them for each one:
>
> > c104e2f9 T tick_handle_periodic
> > c104e360 T tick_get_broadcast_device
>
> > c1063e1b t stop_cpu
> > c1063ec6 T stop_machine_destroy
>
> > c11b6fb8 T acpi_pm_read_verified
> > c11b6ffc t acpi_pm_read
>
> Since I did a Kernel upgrade (2.6.32-3 from -2 trunk) yesterday morning,
> I noticed a new address in my /var/log/messages (only one so far):
> Mar 16 05:41:16 foobar mcelog: HARDWARE ERROR. This is *NOT* a software problem!
> Mar 16 05:41:16 foobar mcelog: Please contact your hardware vendor
> Mar 16 05:41:16 foobar mcelog: MCE 0
> Mar 16 05:41:16 foobar mcelog: CPU 1 1 instruction cache
> Mar 16 05:41:16 foobar mcelog: ADDR c104e570
> Mar 16 05:41:16 foobar mcelog: TIME 1268743276 Tue Mar 16 05:41:16 2010
> Mar 16 05:41:16 foobar mcelog:   TLB parity error in virtual array
> Mar 16 05:41:16 foobar mcelog:   TLB error 'instruction transaction, level 1'
> Mar 16 05:41:16 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
> Mar 16 05:41:16 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
> Mar 16 05:41:16 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
>
> # ls -all /boot/System.map-2.6.32-3-686
> -rw-r--r-- 1 root root 1259340 2010-02-25 01:00 /boot/System.map-2.6.32-3-686
>
> I am going to assume contents changed in both Kernel and the system.map. I did a look up to match that c104e570 address. Closest address were:
> # cat /boot/System.map-2.6.32-3-686 |grep c104e
> c104e07d t tick_notify
> c104e374 t tick_periodic
> c104e3dd T tick_handle_periodic
> c104e444 T tick_get_broadcast_device
> c104e44a T tick_get_broadcast_mask
> c104e450 T tick_is_broadcast_device
> c104e464 T tick_set_periodic_handler
> c104e477 T tick_get_broadcast_oneshot_mask
> c104e47d T tick_broadcast_oneshot_active
> c104e48a T tick_shutdown_broadcast_oneshot
> c104e4ac T tick_check_oneshot_broadcast
> c104e4d5 T tick_resume_broadcast_oneshot
> c104e4e2 T tick_broadcast_setup_oneshot
> c104e5ae T tick_broadcast_switch_to_oneshot
> c104e5e0 t tick_do_broadcast
> c104e634 t tick_handle_oneshot_broadcast
> c104e71d t tick_do_periodic_broadcast
> c104e74a T tick_broadcast_oneshot_control
> c104e82c T tick_resume_broadcast
> c104e8a3 T tick_device_uses_broadcast
> c104e91b T tick_suspend_broadcast
> c104e943 T tick_shutdown_broadcast
> c104e989 t tick_handle_periodic_broadcast
> c104e9ce T tick_broadcast_on_off
> c104eb0e T tick_check_broadcast_device
> c104eb60 T tick_oneshot_mode_active
> c104eb96 T tick_switch_to_oneshot
> c104ec1e T tick_init_highres
> c104ec28 T tick_dev_program_event
> c104eca9 T tick_setup_oneshot
> c104ecd9 T tick_program_event
> c104ecfc T tick_resume_oneshot
> c104ed24 T tick_get_tick_sched
> c104ed33 T tick_nohz_get_sleep_length
> c104ed4c T tick_oneshot_notify
> c104ed63 t tick_init_jiffy_update
> c104edae T tick_check_oneshot_change
> c104eea1 t tick_do_update_jiffies64
> c104ef87 t tick_nohz_handler

After 1.5 months later, I did comparisons with the last two weeks'
logs with two different kernel 2.6.32 i686 (-3 and -4) packages.

-3:
Apr 20 04:13:52 mcelog: ADDR c104e500
Apr 14 01:36:16 mcelog: ADDR c104e530
Apr 16 06:03:52 mcelog: ADDR c104e540
Apr 20 02:51:22 mcelog: ADDR c104e570
/boot/System.map-2.6.32-3-686 showed:
c104e4e2 T tick_broadcast_setup_oneshot
c104e5ae T tick_broadcast_switch_to_oneshot

Apr 13 23:58:46 mcelog: ADDR c104f2c0
/boot/System.map-2.6.32-4-686 showed:
c104f2bb T tick_check_idle
c104f32f T tick_nohz_restart_sched_tick

Most /var/log/messages' addresses were at c104e570 for Kernel
2.6.32-3.


-4 has four days and 21 hours of uptime after upgrading the kernel and
rebooting. So far, only two machine errors and no kernel panics:
Apr 27 09:00:20 mcelog: ADDR c1046d30
/boot/System.map-2.6.32-4-686 showed:
c1046cae T hrtimer_interrupt
c1046e08 t __hrtimer_peek_ahead_timers

Apr 30 07:17:50 mcelog: ADDR c106ee80
/boot/System.map-2.6.32-4-686 showed:
c106ee3c T rcu_irq_enter
c106ee88 T rcu_nmi_exit

Completely different now. Weird.