Prev: Any monthly web hosting companies?
Next: "TLB parity error in virtual array; TLB error 'instruction"?(acpidump)
From: Robert Redelmeier on 16 Mar 2010 11:04 Jerry Peters <jerry(a)example.invalid> wrote in part: > No, ACPI is also involved with hardware configuration: > Advanced *Configuration* & Power Interface. That was the intent, a replacement for PnP, however AFAIK Linux _only_ implements the power features, and even has trouble with that. Linus has been known to rail against ACPI. -- Robert
From: ANTant on 16 Mar 2010 16:02 >> Having a better look through your logs, I see this addr is >> very common (almost all errs are at this addr). Aren't >> you curious about the instruction that produced the errors? >> /boot/System.map should contain the addr of all kernel fns, >> and there should be some way to lookup modules. > > I did a "cat /var/log/messages |grep ADDR" and found these addresses: > c104e3f0 > c106e8c0 > c11b6ff0 (most common) > > But none of them matched to /boot/System.map-2.6.32-trunk-686. Here are > close addresses around them for each one: > > c104e2f9 T tick_handle_periodic > c104e360 T tick_get_broadcast_device > > c1063e1b t stop_cpu > c1063ec6 T stop_machine_destroy > > c11b6fb8 T acpi_pm_read_verified > c11b6ffc t acpi_pm_read Since I did a Kernel upgrade (2.6.32-3 from -2 trunk) yesterday morning, I noticed a new address in my /var/log/messages (only one so far): Mar 16 05:41:16 foobar mcelog: HARDWARE ERROR. This is *NOT* a software problem! Mar 16 05:41:16 foobar mcelog: Please contact your hardware vendor Mar 16 05:41:16 foobar mcelog: MCE 0 Mar 16 05:41:16 foobar mcelog: CPU 1 1 instruction cache Mar 16 05:41:16 foobar mcelog: ADDR c104e570 Mar 16 05:41:16 foobar mcelog: TIME 1268743276 Tue Mar 16 05:41:16 2010 Mar 16 05:41:16 foobar mcelog: TLB parity error in virtual array Mar 16 05:41:16 foobar mcelog: TLB error 'instruction transaction, level 1' Mar 16 05:41:16 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 16 05:41:16 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0 Mar 16 05:41:16 foobar mcelog: CPUID Vendor AMD Family 15 Model 43 # ls -all /boot/System.map-2.6.32-3-686 -rw-r--r-- 1 root root 1259340 2010-02-25 01:00 /boot/System.map-2.6.32-3-686 I am going to assume contents changed in both Kernel and the system.map. I did a look up to match that c104e570 address. Closest address were: # cat /boot/System.map-2.6.32-3-686 |grep c104e c104e07d t tick_notify c104e374 t tick_periodic c104e3dd T tick_handle_periodic c104e444 T tick_get_broadcast_device c104e44a T tick_get_broadcast_mask c104e450 T tick_is_broadcast_device c104e464 T tick_set_periodic_handler c104e477 T tick_get_broadcast_oneshot_mask c104e47d T tick_broadcast_oneshot_active c104e48a T tick_shutdown_broadcast_oneshot c104e4ac T tick_check_oneshot_broadcast c104e4d5 T tick_resume_broadcast_oneshot c104e4e2 T tick_broadcast_setup_oneshot c104e5ae T tick_broadcast_switch_to_oneshot c104e5e0 t tick_do_broadcast c104e634 t tick_handle_oneshot_broadcast c104e71d t tick_do_periodic_broadcast c104e74a T tick_broadcast_oneshot_control c104e82c T tick_resume_broadcast c104e8a3 T tick_device_uses_broadcast c104e91b T tick_suspend_broadcast c104e943 T tick_shutdown_broadcast c104e989 t tick_handle_periodic_broadcast c104e9ce T tick_broadcast_on_off c104eb0e T tick_check_broadcast_device c104eb60 T tick_oneshot_mode_active c104eb96 T tick_switch_to_oneshot c104ec1e T tick_init_highres c104ec28 T tick_dev_program_event c104eca9 T tick_setup_oneshot c104ecd9 T tick_program_event c104ecfc T tick_resume_oneshot c104ed24 T tick_get_tick_sched c104ed33 T tick_nohz_get_sleep_length c104ed4c T tick_oneshot_notify c104ed63 t tick_init_jiffy_update c104edae T tick_check_oneshot_change c104eea1 t tick_do_update_jiffies64 c104ef87 t tick_nohz_handler A Google quick search (http://www.google.com/search?q=linux+kernel+tick+broadcast) seems to show related to APIC? Does anyone know what these ticks do to cause these rare and random machine errors and kernel panics? The address seems to hang out in broadcast area. Again, I am not familiar with hardwares. :( -- "We are anthill men upon an anthill world." --Ray Bradbury /\___/\ / /\ /\ \ Phillip (Ant) @ http://antfarm.ma.cx (Personal Web Site) | |o o| | Ant's Quality Foraged Links (AQFL): http://aqfl.net \ _ / Please remove ANT if replying by e-mail. ( )
From: Jerry Peters on 16 Mar 2010 16:41 Robert Redelmeier <redelm(a)ev1.net.invalid> wrote: > Jerry Peters <jerry(a)example.invalid> wrote in part: >> No, ACPI is also involved with hardware configuration: >> Advanced *Configuration* & Power Interface. > > That was the intent, a replacement for PnP, however AFAIK Linux > _only_ implements the power features, and even has trouble with > that. Linus has been known to rail against ACPI. > > -- Robert > Wrong, Linux implements the configuration features also. Some machines, probably newer laptops, can't be configured without ACPI. And I'd expect that desktop machines will be getting to that point also. Linus hates the ACPI design, the AML language that invokes unknown and probably buggy firmware routines. It's another "everything including the kitchen sink" design. I'd doubt that the OP's problem is caused by ACPI though. The TLB on x86 is mostly hardware maintained, the OS's sole responsibility is to purge the TLB when it changes the page tables. He's getting a parity error in the associative array, that's a hardware problem. Jerry
From: Robert Redelmeier on 16 Mar 2010 17:34 Jerry Peters <jerry(a)example.invalid> wrote in part: > Wrong, Linux implements the configuration features also. Some > machines, probably newer laptops, can't be configured without ACPI. While I cannot say that _none_ of the 1000s of device modules use ACPI, I can say that most do not need it. Not to say BIOS didn't use it. I've compiled lots of kernels and never needed CONFIG_ACPI_*. Nor did it help when I couldn't get a device working -- something fairly frequent under Linux, especially for wireless. Very frustrating when `lspci` shows it. I presume some sort of device code IPL is required. I have no problem squirting arbitrary bytes at known PCI addr[s], nor do I imagine Linus does either, although Stallman might. But giving execution over to foreign code in ring0 is a recipe for insecurity. You wanna get Theo de Raadt even hotter under the collar? :) > I'd doubt that the OP's problem is caused by ACPI though. The TLB on > x86 is mostly hardware maintained, the OS's sole responsibility is to > purge the TLB when it changes the page tables. He's getting a parity > error in the associative array, that's a hardware problem. Agreed it looks like a hardware problem. But the fact it arises almost exclusively at one code address is very suspicious. Some code there seems to be triggering some hardware "sensitivity". Especially since the OP did not have this problem prior to a known PSU fry-fest. There have been recent changes to the kernel in this area -- perhaps a roll-back to an earlier kernel (that gave good service on the hardware) would be a good test. Newer is not always better. -- Robert > > Jerry
From: Trevor Hemsley on 16 Mar 2010 18:39
On Tue, 16 Mar 2010 20:02:49 UTC in comp.os.linux.hardware, ANTant(a)zimage.com wrote: > Does anyone know what these ticks do to cause > these rare and random machine errors and kernel panics? No but everything about those errors looks hardware related so I'd be looking at replacing the cpu at the very least. That looks like the most likely component but it's not necessarily the right one - other bits that spring to mind are motherboard, PSU and RAM. -- Trevor Hemsley, Brighton, UK Trevor dot Hemsley at ntlworld dot com |