From: Ant on
On 3/13/2010 10:51 AM PT, Ant typed:

>>>> OTOH, I would rmmod cpufreq* because they get loaded in kernel space.
>>
>> Just make sure they're not running. Look at the
>> module dependences with depmod or moddep, then
>> rmmod them in the correct order.
>
> # lsmod |grep cpufreq
> cpufreq_powersave 602 0
> cpufreq_userspace 1444 0
> cpufreq_stats 1940 0
> cpufreq_conservative 4018 0
>
> For kicks, I removed these four cpufreq modules in that order to see if
> I still get errors and/or kernel panics.

NOPE! Still happening:
Mar 13 12:36:53 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
problem!
Mar 13 12:36:53 foobar mcelog: Please contact your hardware vendor
Mar 13 12:36:53 foobar mcelog: MCE 0
Mar 13 12:36:53 foobar mcelog: CPU 1 1 instruction cache
Mar 13 12:36:53 foobar mcelog: ADDR c11b6ff0
Mar 13 12:36:53 foobar mcelog: TIME 1268512613 Sat Mar 13 12:36:53 2010
Mar 13 12:36:53 foobar mcelog: TLB parity error in virtual array
Mar 13 12:36:53 foobar mcelog: TLB error 'instruction transaction,
level 1'
Mar 13 12:36:53 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 13 12:36:53 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
Mar 13 12:36:53 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
Mar 13 12:36:53 foobar kernel: [45599.988029] Machine check events logged

:(
--
"Don't be no Ant-Man. An Ant-Man has very low horizons." --Forrest Gump
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT
( ) or ANTant(a)zimage.com
Ant is currently not listening to any songs on his home computer.
From: Ant on
On 3/13/2010 1:32 PM PT, Robert Redelmeier typed:

>> # lsmod |grep cpufreq
>> cpufreq_powersave 602 0
>> cpufreq_userspace 1444 0
>> cpufreq_stats 1940 0
>> cpufreq_conservative 4018 0
>
> The 0 is good because these modules are independant

OK cool.


>> Is Kernel autoloading these modules even if I don't use
>> AMD's Cool'n'Quiet and powernow?
>
> I don't think so -- such dependencies should show on the right
> (iso 0) for some other mod. But daemons do strange things,
> and acpid is one of the strangest.

Hmm, lsmod |grep acpid showed nothing.
--
"The antics begin!" --SimAnt Game
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT
( ) or ANTant(a)zimage.com
Ant is currently not listening to any songs on his home computer.
From: Robert Redelmeier on
Ant <ant(a)zimage.comant> wrote in part:
> Hmm, lsmod |grep acpid showed nothing.

It won't -- acpid is a daemon, not a module.
It shows on the process taks list (ps/top), not lsmod .

But it is unlikely to be the cause if cpufreq modules
aren't loaded.

-- Robert

From: Robert Redelmeier on
Robert Redelmeier <redelm(a)ev1.net.invalid> wrote in part:
>>>> Bah, the error came back again after my tests:
>>>>
>>>> dmesg:
>>>> [32399.988020] Machine check events logged
>>>>
>>>> From /var/log/messages:
>>>> Mar 12 14:45:16 foobar kernel: [32399.988020] Machine check events logged
>>>> Mar 12 14:45:16 foobar mcelog: HARDWARE ERROR. This is *NOT* a software problem!
>>>> Mar 12 14:45:16 foobar mcelog: Please contact your hardware vendor
>>>> Mar 12 14:45:16 foobar mcelog: MCE 0
>>>> Mar 12 14:45:16 foobar mcelog: CPU 1 1 instruction cache
>>>> Mar 12 14:45:16 foobar mcelog: ADDR c11b6ff0
>>>> Mar 12 14:45:16 foobar mcelog: TIME 1268433916 Fri Mar 12 14:45:16 2010
>>>> Mar 12 14:45:16 foobar mcelog: TLB parity error in virtual array
>>>> Mar 12 14:45:16 foobar mcelog: TLB error 'instruction transaction, level 1'
>>>> Mar 12 14:45:16 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
>>>> Mar 12 14:45:16 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
>>>> Mar 12 14:45:16 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
>>>
>>> Noting the addr is in kernel space and the instruction cache,
>>> this is going to take much ingenuity to replicate :(
>>
>> You just gave me an idea: # cat /var/log/messages |grep MCGSTATUS
>> Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
>> Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
>> Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
>
> [snip] no, this is the status word where the bits have meanings.
> The ADDR line tells you where the error occurred. 0xC+ is kernel space
> on most kernels.



Having a better look through your logs, I see this addr is
very common (almost all errs are at this addr). Aren't
you curious about the instruction that produced the errors?
/boot/System.map should contain the addr of all kernel fns,
and there should be some way to lookup modules.

-- Robert

From: Ant on
On 3/13/2010 9:28 PM PT, Robert Redelmeier typed:

> Having a better look through your logs, I see this addr is
> very common (almost all errs are at this addr). Aren't
> you curious about the instruction that produced the errors?
> /boot/System.map should contain the addr of all kernel fns,
> and there should be some way to lookup modules.

I did a "cat /var/log/messages |grep ADDR" and found these addresses:
c104e3f0
c106e8c0
c11b6ff0 (most common)

But none of them matched to /boot/System.map-2.6.32-trunk-686. Here are
close addresses around them for each one:

c104e2f9 T tick_handle_periodic
c104e360 T tick_get_broadcast_device

c1063e1b t stop_cpu
c1063ec6 T stop_machine_destroy

c11b6fb8 T acpi_pm_read_verified
c11b6ffc t acpi_pm_read


For the common one, it is ACPI. Hmm!
# locate acpi_pm
/usr/src/linux-headers-2.6.30-2-common/include/linux/acpi_pmtmr.h
/usr/src/linux-headers-2.6.32-trunk-common/include/linux/acpi_pmtmr.h
# more /usr/src/linux-headers-2.6.32-trunk-common/include/linux/acpi_pmtmr.h
#ifndef _ACPI_PMTMR_H_
#define _ACPI_PMTMR_H_

#include <linux/clocksource.h>

/* Number of PMTMR ticks expected during calibration run */
#define PMTMR_TICKS_PER_SEC 3579545

/* limit it to 24 bits */
#define ACPI_PM_MASK CLOCKSOURCE_MASK(24)

/* Overrun value */
#define ACPI_PM_OVRRUN (1<<24)

#ifdef CONFIG_X86_PM_TIMER

extern u32 acpi_pm_read_verified(void);
extern u32 pmtmr_ioport;

static inline u32 acpi_pm_read_early(void)
{
if (!pmtmr_ioport)
return 0;
/* mask the output to 24 bits */
return acpi_pm_read_verified() & ACPI_PM_MASK;
}

extern void pmtimer_wait(unsigned);

#else

static inline u32 acpi_pm_read_early(void)
{
return 0;
}

#endif

#endif


Hmm, what is using ACPI then?
# lsof |grep acpi
kacpid 22 root cwd DIR 3,1 1024 2 /
kacpid 22 root rtd DIR 3,1 1024 2 /
kacpid 22 root txt unknown
/proc/22/exe
kacpi_not 23 root cwd DIR 3,1 1024 2 /
kacpi_not 23 root rtd DIR 3,1 1024 2 /
kacpi_not 23 root txt unknown
/proc/23/exe
kacpi_hot 24 root cwd DIR 3,1 1024 2 /
kacpi_hot 24 root rtd DIR 3,1 1024 2 /
kacpi_hot 24 root txt unknown
/proc/24/exe
acpid 1986 root cwd DIR 3,1 1024 2 /
acpid 1986 root rtd DIR 3,1 1024 2 /
acpid 1986 root txt REG 3,6 34684
353719 /usr/sbin/acpid
acpid 1986 root mem REG 3,1 1331496
14245 /lib/libc-2.10.2.so
acpid 1986 root mem REG 3,1 117416
14243 /lib/ld-2.10.2.so
acpid 1986 root 0u CHR 1,3 0t0
1344 /dev/null
acpid 1986 root 1u CHR 1,3 0t0
1344 /dev/null
acpid 1986 root 2u CHR 1,3 0t0
1344 /dev/null
acpid 1986 root 3r CHR 13,64 0t0
4005 /dev/input/event0
acpid 1986 root 4r CHR 13,65 0t0
4012 /dev/input/event1
acpid 1986 root 5r CHR 13,66 0t0
4016 /dev/input/event2
acpid 1986 root 6r DIR 0,10 0
1 inotify
acpid 1986 root 7u sock 0,6 0t0
5680 can't identify protocol
acpid 1986 root 8u unix 0xf5749c00 0t0
5681 /var/run/acpid.socket
acpid 1986 root 9u unix 0xf52ad400 0t0
7044 /var/run/acpid.socket
acpid 1986 root 10u unix 0xf6fef800 0t0
5683 socket
acpid 1986 root 11u unix 0xf5eb1200 0t0
1585927 /var/run/acpid.socket
acpid 1986 root 12u unix 0xf543a000 0t0
1585931 /var/run/acpid.socket
hald-addo 2632 haldaemon txt REG 3,6 11604
401855 /usr/lib/hal/hald-addon-acpi
I looked around on my Debian's installation, and found an acpid package
so I uninstalled it to see what happens... FYI:
# apt-get remove acpi
Reading package lists... Done
Building dependency tree
Reading state information... Done
Package acpi is not installed, so not removed
0 upgraded, 0 newly installed, 0 to remove and 126 not upgraded.
foobar:/home/ant/download# apt-cache show acpid
Package: acpid
Priority: optional
Section: admin
Installed-Size: 196
Maintainer: Debian Acpi Team <pkg-acpi-devel(a)lists.alioth.debian.org>
Architecture: i386
Version: 1:2.0.2-1
Depends: libc6 (>= 2.4), lsb-base (>= 3.2-14), module-init-tools (>>
3.1-rel-2)
Recommends: acpi-support-base (>= 0.114-1)
Filename: pool/main/a/acpid/acpid_2.0.2-1_i386.deb
Size: 48204
MD5sum: f7a607fe746c5503f364ef82cd47cbd8
SHA1: 7fac7cedade5d17f6644da1cff1bdafc10d798b3
SHA256: 852fe7a6ac15d4c11a0d9df2739b34dab3307a3b96ffb9a96029a1b0e23cca81
Description: Advanced Configuration and Power Interface event daemon
Modern computers support the Advanced Configuration and Power
Interface (ACPI)
to allow intelligent power management on your system and to query
battery and
configuration status.
.
ACPID is a completely flexible, totally extensible daemon for delivering
ACPI events. It listens on netlink interface (or on the deprecated file
/proc/acpi/event), and when an event occurs, executes programs to
handle the
event. The programs it executes are configured through a set of
configuration
files, which can be dropped into place by packages or by the admin.
Homepage: http://acpid.sourceforge.net/
Tag: admin::power-management, hardware::power, hardware::power:acpi,
interface::daemon, role::program
Task: laptop


I am not sure why that is installed since this is a desktop. :P So, now
we wait again... At least we're getting more clues. :)
--
"To the gods I am an ant, but to the ants, I am a god." --unknown
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT
( ) or ANTant(a)zimage.com
Ant is currently not listening to any songs on his home computer.