From: Yousuf Khan on
Ant wrote:
> On 3/12/2010 10:49 PM PT, Yousuf Khan typed:
>
>> Error always on CPU 1? Maybe try to disable that core?
>
> No, there was a 0 and some interesting parts:
>
> Mar 11 00:17:55 foobar mcelog: CPU 0 1 instruction cache
> Mar 11 00:17:55 foobar mcelog: memory/cache error 'evict mem
> transaction, instruction transaction, level 1'
> Mar 11 00:29:19 foobar kernel: [ 0.008322] Checking 'hlt'
> instruction... OK.
> Mar 12 05:45:36 foobar kernel: [ 0.004322] Checking 'hlt'
> instruction... OK.
> Mar 12 14:45:16 foobar mcelog: CPU 1 1 instruction cache
> Mar 12 14:45:16 foobar mcelog: TLB error 'instruction transaction,
> level 1'
> Mar 12 22:02:46 foobar mcelog: CPU 1 1 instruction cache
> Mar 12 22:02:46 foobar mcelog: TLB error 'instruction transaction,
> level 1'
>
>
> I wonder how I can output more sections of those errors instead of
> lines. Is there really a way to disable a core? I don't know how nor saw
> one in CMOS (yes, latest BIOS).

Vast majority seem to be on CPU 1, rather than CPU 0. The error on CPU 0
is also slightly different from that on CPU 1. It seems like CPU 0's
error might be related to some kind of bad cache transfer from CPU 1.

As for disabling the core, I'm not sure where to look for it in BIOS. My
own BIOS has a feature called Advanced Clock Calibration (ACC), which
allows me to change how many cores come up on my Phenom II X3. I can
enable upto 4 cores, or change which cores are enabled, theoretically.
However, in my case, doing anything but the default results in a hang.

Yousuf Khan
From: Yousuf Khan on
Ant wrote:
> On 3/12/2010 9:11 PM PT, Yousuf Khan typed:
>
>> My guess here, but is it possible that the TLB only decays if it isn't
>> being used constantly?
>
> Last night I ran memtest86+ v4.00's test #9.
> http://www.memtest86.com/tech.html#descri says: "Test 9 [Bit fade test,
> 90 min, 2 patterns]
>
> The bit fade test initializes all of memory with a pattern and then
> sleeps for 90 minutes. Then memory is examined to see if any memory bits
> have changed. All ones and all zero patterns are used. This test takes 3
> hours to complete. The Bit Fade test is not included in the normal test
> sequence and must be run manually via the runtime configuration menu."
>
> I only ran it for over 3.25 hours and it passed (only one test).
> Shouldn't this test that problem? Or is that TLB somewhere else? Maybe I
> need to run it longer and more?

No, the TLB is inside the processor, not in RAM. It's part of the
processor's caching system. So if there was a Memtest equivalent for the
caching system, then this is the sort of test that would probably catch
it. Though the caching system caches your RAM, they are not directly
related otherwise. A problem with your RAM will not result in a problem
with your cache, or vice-versa.

If you look at the functional hierarchy in a system, it usually goes
like this: Core -> Cache -> Memory Controller -> RAM. So as you can see,
the cache is sitting two levels up from the RAM. These days, everything
from the Core to the Memory Controller sits inside the processor, and
RAM remains outside. In the olden days, even the Memory Controller was
outside the processor, it used to be part of the chipset.

Basically, it's not a problem with your memory, it's problem with your
processor. At what point will you just simply decide to replace the
processor? I'm sure you can get a Socket 939 Athlon X2 relatively cheap
these days used.

Yousuf Khan
From: Robert Redelmeier on
Ant <ant(a)zimage.comant> wrote in part:
> On 3/12/2010 8:38 PM PT, Robert Redelmeier typed:
>
>>> Bah, the error came back again after my tests:
>>>
>>> dmesg:
>>> [32399.988020] Machine check events logged
>>>
>>> From /var/log/messages:
>>> Mar 12 14:45:16 foobar kernel: [32399.988020] Machine check events logged
>>> Mar 12 14:45:16 foobar mcelog: HARDWARE ERROR. This is *NOT* a software problem!
>>> Mar 12 14:45:16 foobar mcelog: Please contact your hardware vendor
>>> Mar 12 14:45:16 foobar mcelog: MCE 0
>>> Mar 12 14:45:16 foobar mcelog: CPU 1 1 instruction cache
>>> Mar 12 14:45:16 foobar mcelog: ADDR c11b6ff0
>>> Mar 12 14:45:16 foobar mcelog: TIME 1268433916 Fri Mar 12 14:45:16 2010
>>> Mar 12 14:45:16 foobar mcelog: TLB parity error in virtual array
>>> Mar 12 14:45:16 foobar mcelog: TLB error 'instruction transaction, level 1'
>>> Mar 12 14:45:16 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
>>> Mar 12 14:45:16 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
>>> Mar 12 14:45:16 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
>>
>> Noting the addr is in kernel space and the instruction cache,
>> this is going to take much ingenuity to replicate :(
>
> You just gave me an idea: # cat /var/log/messages |grep MCGSTATUS
> Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
> Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
> Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0

[snip] no, this is the status word where the bits have meanings.
The ADDR line tells you where the error occurred. 0xC+ is kernel space
on most kernels.

-- Robert

From: Robert Redelmeier on
Ant <ant(a)zimage.comant> wrote in part:
> On 3/12/2010 11:14 PM PT, Ant typed:
>>> Wait a minute.
>>> # dpkg -l | grep ^ii |grep cpu
>>> ii cpufrequtils 006-2 utilities to deal with the cpufreq Linux kernel
>>> feature
>>> ii cpulimit 1.1-13 tool for limiting the CPU usage of a process
>>> ii libcpufreq0 006-2 shared library to deal with the cpufreq Linux
>>> kernel fe
>>>
>>> I don't think I am supposed to have these even though I disabled
>>> cool'n'quiet and don't have powernow module. I will try removing them
>>> and see if I still have problems.
>
> Also:
>
> # lsmod |grep cpu
> cpufreq_powersave 602 0
> cpufreq_userspace 1444 0
> cpufreq_stats 1940 0
> cpufreq_conservative 4018 0
> xt_tcpudp 1743 92
> x_tables 8335 6
> xt_tcpudp,xt_limit,xt_state,ipt_LOG,ipt_REJECT,ip_tables
>
> Not sure if those are bad or not if I don't use AMD's Cool'n'Quiet and .


IIRC, there are some errata out on AMD CnQ.

I wouldn't worry too much about lib* and other userspace tools.
OTOH, I would rmmod cpufreq* because they get loaded in kernel space.

Although perhaps justified for laptop battery dispair, I'm skeptical of
CPU frequency savings. Idle at HLT ought to be enough, and base leakage
is more significant. This business of clocks eating 10% of power might
apply at HLT, but looks like poor design if true at full load.


-- Robert


From: Ant on
On 3/13/2010 7:32 AM PT, Yousuf Khan typed:

> No, the TLB is inside the processor, not in RAM. It's part of the
> processor's caching system. So if there was a Memtest equivalent for the
> caching system, then this is the sort of test that would probably catch
> it. Though the caching system caches your RAM, they are not directly
> related otherwise. A problem with your RAM will not result in a problem
> with your cache, or vice-versa.
>
> If you look at the functional hierarchy in a system, it usually goes
> like this: Core -> Cache -> Memory Controller -> RAM. So as you can see,
> the cache is sitting two levels up from the RAM. These days, everything
> from the Core to the Memory Controller sits inside the processor, and
> RAM remains outside. In the olden days, even the Memory Controller was
> outside the processor, it used to be part of the chipset.
>
> Basically, it's not a problem with your memory, it's problem with your
> processor. At what point will you just simply decide to replace the
> processor? I'm sure you can get a Socket 939 Athlon X2 relatively cheap
> these days used.

Ah! Now if I could find a way to test the processor in my old Debian OS
and outside (KNOPPIX v6.2.1 LiveCD and memtest86+ boot disk). However, I
have not been able to reproduce the problems except wait and mostly idle
in my old Debian. I even wonder if my old Debian is causing it. Hence,
why I researched my Cool'n'Quiet stuff which appear to be disabled and
not in used. :(

Actually, I did look at their prices recently and they're not cheap. :(
--
"I got this aunt... Carpenter ant." --Girl and Crow
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT
( ) or ANTant(a)zimage.com
Ant is currently not listening to any songs on his home computer.