From: Yousuf Khan on
ANTant(a)zimage.com wrote:
> So far no errors (no TLB errors and crashes within eight hours. I will
> keep it running for another 3-4 hours and then I am going to killall
> those processes so I can use the machine.
>
> It seems like the issue only comes up if my box is not idled? What the
> frak?

Just a guess here, the version of Linux you are running. I don't
remember what the version numbers were, but are they somewhat out of
date? Have you considered an update to the kernel?

Yousuf Khan
From: ANTant on
>> So far no errors (no TLB errors and crashes within eight hours. I will
>> keep it running for another 3-4 hours and then I am going to killall
>> those processes so I can use the machine.
>>
>> It seems like the issue only comes up if my box is not idled? What the
>> frak?
>
> Just a guess here, the version of Linux you are running. I don't
> remember what the version numbers were, but are they somewhat out of
> date? Have you considered an update to the kernel?

I had Debian's Kernel 2.6.30 and am currently using 2.6.32. Both did not
make differences. :(

I just need to reproduce this problem out of my Debian via a LiveCD,
memtest86, something!
--
"We are anthill men upon an anthill world." --Ray Bradbury
/\___/\
/ /\ /\ \ Phillip (Ant) @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links (AQFL): http://aqfl.net
\ _ / Please remove ANT if replying by e-mail.
( )
From: Robert Redelmeier on
ANTant(a)zimage.com wrote in part:
> Too much trouble as in what? Swap partition going crazy
> like when I ran 40 processes? :D

Thrashing (continuous swapping in/out) is an obvious
sign of an overloaded system. Useful to see if you
haven't seen it before.

> BTW, so far no crashes and errors with 33 processes. I
> think it has been over six hours. I can abort them now if
> you think this is enough. :P

Probably is.

-- Robert

From: ANTant on
>> Too much trouble as in what? Swap partition going crazy
>> like when I ran 40 processes? :D
>
> Thrashing (continuous swapping in/out) is an obvious
> sign of an overloaded system. Useful to see if you
> haven't seen it before.

Yeah, never seen it with 40 cpuburn processes. ;)


>> BTW, so far no crashes and errors with 33 processes. I
>> think it has been over six hours. I can abort them now if
>> you think this is enough. :P
>
> Probably is.

OK, I will abort it after this post. FYI for almost this 33 processes'
8.25 hours test:

$ top
top - 14:10:29 up 8:25, 1 user, load average: 32.89, 32.48, 32.54
Tasks: 176 total, 34 running, 142 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.3%us, 0.7%sy, 99.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 0.0%us, 0.0%sy, 75.1%ni, 24.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 2595064k total, 2347444k used, 247620k free, 3380k buffers
Swap: 2361512k total, 181152k used, 2180360k free, 19972k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4135 ant 39 19 65632 64m 4 R 18 2.5 29:46.00 burnMMX
4170 ant 39 19 65632 64m 4 R 16 2.5 29:41.27 burnMMX
4171 ant 39 19 65632 64m 4 R 8 2.5 29:46.18 burnMMX
4148 ant 39 19 65632 64m 4 R 8 2.5 29:44.35 burnMMX
4139 ant 39 19 65632 64m 4 R 8 2.5 29:59.29 burnMMX
4149 ant 39 19 65632 64m 4 R 7 2.5 29:47.52 burnMMX
4152 ant 39 19 65632 64m 4 R 7 2.5 29:41.53 burnMMX
4165 ant 39 19 65632 64m 4 R 7 2.5 29:39.02 burnMMX
4166 ant 39 19 65632 64m 4 R 7 2.5 29:47.40 burnMMX
4167 ant 39 19 65632 64m 4 R 7 2.5 29:56.93 burnMMX
4186 ant 39 19 65632 64m 4 R 7 2.5 29:52.83 burnMMX
4192 ant 39 19 65632 64m 4 R 7 2.5 29:51.10 burnMMX
4145 ant 39 19 65632 64m 4 R 7 2.5 29:46.08 burnMMX
4153 ant 39 19 65632 64m 4 R 7 2.5 29:37.84 burnMMX
4187 ant 39 19 65632 64m 4 R 7 2.5 29:24.71 burnMMX
4164 ant 39 19 65632 64m 4 R 5 2.5 29:25.25 burnMMX
4169 ant 39 19 65632 64m 4 R 5 2.5 29:27.45 burnMMX
4196 ant 39 19 65632 64m 4 R 5 2.5 29:28.98 burnMMX
4162 ant 39 19 65632 64m 4 R 5 2.5 29:29.86 burnMMX
4163 ant 39 19 65632 64m 4 R 5 2.5 29:40.07 burnMMX
4191 ant 39 19 65632 64m 4 R 5 2.5 29:43.16 burnMMX
4193 ant 39 19 65632 64m 4 R 5 2.5 29:30.78 burnMMX
4195 ant 39 19 65632 64m 4 R 4 2.5 29:47.66 burnMMX
4151 ant 39 19 65632 64m 4 R 3 2.5 29:29.78 burnMMX
4194 ant 39 19 65632 64m 4 R 3 2.5 29:48.43 burnMMX
4142 ant 39 19 65632 64m 4 R 3 2.5 29:45.24 burnMMX
4150 ant 39 19 65632 64m 4 R 3 2.5 29:53.71 burnMMX
4168 ant 39 19 65632 64m 4 R 3 2.5 29:36.53 burnMMX
4188 ant 39 19 65632 64m 4 R 3 2.5 29:32.41 burnMMX
4189 ant 39 19 65632 64m 4 R 3 2.5 29:11.58 burnMMX
4190 ant 39 19 65632 64m 4 R 3 2.5 29:24.99 burnMMX
4197 ant 39 19 65632 64m 4 R 3 2.5 29:39.87 burnMMX
4198 ant 39 19 65632 64m 4 R 3 2.5 29:37.67 burnMMX
5108 ant 40 0 2464 1204 888 R 0 0.0 0:00.04 top
1 root 40 0 2036 152 132 S 0 0.0 0:00.82 init
2 root 40 0 0 0 0 S 0 0.0 0:00.00 kthreadd
3 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/0
4 root 20 0 0 0 0 S 0 0.0 0:00.00 ksoftirqd/0
5 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0
6 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/1
7 root 20 0 0 0 0 S 0 0.0 0:00.00 ksoftirqd/1
8 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/1
9 root 20 0 0 0 0 S 0 0.0 0:00.00 events/0
10 root 20 0 0 0 0 S 0 0.0 0:00.01 events/1
11 root 20 0 0 0 0 S 0 0.0 0:00.00 cpuset
12 root 20 0 0 0 0 S 0 0.0 0:00.00 khelper
13 root 20 0 0 0 0 S 0 0.0 0:00.00 netns
14 root 20 0 0 0 0 S 0 0.0 0:00.00 async/mgr
15 root 20 0 0 0 0 S 0 0.0 0:00.00 pm

$ sensors -f
acpitz-virtual-0
Adapter: Virtual device
temp1: +71.2?F (crit = +206.2?F)

k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp: +120.2?F
Core1 Temp: +91.4?F


I noticed that swap partition usage grew from this early morning. :)
--
"We are anthill men upon an anthill world." --Ray Bradbury
/\___/\
/ /\ /\ \ Phillip (Ant) @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links (AQFL): http://aqfl.net
\ _ / Please remove ANT if replying by e-mail.
( )
From: ANTant on
> OK, I will abort it after this post. FYI for almost this 33 processes'
> 8.25 hours test:
>
> I noticed that swap partition usage grew from this early morning. :)

Bah, the error came back again after my tests:

dmesg:
[32399.988020] Machine check events logged

From /var/log/messages:
Mar 12 14:45:16 foobar kernel: [32399.988020] Machine check events logged
Mar 12 14:45:16 foobar mcelog: HARDWARE ERROR. This is *NOT* a software problem!
Mar 12 14:45:16 foobar mcelog: Please contact your hardware vendor
Mar 12 14:45:16 foobar mcelog: MCE 0
Mar 12 14:45:16 foobar mcelog: CPU 1 1 instruction cache
Mar 12 14:45:16 foobar mcelog: ADDR c11b6ff0
Mar 12 14:45:16 foobar mcelog: TIME 1268433916 Fri Mar 12 14:45:16 2010
Mar 12 14:45:16 foobar mcelog: TLB parity error in virtual array
Mar 12 14:45:16 foobar mcelog: TLB error 'instruction transaction, level 1'
Mar 12 14:45:16 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 12 14:45:16 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
Mar 12 14:45:16 foobar mcelog: CPUID Vendor AMD Family 15 Model 43

:(
--
"We are anthill men upon an anthill world." --Ray Bradbury
/\___/\
/ /\ /\ \ Phillip (Ant) @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links (AQFL): http://aqfl.net
\ _ / Please remove ANT if replying by e-mail.
( )