From: ANTant on
>> $ top - 09:07:52 up 8:40, 1 user, load average: 6.99, 6.16, 3.49
>> Tasks: 122 total, 8 running, 114 sleeping, 0 stopped, 0 zombie
>> Cpu0 : 0.0%us, 0.2%sy, 74.9%ni, 24.9%id, 0.0%wa, 0.0%hi, 0.0%si,
>> Cpu1 : 0.0%us, 0.2%sy, 74.8%ni, 25.0%id, 0.0%wa, 0.0%hi, 0.0%si,
>> Mem: 2595064k total, 1178604k used, 1416460k free, 124472k buffers
>> Swap: 2361512k total, 0k used, 2361512k free, 455528k cached
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 5919 ant 39 19 65632 64m 4 R 39 2.5 3:08.23 burnMMX
>> 5908 ant 39 19 65632 64m 4 R 37 2.5 3:01.57 burnMMX
>> 5913 ant 39 19 65632 64m 4 R 26 2.5 3:00.80 burnMMX
>> 5917 ant 39 19 65632 64m 4 R 26 2.5 2:59.61 burnMMX
>> 5916 ant 39 19 65632 64m 4 R 25 2.5 3:06.34 burnMMX
>> 5914 ant 39 19 65632 64m 4 R 24 2.5 3:02.03 burnMMX
>> 5918 ant 39 19 65632 64m 4 R 23 2.5 3:07.14 burnMMX
>
>
> You started _40_ and only _7_ are left running? Bad news.

No, you told me to do seven instead of 40. I think I had all 40 when I aborted yesterday.


>> I will follow-up later. BTW, how long should I run these nonstop? All day?
>
> As long as you can. Min 2h . But if you are getting early
> abends, then you have just confirmed a hardware problem.

Still running seven and not hogging my HDD like yesterday's 40:

Tasks: 129 total, 8 running, 121 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.0%us, 0.5%sy, 74.6%ni, 24.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 0.0%us, 1.0%sy, 74.2%ni, 24.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 2595064k total, 1352692k used, 1242372k free, 144940k buffers
Swap: 2361512k total, 0k used, 2361512k free, 536844k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5914 ant 39 19 65632 64m 4 R 33 2.5 46:37.13 burnMMX
5908 ant 39 19 65632 64m 4 R 33 2.5 46:18.56 burnMMX
5917 ant 39 19 65632 64m 4 R 30 2.5 46:32.14 burnMMX
5916 ant 39 19 65632 64m 4 R 27 2.5 46:27.77 burnMMX
5918 ant 39 19 65632 64m 4 R 27 2.5 46:39.00 burnMMX
5919 ant 39 19 65632 64m 4 R 24 2.5 46:21.80 burnMMX
5913 ant 39 19 65632 64m 4 R 24 2.5 46:38.61 burnMMX
4174 ant 40 0 61304 44m 4700 S 1 1.8 1:22.50 launch_here.rb
6152 ant 40 0 2464 1172 888 R 1 0.0 0:00.03 top
2532 root 40 0 2704 924 792 S 0 0.0 0:00.17 syslogd
3211 root 40 0 3392 1116 972 S 0 0.0 0:02.03 hald-addon-stor
1 root 40 0 2036 704 604 S 0 0.0 0:00.97 init
2 root 40 0 0 0 0 S 0 0.0 0:00.00 kthreadd
....

Nothing unusual in my dmesg. So far, so good.

>
> To get exit status, you could try
> nice -19 ./burnMMX | echo $? &
>
> burnMMX typically exits 127 when it encounters a memory error.
> It could do this withing the first second if there is a problem
> with memory mapping (hardware does not obey kernel instructions).

Ah, I will try that if I need to run it. I am not aborting the seven
processes now. ;)
--
"We are anthill men upon an anthill world." --Ray Bradbury
/\___/\
/ /\ /\ \ Phillip (Ant) @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links (AQFL): http://aqfl.net
\ _ / Please remove ANT if replying by e-mail.
( )
From: ANTant on
>>> $ top
>>> top - 07:35:06 up 1 day, 23:52, 1 user, load average: 42.33, 37.41, 20.82
>>> Tasks: 188 total, 37 running, 151 sleeping, 0 stopped, 0 zombie
>>> ...
>>>
>>> Do I need to run this overnight or something?
>>
>> Looking at your process list more closely, I notice big gaps in
>> the PIDs. Either you have very active daemons, or you tried to
>> start burnMMX and they quickly abended (very, very bad sign).
>> Please run under `time` so you can spot these quick terminations.
>>
>> Running overnight would give you some assurance, since I
>> have seen rare errors (2-3/day) produce unstable systems.

So far no errors (no TLB errors and crashes within eight hours. I will
keep it running for another 3-4 hours and then I am going to killall
those processes so I can use the machine.

It seems like the issue only comes up if my box is not idled? What the
frak?
--
/\___/\
/ /\ /\ \ Phillip (Ant) @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links (AQFL): http://aqfl.net
\ _ / Please remove ANT if replying by e-mail.
( )
From: Robert Redelmeier on
ANTant(a)zimage.com wrote in part:
> No, you told me to do seven instead of 40. I think I had all 40 when I aborted yesterday.

No, I believe I told you to run 7 _less_ , so 33 iso 40.
This still does not explain the odd PID numbering unless
you are slow on the kbd or have very active daemons.

> Ah, I will try that if I need to run it. I am not aborting
> the seven processes now. ;)

Fine. Nothing stops you from launching another 26 . You want
to use as much RAM as possible without thrashing. More TLB
reloads with more tag patterns.


-- Robert



From: Ant on
On 3/11/2010 9:40 PM PT, ANTant(a)zimage.com typed:

>> This still does not explain the odd PID numbering unless
>> you are slow on the kbd or have very active daemons.
>
> Nope, I ran a test script that had all those "time nice -19 ./burnMMX P
> &" lines.
>
>... OK, I just stopped my seven processes earlier so I can use it and no
> machine check errors in logs and crashes after about 12.25 hours
> nonstop. SO weird!
>
> I am going to try to run memtest86+ v4.00's test #9 during my sleep. And
> then try 33 burnMMX processes tomorrow while working.

Memtest86+ v4.00's test #9 passed after 3.25 hours. I am not sure if I
need to run more of it. I will wait for more replies about in my
http://forum.canardpc.com/showthread.php?p=3021104 forum thread.

I just started 33 "time nice -19 ./burnMMX P &" processes from an
executable script text file in bash. After a few minutes, its top showed
(note that I just booted it up and not running X):

$ top
top - 06:08:53 up 23 min, 1 user, load average: 33.05, 28.65, 15.85
Tasks: 173 total, 34 running, 139 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.0%us, 0.0%sy, 75.1%ni, 24.9%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Cpu1 : 0.7%us, 0.3%sy, 99.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 2595064k total, 2520296k used, 74768k free, 39504k buffers
Swap: 2361512k total, 2376k used, 2359136k free, 196776k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

4189 ant 39 19 65632 64m 4 R 15 2.5 0:37.57 burnMMX

4170 ant 39 19 65632 64m 4 R 12 2.5 0:38.69 burnMMX

4151 ant 39 19 65632 64m 4 R 9 2.5 0:38.01 burnMMX

4145 ant 39 19 65632 64m 4 R 8 2.5 0:38.04 burnMMX

4148 ant 39 19 65632 64m 4 R 8 2.5 0:38.09 burnMMX

4164 ant 39 19 65632 64m 4 R 8 2.5 0:36.95 burnMMX

4192 ant 39 19 65632 64m 4 R 8 2.5 0:36.19 burnMMX

4135 ant 39 19 65632 64m 4 R 7 2.5 0:36.08 burnMMX

4150 ant 39 19 65632 64m 4 R 7 2.5 0:34.74 burnMMX

4167 ant 39 19 65632 64m 4 R 7 2.5 0:36.15 burnMMX

4169 ant 39 19 65632 64m 4 R 7 2.5 0:39.14 burnMMX

4193 ant 39 19 65632 64m 4 R 7 2.5 0:35.66 burnMMX

4153 ant 39 19 65632 64m 4 R 6 2.5 0:37.67 burnMMX

4163 ant 39 19 65632 64m 4 R 6 2.5 0:33.46 burnMMX

4186 ant 39 19 65632 64m 4 R 6 2.5 0:35.52 burnMMX

4190 ant 39 19 65632 64m 4 R 6 2.5 0:33.59 burnMMX

4149 ant 39 19 65632 64m 4 R 6 2.5 0:36.19 burnMMX

4165 ant 39 19 65632 64m 4 R 6 2.5 0:35.24 burnMMX

4171 ant 39 19 65632 64m 4 R 6 2.5 0:38.67 burnMMX

4191 ant 39 19 65632 64m 4 R 6 2.5 0:36.90 burnMMX

4194 ant 39 19 65632 64m 4 R 6 2.5 0:38.18 burnMMX

4168 ant 39 19 65632 64m 4 R 5 2.5 0:37.40 burnMMX

4152 ant 39 19 65632 64m 4 R 4 2.5 0:35.72 burnMMX

4195 ant 39 19 65632 64m 4 R 4 2.5 0:34.68 burnMMX

4198 ant 39 19 65632 64m 4 R 4 2.5 0:36.17 burnMMX

4162 ant 39 19 65632 64m 4 R 4 2.5 0:37.35 burnMMX

4187 ant 39 19 65632 64m 4 R 4 2.5 0:36.55 burnMMX

4196 ant 39 19 65632 64m 4 R 4 2.5 0:37.77 burnMMX

4142 ant 39 19 65632 64m 4 R 3 2.5 0:37.45 burnMMX

4188 ant 39 19 65632 64m 4 R 3 2.5 0:35.61 burnMMX

4197 ant 39 19 65632 64m 4 R 3 2.5 0:37.66 burnMMX

4139 ant 39 19 65632 64m 4 R 3 2.5 0:35.24 burnMMX

4166 ant 39 19 65632 64m 4 R 3 2.5 0:34.89 burnMMX

4249 ant 40 0 2464 1204 888 R 1 0.0 0:00.04 top

2876 ant 40 0 58428 43m 4692 S 0 1.7 0:11.61
launch_here.rb
1 root 40 0 2036 640 604 S 0 0.0 0:00.81 init

2 root 40 0 0 0 0 S 0 0.0 0:00.00 kthreadd

3 root RT 0 0 0 0 S 0 0.0 0:00.00
migration/0
4 root 20 0 0 0 0 S 0 0.0 0:00.00
ksoftirqd/0
5 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0

6 root RT 0 0 0 0 S 0 0.0 0:00.00
migration/1
7 root 20 0 0 0 0 S 0 0.0 0:00.00
ksoftirqd/1
8 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/1

9 root 20 0 0 0 0 S 0 0.0 0:00.00 events/0
....

$ sensors -f
acpitz-virtual-0
Adapter: Virtual device
temp1: +71.2�F (crit = +206.2�F)

k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp: +125.6�F
Core1 Temp: +100.4�F

I am planning to leave them running for about 15 hours straight until I
need to use the box locally again tonight. I am curious if I will get no
errors and crashes like yesterday's seven processes test.
--
"We are anthill men upon an anthill world." --Ray Bradbury
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT
( ) or ANTant(a)zimage.com
Ant is currently not listening to any songs on his home computer.
From: Robert Redelmeier on
Ant <ant(a)zimage.comant> wrote in part:
> I am planning to leave them running for about 15 hours straight until
> I need to use the box locally again tonight. I am curious if I will
> get no errors and crashes like yesterday's seven processes test.

Yes, this seems to be running well. I'm not sure what else to suggest.
Odd to see stability under load but instability at idle. mobo caps/PS?
You might try running 66 `burnMMX O` or 132 `burnMMX N` or even 264
`burnMMX M` to increase TLB swapping (more smaller maps).
But that may be too much trouble.

-- Robert