Prev: Any monthly web hosting companies?
Next: "TLB parity error in virtual array; TLB error 'instruction"?(acpidump)
From: ANTant on 11 Mar 2010 14:41 >> $ top - 09:07:52 up 8:40, 1 user, load average: 6.99, 6.16, 3.49 >> Tasks: 122 total, 8 running, 114 sleeping, 0 stopped, 0 zombie >> Cpu0 : 0.0%us, 0.2%sy, 74.9%ni, 24.9%id, 0.0%wa, 0.0%hi, 0.0%si, >> Cpu1 : 0.0%us, 0.2%sy, 74.8%ni, 25.0%id, 0.0%wa, 0.0%hi, 0.0%si, >> Mem: 2595064k total, 1178604k used, 1416460k free, 124472k buffers >> Swap: 2361512k total, 0k used, 2361512k free, 455528k cached >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 5919 ant 39 19 65632 64m 4 R 39 2.5 3:08.23 burnMMX >> 5908 ant 39 19 65632 64m 4 R 37 2.5 3:01.57 burnMMX >> 5913 ant 39 19 65632 64m 4 R 26 2.5 3:00.80 burnMMX >> 5917 ant 39 19 65632 64m 4 R 26 2.5 2:59.61 burnMMX >> 5916 ant 39 19 65632 64m 4 R 25 2.5 3:06.34 burnMMX >> 5914 ant 39 19 65632 64m 4 R 24 2.5 3:02.03 burnMMX >> 5918 ant 39 19 65632 64m 4 R 23 2.5 3:07.14 burnMMX > > > You started _40_ and only _7_ are left running? Bad news. No, you told me to do seven instead of 40. I think I had all 40 when I aborted yesterday. >> I will follow-up later. BTW, how long should I run these nonstop? All day? > > As long as you can. Min 2h . But if you are getting early > abends, then you have just confirmed a hardware problem. Still running seven and not hogging my HDD like yesterday's 40: Tasks: 129 total, 8 running, 121 sleeping, 0 stopped, 0 zombie Cpu0 : 0.0%us, 0.5%sy, 74.6%ni, 24.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 0.0%us, 1.0%sy, 74.2%ni, 24.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 2595064k total, 1352692k used, 1242372k free, 144940k buffers Swap: 2361512k total, 0k used, 2361512k free, 536844k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5914 ant 39 19 65632 64m 4 R 33 2.5 46:37.13 burnMMX 5908 ant 39 19 65632 64m 4 R 33 2.5 46:18.56 burnMMX 5917 ant 39 19 65632 64m 4 R 30 2.5 46:32.14 burnMMX 5916 ant 39 19 65632 64m 4 R 27 2.5 46:27.77 burnMMX 5918 ant 39 19 65632 64m 4 R 27 2.5 46:39.00 burnMMX 5919 ant 39 19 65632 64m 4 R 24 2.5 46:21.80 burnMMX 5913 ant 39 19 65632 64m 4 R 24 2.5 46:38.61 burnMMX 4174 ant 40 0 61304 44m 4700 S 1 1.8 1:22.50 launch_here.rb 6152 ant 40 0 2464 1172 888 R 1 0.0 0:00.03 top 2532 root 40 0 2704 924 792 S 0 0.0 0:00.17 syslogd 3211 root 40 0 3392 1116 972 S 0 0.0 0:02.03 hald-addon-stor 1 root 40 0 2036 704 604 S 0 0.0 0:00.97 init 2 root 40 0 0 0 0 S 0 0.0 0:00.00 kthreadd .... Nothing unusual in my dmesg. So far, so good. > > To get exit status, you could try > nice -19 ./burnMMX | echo $? & > > burnMMX typically exits 127 when it encounters a memory error. > It could do this withing the first second if there is a problem > with memory mapping (hardware does not obey kernel instructions). Ah, I will try that if I need to run it. I am not aborting the seven processes now. ;) -- "We are anthill men upon an anthill world." --Ray Bradbury /\___/\ / /\ /\ \ Phillip (Ant) @ http://antfarm.ma.cx (Personal Web Site) | |o o| | Ant's Quality Foraged Links (AQFL): http://aqfl.net \ _ / Please remove ANT if replying by e-mail. ( )
From: ANTant on 11 Mar 2010 20:51 >>> $ top >>> top - 07:35:06 up 1 day, 23:52, 1 user, load average: 42.33, 37.41, 20.82 >>> Tasks: 188 total, 37 running, 151 sleeping, 0 stopped, 0 zombie >>> ... >>> >>> Do I need to run this overnight or something? >> >> Looking at your process list more closely, I notice big gaps in >> the PIDs. Either you have very active daemons, or you tried to >> start burnMMX and they quickly abended (very, very bad sign). >> Please run under `time` so you can spot these quick terminations. >> >> Running overnight would give you some assurance, since I >> have seen rare errors (2-3/day) produce unstable systems. So far no errors (no TLB errors and crashes within eight hours. I will keep it running for another 3-4 hours and then I am going to killall those processes so I can use the machine. It seems like the issue only comes up if my box is not idled? What the frak? -- /\___/\ / /\ /\ \ Phillip (Ant) @ http://antfarm.ma.cx (Personal Web Site) | |o o| | Ant's Quality Foraged Links (AQFL): http://aqfl.net \ _ / Please remove ANT if replying by e-mail. ( )
From: Robert Redelmeier on 11 Mar 2010 23:37 ANTant(a)zimage.com wrote in part: > No, you told me to do seven instead of 40. I think I had all 40 when I aborted yesterday. No, I believe I told you to run 7 _less_ , so 33 iso 40. This still does not explain the odd PID numbering unless you are slow on the kbd or have very active daemons. > Ah, I will try that if I need to run it. I am not aborting > the seven processes now. ;) Fine. Nothing stops you from launching another 26 . You want to use as much RAM as possible without thrashing. More TLB reloads with more tag patterns. -- Robert
From: Ant on 12 Mar 2010 09:11 On 3/11/2010 9:40 PM PT, ANTant(a)zimage.com typed: >> This still does not explain the odd PID numbering unless >> you are slow on the kbd or have very active daemons. > > Nope, I ran a test script that had all those "time nice -19 ./burnMMX P > &" lines. > >... OK, I just stopped my seven processes earlier so I can use it and no > machine check errors in logs and crashes after about 12.25 hours > nonstop. SO weird! > > I am going to try to run memtest86+ v4.00's test #9 during my sleep. And > then try 33 burnMMX processes tomorrow while working. Memtest86+ v4.00's test #9 passed after 3.25 hours. I am not sure if I need to run more of it. I will wait for more replies about in my http://forum.canardpc.com/showthread.php?p=3021104 forum thread. I just started 33 "time nice -19 ./burnMMX P &" processes from an executable script text file in bash. After a few minutes, its top showed (note that I just booted it up and not running X): $ top top - 06:08:53 up 23 min, 1 user, load average: 33.05, 28.65, 15.85 Tasks: 173 total, 34 running, 139 sleeping, 0 stopped, 0 zombie Cpu0 : 0.0%us, 0.0%sy, 75.1%ni, 24.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 0.7%us, 0.3%sy, 99.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 2595064k total, 2520296k used, 74768k free, 39504k buffers Swap: 2361512k total, 2376k used, 2359136k free, 196776k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4189 ant 39 19 65632 64m 4 R 15 2.5 0:37.57 burnMMX 4170 ant 39 19 65632 64m 4 R 12 2.5 0:38.69 burnMMX 4151 ant 39 19 65632 64m 4 R 9 2.5 0:38.01 burnMMX 4145 ant 39 19 65632 64m 4 R 8 2.5 0:38.04 burnMMX 4148 ant 39 19 65632 64m 4 R 8 2.5 0:38.09 burnMMX 4164 ant 39 19 65632 64m 4 R 8 2.5 0:36.95 burnMMX 4192 ant 39 19 65632 64m 4 R 8 2.5 0:36.19 burnMMX 4135 ant 39 19 65632 64m 4 R 7 2.5 0:36.08 burnMMX 4150 ant 39 19 65632 64m 4 R 7 2.5 0:34.74 burnMMX 4167 ant 39 19 65632 64m 4 R 7 2.5 0:36.15 burnMMX 4169 ant 39 19 65632 64m 4 R 7 2.5 0:39.14 burnMMX 4193 ant 39 19 65632 64m 4 R 7 2.5 0:35.66 burnMMX 4153 ant 39 19 65632 64m 4 R 6 2.5 0:37.67 burnMMX 4163 ant 39 19 65632 64m 4 R 6 2.5 0:33.46 burnMMX 4186 ant 39 19 65632 64m 4 R 6 2.5 0:35.52 burnMMX 4190 ant 39 19 65632 64m 4 R 6 2.5 0:33.59 burnMMX 4149 ant 39 19 65632 64m 4 R 6 2.5 0:36.19 burnMMX 4165 ant 39 19 65632 64m 4 R 6 2.5 0:35.24 burnMMX 4171 ant 39 19 65632 64m 4 R 6 2.5 0:38.67 burnMMX 4191 ant 39 19 65632 64m 4 R 6 2.5 0:36.90 burnMMX 4194 ant 39 19 65632 64m 4 R 6 2.5 0:38.18 burnMMX 4168 ant 39 19 65632 64m 4 R 5 2.5 0:37.40 burnMMX 4152 ant 39 19 65632 64m 4 R 4 2.5 0:35.72 burnMMX 4195 ant 39 19 65632 64m 4 R 4 2.5 0:34.68 burnMMX 4198 ant 39 19 65632 64m 4 R 4 2.5 0:36.17 burnMMX 4162 ant 39 19 65632 64m 4 R 4 2.5 0:37.35 burnMMX 4187 ant 39 19 65632 64m 4 R 4 2.5 0:36.55 burnMMX 4196 ant 39 19 65632 64m 4 R 4 2.5 0:37.77 burnMMX 4142 ant 39 19 65632 64m 4 R 3 2.5 0:37.45 burnMMX 4188 ant 39 19 65632 64m 4 R 3 2.5 0:35.61 burnMMX 4197 ant 39 19 65632 64m 4 R 3 2.5 0:37.66 burnMMX 4139 ant 39 19 65632 64m 4 R 3 2.5 0:35.24 burnMMX 4166 ant 39 19 65632 64m 4 R 3 2.5 0:34.89 burnMMX 4249 ant 40 0 2464 1204 888 R 1 0.0 0:00.04 top 2876 ant 40 0 58428 43m 4692 S 0 1.7 0:11.61 launch_here.rb 1 root 40 0 2036 640 604 S 0 0.0 0:00.81 init 2 root 40 0 0 0 0 S 0 0.0 0:00.00 kthreadd 3 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/0 4 root 20 0 0 0 0 S 0 0.0 0:00.00 ksoftirqd/0 5 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0 6 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/1 7 root 20 0 0 0 0 S 0 0.0 0:00.00 ksoftirqd/1 8 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/1 9 root 20 0 0 0 0 S 0 0.0 0:00.00 events/0 .... $ sensors -f acpitz-virtual-0 Adapter: Virtual device temp1: +71.2�F (crit = +206.2�F) k8temp-pci-00c3 Adapter: PCI adapter Core0 Temp: +125.6�F Core1 Temp: +100.4�F I am planning to leave them running for about 15 hours straight until I need to use the box locally again tonight. I am curious if I will get no errors and crashes like yesterday's seven processes test. -- "We are anthill men upon an anthill world." --Ray Bradbury /\___/\ / /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site) | |o o| | Ant's Quality Foraged Links: http://aqfl.net \ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT ( ) or ANTant(a)zimage.com Ant is currently not listening to any songs on his home computer.
From: Robert Redelmeier on 12 Mar 2010 11:43
Ant <ant(a)zimage.comant> wrote in part: > I am planning to leave them running for about 15 hours straight until > I need to use the box locally again tonight. I am curious if I will > get no errors and crashes like yesterday's seven processes test. Yes, this seems to be running well. I'm not sure what else to suggest. Odd to see stability under load but instability at idle. mobo caps/PS? You might try running 66 `burnMMX O` or 132 `burnMMX N` or even 264 `burnMMX M` to increase TLB swapping (more smaller maps). But that may be too much trouble. -- Robert |