Prev: Any monthly web hosting companies?
Next: "TLB parity error in virtual array; TLB error 'instruction"?(acpidump)
From: Robert Redelmeier on 10 Mar 2010 15:19 Ant <ant(a)zimage.comant> wrote in part: > So far, nothing interesting in my logs or any crashes. Just a very slow > Debian/Linux! Also, the HDD light was very busy. And top shows swap > usage. I checked iotop and saw: > > $ iotop > > Total DISK READ: 3.02 M/s | Total DISK WRITE: 1259.75 K/s > TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND > > 31 be/4 root 0.00 B/s 0.00 B/s 0.00 % 99.99 % [kswapd0] > 1045 be/4 root 0.00 B/s 38.17 K/s 0.00 % 46.36 % [kjournald] > 1465 be/4 ant 690.95 K/s 76.35 K/s 50.71 % 4.72 % ruby > ./launch_here.rb -b > 1844 be/7 ant 580.25 K/s 0.00 B/s 99.99 % 0.00 % ./burnMMX P > 1859 be/7 ant 255.77 K/s 0.00 B/s 92.82 % 0.00 % ./burnMMX P > 1860 be/7 ant 209.96 K/s 0.00 B/s 99.99 % 0.00 % ./burnMMX P > 1874 be/7 ant 427.55 K/s 0.00 B/s 63.77 % 0.00 % ./burnMMX P > 1875 be/7 ant 244.31 K/s 0.00 B/s 99.99 % 0.00 % ./burnMMX P > 1880 be/7 ant 454.27 K/s 0.00 B/s 99.99 % 0.00 % ./burnMMX P > 1840 be/7 ant 263.40 K/s 0.00 B/s 99.99 % 0.00 % ./burnMMX P > 1 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % init [2] > 2 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kthreadd] > 3 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/0] > 4 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/0] > 5 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/0] > 6 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/1] > 7 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/1] > 8 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/1] > ... > > Are you sure it is not supposed to use swap? I am not even running X. > Do I need to run this overnight or something? You have less free RAM than I expected. 7 of the 40 burnMMX are hitting swap (which should be avoided). Run only 32. A small swapout at startup is OK, but thrashing (as above) is not and only reduces severity. Better to run one burnMMX too light than one too many. -- Robert
From: Robert Redelmeier on 10 Mar 2010 15:37 Ant <ant(a)zimage.comant> wrote in part: > $ top > top - 07:35:06 up 1 day, 23:52, 1 user, load average: 42.33, 37.41, 20.82 > Tasks: 188 total, 37 running, 151 sleeping, 0 stopped, 0 zombie > ... > > Do I need to run this overnight or something? Looking at your process list more closely, I notice big gaps in the PIDs. Either you have very active daemons, or you tried to start burnMMX and they quickly abended (very, very bad sign). Please run under `time` so you can spot these quick terminations. Running overnight would give you some assurance, since I have seen rare errors (2-3/day) produce unstable systems. -- Robert
From: Ant on 11 Mar 2010 12:00 On 3/10/2010 12:37 PM PT, Robert Redelmeier typed: > Ant<ant(a)zimage.comant> wrote in part: >> $ top >> top - 07:35:06 up 1 day, 23:52, 1 user, load average: 42.33, 37.41, 20.82 >> Tasks: 188 total, 37 running, 151 sleeping, 0 stopped, 0 zombie >> ... >> >> Do I need to run this overnight or something? > > Looking at your process list more closely, I notice big gaps in > the PIDs. Either you have very active daemons, or you tried to > start burnMMX and they quickly abended (very, very bad sign). > Please run under `time` so you can spot these quick terminations. > > Running overnight would give you some assurance, since I > have seen rare errors (2-3/day) produce unstable systems. I made a text file with 40 of these lines: time nice -19 ./burnMMX P & And then ran it. Here's with seven of them after about three minutes: Tasks: 122 total, 8 running, 114 sleeping, 0 stopped, 0 zombie Cpu0 : 0.1%us, 0.1%sy, 0.5%ni, 98.4%id, 0.9%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 0.1%us, 0.0%sy, 0.5%ni, 99.3%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 2595064k total, 1174968k used, 1420096k free, 123336k buffers Swap: 2361512k total, 0k used, 2361512k free, 453880k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5914 ant 39 19 65632 64m 4 R 51 2.5 0:47.73 burnMMX 5908 ant 39 19 65632 64m 4 R 35 2.5 0:47.66 burnMMX 5917 ant 39 19 65632 64m 4 R 27 2.5 0:45.95 burnMMX 5916 ant 39 19 65632 64m 4 R 23 2.5 0:48.26 burnMMX 5913 ant 39 19 65632 64m 4 R 20 2.5 0:46.34 burnMMX 5919 ant 39 19 65632 64m 4 R 20 2.5 0:49.24 burnMMX 5918 ant 39 19 65632 64m 4 R 16 2.5 0:47.99 burnMMX 5929 ant 40 0 2460 1076 804 R 4 0.0 0:00.02 top 4174 ant 40 0 57972 43m 4700 S 2 1.7 0:22.89 launch_here.rb 1 root 40 0 2036 704 604 S 0 0.0 0:00.97 init 2 root 40 0 0 0 0 S 0 0.0 0:00.00 kthreadd 3 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/0 4 root 20 0 0 0 0 S 0 0.0 0:00.00 ksoftirqd/0 5 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0 6 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/1 7 root 20 0 0 0 0 S 0 0.0 0:00.01 ksoftirqd/1 8 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/1 9 root 20 0 0 0 0 S 0 0.0 0:00.00 events/0 10 root 20 0 0 0 0 S 0 0.0 0:00.00 events/1 11 root 20 0 0 0 0 S 0 0.0 0:00.00 cpuset 12 root 20 0 0 0 0 S 0 0.0 0:00.00 khelper 13 root 20 0 0 0 0 S 0 0.0 0:00.00 netns 14 root 20 0 0 0 0 S 0 0.0 0:00.00 async/mgr -- "Now I have you where I want you... where is my jar of Bull ants?" --unknown /\___/\ / /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site) | |o o| | Ant's Quality Foraged Links: http://aqfl.net \ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT ( ) or ANTant(a)zimage.com Ant is currently not listening to any songs on his home computer.
From: Ant on 11 Mar 2010 12:08 On 3/10/2010 12:19 PM PT, Robert Redelmeier typed: > You have less free RAM than I expected. 7 of the 40 burnMMX are > hitting swap (which should be avoided). Run only 32. A small swapout > at startup is OK, but thrashing (as above) is not and only reduces > severity. Better to run one burnMMX too light than one too many. Yeah. 2.5 GB of RAM. I used to have three (512 MB), but it came out bad memtest86+ v4.00 when I tested it last month. I thought that was the problem, but I still have kernel panics. I will keep it running all day. I might need to kill them if I need to use the box at full speed. I did have another kernel after midnight while idling. I started the test at about 8:57 AM PST. After about ten minutes, I saw: $ sensors -f acpitz-virtual-0 Adapter: Virtual device temp1: +71.2�F (crit = +206.2�F) k8temp-pci-00c3 Adapter: PCI adapter Core0 Temp: +122.0�F Core1 Temp: +95.0�F $ top - 09:07:52 up 8:40, 1 user, load average: 6.99, 6.16, 3.49 Tasks: 122 total, 8 running, 114 sleeping, 0 stopped, 0 zombie Cpu0 : 0.0%us, 0.2%sy, 74.9%ni, 24.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 0.0%us, 0.2%sy, 74.8%ni, 25.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 2595064k total, 1178604k used, 1416460k free, 124472k buffers Swap: 2361512k total, 0k used, 2361512k free, 455528k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5919 ant 39 19 65632 64m 4 R 39 2.5 3:08.23 burnMMX 5908 ant 39 19 65632 64m 4 R 37 2.5 3:01.57 burnMMX 5913 ant 39 19 65632 64m 4 R 26 2.5 3:00.80 burnMMX 5917 ant 39 19 65632 64m 4 R 26 2.5 2:59.61 burnMMX 5916 ant 39 19 65632 64m 4 R 25 2.5 3:06.34 burnMMX 5914 ant 39 19 65632 64m 4 R 24 2.5 3:02.03 burnMMX 5918 ant 39 19 65632 64m 4 R 23 2.5 3:07.14 burnMMX 4174 ant 40 0 60260 44m 4700 S 0 1.8 0:26.20 launch_here.rb 1 root 40 0 2036 704 604 S 0 0.0 0:00.97 init 2 root 40 0 0 0 0 S 0 0.0 0:00.00 kthreadd 3 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/0 4 root 20 0 0 0 0 S 0 0.0 0:00.00 ksoftirqd/0 5 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0 6 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/1 7 root 20 0 0 0 0 S 0 0.0 0:00.01 ksoftirqd/1 8 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/1 9 root 20 0 0 0 0 S 0 0.0 0:00.00 events/0 10 root 20 0 0 0 0 S 0 0.0 0:00.00 events/1 11 root 20 0 0 0 0 S 0 0.0 0:00.00 cpuset 12 root 20 0 0 0 0 S 0 0.0 0:00.00 khelper 13 root 20 0 0 0 0 S 0 0.0 0:00.00 netns 14 root 20 0 0 0 0 S 0 0.0 0:00.00 async/mgr 15 root 20 0 0 0 0 S 0 0.0 0:00.00 pm .... I will follow-up later. BTW, how long should I run these nonstop? All day? -- "Where there is sugar, there are bound to be ants." --Malay Proverb /\___/\ / /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site) | |o o| | Ant's Quality Foraged Links: http://aqfl.net \ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT ( ) or ANTant(a)zimage.com Ant is currently not listening to any songs on his home computer.
From: Robert Redelmeier on 11 Mar 2010 14:28
Ant <ant(a)zimage.comant> wrote in part: > $ top - 09:07:52 up 8:40, 1 user, load average: 6.99, 6.16, 3.49 > Tasks: 122 total, 8 running, 114 sleeping, 0 stopped, 0 zombie > Cpu0 : 0.0%us, 0.2%sy, 74.9%ni, 24.9%id, 0.0%wa, 0.0%hi, 0.0%si, > Cpu1 : 0.0%us, 0.2%sy, 74.8%ni, 25.0%id, 0.0%wa, 0.0%hi, 0.0%si, > Mem: 2595064k total, 1178604k used, 1416460k free, 124472k buffers > Swap: 2361512k total, 0k used, 2361512k free, 455528k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5919 ant 39 19 65632 64m 4 R 39 2.5 3:08.23 burnMMX > 5908 ant 39 19 65632 64m 4 R 37 2.5 3:01.57 burnMMX > 5913 ant 39 19 65632 64m 4 R 26 2.5 3:00.80 burnMMX > 5917 ant 39 19 65632 64m 4 R 26 2.5 2:59.61 burnMMX > 5916 ant 39 19 65632 64m 4 R 25 2.5 3:06.34 burnMMX > 5914 ant 39 19 65632 64m 4 R 24 2.5 3:02.03 burnMMX > 5918 ant 39 19 65632 64m 4 R 23 2.5 3:07.14 burnMMX You started _40_ and only _7_ are left running? Bad news. What happened to PIDs 5909-12, 5915, 5920-47 ? The seven running might be mapped to non-defective areas/TLB. They might have abended when the memory the kernel mapped either produced a segfault, TLB fault, or memory error. Each burnMMX has its' own pages and mmap and stomps them all. > I will follow-up later. BTW, how long should I run these nonstop? All day? As long as you can. Min 2h . But if you are getting early abends, then you have just confirmed a hardware problem. To get exit status, you could try nice -19 ./burnMMX | echo $? & burnMMX typically exits 127 when it encounters a memory error. It could do this withing the first second if there is a problem with memory mapping (hardware does not obey kernel instructions). -- Robert |