"TLB parity error in virtual array; TLB error 'instruction"? [Chips]

Prev: Any monthly web hosting companies?
Next: "TLB parity error in virtual array; TLB error 'instruction"?(acpidump)

From: Robert Redelmeier on 10 Mar 2010 15:19

Ant <ant(a)zimage.comant> wrote in part:
> So far, nothing interesting in my logs or any crashes. Just a very slow
> Debian/Linux! Also, the HDD light was very busy. And top shows swap
> usage. I checked iotop and saw:
>
> $ iotop
>
> Total DISK READ: 3.02 M/s | Total DISK WRITE: 1259.75 K/s
> TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
>
> 31 be/4 root 0.00 B/s 0.00 B/s 0.00 % 99.99 % [kswapd0]
> 1045 be/4 root 0.00 B/s 38.17 K/s 0.00 % 46.36 % [kjournald]
> 1465 be/4 ant 690.95 K/s 76.35 K/s 50.71 % 4.72 % ruby
> ./launch_here.rb -b
> 1844 be/7 ant 580.25 K/s 0.00 B/s 99.99 % 0.00 % ./burnMMX P
> 1859 be/7 ant 255.77 K/s 0.00 B/s 92.82 % 0.00 % ./burnMMX P
> 1860 be/7 ant 209.96 K/s 0.00 B/s 99.99 % 0.00 % ./burnMMX P
> 1874 be/7 ant 427.55 K/s 0.00 B/s 63.77 % 0.00 % ./burnMMX P
> 1875 be/7 ant 244.31 K/s 0.00 B/s 99.99 % 0.00 % ./burnMMX P
> 1880 be/7 ant 454.27 K/s 0.00 B/s 99.99 % 0.00 % ./burnMMX P
> 1840 be/7 ant 263.40 K/s 0.00 B/s 99.99 % 0.00 % ./burnMMX P
> 1 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % init [2]
> 2 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kthreadd]
> 3 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/0]
> 4 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/0]
> 5 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/0]
> 6 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/1]
> 7 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/1]
> 8 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/1]
> ...
>
> Are you sure it is not supposed to use swap? I am not even running X.
> Do I need to run this overnight or something?

You have less free RAM than I expected. 7 of the 40 burnMMX are
hitting swap (which should be avoided). Run only 32. A small swapout
at startup is OK, but thrashing (as above) is not and only reduces
severity. Better to run one burnMMX too light than one too many.

-- Robert

From: Robert Redelmeier on 10 Mar 2010 15:37

Ant <ant(a)zimage.comant> wrote in part:
> $ top
> top - 07:35:06 up 1 day, 23:52, 1 user, load average: 42.33, 37.41, 20.82
> Tasks: 188 total, 37 running, 151 sleeping, 0 stopped, 0 zombie
> ...
>
> Do I need to run this overnight or something?

Looking at your process list more closely, I notice big gaps in
the PIDs. Either you have very active daemons, or you tried to
start burnMMX and they quickly abended (very, very bad sign).
Please run under `time` so you can spot these quick terminations.

Running overnight would give you some assurance, since I
have seen rare errors (2-3/day) produce unstable systems.

-- Robert

From: Ant on 11 Mar 2010 12:00

On 3/10/2010 12:37 PM PT, Robert Redelmeier typed:

> Ant<ant(a)zimage.comant> wrote in part:
>> $ top
>> top - 07:35:06 up 1 day, 23:52, 1 user, load average: 42.33, 37.41, 20.82
>> Tasks: 188 total, 37 running, 151 sleeping, 0 stopped, 0 zombie
>> ...
>>
>> Do I need to run this overnight or something?
>
> Looking at your process list more closely, I notice big gaps in
> the PIDs. Either you have very active daemons, or you tried to
> start burnMMX and they quickly abended (very, very bad sign).
> Please run under `time` so you can spot these quick terminations.
>
> Running overnight would give you some assurance, since I
> have seen rare errors (2-3/day) produce unstable systems.

I made a text file with 40 of these lines:
time nice -19 ./burnMMX P &

And then ran it.

Here's with seven of them after about three minutes:
Tasks: 122 total, 8 running, 114 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.1%us, 0.1%sy, 0.5%ni, 98.4%id, 0.9%wa, 0.0%hi, 0.0%si,
0.0%st
Cpu1 : 0.1%us, 0.0%sy, 0.5%ni, 99.3%id, 0.1%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 2595064k total, 1174968k used, 1420096k free, 123336k buffers
Swap: 2361512k total, 0k used, 2361512k free, 453880k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5914 ant 39 19 65632 64m 4 R 51 2.5 0:47.73 burnMMX

5908 ant 39 19 65632 64m 4 R 35 2.5 0:47.66 burnMMX

5917 ant 39 19 65632 64m 4 R 27 2.5 0:45.95 burnMMX

5916 ant 39 19 65632 64m 4 R 23 2.5 0:48.26 burnMMX

5913 ant 39 19 65632 64m 4 R 20 2.5 0:46.34 burnMMX

5919 ant 39 19 65632 64m 4 R 20 2.5 0:49.24 burnMMX

5918 ant 39 19 65632 64m 4 R 16 2.5 0:47.99 burnMMX

5929 ant 40 0 2460 1076 804 R 4 0.0 0:00.02 top

4174 ant 40 0 57972 43m 4700 S 2 1.7 0:22.89
launch_here.rb
1 root 40 0 2036 704 604 S 0 0.0 0:00.97 init

2 root 40 0 0 0 0 S 0 0.0 0:00.00 kthreadd

3 root RT 0 0 0 0 S 0 0.0 0:00.00
migration/0
4 root 20 0 0 0 0 S 0 0.0 0:00.00
ksoftirqd/0
5 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0

6 root RT 0 0 0 0 S 0 0.0 0:00.00
migration/1
7 root 20 0 0 0 0 S 0 0.0 0:00.01
ksoftirqd/1
8 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/1

9 root 20 0 0 0 0 S 0 0.0 0:00.00 events/0

10 root 20 0 0 0 0 S 0 0.0 0:00.00 events/1

11 root 20 0 0 0 0 S 0 0.0 0:00.00 cpuset

12 root 20 0 0 0 0 S 0 0.0 0:00.00 khelper

13 root 20 0 0 0 0 S 0 0.0 0:00.00 netns

14 root 20 0 0 0 0 S 0 0.0 0:00.00 async/mgr
--
"Now I have you where I want you... where is my jar of Bull ants?" --unknown
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT
( ) or ANTant(a)zimage.com
Ant is currently not listening to any songs on his home computer.

From: Ant on 11 Mar 2010 12:08

On 3/10/2010 12:19 PM PT, Robert Redelmeier typed:

> You have less free RAM than I expected. 7 of the 40 burnMMX are
> hitting swap (which should be avoided). Run only 32. A small swapout
> at startup is OK, but thrashing (as above) is not and only reduces
> severity. Better to run one burnMMX too light than one too many.

Yeah. 2.5 GB of RAM. I used to have three (512 MB), but it came out bad
memtest86+ v4.00 when I tested it last month. I thought that was the
problem, but I still have kernel panics.

I will keep it running all day. I might need to kill them if I need to
use the box at full speed. I did have another kernel after midnight
while idling.

I started the test at about 8:57 AM PST. After about ten minutes, I saw:

$ sensors -f
acpitz-virtual-0
Adapter: Virtual device
temp1: +71.2�F (crit = +206.2�F)

k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp: +122.0�F
Core1 Temp: +95.0�F

$ top - 09:07:52 up 8:40, 1 user, load average: 6.99, 6.16, 3.49
Tasks: 122 total, 8 running, 114 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.0%us, 0.2%sy, 74.9%ni, 24.9%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Cpu1 : 0.0%us, 0.2%sy, 74.8%ni, 25.0%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 2595064k total, 1178604k used, 1416460k free, 124472k buffers
Swap: 2361512k total, 0k used, 2361512k free, 455528k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5919 ant 39 19 65632 64m 4 R 39 2.5 3:08.23 burnMMX

5908 ant 39 19 65632 64m 4 R 37 2.5 3:01.57 burnMMX

5913 ant 39 19 65632 64m 4 R 26 2.5 3:00.80 burnMMX

5917 ant 39 19 65632 64m 4 R 26 2.5 2:59.61 burnMMX

5916 ant 39 19 65632 64m 4 R 25 2.5 3:06.34 burnMMX

5914 ant 39 19 65632 64m 4 R 24 2.5 3:02.03 burnMMX

5918 ant 39 19 65632 64m 4 R 23 2.5 3:07.14 burnMMX

4174 ant 40 0 60260 44m 4700 S 0 1.8 0:26.20
launch_here.rb
1 root 40 0 2036 704 604 S 0 0.0 0:00.97 init

2 root 40 0 0 0 0 S 0 0.0 0:00.00 kthreadd

3 root RT 0 0 0 0 S 0 0.0 0:00.00
migration/0
4 root 20 0 0 0 0 S 0 0.0 0:00.00
ksoftirqd/0
5 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0

6 root RT 0 0 0 0 S 0 0.0 0:00.00
migration/1
7 root 20 0 0 0 0 S 0 0.0 0:00.01
ksoftirqd/1
8 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/1

9 root 20 0 0 0 0 S 0 0.0 0:00.00 events/0

10 root 20 0 0 0 0 S 0 0.0 0:00.00 events/1

11 root 20 0 0 0 0 S 0 0.0 0:00.00 cpuset

12 root 20 0 0 0 0 S 0 0.0 0:00.00 khelper

13 root 20 0 0 0 0 S 0 0.0 0:00.00 netns

14 root 20 0 0 0 0 S 0 0.0 0:00.00 async/mgr

15 root 20 0 0 0 0 S 0 0.0 0:00.00 pm
....

I will follow-up later. BTW, how long should I run these nonstop? All day?
--
"Where there is sugar, there are bound to be ants." --Malay Proverb
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT
( ) or ANTant(a)zimage.com
Ant is currently not listening to any songs on his home computer.

From: Robert Redelmeier on 11 Mar 2010 14:28

Ant <ant(a)zimage.comant> wrote in part:
> $ top - 09:07:52 up 8:40, 1 user, load average: 6.99, 6.16, 3.49
> Tasks: 122 total, 8 running, 114 sleeping, 0 stopped, 0 zombie
> Cpu0 : 0.0%us, 0.2%sy, 74.9%ni, 24.9%id, 0.0%wa, 0.0%hi, 0.0%si,
> Cpu1 : 0.0%us, 0.2%sy, 74.8%ni, 25.0%id, 0.0%wa, 0.0%hi, 0.0%si,
> Mem: 2595064k total, 1178604k used, 1416460k free, 124472k buffers
> Swap: 2361512k total, 0k used, 2361512k free, 455528k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5919 ant 39 19 65632 64m 4 R 39 2.5 3:08.23 burnMMX
> 5908 ant 39 19 65632 64m 4 R 37 2.5 3:01.57 burnMMX
> 5913 ant 39 19 65632 64m 4 R 26 2.5 3:00.80 burnMMX
> 5917 ant 39 19 65632 64m 4 R 26 2.5 2:59.61 burnMMX
> 5916 ant 39 19 65632 64m 4 R 25 2.5 3:06.34 burnMMX
> 5914 ant 39 19 65632 64m 4 R 24 2.5 3:02.03 burnMMX
> 5918 ant 39 19 65632 64m 4 R 23 2.5 3:07.14 burnMMX

You started _40_ and only _7_ are left running? Bad news.
What happened to PIDs 5909-12, 5915, 5920-47 ? The seven
running might be mapped to non-defective areas/TLB.

They might have abended when the memory the kernel mapped
either produced a segfault, TLB fault, or memory error.
Each burnMMX has its' own pages and mmap and stomps them all.

> I will follow-up later. BTW, how long should I run these nonstop? All day?

As long as you can. Min 2h . But if you are getting early
abends, then you have just confirmed a hardware problem.

To get exit status, you could try
nice -19 ./burnMMX | echo $? &

burnMMX typically exits 127 when it encounters a memory error.
It could do this withing the first second if there is a problem
with memory mapping (hardware does not obey kernel instructions).

-- Robert

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Prev: Any monthly web hosting companies?
Next: "TLB parity error in virtual array; TLB error 'instruction"?(acpidump)