From: nash_rack1 on
We have this problem on our database machine where the load average
shown by 'top' goes very high ( > 100) at random times and the database
becomes really slow. We're trying to find out which process could be
causing this load. For some reason, top does not show any processes
that could be suspects. It shows only 2 running processes using some
CPU. Other processes are not using any CPU. How can we find out what is
causing the load average to be so high.

11:18:06 up 163 days, 5:56, 7 users, load average: 101.27, 102.72,
69.93
Tasks: 443 total, 2 running, 441 sleeping, 0 stopped, 0 zombie
Cpu(s): 7.3% us, 1.4% sy, 0.0% ni, 88.3% id, 3.0% wa, 0.1% hi,
0.0% si
Mem: 16634608k total, 15611264k used, 1023344k free, 3260k
buffers
Swap: 32796688k total, 119724k used, 32676964k free, 14511340k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28419 oracle 16 0 1493m 1.1g 1.1g R 55.3 6.9 19:59.11 oracle
22906 oracle 15 0 1494m 1.2g 1.2g D 17.2 7.4 0:39.87 oracle
31900 dba01 15 0 2156 1096 692 R 5.7 0.0 0:00.04 top
8 root RT 0 0 0 0 S 1.9 0.0 0:37.66
migration/3
3198 root 15 0 0 0 0 S 1.9 0.0 196:38.27 vxiod
1 root 16 0 2760 608 520 S 0.0 0.0 1:50.83 init
2 root RT 0 0 0 0 S 0.0 0.0 0:24.24
migration/0
3 root 34 19 0 0 0 S 0.0 0.0 1:00.85
ksoftirqd/0
4 root RT 0 0 0 0 S 0.0 0.0 0:34.51
migration/1
5 root 34 19 0 0 0 S 0.0 0.0 1:13.25
ksoftirqd/1
6 root RT 0 0 0 0 S 0.0 0.0 0:25.12
migration/2
7 root 34 19 0 0 0 S 0.0 0.0 1:04.21
ksoftirqd/2
9 root 34 19 0 0 0 S 0.0 0.0 1:12.07
ksoftirqd/3
10 root RT 0 0 0 0 S 0.0 0.0 0:26.34
migration/4
11 root 34 19 0 0 0 S 0.0 0.0 1:02.59
ksoftirqd/4
12 root RT 0 0 0 0 S 0.0 0.0 0:32.50
migration/5
13 root 34 19 0 0 0 S 0.0 0.0 1:11.32
ksoftirqd/5
14 root RT 0 0 0 0 S 0.0 0.0 0:32.99
migration/6
15 root 34 19 0 0 0 S 0.0 0.0 1:04.03
ksoftirqd/6
16 root RT 0 0 0 0 S 0.0 0.0 0:33.00
migration/7
17 root 34 19 0 0 0 S 0.0 0.0 1:08.40
ksoftirqd/7
18 root 5 -10 0 0 0 S 0.0 0.0 0:00.58 events/0
19 root 5 -10 0 0 0 S 0.0 0.0 0:00.58 events/1
20 root 5 -10 0 0 0 S 0.0 0.0 0:00.54 events/2
21 root 5 -10 0 0 0 S 0.0 0.0 0:00.50 events/3
22 root 5 -10 0 0 0 S 0.0 0.0 0:00.56 events/4
23 root 5 -10 0 0 0 S 0.0 0.0 0:00.55 events/5
24 root 5 -10 0 0 0 S 0.0 0.0 0:00.56 events/6
25 root 5 -10 0 0 0 S 0.0 0.0 0:00.51 events/7
26 root 5 -10 0 0 0 S 0.0 0.0 0:00.01 khelper
27 root 13 -10 0 0 0 S 0.0 0.0 0:00.00 kacpid
124 root 5 -10 0 0 0 S 0.0 0.0 0:00.00 kblockd/0


$ uname -a
Linux db01 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:54:53 EST 2006 i686 i686
i38
6 GNU/Linux

This is a 4 CPU machine.

If you notice anything unusual in the above output or if there's
another command we can use, please let me know.

Thanks,
Nash.

From: Patrick on
<nash_rack1(a)yahoo.com> wrote in message
news:1156193776.060140.111970(a)m79g2000cwm.googlegroups.com

> We have this problem on our database machine where the load average
> shown by 'top' goes very high ( > 100) at random times and the
> database becomes really slow. We're trying to find out which process
> could be causing this load. ...
> 11:18:06 up 163 days, 5:56, 7 users, load average: 101.27, 102.72,
> 69.93
> Tasks: 443 total, 2 running, 441 sleeping, 0 stopped, 0 zombie
....
> If you notice anything unusual in the above output or if there's
> another command we can use, please let me know.

What are those 441 sleeping processes, and why are they sleeping? Waiting
for disk I/O or interprocess communications?

From the man page: "The load averages are the average number of processes
ready to run during the last 1, 5 and 15 minutes."

From: The Natural Philosopher on
Patrick wrote:
> <nash_rack1(a)yahoo.com> wrote in message
> news:1156193776.060140.111970(a)m79g2000cwm.googlegroups.com
>
>> We have this problem on our database machine where the load average
>> shown by 'top' goes very high ( > 100) at random times and the
>> database becomes really slow. We're trying to find out which process
>> could be causing this load. ...
>> 11:18:06 up 163 days, 5:56, 7 users, load average: 101.27, 102.72,
>> 69.93
>> Tasks: 443 total, 2 running, 441 sleeping, 0 stopped, 0 zombie
> ...
>> If you notice anything unusual in the above output or if there's
>> another command we can use, please let me know.
>
> What are those 441 sleeping processes, and why are they sleeping? Waiting
> for disk I/O or interprocess communications?

Almost certainly. I ran some DB stuff on SCO unix once, and it beat the
hell out of it...we increased file, inodes, file names and directory
cacheing by an order of about a thousand, and it went much better...;-)

No idea whether modern Linux needs that or not, or if its
possible..generally it was then a kernel boot time option.

>
> From the man page: "The load averages are the average number of processes
> ready to run during the last 1, 5 and 15 minutes."
>
From: nash_rack1 on
Most of these sleeping processes are the processes that Oracle creates
for pooled database connecitons. I believe they're sleeping because the
application is not doing any SQL activity on those connections.

What other commands can I use to find the root cause of this high load?

Thanks,
Nash.

Patrick wrote:
> <nash_rack1(a)yahoo.com> wrote in message
> news:1156193776.060140.111970(a)m79g2000cwm.googlegroups.com
>
> > We have this problem on our database machine where the load average
> > shown by 'top' goes very high ( > 100) at random times and the
> > database becomes really slow. We're trying to find out which process
> > could be causing this load. ...
> > 11:18:06 up 163 days, 5:56, 7 users, load average: 101.27, 102.72,
> > 69.93
> > Tasks: 443 total, 2 running, 441 sleeping, 0 stopped, 0 zombie
> ...
> > If you notice anything unusual in the above output or if there's
> > another command we can use, please let me know.
>
> What are those 441 sleeping processes, and why are they sleeping? Waiting
> for disk I/O or interprocess communications?
>
> From the man page: "The load averages are the average number of processes
> ready to run during the last 1, 5 and 15 minutes."