[RFC] Large weight differential leads to inefficient load balancing [Kernel]

Prev: [PATCH -staging] slicoss: Remove net_device_stats from the driver's private
Next: [PATCH 4/6] vmscan: tracing: Correct units in post-processing script

From: Mike Galbraith on 30 Jul 2010 09:40

On Thu, 2010-07-29 at 22:19 -0700, Nikhil Rao wrote:
> Hi all,
>
> We have observed that a large weight differential between tasks on a runqueue
> leads to sub-optimal machine utilization and poor load balancing. For example,
> if you have lots of SCHED_IDLE tasks (sufficient number to keep the machine 100%
> busy) and a few SCHED_NORMAL soaker tasks, we see that the machine has
> significant idle time.
>
> The data below highlights this problem. The test machine is a 4 socket quad-core
> box (16 cpus). These experiemnts were done with v2.6.25-rc6. We spawn 16
> SCHED_IDLE soaker threads (one per-cpu) to completely fill up the machine. CPU
> utilization numbers gathered from mpstat for 10s are:
>
> 03:30:24 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
> 03:30:25 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16234.65
> 03:30:26 PM all 99.88 0.06 0.06 0.00 0.00 0.00 0.00 0.00 16374.00
> 03:30:27 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16392.00
> 03:30:28 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16612.12
> 03:30:29 PM all 99.88 0.00 0.12 0.00 0.00 0.00 0.00 0.00 16375.00
> 03:30:30 PM all 99.94 0.06 0.00 0.00 0.00 0.00 0.00 0.00 16440.00
> 03:30:31 PM all 99.81 0.00 0.19 0.00 0.00 0.00 0.00 0.00 16237.62
> 03:30:32 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16360.00
> 03:30:33 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16405.00
> 03:30:34 PM all 99.38 0.06 0.50 0.00 0.00 0.00 0.00 0.06 18881.82
> Average: all 99.86 0.02 0.12 0.00 0.00 0.00 0.00 0.01 16628.20
>
> We then spawn one SCHED_NORMAL while-1 task (the absolute number does not matter
> so long as we introduce some large weight differential).
>
> 03:40:57 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
> 03:40:58 PM all 83.06 0.00 0.06 0.00 0.00 0.00 0.00 16.88 14555.00
> 03:40:59 PM all 78.25 0.00 0.06 0.00 0.00 0.00 0.00 21.69 14527.00
> 03:41:00 PM all 82.71 0.06 0.06 0.00 0.00 0.00 0.00 17.17 14879.00
> 03:41:01 PM all 87.34 0.00 0.06 0.00 0.00 0.00 0.00 12.59 15466.00
> 03:41:02 PM all 80.80 0.06 0.19 0.00 0.00 0.00 0.00 18.95 14584.00
> 03:41:03 PM all 82.90 0.00 0.06 0.00 0.00 0.00 0.00 17.04 14570.00
> 03:41:04 PM all 79.45 0.00 0.06 0.00 0.00 0.00 0.00 20.49 14536.00
> 03:41:05 PM all 86.48 0.00 0.07 0.00 0.00 0.00 0.00 13.46 14577.00
> 03:41:06 PM all 76.73 0.06 0.06 0.00 0.00 0.06 0.00 23.10 14594.00
> 03:41:07 PM all 86.48 0.00 0.07 0.00 0.00 0.00 0.00 13.45 14703.03
> Average: all 82.31 0.02 0.08 0.00 0.00 0.01 0.00 17.59 14699.10

What happens with s/SCHED_IDLE/nice 19?

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Peter Zijlstra on 2 Aug 2010 07:50

On Thu, 2010-07-29 at 22:19 -0700, Nikhil Rao wrote:
> If folks think this is a good direction and this idea has some merit, then I
> will be more than happy to continue working with the community to improve this
> patchset. If there is another way to fix this problem, then please do tell; I
> would be more than happy to help out.

No its terrible. I've already explained how to solve this on several
occasions (although my google skillz seem to fail me to find the latest
occasion a few months ago).

The thing is that from a fairness point of view 1 nice-0 (weight=1024)
on one CPU and 512 SCHED_IDLE (weight=2) tasks on another CPU and all
other CPUs idle is correct.

It just doesn't seem to be the thing that most people expect.

Special casing things like you've done is utterly the wrong thing to do.

This problem comes in two forms and its name is infeasible weight
distribution. The load-balancer tries to ensure W_k ~= W_l, k,l elem_of
CPUs, where W_k = \Sum_i w_i^k, where w_i^k is the i-th task on CPU k.

The two cases are statically infeasible, and dynamically infeasible.

We say the task-set is statically infeasible if for a task set of n
tasks there is no way to statically distribute them on N <= n CPUs such
that each task gets equal service (assuming the scheduling on each CPU
is fair).

We say the task-set is dynamically infeasible if for the given scenario
there is no way to rotate the tasks to obtain equality.

Lets assume 2 CPUs.

Ex.1: 2 tasks of different weight.

Ex.2: 3 tasks of equal weight.

The first example is both statically and dynamically infeasible as there
is no way to occupy both CPUs such that each task gets the proportional
correct service.

The second example is statically infeasible, but dynamically feasible,
for if we rotate one task, such that we alternate between 2:1 and 1:2 in
equal measures, each task will receive its correct 2/3rd CPU service.

The current load-balancer isn't particularly skilled at either issue.

The proper solution is to 'fix' find_busiest_group() so that it will:
- pick the heaviest cpu with more than 1 task on it
- slowly over-balance things

The first thing will solve your issue.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

|
Pages: 1
Prev: [PATCH -staging] slicoss: Remove net_device_stats from the driver's private
Next: [PATCH 4/6] vmscan: tracing: Correct units in post-processing script