[RFC] Large weight differential leads to inefficient load balancing [Kernel]

Prev: [071/165] USB: g_serial: fix tty cleanup on unload
Next: [081/165] ext4: Fix potential quota deadlock

From: Paul Turner on 30 Jul 2010 15:10

On Fri, Jul 30, 2010 at 6:32 AM, Mike Galbraith <efault(a)gmx.de> wrote:
> On Thu, 2010-07-29 at 22:19 -0700, Nikhil Rao wrote:
>> Hi all,
>>
>> We have observed that a large weight differential between tasks on a runqueue
>> leads to sub-optimal machine utilization and poor load balancing. For example,
>> if you have lots of SCHED_IDLE tasks (sufficient number to keep the machine 100%
>> busy) and a few SCHED_NORMAL soaker tasks, we see that the machine has
>> significant idle time.
>>
>> The data below highlights this problem. The test machine is a 4 socket quad-core
>> box (16 cpus). These experiemnts were done with v2.6.25-rc6. We spawn 16
>> SCHED_IDLE soaker threads (one per-cpu) to completely fill up the machine. CPU
>> utilization numbers gathered from mpstat for 10s are:
>>
>> 03:30:24 PM �CPU � %user � %nice � �%sys %iowait � �%irq � %soft �%steal � %idle � �intr/s
>> 03:30:25 PM �all � 99.94 � �0.00 � �0.06 � �0.00 � �0.00 � �0.00 � �0.00 � �0.00 �16234.65
>> 03:30:26 PM �all � 99.88 � �0.06 � �0.06 � �0.00 � �0.00 � �0.00 � �0.00 � �0.00 �16374.00
>> 03:30:27 PM �all � 99.94 � �0.00 � �0.06 � �0.00 � �0.00 � �0.00 � �0.00 � �0.00 �16392.00
>> 03:30:28 PM �all � 99.94 � �0.00 � �0.06 � �0.00 � �0.00 � �0.00 � �0.00 � �0.00 �16612.12
>> 03:30:29 PM �all � 99.88 � �0.00 � �0.12 � �0.00 � �0.00 � �0.00 � �0.00 � �0.00 �16375.00
>> 03:30:30 PM �all � 99.94 � �0.06 � �0.00 � �0.00 � �0.00 � �0.00 � �0.00 � �0.00 �16440.00
>> 03:30:31 PM �all � 99.81 � �0.00 � �0.19 � �0.00 � �0.00 � �0.00 � �0.00 � �0.00 �16237.62
>> 03:30:32 PM �all � 99.94 � �0.00 � �0.06 � �0.00 � �0.00 � �0.00 � �0.00 � �0.00 �16360.00
>> 03:30:33 PM �all � 99.94 � �0.00 � �0.06 � �0.00 � �0.00 � �0.00 � �0.00 � �0.00 �16405.00
>> 03:30:34 PM �all � 99.38 � �0.06 � �0.50 � �0.00 � �0.00 � �0.00 � �0.00 � �0.06 �18881.82
>> Average: � � all � 99.86 � �0.02 � �0.12 � �0.00 � �0.00 � �0.00 � �0.00 � �0.01 �16628.20
>>
>> We then spawn one SCHED_NORMAL while-1 task (the absolute number does not matter
>> so long as we introduce some large weight differential).
>>
>> 03:40:57 PM �CPU � %user � %nice � �%sys %iowait � �%irq � %soft �%steal � %idle � �intr/s
>> 03:40:58 PM �all � 83.06 � �0.00 � �0.06 � �0.00 � �0.00 � �0.00 � �0.00 � 16.88 �14555.00
>> 03:40:59 PM �all � 78.25 � �0.00 � �0.06 � �0.00 � �0.00 � �0.00 � �0.00 � 21.69 �14527.00
>> 03:41:00 PM �all � 82.71 � �0.06 � �0.06 � �0.00 � �0.00 � �0.00 � �0.00 � 17.17 �14879.00
>> 03:41:01 PM �all � 87.34 � �0.00 � �0.06 � �0.00 � �0.00 � �0.00 � �0.00 � 12.59 �15466.00
>> 03:41:02 PM �all � 80.80 � �0.06 � �0.19 � �0.00 � �0.00 � �0.00 � �0.00 � 18.95 �14584.00
>> 03:41:03 PM �all � 82.90 � �0.00 � �0.06 � �0.00 � �0.00 � �0.00 � �0.00 � 17.04 �14570.00
>> 03:41:04 PM �all � 79.45 � �0.00 � �0.06 � �0.00 � �0.00 � �0.00 � �0.00 � 20.49 �14536.00
>> 03:41:05 PM �all � 86.48 � �0.00 � �0.07 � �0.00 � �0.00 � �0.00 � �0.00 � 13.46 �14577.00
>> 03:41:06 PM �all � 76.73 � �0.06 � �0.06 � �0.00 � �0.00 � �0.06 � �0.00 � 23.10 �14594.00
>> 03:41:07 PM �all � 86.48 � �0.00 � �0.07 � �0.00 � �0.00 � �0.00 � �0.00 � 13.45 �14703.03
>> Average: � � all � 82.31 � �0.02 � �0.08 � �0.00 � �0.00 � �0.01 � �0.00 � 17.59 �14699.10
>
> What happens with s/SCHED_IDLE/nice 19?
>
> � � � �-Mike
>
>

So this is also a concern of mine that I wanted to raise.

The problem observed above is a function of balancing entities which
can't meaningfully contribute to improving the observed imbalance.

While this does occur with {nice 19, SCHED_IDLE} threads, it is a more
general problem and actually much more prevalent in the group
scheduling case where the group entity weight is more arbitrary. This
is compounded by the fact that we iterate over the task_groups in load
balance and don't have a good way of differentiating in-group high/low
weight entities.

Something I have been investigating is how can we fix this problem
more generally: e.g. how can we effectively load-balance a low-weight
entity in the presence of high-weight competition. This could just as
easily be a {1000, 2} share split as it could be a {64000, 2000} share
split -- the latter won't benefit from IDLE_SCHEDULING optimizations
but is a realistic test case if someone is trying to provide minimal
latencies for the high-weight task.

What I've been considering in avenue is instead of load-balancing the
low weight entities when their h_load relative to the imbalance is
'not meaningful' -- by some fitness function, tag them for a separate
load-balancing pass. This secondary pass would only need to be
periodic in nature, and consider the 'tagged' task_groups from the
generic load_balancing operations. At this point in time it can be
evaluated whether the load is still 'not meaningful', and distribution
towards the most idle cpu (by utilization) can be evaluated.

The one converse here is that while something like the above is more
general for the low weight-group entity case, it does not extend
particularly cleanly to the non group-scheduling case. I haven't yet
got a particularly good answer for this since without a group entity
representing the low-weight threads tracking them for a second pass
becomes more cumbersome.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nikhil Rao on 30 Jul 2010 15:10

On Fri, Jul 30, 2010 at 6:32 AM, Mike Galbraith <efault(a)gmx.de> wrote:
> On Thu, 2010-07-29 at 22:19 -0700, Nikhil Rao wrote:
>> Hi all,
>>
>> We have observed that a large weight differential between tasks on a runqueue
>> leads to sub-optimal machine utilization and poor load balancing. For example,
>> if you have lots of SCHED_IDLE tasks (sufficient number to keep the machine 100%
>> busy) and a few SCHED_NORMAL soaker tasks, we see that the machine has
>> significant idle time.
>>
>> The data below highlights this problem. The test machine is a 4 socket quad-core
>> box (16 cpus). These experiemnts were done with v2.6.25-rc6. We spawn 16
>> SCHED_IDLE soaker threads (one per-cpu) to completely fill up the machine. CPU
>> utilization numbers gathered from mpstat for 10s are:
>>
>> 03:30:24 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
>> 03:30:25 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16234.65
>> 03:30:26 PM all 99.88 0.06 0.06 0.00 0.00 0.00 0.00 0.00 16374.00
>> 03:30:27 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16392.00
>> 03:30:28 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16612.12
>> 03:30:29 PM all 99.88 0.00 0.12 0.00 0.00 0.00 0.00 0.00 16375.00
>> 03:30:30 PM all 99.94 0.06 0.00 0.00 0.00 0.00 0.00 0.00 16440.00
>> 03:30:31 PM all 99.81 0.00 0.19 0.00 0.00 0.00 0.00 0.00 16237.62
>> 03:30:32 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16360.00
>> 03:30:33 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16405.00
>> 03:30:34 PM all 99.38 0.06 0.50 0.00 0.00 0.00 0.00 0.06 18881.82
>> Average: all 99.86 0.02 0.12 0.00 0.00 0.00 0.00 0.01 16628.20
>>
>> We then spawn one SCHED_NORMAL while-1 task (the absolute number does not matter
>> so long as we introduce some large weight differential).
>>
>> 03:40:57 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
>> 03:40:58 PM all 83.06 0.00 0.06 0.00 0.00 0.00 0.00 16.88 14555.00
>> 03:40:59 PM all 78.25 0.00 0.06 0.00 0.00 0.00 0.00 21.69 14527.00
>> 03:41:00 PM all 82.71 0.06 0.06 0.00 0.00 0.00 0.00 17.17 14879.00
>> 03:41:01 PM all 87.34 0.00 0.06 0.00 0.00 0.00 0.00 12.59 15466.00
>> 03:41:02 PM all 80.80 0.06 0.19 0.00 0.00 0.00 0.00 18.95 14584.00
>> 03:41:03 PM all 82.90 0.00 0.06 0.00 0.00 0.00 0.00 17.04 14570.00
>> 03:41:04 PM all 79.45 0.00 0.06 0.00 0.00 0.00 0.00 20.49 14536.00
>> 03:41:05 PM all 86.48 0.00 0.07 0.00 0.00 0.00 0.00 13.46 14577.00
>> 03:41:06 PM all 76.73 0.06 0.06 0.00 0.00 0.06 0.00 23.10 14594.00
>> 03:41:07 PM all 86.48 0.00 0.07 0.00 0.00 0.00 0.00 13.45 14703.03
>> Average: all 82.31 0.02 0.08 0.00 0.00 0.01 0.00 17.59 14699.10
>
> What happens with s/SCHED_IDLE/nice 19?
>
> -Mike

We see the same result with nice 19 as well.

w/ 16 nice-19 soakers:

10:15:16 AM CPU %user %nice %sys %iowait %irq %soft
%steal %idle intr/s
10:15:17 AM all 0.06 99.94 0.00 0.00 0.00 0.00
0.00 0.00 16296.04
10:15:18 AM all 0.00 99.94 0.06 0.00 0.00 0.00
0.00 0.00 16379.00
10:15:19 AM all 0.00 99.94 0.06 0.00 0.00 0.00
0.00 0.00 16414.00
10:15:20 AM all 0.00 99.94 0.06 0.00 0.00 0.00
0.00 0.00 16413.00
10:15:21 AM all 0.00 100.00 0.00 0.00 0.00 0.00
0.00 0.00 16402.00
10:15:22 AM all 0.00 99.88 0.06 0.00 0.00 0.06
0.00 0.00 16419.00
10:15:23 AM all 0.00 99.94 0.06 0.00 0.00 0.00
0.00 0.00 16406.00
10:15:24 AM all 0.19 99.69 0.12 0.00 0.00 0.00
0.00 0.00 16613.13
10:15:25 AM all 0.38 99.31 0.31 0.00 0.00 0.00
0.00 0.00 16313.86
10:15:26 AM all 0.50 99.31 0.19 0.00 0.00 0.00
0.00 0.00 16623.23
Average: all 0.11 99.79 0.09 0.00 0.00 0.01
0.00 0.00 16427.30

w/ adding a SCHED_NORMAL soaker to the mix:

10:17:44 AM CPU %user %nice %sys %iowait %irq %soft
%steal %idle intr/s
10:17:45 AM all 6.20 74.38 0.06 0.00 0.00 0.00
0.00 19.35 14419.80
10:17:46 AM all 6.25 74.89 0.06 0.00 0.00 0.00
0.00 18.80 14619.00
10:17:47 AM all 6.30 74.84 0.06 0.00 0.00 0.00
0.00 18.79 14590.00
10:17:48 AM all 6.25 80.57 0.06 0.00 0.00 0.00
0.00 13.12 15511.00
10:17:49 AM all 6.51 80.33 0.07 0.00 0.00 0.00
0.00 13.09 14904.00
10:17:50 AM all 6.06 72.62 0.06 0.00 0.00 0.00
0.00 21.26 14564.00
10:17:51 AM all 6.21 74.47 0.06 0.00 0.00 0.00
0.00 19.25 14584.00
10:17:52 AM all 6.47 77.67 0.12 0.00 0.00 0.00
0.00 15.73 15295.96
10:17:53 AM all 6.27 79.39 0.06 0.00 0.00 0.00
0.00 14.29 15251.00
10:17:54 AM all 6.32 75.85 0.00 0.00 0.00 0.00
0.00 17.83 14537.00
Average: all 6.28 76.47 0.06 0.00 0.00 0.00
0.00 17.18 14826.70

The problem is the large weight differential between nice
19/SCHED_IDLE and SCHED_NORMAL. I ran a quick experiment with the
soaker tasks at different nice levels. Data is in the table below.
First column is nice level, second is idle% on the machine (mpstat 10s
average) and third is ratio of nice weight/1024.

0 0.00 1
1 0.00 0.800781
2 0.00 0.639648
3 0.00 0.513672
4 0.00 0.413086
5 0.17 0.327148
6 1.06 0.265625
7 7.62 0.209961
8 4.47 0.167969
9 11.78 0.133789
10 13.52 0.107422
11 14.92 0.0849609
12 14.33 0.0683594
13 17.47 0.0546875
14 15.89 0.0439453
15 18.69 0.0351562
16 16.63 0.0283203
17 17.04 0.0224609
18 17.86 0.0175781
19 18.13 0.0146484

It looks like we start seeing seeing sub-optimal performance when the
weight ratio is >0.3.

-Thanks,
Nikhil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nikhil Rao on 3 Aug 2010 17:30

Peter,

Thanks for the feedback!

On Mon, Aug 2, 2010 at 4:39 AM, Peter Zijlstra <peterz(a)infradead.org> wrote:
> The thing is that from a fairness point of view 1 nice-0 (weight=1024)
> on one CPU and 512 SCHED_IDLE (weight=2) tasks on another CPU and all
> other CPUs idle is correct.
>
> It just doesn't seem to be the thing that most people expect.
>
> Special casing things like you've done is utterly the wrong thing to do.
>

I see your point here, and yes I agree having 1 nice-0 on one cpu, 512
SCHED_IDLE tasks on another cpu and all other cpus idle is correct if
we only considered fairness. However, we would also like to maximize
machine utilization. The fitness function we would ideally like to
optimize for is a combination of both fairness and utilization.

I'm going to take a step back here and explain the bigger problem we
are trying to solve. We have two types of workloads -- high priority
tasks that need to run for short periods of time and can consume as
much cpu as required when they run; and low priority tasks that
ideally consume slack cpu. Let's say we had a machine with enough
tasks to soak up all the cpus. There are two cases to implement the
priority scheme.

The first case is to run all tasks run as SCHED_NORMAL. The machine is
fully utilized but this has other side-effects. Low priority tasks
consume much more cpu than we would like, they get to preempt high
priority tasks, and this results in poor performance of high priority
tasks.

The second case is to run high priority tasks as SCHED_NORMAL and low
priority tasks as SCHED_IDLE. We choose SCHED_IDLE to minimize
interference with high priority tasks -- we don't want low priority
tasks to preempt high priority or to be favored over high priority
tasks in any way. When there are high priority tasks, we want low
priority tasks to get as little cpu as possible; thus giving them the
minimum possible weight works out well. In this case, we are able to
isolate high prio tasks from low prio tasks reasonably well. However,
sub-optimal machine utilization defeats the purpose of packing a
machine with lots of low priority tasks as we are not able to consume
the slack cpu.

This RFC is really meant to explain the problem that we are facing. We
presented one possible solution to fix this but we are also open to
other suggestions and ideas.

> This problem comes in two forms and its name is infeasible weight
> distribution. The load-balancer tries to ensure W_k ~= W_l, k,l elem_of
> CPUs, where W_k = \Sum_i w_i^k, where w_i^k is the i-th task on CPU k.
>
> The two cases are statically infeasible, and dynamically infeasible.
>
> We say the task-set is statically infeasible if for a task set of n
> tasks there is no way to statically distribute them on N <= n CPUs such
> that each task gets equal service (assuming the scheduling on each CPU
> is fair).
>
> We say the task-set is dynamically infeasible if for the given scenario
> there is no way to rotate the tasks to obtain equality.
>
> Lets assume 2 CPUs.
>
> Ex.1: 2 tasks of different weight.
>
> Ex.2: 3 tasks of equal weight.
>
> The first example is both statically and dynamically infeasible as there
> is no way to occupy both CPUs such that each task gets the proportional
> correct service.
>
> The second example is statically infeasible, but dynamically feasible,
> for if we rotate one task, such that we alternate between 2:1 and 1:2 in
> equal measures, each task will receive its correct 2/3rd CPU service.
>
> The current load-balancer isn't particularly skilled at either issue.
>
> The proper solution is to 'fix' find_busiest_group() so that it will:
> - pick the heaviest cpu with more than 1 task on it
> - slowly over-balance things
>
> The first thing will solve your issue.
>

Thanks for your suggestions; I explored the first one a bit and I
added a check into find_busiest_queue() (instead of
find_busiest_group()) to skip a cpu if it has only 1 task on it (patch
attached below - did you have something else in mind?). This fixes the
example I posted in the RFC, but it doesn't work as well when the
SCHED_NORMAL tasks have a sleep/wakeup pattern. I have some data below
where the load balancer fails to fully utilize a machine. In these
examples, I ran with the upstream kernel and with a kernel compiled
with the check in fbq().

Setup: We run 16 SCHED_IDLE soakers and about half as many
SCHED_NORMAL tasks which have 100ms / 100ms sleep/busy cycles. This is
actually a very common use case that we run into.

2.6.35-rc6

12:41:45 PM CPU %user %nice %sys %iowait %irq %soft
%steal %idle intr/s
12:41:46 PM all 97.50 0.00 0.00 0.00 0.00 0.00
0.00 2.50 16405.00
12:41:47 PM all 89.72 0.00 0.06 0.00 0.00 0.00
0.00 10.22 15736.00
12:41:48 PM all 90.54 0.06 0.06 0.00 0.00 0.00
0.00 9.34 15791.09
12:41:49 PM all 93.00 0.00 0.06 0.00 0.00 0.00
0.00 6.93 15816.83
12:41:50 PM all 96.44 0.00 0.06 0.00 0.00 0.06
0.00 3.44 16362.00
12:41:51 PM all 97.62 0.00 0.06 0.00 0.00 0.00
0.00 2.31 16326.00
12:41:52 PM all 99.56 0.00 0.06 0.00 0.00 0.00
0.00 0.38 16512.12
12:41:53 PM all 99.06 0.00 0.06 0.00 0.00 0.00
0.00 0.88 16289.00
12:41:54 PM all 99.50 0.00 0.06 0.00 0.00 0.00
0.00 0.44 16149.50
12:41:55 PM all 98.06 0.00 0.06 0.00 0.00 0.00
0.00 1.88 16405.05
Average: all 96.10 0.01 0.06 0.00 0.00 0.01
0.00 3.83 16177.92

2.6.35-rc6 + fbg-fix

12:42:48 PM CPU %user %nice %sys %iowait %irq %soft
%steal %idle intr/s
12:42:49 PM all 98.07 0.00 0.06 0.00 0.00 0.00
0.00 1.87 16346.00
12:42:50 PM all 98.75 0.00 0.12 0.00 0.00 0.00
0.00 1.12 16236.63
12:42:51 PM all 99.56 0.06 0.19 0.00 0.00 0.00
0.00 0.19 16616.16
12:42:52 PM all 97.94 0.00 0.06 0.00 0.00 0.06
0.00 1.94 16348.00
12:42:53 PM all 96.94 0.00 0.25 0.00 0.00 0.00
0.00 2.81 16234.65
12:42:54 PM all 98.56 0.06 0.06 0.00 0.00 0.00
0.00 1.31 16339.00
12:42:55 PM all 97.56 0.00 0.19 0.00 0.00 0.00
0.00 2.25 16570.71
12:42:56 PM all 99.50 0.00 0.06 0.00 0.00 0.00
0.00 0.44 16400.00
12:42:57 PM all 95.25 0.00 0.37 0.00 0.00 0.00
0.00 4.37 16367.00
12:42:58 PM all 97.75 0.00 0.06 0.00 0.00 0.00
0.00 2.19 16409.00
Average: all 97.99 0.01 0.14 0.00 0.00 0.01
0.00 1.85 16386.00

-Thanks,
Nikhil

---
fqb-fix patch:

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index a878b53..e05c61f 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2742,13 +2742,8 @@ find_busiest_queue(struct sched_group *group,
enum cpu_idle_type idle,
continue;

rq = cpu_rq(i);
- wl = weighted_cpuload(i);

- /*
- * When comparing with imbalance, use weighted_cpuload()
- * which is not scaled with the cpu power.
- */
- if (capacity && rq->nr_running == 1 && wl > imbalance)
+ if (capacity && rq->nr_running == 1)
continue;

/*
@@ -2757,6 +2752,7 @@ find_busiest_queue(struct sched_group *group,
enum cpu_idle_type idle,
* the load can be moved away from the cpu that is potentially
* running at a lower capacity.
*/
+ wl = weighted_cpuload(i);
wl = (wl * SCHED_LOAD_SCALE) / power;

if (wl > max_load) {
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

|
Pages: 1
Prev: [071/165] USB: g_serial: fix tty cleanup on unload
Next: [081/165] ext4: Fix potential quota deadlock