[RFC] Large weight differential leads to inefficient load balancing [Kernel]

Prev: genhd, efi: add efi partition metadata to hd_structs
Next: oprofile: updates for v2.6.36

From: Peter Zijlstra on 4 Aug 2010 06:20

On Tue, 2010-08-03 at 14:28 -0700, Nikhil Rao wrote:

> I see your point here, and yes I agree having 1 nice-0 on one cpu, 512
> SCHED_IDLE tasks on another cpu and all other cpus idle is correct if
> we only considered fairness. However, we would also like to maximize
> machine utilization. The fitness function we would ideally like to
> optimize for is a combination of both fairness and utilization.

Sure, I see (and agree with) the fact that we want to optimize
utilization as well (although I bet the power management people might
feel otherwise :-)

> Thanks for your suggestions; I explored the first one a bit and I
> added a check into find_busiest_queue() (instead of
> find_busiest_group()) to skip a cpu if it has only 1 task on it (patch
> attached below - did you have something else in mind?).

You might also need some changes to find_busiest_group(), suppose you
have a 4 cpu machine, with 2 groups of 2, now also assume you have 4
tasks, 2 of nice-0 and 2 idle, if both nice-0 are in the same group,
each on their own cpu, then f_b_g() could select that group as being the
busiest (its got W=2048, against W=4 of the other group after all).

Once you have that group, f_b_q() won't be able to do anything sensible.

> This fixes the
> example I posted in the RFC, but it doesn't work as well when the
> SCHED_NORMAL tasks have a sleep/wakeup pattern. I have some data below
> where the load balancer fails to fully utilize a machine. In these
> examples, I ran with the upstream kernel and with a kernel compiled
> with the check in fbq().

Right, so wakeup/sleep are indeed more interesting. For wakeup we also
have select_task_rq() to consider, it is responsible to choosing where
to run the newly woken task.

For sleeps we have new idle balancing, which is a lot like the regular
load-balancing but differs enough to need looking at.

>From the data you provided I cannot tell you which of these two is
responsible for the thing you see (although under-utilization suggests
the new-idle balancer), you can use perf/ftrace to look at what your
tasks are doing and how they could be doing it better (Arjan's timechart
might be a good help).

If they get woken to the wrong CPU, its select_task_rq(), if they leave
a CPU idle too long, its new idle balancing -- or possibly its something
I overlooked all together :-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

|
Pages: 1
Prev: genhd, efi: add efi partition metadata to hd_structs
Next: oprofile: updates for v2.6.36