From: Suresh Siddha on
On Thu, 2010-04-01 at 23:20 -0700, Mike Galbraith wrote:
> Yes, if task A and task B are more or less unrelated, you'd want them to
> stay in separate domains, you'd not want some random event to pull. The
> other side of the coin is tasks which fork off partners that they will
> talk to at high frequency. They land just as far away, and desperately
> need to move into a shared cache domain. There's currently no
> discriminator, so while always asking wake_affine() may reduce the risk
> of moving a task with a large footprint, it also increases the risk of
> leaving buddies jabbering cross cache.

Mike, Apart from this small tweak that you added in wake_up() path there
is no extra logic that keeps buddies together for long. As I was saying,
fork/exec balance starts apart and in the partial loaded case (i.e.,
when # of running tasks <= # of sockets or # of total cores) the default
load balancer policy also tries to distribute the load equally among
sockets/cores (for peak cache/memory controller bw etc). While the
wakeup() may keep the buddies on SMT siblings, next load balancing event
will move them far away. If we need to keep buddies together we need
more changes than this small tweak.

> Do you have a compute load bouncing painfully which this patch cures?
>
> I have no strong objections, and the result is certainly easier on the
> eye. If I were making the decision, I'd want to see some numbers.

All I saw in the changelog when you added this new tweak was:

commit 8b911acdf08477c059d1c36c21113ab1696c612b
Author: Mike Galbraith <efault(a)gmx.de>
Date: Thu Mar 11 17:17:16 2010 +0100

sched: Fix select_idle_sibling()

Don't bother with selection when the current cpu is idle. ....

Is it me or you who need to provide the data for justification for your
new tweak that changes the current behavior ;)

I will run some workloads aswell!

thanks,
suresh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mike Galbraith on
On Fri, 2010-04-02 at 10:05 -0700, Suresh Siddha wrote:
> On Thu, 2010-04-01 at 23:20 -0700, Mike Galbraith wrote:
> > Yes, if task A and task B are more or less unrelated, you'd want them to
> > stay in separate domains, you'd not want some random event to pull. The
> > other side of the coin is tasks which fork off partners that they will
> > talk to at high frequency. They land just as far away, and desperately
> > need to move into a shared cache domain. There's currently no
> > discriminator, so while always asking wake_affine() may reduce the risk
> > of moving a task with a large footprint, it also increases the risk of
> > leaving buddies jabbering cross cache.
>
> Mike, Apart from this small tweak that you added in wake_up() path there
> is no extra logic that keeps buddies together for long.

The wakeup logic is the only thing that keeps buddies together.

> As I was saying,
> fork/exec balance starts apart and in the partial loaded case (i.e.,
> when # of running tasks <= # of sockets or # of total cores) the default
> load balancer policy also tries to distribute the load equally among
> sockets/cores (for peak cache/memory controller bw etc). While the
> wakeup() may keep the buddies on SMT siblings, next load balancing event
> will move them far away. If we need to keep buddies together we need
> more changes than this small tweak.

We very definitely need to keep buddies cache affine. Yes, maybe some
additional logic is needed, as there is a conflict between load types.
A kbuild spreading to the four winds is fine, while netperf jabbering
cross cache is horrible.

> > Do you have a compute load bouncing painfully which this patch cures?
> >
> > I have no strong objections, and the result is certainly easier on the
> > eye. If I were making the decision, I'd want to see some numbers.
>
> All I saw in the changelog when you added this new tweak was:
>
> commit 8b911acdf08477c059d1c36c21113ab1696c612b
> Author: Mike Galbraith <efault(a)gmx.de>
> Date: Thu Mar 11 17:17:16 2010 +0100
>
> sched: Fix select_idle_sibling()
>
> Don't bother with selection when the current cpu is idle. ....
>
> Is it me or you who need to provide the data for justification for your
> new tweak that changes the current behavior ;)
>
> I will run some workloads aswell!

Whoa. It was a simple question, no need to get defensive. You need not
provide anything. Forget I even asked, it's not my decision.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Wed, 2010-03-31 at 16:47 -0700, Suresh Siddha wrote:

> From: Suresh Siddha <suresh.b.siddha(a)intel.com>
> Subject: sched: fix select_idle_sibling() logic in select_task_rq_fair()
>
> Issues in the current select_idle_sibling() logic in select_task_rq_fair()
> in the context of a task wake-up:
>
> a) Once we select the idle sibling, we use that domain (spanning the cpu that
> the task is currently woken-up and the idle sibling that we found) in our
> wake_affine() decisions. This domain is completely different from the
> domain(we are supposed to use) that spans the cpu that the task currently
> woken-up and the cpu where the task previously ran.
>
> b) We do select_idle_sibling() check only for the cpu that the task is
> currently woken-up on. If select_task_rq_fair() selects the previously run
> cpu for waking the task, doing a select_idle_sibling() check
> for that cpu also helps and we don't do this currently.
>
> c) In the scenarios where the cpu that the task is woken-up is busy but
> with its HT siblings are idle, we are selecting the task be woken-up
> on the idle HT sibling instead of a core that it previously ran
> and currently completely idle. i.e., we are not taking decisions based on
> wake_affine() but directly selecting an idle sibling that can cause
> an imbalance at the SMT/MC level which will be later corrected by the
> periodic load balancer.
>
> Fix this by first going through the load imbalance calculations using
> wake_affine() and once we make a decision of woken-up cpu vs previously-ran cpu,
> then choose a possible idle sibling for waking up the task on.
>
> Signed-off-by: Suresh Siddha <suresh.b.siddha(a)intel.com>

OK, so I'm going to take this, but I had one concern, see below:

> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 49ad993..f905a4b 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1385,28 +1385,48 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
> * Try and locate an idle CPU in the sched_domain.
> */
> static int
> +select_idle_sibling(struct task_struct *p, int target)
> {
> int cpu = smp_processor_id();
> int prev_cpu = task_cpu(p);
> int i;
> + struct sched_domain *sd;
>
> /*
> + * If the task is going to be woken-up on this cpu and if it is
> + * already idle, then it is the right target.
> */
> + if (target == cpu && !cpu_rq(cpu)->cfs.nr_running)
> + return cpu;
> +
> + /*
> + * If the task is going to be woken-up on the cpu where it previously
> + * ran and if it is currently idle, then it the right target.
> + */
> + if (target == prev_cpu && !cpu_rq(prev_cpu)->cfs.nr_running)
> return prev_cpu;
>
> /*
> + * Otherwise, iterate the domains and find an elegible idle cpu.
> */
> + for_each_domain(target, sd) {
> + if (!(sd->flags & SD_SHARE_PKG_RESOURCES))
> break;
> +
> + for_each_cpu_and(i, sched_domain_span(sd), &p->cpus_allowed) {
> + if (!cpu_rq(i)->cfs.nr_running) {
> + target = i;
> + break;
> + }
> }
> +
> + /*
> + * Lets stop looking for an idle sibling when we reached
> + * the domain that spans the current cpu and prev_cpu.
> + */
> + if (cpumask_test_cpu(cpu, sched_domain_span(sd)) &&
> + cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
> + break;
> }
>
> return target;

So here we keep using !cfs.nr_running to mean idle, which might not at
all be true when there's real-time tasks around.

So should we be using idle_cpu(i) instead?

> @@ -1429,7 +1449,7 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
> int cpu = smp_processor_id();
> int prev_cpu = task_cpu(p);
> int new_cpu = cpu;
> + int want_affine = 0;
> int want_sd = 1;
> int sync = wake_flags & WF_SYNC;
>
> @@ -1467,36 +1487,15 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
> want_sd = 0;
> }
>
> if (want_affine) {
> /*
> * If both cpu and prev_cpu are part of this domain,
> * cpu is a valid SD_WAKE_AFFINE target.
> */
> + if (cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))
> + && (tmp->flags & SD_WAKE_AFFINE)) {
> + affine_sd = tmp;
> + want_affine = 0;
> }
> }
>
> @@ -1527,8 +1526,10 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
> #endif
>
> if (affine_sd) {
> + if (cpu == prev_cpu || wake_affine(affine_sd, p, sync))
> + return select_idle_sibling(p, cpu);
> + else
> + return select_idle_sibling(p, prev_cpu);
> }
>
> while (sd) {
>
>



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Tue, 2010-04-20 at 10:46 +0200, Peter Zijlstra wrote:
> So should we be using idle_cpu(i) instead?

something like the below..

---
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1375,26 +1375,25 @@ find_idlest_cpu(struct sched_group *grou
/*
* Try and locate an idle CPU in the sched_domain.
*/
-static int
-select_idle_sibling(struct task_struct *p, int target)
+static int select_idle_sibling(struct task_struct *p, int target)
{
int cpu = smp_processor_id();
int prev_cpu = task_cpu(p);
- int i;
struct sched_domain *sd;
+ int i;

/*
* If the task is going to be woken-up on this cpu and if it is
* already idle, then it is the right target.
*/
- if (target == cpu && !cpu_rq(cpu)->cfs.nr_running)
+ if (target == cpu && idle_cpu(cpu))
return cpu;

/*
* If the task is going to be woken-up on the cpu where it previously
* ran and if it is currently idle, then it the right target.
*/
- if (target == prev_cpu && !cpu_rq(prev_cpu)->cfs.nr_running)
+ if (target == prev_cpu && idle_cpu(prev_cpu))
return prev_cpu;

/*
@@ -1405,7 +1404,7 @@ select_idle_sibling(struct task_struct *
break;

for_each_cpu_and(i, sched_domain_span(sd), &p->cpus_allowed) {
- if (!cpu_rq(i)->cfs.nr_running) {
+ if (idle_cpu(i)) {
target = i;
break;
}
@@ -1479,16 +1478,14 @@ select_task_rq_fair(struct rq *rq, struc
want_sd = 0;
}

- if (want_affine) {
- /*
- * If both cpu and prev_cpu are part of this domain,
- * cpu is a valid SD_WAKE_AFFINE target.
- */
- if (cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))
- && (tmp->flags & SD_WAKE_AFFINE)) {
- affine_sd = tmp;
- want_affine = 0;
- }
+ /*
+ * If both cpu and prev_cpu are part of this domain,
+ * cpu is a valid SD_WAKE_AFFINE target.
+ */
+ if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
+ cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
+ affine_sd = tmp;
+ want_affine = 0;
}

if (!want_sd && !want_affine)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Suresh Siddha on
On Tue, 2010-04-20 at 01:55 -0700, Peter Zijlstra wrote:
> On Tue, 2010-04-20 at 10:46 +0200, Peter Zijlstra wrote:
> > So should we be using idle_cpu(i) instead?
>
> something like the below..

Looks good to me.

Acked-by: Suresh Siddha <suresh.b.siddha(a)intel.com>

>
> ---
> Index: linux-2.6/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_fair.c
> +++ linux-2.6/kernel/sched_fair.c
> @@ -1375,26 +1375,25 @@ find_idlest_cpu(struct sched_group *grou
> /*
> * Try and locate an idle CPU in the sched_domain.
> */
> -static int
> -select_idle_sibling(struct task_struct *p, int target)
> +static int select_idle_sibling(struct task_struct *p, int target)
> {
> int cpu = smp_processor_id();
> int prev_cpu = task_cpu(p);
> - int i;
> struct sched_domain *sd;
> + int i;
>
> /*
> * If the task is going to be woken-up on this cpu and if it is
> * already idle, then it is the right target.
> */
> - if (target == cpu && !cpu_rq(cpu)->cfs.nr_running)
> + if (target == cpu && idle_cpu(cpu))
> return cpu;
>
> /*
> * If the task is going to be woken-up on the cpu where it previously
> * ran and if it is currently idle, then it the right target.
> */
> - if (target == prev_cpu && !cpu_rq(prev_cpu)->cfs.nr_running)
> + if (target == prev_cpu && idle_cpu(prev_cpu))
> return prev_cpu;
>
> /*
> @@ -1405,7 +1404,7 @@ select_idle_sibling(struct task_struct *
> break;
>
> for_each_cpu_and(i, sched_domain_span(sd), &p->cpus_allowed) {
> - if (!cpu_rq(i)->cfs.nr_running) {
> + if (idle_cpu(i)) {
> target = i;
> break;
> }
> @@ -1479,16 +1478,14 @@ select_task_rq_fair(struct rq *rq, struc
> want_sd = 0;
> }
>
> - if (want_affine) {
> - /*
> - * If both cpu and prev_cpu are part of this domain,
> - * cpu is a valid SD_WAKE_AFFINE target.
> - */
> - if (cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))
> - && (tmp->flags & SD_WAKE_AFFINE)) {
> - affine_sd = tmp;
> - want_affine = 0;
> - }
> + /*
> + * If both cpu and prev_cpu are part of this domain,
> + * cpu is a valid SD_WAKE_AFFINE target.
> + */
> + if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
> + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
> + affine_sd = tmp;
> + want_affine = 0;
> }
>
> if (!want_sd && !want_affine)
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/