sched: fix SMT scheduler regression in find_busiest

Prev: [LOCKDEP] 33-rc8 Running aplay with pulse as the default
Next: 33-rc8 Running aplay with pulse as the default

From: Vaidyanathan Srinivasan on 13 Feb 2010 13:30

* Suresh Siddha <suresh.b.siddha(a)intel.com> [2010-02-12 17:14:22]:

> PeterZ/Ingo,
>
> Ling Ma and Yanmin reported this SMT scheduler regression which lead to
> the condition where both the SMT threads on a core are busy while the
> other cores in the same socket are completely idle, causing major
> performance regression. I have appended a fix for this. This is
> relatively low risk fix and if you agree with both the fix and
> risk-assessment, can we please push this to Linus so that we can address
> this in 2.6.33.

Hi Suresh,

I have been looking at this issue in order to make
sched_smt_powersavings work. In my simple tests I find that the
default behavior is to have one task per core first since the total
cpu power of the core will be 1178 (589*2) that is not sufficient to
keep two tasks balanced in the group.

In the scenario that you have described, even though the group has
been identified as busiest, the find_busiest_queue() will return null
since wl will be 1780 {load(1024)*SCHED_LOAD_SCALE/power(589)} leading
to wl being greater than imbalance.

The fix that you have posted will solve the problem described.
However we need to make sched_smt_powersavings also work by increasing
the group capacity and allowing two tasks to run in a core.

As Peter mentioned, SD_PREFER_SIBLING flag is meant to spread the work
across group at any sched domain so that the solution will work for
pre-Nehalem quad cores also. But it still needs some work to get it
right. Please refer to my earlier bug report at:

http://lkml.org/lkml/2010/2/8/80

The solution you have posted will not work for non-HT quad cores where
we want the tasks to be spread across cache domains for best
performance though not a severe performance regression as in the case
of Nehalem.

I will test your solution in different scenarios and post updates.

Thanks,
Vaidy

> thanks,
> suresh
> ---
>
> From: Suresh Siddha <suresh.b.siddha(a)intel.com>
> Subject: sched: fix SMT scheduler regression in find_busiest_queue()
>
> Fix a SMT scheduler performance regression that is leading to a scenario
> where SMT threads in one core are completely idle while both the SMT threads
> in another core (on the same socket) are busy.
>
> This is caused by this commit (with the problematic code highlighted)
>
> commit bdb94aa5dbd8b55e75f5a50b61312fe589e2c2d1
> Author: Peter Zijlstra <a.p.zijlstra(a)chello.nl>
> Date: Tue Sep 1 10:34:38 2009 +0200
>
> sched: Try to deal with low capacity
>
> @@ -4203,15 +4223,18 @@ find_busiest_queue()
> ...
> for_each_cpu(i, sched_group_cpus(group)) {
> + unsigned long power = power_of(i);
>
> ...
>
> - wl = weighted_cpuload(i);
> + wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
> + wl /= power;
>
> - if (rq->nr_running == 1 && wl > imbalance)
> + if (capacity && rq->nr_running == 1 && wl > imbalance)
> continue;
>
> On a SMT system, power of the HT logical cpu will be 589 and
> the scheduler load imbalance (for scenarios like the one mentioned above)
> can be approximately 1024 (SCHED_LOAD_SCALE). The above change of scaling
> the weighted load with the power will result in "wl > imbalance" and
> ultimately resulting in find_busiest_queue() return NULL, causing
> load_balance() to think that the load is well balanced. But infact
> one of the tasks can be moved to the idle core for optimal performance.
>
> We don't need to use the weighted load (wl) scaled by the cpu power to
> compare with imabalance. In that condition, we already know there is only a
> single task "rq->nr_running == 1" and the comparison between imbalance,
> wl is to make sure that we select the correct priority thread which matches
> imbalance. So we really need to compare the imabalnce with the original
> weighted load of the cpu and not the scaled load.
>
> But in other conditions where we want the most hammered(busiest) cpu, we can
> use scaled load to ensure that we consider the cpu power in addition to the
> actual load on that cpu, so that we can move the load away from the
> guy that is getting most hammered with respect to the actual capacity,
> as compared with the rest of the cpu's in that busiest group.
>
> Fix it.
>
> Reported-by: Ma Ling <ling.ma(a)intel.com>
> Initial-Analysis-by: Zhang, Yanmin <yanmin_zhang(a)linux.intel.com>
> Signed-off-by: Suresh Siddha <suresh.b.siddha(a)intel.com>
> Cc: stable(a)kernel.org [2.6.32.x]
> ---
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 3a8fb30..bef5369 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -4119,12 +4119,23 @@ find_busiest_queue(struct sched_group *group, enum cpu_idle_type idle,
> continue;
>
> rq = cpu_rq(i);
> - wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
> - wl /= power;
> + wl = weighted_cpuload(i);
>
> + /*
> + * When comparing with imbalance, use weighted_cpuload()
> + * which is not scaled with the cpu power.
> + */
> if (capacity && rq->nr_running == 1 && wl > imbalance)
> continue;
>
> + /*
> + * For the load comparisons with the other cpu's, consider
> + * the weighted_cpuload() scaled with the cpu power, so that
> + * the load can be moved away from the cpu that is potentially
> + * running at a lower capacity.
> + */
> + wl = (wl * SCHED_LOAD_SCALE) / power;
> +
> if (wl > max_load) {
> max_load = wl;
> busiest = rq;
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Suresh Siddha on 13 Feb 2010 13:50

On Sat, 2010-02-13 at 11:27 -0700, Vaidyanathan Srinivasan wrote:
> The fix that you have posted will solve the problem described.

Thanks. This SMT scheduler regression is critical for performance and
would like Ingo/Peterz to push this to Linus as soon as possible. We can
fix other known issues when we have patches ready and acceptable to
everyone. Agree?

> However we need to make sched_smt_powersavings also work by increasing
> the group capacity and allowing two tasks to run in a core.

I don't think you saying that this patch breaks sched_smt_powersavings?
If so, We need to address power-saving aspect differently. Atleast this
is not as critical, as we don't have any customer who is using the
smt/mc powersavings tunables.

> As Peter mentioned, SD_PREFER_SIBLING flag is meant to spread the work
> across group at any sched domain so that the solution will work for
> pre-Nehalem quad cores also. But it still needs some work to get it
> right.

Agree.

> The solution you have posted will not work for non-HT quad cores where
> we want the tasks to be spread across cache domains for best
> performance though not a severe performance regression as in the case
> of Nehalem.

This is completely different issue from this patch and I started another
thread for this.

thanks
suresh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Vaidyanathan Srinivasan on 13 Feb 2010 14:00

* Suresh Siddha <suresh.b.siddha(a)intel.com> [2010-02-13 10:39:36]:

> On Sat, 2010-02-13 at 11:27 -0700, Vaidyanathan Srinivasan wrote:
> > The fix that you have posted will solve the problem described.
>
> Thanks. This SMT scheduler regression is critical for performance and
> would like Ingo/Peterz to push this to Linus as soon as possible. We can
> fix other known issues when we have patches ready and acceptable to
> everyone. Agree?

Yes, Agreed.

> > However we need to make sched_smt_powersavings also work by increasing
> > the group capacity and allowing two tasks to run in a core.
>
> I don't think you saying that this patch breaks sched_smt_powersavings?
> If so, We need to address power-saving aspect differently. Atleast this
> is not as critical, as we don't have any customer who is using the
> smt/mc powersavings tunables.

Correct. This patch does not break sched_smt_powersavings, additional
change in group capacity is needed. More work is needed, but nothing
to hold against this fix.

We would want customers to start using powersavings tunables and make
them work reliably. However, I agree that performance comes first :)

> > As Peter mentioned, SD_PREFER_SIBLING flag is meant to spread the work
> > across group at any sched domain so that the solution will work for
> > pre-Nehalem quad cores also. But it still needs some work to get it
> > right.
>
> Agree.
>
> > The solution you have posted will not work for non-HT quad cores where
> > we want the tasks to be spread across cache domains for best
> > performance though not a severe performance regression as in the case
> > of Nehalem.
>
> This is completely different issue from this patch and I started another
> thread for this.

Correct. We can incrementally solve the balancing for different scenarios.

--Vaidy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Vaidyanathan Srinivasan on 13 Feb 2010 15:30

* Vaidyanathan Srinivasan <svaidy(a)linux.vnet.ibm.com> [2010-02-13 23:57:48]:

> * Suresh Siddha <suresh.b.siddha(a)intel.com> [2010-02-12 17:14:22]:
>
> > PeterZ/Ingo,
> >
> > Ling Ma and Yanmin reported this SMT scheduler regression which lead to
> > the condition where both the SMT threads on a core are busy while the
> > other cores in the same socket are completely idle, causing major
> > performance regression. I have appended a fix for this. This is
> > relatively low risk fix and if you agree with both the fix and
> > risk-assessment, can we please push this to Linus so that we can address
> > this in 2.6.33.
>
> Hi Suresh,
>
> I have been looking at this issue in order to make
> sched_smt_powersavings work. In my simple tests I find that the
> default behavior is to have one task per core first since the total
> cpu power of the core will be 1178 (589*2) that is not sufficient to
> keep two tasks balanced in the group.
>
> In the scenario that you have described, even though the group has
> been identified as busiest, the find_busiest_queue() will return null
> since wl will be 1780 {load(1024)*SCHED_LOAD_SCALE/power(589)} leading
> to wl being greater than imbalance.
>
> The fix that you have posted will solve the problem described.
> However we need to make sched_smt_powersavings also work by increasing
> the group capacity and allowing two tasks to run in a core.
>
> As Peter mentioned, SD_PREFER_SIBLING flag is meant to spread the work
> across group at any sched domain so that the solution will work for
> pre-Nehalem quad cores also. But it still needs some work to get it
> right. Please refer to my earlier bug report at:
>
> http://lkml.org/lkml/2010/2/8/80
>
> The solution you have posted will not work for non-HT quad cores where
> we want the tasks to be spread across cache domains for best
> performance though not a severe performance regression as in the case
> of Nehalem.
>
> I will test your solution in different scenarios and post updates.

I have tested this patch on Nehalem and it provides the desired result
when sched_smt/mc_powersavings=0. One task per core is placed before
using sibling threads. However the core usage is mostly balanced across
different packages but not always. Incorrect consolidation to SMT
threads when free cores are available does not happen once this fix is
applied.

> Thanks,
> Vaidy
>
>
> > thanks,
> > suresh
> > ---
> >
> > From: Suresh Siddha <suresh.b.siddha(a)intel.com>
> > Subject: sched: fix SMT scheduler regression in find_busiest_queue()
> >
> > Fix a SMT scheduler performance regression that is leading to a scenario
> > where SMT threads in one core are completely idle while both the SMT threads
> > in another core (on the same socket) are busy.
> >
> > This is caused by this commit (with the problematic code highlighted)
> >
> > commit bdb94aa5dbd8b55e75f5a50b61312fe589e2c2d1
> > Author: Peter Zijlstra <a.p.zijlstra(a)chello.nl>
> > Date: Tue Sep 1 10:34:38 2009 +0200
> >
> > sched: Try to deal with low capacity
> >
> > @@ -4203,15 +4223,18 @@ find_busiest_queue()
> > ...
> > for_each_cpu(i, sched_group_cpus(group)) {
> > + unsigned long power = power_of(i);
> >
> > ...
> >
> > - wl = weighted_cpuload(i);
> > + wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
> > + wl /= power;
> >
> > - if (rq->nr_running == 1 && wl > imbalance)
> > + if (capacity && rq->nr_running == 1 && wl > imbalance)
> > continue;
> >
> > On a SMT system, power of the HT logical cpu will be 589 and
> > the scheduler load imbalance (for scenarios like the one mentioned above)
> > can be approximately 1024 (SCHED_LOAD_SCALE). The above change of scaling
> > the weighted load with the power will result in "wl > imbalance" and
> > ultimately resulting in find_busiest_queue() return NULL, causing
> > load_balance() to think that the load is well balanced. But infact
> > one of the tasks can be moved to the idle core for optimal performance.
> >
> > We don't need to use the weighted load (wl) scaled by the cpu power to
> > compare with imabalance. In that condition, we already know there is only a
> > single task "rq->nr_running == 1" and the comparison between imbalance,
> > wl is to make sure that we select the correct priority thread which matches
> > imbalance. So we really need to compare the imabalnce with the original
> > weighted load of the cpu and not the scaled load.
> >
> > But in other conditions where we want the most hammered(busiest) cpu, we can
> > use scaled load to ensure that we consider the cpu power in addition to the
> > actual load on that cpu, so that we can move the load away from the
> > guy that is getting most hammered with respect to the actual capacity,
> > as compared with the rest of the cpu's in that busiest group.
> >
> > Fix it.
> >
> > Reported-by: Ma Ling <ling.ma(a)intel.com>
> > Initial-Analysis-by: Zhang, Yanmin <yanmin_zhang(a)linux.intel.com>
> > Signed-off-by: Suresh Siddha <suresh.b.siddha(a)intel.com>

Acked-by: Vaidyanathan Srinivasan <svaidy(a)linux.vnet.ibm.com>

> > Cc: stable(a)kernel.org [2.6.32.x]
> > ---
> >
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 3a8fb30..bef5369 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -4119,12 +4119,23 @@ find_busiest_queue(struct sched_group *group, enum cpu_idle_type idle,
> > continue;
> >
> > rq = cpu_rq(i);
> > - wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
> > - wl /= power;
> > + wl = weighted_cpuload(i);
> >
> > + /*
> > + * When comparing with imbalance, use weighted_cpuload()
> > + * which is not scaled with the cpu power.
> > + */
> > if (capacity && rq->nr_running == 1 && wl > imbalance)
> > continue;
> >
> > + /*
> > + * For the load comparisons with the other cpu's, consider
> > + * the weighted_cpuload() scaled with the cpu power, so that
> > + * the load can be moved away from the cpu that is potentially
> > + * running at a lower capacity.
> > + */
> > + wl = (wl * SCHED_LOAD_SCALE) / power;
> > +
> > if (wl > max_load) {
> > max_load = wl;
> > busiest = rq;
> >
> >
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Vaidyanathan Srinivasan on 13 Feb 2010 15:40

* Vaidyanathan Srinivasan <svaidy(a)linux.vnet.ibm.com> [2010-02-14 01:55:52]:

> * Vaidyanathan Srinivasan <svaidy(a)linux.vnet.ibm.com> [2010-02-13 23:57:48]:
>
> > * Suresh Siddha <suresh.b.siddha(a)intel.com> [2010-02-12 17:14:22]:
> >
> > > PeterZ/Ingo,
> > >
> > > Ling Ma and Yanmin reported this SMT scheduler regression which lead to
> > > the condition where both the SMT threads on a core are busy while the
> > > other cores in the same socket are completely idle, causing major
> > > performance regression. I have appended a fix for this. This is
> > > relatively low risk fix and if you agree with both the fix and
> > > risk-assessment, can we please push this to Linus so that we can address
> > > this in 2.6.33.
> >
> > Hi Suresh,
> >
> > I have been looking at this issue in order to make
> > sched_smt_powersavings work. In my simple tests I find that the
> > default behavior is to have one task per core first since the total
> > cpu power of the core will be 1178 (589*2) that is not sufficient to
> > keep two tasks balanced in the group.
> >
> > In the scenario that you have described, even though the group has
> > been identified as busiest, the find_busiest_queue() will return null
> > since wl will be 1780 {load(1024)*SCHED_LOAD_SCALE/power(589)} leading
> > to wl being greater than imbalance.
> >
> > The fix that you have posted will solve the problem described.
> > However we need to make sched_smt_powersavings also work by increasing
> > the group capacity and allowing two tasks to run in a core.
> >
> > As Peter mentioned, SD_PREFER_SIBLING flag is meant to spread the work
> > across group at any sched domain so that the solution will work for
> > pre-Nehalem quad cores also. But it still needs some work to get it
> > right. Please refer to my earlier bug report at:
> >
> > http://lkml.org/lkml/2010/2/8/80
> >
> > The solution you have posted will not work for non-HT quad cores where
> > we want the tasks to be spread across cache domains for best
> > performance though not a severe performance regression as in the case
> > of Nehalem.
> >
> > I will test your solution in different scenarios and post updates.
>
> I have tested this patch on Nehalem and it provides the desired result
> when sched_smt/mc_powersavings=0. One task per core is placed before
> using sibling threads. However the core usage is mostly balanced across
> different packages but not always. Incorrect consolidation to SMT
> threads when free cores are available does not happen once this fix is
> applied.
>
> > Thanks,
> > Vaidy
> >
> >
> > > thanks,
> > > suresh
> > > ---
> > >
> > > From: Suresh Siddha <suresh.b.siddha(a)intel.com>
> > > Subject: sched: fix SMT scheduler regression in find_busiest_queue()
> > >
> > > Fix a SMT scheduler performance regression that is leading to a scenario
> > > where SMT threads in one core are completely idle while both the SMT threads
> > > in another core (on the same socket) are busy.
> > >
> > > This is caused by this commit (with the problematic code highlighted)
> > >
> > > commit bdb94aa5dbd8b55e75f5a50b61312fe589e2c2d1
> > > Author: Peter Zijlstra <a.p.zijlstra(a)chello.nl>
> > > Date: Tue Sep 1 10:34:38 2009 +0200
> > >
> > > sched: Try to deal with low capacity
> > >
> > > @@ -4203,15 +4223,18 @@ find_busiest_queue()
> > > ...
> > > for_each_cpu(i, sched_group_cpus(group)) {
> > > + unsigned long power = power_of(i);
> > >
> > > ...
> > >
> > > - wl = weighted_cpuload(i);
> > > + wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
> > > + wl /= power;
> > >
> > > - if (rq->nr_running == 1 && wl > imbalance)
> > > + if (capacity && rq->nr_running == 1 && wl > imbalance)
> > > continue;
> > >
> > > On a SMT system, power of the HT logical cpu will be 589 and
> > > the scheduler load imbalance (for scenarios like the one mentioned above)
> > > can be approximately 1024 (SCHED_LOAD_SCALE). The above change of scaling
> > > the weighted load with the power will result in "wl > imbalance" and
> > > ultimately resulting in find_busiest_queue() return NULL, causing
> > > load_balance() to think that the load is well balanced. But infact
> > > one of the tasks can be moved to the idle core for optimal performance.
> > >
> > > We don't need to use the weighted load (wl) scaled by the cpu power to
> > > compare with imabalance. In that condition, we already know there is only a
> > > single task "rq->nr_running == 1" and the comparison between imbalance,
> > > wl is to make sure that we select the correct priority thread which matches
> > > imbalance. So we really need to compare the imabalnce with the original
> > > weighted load of the cpu and not the scaled load.
> > >
> > > But in other conditions where we want the most hammered(busiest) cpu, we can
> > > use scaled load to ensure that we consider the cpu power in addition to the
> > > actual load on that cpu, so that we can move the load away from the
> > > guy that is getting most hammered with respect to the actual capacity,
> > > as compared with the rest of the cpu's in that busiest group.
> > >
> > > Fix it.
> > >
> > > Reported-by: Ma Ling <ling.ma(a)intel.com>
> > > Initial-Analysis-by: Zhang, Yanmin <yanmin_zhang(a)linux.intel.com>
> > > Signed-off-by: Suresh Siddha <suresh.b.siddha(a)intel.com>
>
> Acked-by: Vaidyanathan Srinivasan <svaidy(a)linux.vnet.ibm.com>
>
> > > Cc: stable(a)kernel.org [2.6.32.x]
> > > ---
> > >
> > > diff --git a/kernel/sched.c b/kernel/sched.c
> > > index 3a8fb30..bef5369 100644
> > > --- a/kernel/sched.c
> > > +++ b/kernel/sched.c
> > > @@ -4119,12 +4119,23 @@ find_busiest_queue(struct sched_group *group, enum cpu_idle_type idle,
> > > continue;
> > >
> > > rq = cpu_rq(i);
> > > - wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
> > > - wl /= power;
> > > + wl = weighted_cpuload(i);
> > >
> > > + /*
> > > + * When comparing with imbalance, use weighted_cpuload()
> > > + * which is not scaled with the cpu power.
> > > + */
> > > if (capacity && rq->nr_running == 1 && wl > imbalance)
> > > continue;
> > >
> > > + /*
> > > + * For the load comparisons with the other cpu's, consider
> > > + * the weighted_cpuload() scaled with the cpu power, so that
> > > + * the load can be moved away from the cpu that is potentially
> > > + * running at a lower capacity.
> > > + */
> > > + wl = (wl * SCHED_LOAD_SCALE) / power;
> > > +
> > > if (wl > max_load) {
> > > max_load = wl;
> > > busiest = rq;
> > >
> > >

In addition to the above fix, for sched_smt_powersavings to work, the
group capacity of the core (mc level) should be made 2 in
update_sg_lb_stats() by changing the DIV_ROUND_CLOSEST to
DIV_RPUND_UP()

sgs->group_capacity =
DIV_ROUND_UP(group->cpu_power, SCHED_LOAD_SCALE);

Ideally we can change this to DIV_ROUND_UP and let SD_PREFER_SIBLING
flag to force capacity to 1. Need to see if there are any side
effects of setting SD_PREFER_SIBLING at SIBLING level sched domain
based on sched_smt_powersavings flag.

Will post a patch after more tests.

--Vaidy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2 3
Prev: [LOCKDEP] 33-rc8 Running aplay with pulse as the default
Next: 33-rc8 Running aplay with pulse as the default