sched: fix capacity calculations for SMT4 [Kernel]

Prev: arch/s390/kvm: Use GFP_ATOMIC when a lock is held
Next: [PATCH 1/2] Add HW_ERR printk prefix for hardware error logging

From: Peter Zijlstra on 31 May 2010 04:40

On Fri, 2010-04-16 at 15:58 +0200, Peter Zijlstra wrote:
>
>
> Hrmm, my brain seems muddled but I might have another solution, let me
> ponder this for a bit..
>

Right, so the thing I was thinking about is taking the group capacity
into account when determining the capacity for a single cpu.

Say the group contains all the SMT siblings, then use the group capacity
(usually larger than 1024) and then distribute the capacity over the
group members, preferring CPUs with higher individual cpu_power over
those with less.

So suppose you've got 4 siblings with cpu_power=294 each, then we assign
capacity 1 to the first member, and the remaining 153 is insufficient,
and thus we stop and the rest lives with 0 capacity.

Now take the example that the first sibling would be running a heavy RT
load, and its cpu_power would be reduced to say, 50, then we still got
nearly 933 left over the others, which is still sufficient for one
capacity, but because the first sibling is low, we'll assign it 0 and
instead assign 1 to the second, again, leaving the third and fourth 0.

If the group were a core group, the total would be much higher and we'd
likely end up assigning 1 to each before we'd run out of capacity.

For power savings, we can lower the threshold and maybe use the maximal
individual cpu_power in the group to base 1 capacity from.

So, suppose the second example, where sibling0 has 50 and the others
have 294, you'd end up with a capacity distribution of: {0,1,1,1}.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Vaidyanathan Srinivasan on 1 Jun 2010 19:00

* Peter Zijlstra <peterz(a)infradead.org> [2010-05-31 10:33:16]:

> On Fri, 2010-04-16 at 15:58 +0200, Peter Zijlstra wrote:
> >
> >
> > Hrmm, my brain seems muddled but I might have another solution, let me
> > ponder this for a bit..
> >
>
> Right, so the thing I was thinking about is taking the group capacity
> into account when determining the capacity for a single cpu.
>
> Say the group contains all the SMT siblings, then use the group capacity
> (usually larger than 1024) and then distribute the capacity over the
> group members, preferring CPUs with higher individual cpu_power over
> those with less.
>
> So suppose you've got 4 siblings with cpu_power=294 each, then we assign
> capacity 1 to the first member, and the remaining 153 is insufficient,
> and thus we stop and the rest lives with 0 capacity.
>
> Now take the example that the first sibling would be running a heavy RT
> load, and its cpu_power would be reduced to say, 50, then we still got
> nearly 933 left over the others, which is still sufficient for one
> capacity, but because the first sibling is low, we'll assign it 0 and
> instead assign 1 to the second, again, leaving the third and fourth 0.

Hi Peter,

Thanks for the suggestion.

> If the group were a core group, the total would be much higher and we'd
> likely end up assigning 1 to each before we'd run out of capacity.

This is a tricky case because we are depending upon the
DIV_ROUND_CLOSEST to decide whether to flag capacity to 0 or 1. We
will not have any task movement until capacity is depleted to quite
low value due to RT task. Having a threshold to flag 0/1 instead of
DIV_ROUND_CLOSEST just like you have suggested in the power savings
case may help here as well to move tasks to other idle cores.

> For power savings, we can lower the threshold and maybe use the maximal
> individual cpu_power in the group to base 1 capacity from.
>
> So, suppose the second example, where sibling0 has 50 and the others
> have 294, you'd end up with a capacity distribution of: {0,1,1,1}.

One challenge here is that if RT tasks run on more that one thread in
this group, we will have slightly different cpu powers. Arranging
them from max to min and having a cutoff threshold should work.

Should we keep the RT scaling as a separate entity along with
cpu_power to simplify these thresholds. Whenever we need to scale
group load with cpu power can take the product of cpu_power and
scale_rt_power but in these cases where we compute capacity, we can
mark a 0 or 1 just based on whether scale_rt_power was less than
SCHED_LOAD_SCALE or not. Alternatively we can keep cpu_power as
a product of all scaling factors as it is today but save the component
scale factors also like scale_rt_power() and arch_scale_freq_power()
so that it can be used in load balance decisions.

Basically in power save balance we would give all threads a capacity
'1' unless the cpu_power was reduced due to RT task. Similarly in
the non-power save case, we can have flag 1,0,0,0 unless first thread
had a RT scaling during the last interval.

I am suggesting to distinguish the reduction is cpu_power due to
architectural (hardware DVFS) reasons from RT tasks so that it is easy
to decide if moving tasks to sibling thread or core can help or not.

--Vaidy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Peter Zijlstra on 3 Jun 2010 05:00

On Wed, 2010-06-02 at 04:22 +0530, Vaidyanathan Srinivasan wrote:

> > If the group were a core group, the total would be much higher and we'd
> > likely end up assigning 1 to each before we'd run out of capacity.
>
> This is a tricky case because we are depending upon the
> DIV_ROUND_CLOSEST to decide whether to flag capacity to 0 or 1. We
> will not have any task movement until capacity is depleted to quite
> low value due to RT task. Having a threshold to flag 0/1 instead of
> DIV_ROUND_CLOSEST just like you have suggested in the power savings
> case may help here as well to move tasks to other idle cores.

Right, well we could put the threshold higher than the 50%, say 90% or
so.

> > For power savings, we can lower the threshold and maybe use the maximal
> > individual cpu_power in the group to base 1 capacity from.
> >
> > So, suppose the second example, where sibling0 has 50 and the others
> > have 294, you'd end up with a capacity distribution of: {0,1,1,1}.
>
> One challenge here is that if RT tasks run on more that one thread in
> this group, we will have slightly different cpu powers. Arranging
> them from max to min and having a cutoff threshold should work.

Right, like the 90% above.

> Should we keep the RT scaling as a separate entity along with
> cpu_power to simplify these thresholds. Whenever we need to scale
> group load with cpu power can take the product of cpu_power and
> scale_rt_power but in these cases where we compute capacity, we can
> mark a 0 or 1 just based on whether scale_rt_power was less than
> SCHED_LOAD_SCALE or not. Alternatively we can keep cpu_power as
> a product of all scaling factors as it is today but save the component
> scale factors also like scale_rt_power() and arch_scale_freq_power()
> so that it can be used in load balance decisions.

Right, so the question is, do we only care about RT or should capacity
reflect the full asymmetric MP case.

I don't quite see why RT is special from any of the other scale factors,
if someone pegged one core at half the frequency of the others you'd
still want it to get 0 capacity so that we only try to populate it on
overload.

> Basically in power save balance we would give all threads a capacity
> '1' unless the cpu_power was reduced due to RT task. Similarly in
> the non-power save case, we can have flag 1,0,0,0 unless first thread
> had a RT scaling during the last interval.
>
> I am suggesting to distinguish the reduction is cpu_power due to
> architectural (hardware DVFS) reasons from RT tasks so that it is easy
> to decide if moving tasks to sibling thread or core can help or not.

For power savings such a special heuristic _might_ make sense.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Srivatsa Vaddagiri on 7 Jun 2010 11:10

On Mon, May 31, 2010 at 10:33:16AM +0200, Peter Zijlstra wrote:
> On Fri, 2010-04-16 at 15:58 +0200, Peter Zijlstra wrote:
> >
> >
> > Hrmm, my brain seems muddled but I might have another solution, let me
> > ponder this for a bit..
> >
>
> Right, so the thing I was thinking about is taking the group capacity
> into account when determining the capacity for a single cpu.

Peter,
We are exploring an alternate solution which seems to be working as
expected. Basically allow capacity of 1 for SMT threads provided there is
no significant influence by RT tasks or freq scaling. Note that at core level,
capacity is unchanged and hence this affects only how tasks are distributed
within a core.

Mike Neuling should post an updated patchset containing this patch
(with more comments added ofcourse!).

Signed-off-by: Srivatsa Vaddagiri <vatsa(a)linux.vnet.ibm.com>

---
include/linux/sched.h | 2 +-
kernel/sched_fair.c | 30 +++++++++++++++++++++++-------
2 files changed, 24 insertions(+), 8 deletions(-)

Index: linux-2.6-ozlabs/include/linux/sched.h
===================================================================
--- linux-2.6-ozlabs.orig/include/linux/sched.h
+++ linux-2.6-ozlabs/include/linux/sched.h
@@ -860,7 +860,7 @@ struct sched_group {
* CPU power of this group, SCHED_LOAD_SCALE being max power for a
* single CPU.
*/
- unsigned int cpu_power;
+ unsigned int cpu_power, cpu_power_orig;

/*
* The CPUs this group covers.
Index: linux-2.6-ozlabs/kernel/sched_fair.c
===================================================================
--- linux-2.6-ozlabs.orig/kernel/sched_fair.c
+++ linux-2.6-ozlabs/kernel/sched_fair.c
@@ -2285,13 +2285,6 @@ static void update_cpu_power(struct sche
unsigned long power = SCHED_LOAD_SCALE;
struct sched_group *sdg = sd->groups;

- if (sched_feat(ARCH_POWER))
- power *= arch_scale_freq_power(sd, cpu);
- else
- power *= default_scale_freq_power(sd, cpu);
-
- power >>= SCHED_LOAD_SHIFT;
-
if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
if (sched_feat(ARCH_POWER))
power *= arch_scale_smt_power(sd, cpu);
@@ -2301,6 +2294,15 @@ static void update_cpu_power(struct sche
power >>= SCHED_LOAD_SHIFT;
}

+ sdg->cpu_power_orig = power;
+
+ if (sched_feat(ARCH_POWER))
+ power *= arch_scale_freq_power(sd, cpu);
+ else
+ power *= default_scale_freq_power(sd, cpu);
+
+ power >>= SCHED_LOAD_SHIFT;
+
power *= scale_rt_power(cpu);
power >>= SCHED_LOAD_SHIFT;

@@ -2333,6 +2335,22 @@ static void update_group_power(struct sc
sdg->cpu_power = power;
}

+static inline int
+rt_freq_influence(struct sched_group *group, struct sched_domain *sd)
+{
+ if (sd->child)
+ return 1;
+
+ /*
+ * Check to see if the final cpu power was reduced by more
+ * than 10% by frequency or rt tasks
+ */
+ if (group->cpu_power * 100 < group->cpu_power_orig * 90)
+ return 1;
+
+ return 0;
+}
+
/**
* update_sg_lb_stats - Update sched_group's statistics for load balancing.
* @sd: The sched_domain whose statistics are to be updated.
@@ -2426,6 +2444,8 @@ static inline void update_sg_lb_stats(st

sgs->group_capacity =
DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE);
+ if (!sgs->group_capacity && !rt_freq_influence(group, sd))
+ sgs->group_capacity = 1;
}

/**
@@ -2725,7 +2745,8 @@ ret:
*/
static struct rq *
find_busiest_queue(struct sched_group *group, enum cpu_idle_type idle,
- unsigned long imbalance, const struct cpumask *cpus)
+ unsigned long imbalance, const struct cpumask *cpus,
+ struct sched_domain *sd)
{
struct rq *busiest = NULL, *rq;
unsigned long max_load = 0;
@@ -2736,6 +2757,9 @@ find_busiest_queue(struct sched_group *g
unsigned long capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
unsigned long wl;

+ if (!capacity && !rt_freq_influence(group, sd))
+ capacity = 1;
+
if (!cpumask_test_cpu(i, cpus))
continue;

@@ -2852,7 +2876,7 @@ redo:
goto out_balanced;
}

- busiest = find_busiest_queue(group, idle, imbalance, cpus);
+ busiest = find_busiest_queue(group, idle, imbalance, cpus, sd);
if (!busiest) {
schedstat_inc(sd, lb_nobusyq[idle]);
goto out_balanced;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Michael Neuling on 9 Jun 2010 19:10

Peter,

It looks like when this was pushed to Ingo, some of the logic was
changed. rt_freq_influence() became fix_small_capacity() but the return
values for these two functions are opposite (return 1 => return 0 and
visa versa]]).

This was changed in the sibling test (return 1 => return 0), but the
check for the change in cpu power due to freq and rt was not.

So either the return values need to be changed at the end of
fix_small_capacity() or the cpu_power test needs to be the other way
around. Below changes the cpu_power test as it brings it more inline
with the comment above it.

Without this the asymmetric packing doesn't work.

Mikey
---
Subject: sched: fix the CPU power test for fix_small_capacity

The CPU power test is the wrong way around in fix_small_capacity.

This was due to a small changes made in the posted patch on lkml to what
was was taken upstream.

This patch fixes asymmetric packing for POWER7.

Signed-off-by: Michael Neuling <mikey(a)neuling.org>

Index: linux-2.6-ozlabs/kernel/sched_fair.c
===================================================================
--- linux-2.6-ozlabs.orig/kernel/sched_fair.c
+++ linux-2.6-ozlabs/kernel/sched_fair.c
@@ -2354,7 +2354,7 @@ fix_small_capacity(struct sched_domain *
/*
* If ~90% of the cpu_power is still there, we're good.
*/
- if (group->cpu_power * 32 < group->cpu_power_orig * 29)
+ if (group->cpu_power * 32 > group->cpu_power_orig * 29)
return 1;

return 0;

In message <tip-9d5efe05eb0c904545a28b19c18b949f23334de0(a)git.kernel.org> you wr
ote:
> Commit-ID: 9d5efe05eb0c904545a28b19c18b949f23334de0
> Gitweb: http://git.kernel.org/tip/9d5efe05eb0c904545a28b19c18b949f23334de

0
> Author: Srivatsa Vaddagiri <vatsa(a)linux.vnet.ibm.com>
> AuthorDate: Tue, 8 Jun 2010 14:57:02 +1000
> Committer: Ingo Molnar <mingo(a)elte.hu>
> CommitDate: Wed, 9 Jun 2010 10:34:54 +0200
>
> sched: Fix capacity calculations for SMT4
>
> Handle cpu capacity being reported as 0 on cores with more number of
> hardware threads. For example on a Power7 core with 4 hardware
> threads, core power is 1177 and thus power of each hardware thread is
> 1177/4 = 294. This low power can lead to capacity for each hardware
> thread being calculated as 0, which leads to tasks bouncing within the
> core madly!
>
> Fix this by reporting capacity for hardware threads as 1, provided
> their power is not scaled down significantly because of frequency
> scaling or real-time tasks usage of cpu.
>
> Signed-off-by: Srivatsa Vaddagiri <vatsa(a)linux.vnet.ibm.com>
> Signed-off-by: Michael Neuling <mikey(a)neuling.org>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra(a)chello.nl>
> Cc: Arjan van de Ven <arjan(a)linux.intel.com>
> LKML-Reference: <20100608045702.21D03CC895(a)localhost.localdomain>
> Signed-off-by: Ingo Molnar <mingo(a)elte.hu>
> ---
> include/linux/sched.h | 2 +-
> kernel/sched_fair.c | 53 +++++++++++++++++++++++++++++++++++++++--------
-
> 2 files changed, 44 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index a3e5b1c..c731296 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -857,7 +857,7 @@ struct sched_group {
> * CPU power of this group, SCHED_LOAD_SCALE being max power for a
> * single CPU.
> */
> - unsigned int cpu_power;
> + unsigned int cpu_power, cpu_power_orig;
>
> /*
> * The CPUs this group covers.
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 6ee2e0a..b9b3462 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -2285,13 +2285,6 @@ static void update_cpu_power(struct sched_domain *sd,
int cpu)
> unsigned long power = SCHED_LOAD_SCALE;
> struct sched_group *sdg = sd->groups;
>
> - if (sched_feat(ARCH_POWER))
> - power *= arch_scale_freq_power(sd, cpu);
> - else
> - power *= default_scale_freq_power(sd, cpu);
> -
> - power >>= SCHED_LOAD_SHIFT;
> -
> if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
> if (sched_feat(ARCH_POWER))
> power *= arch_scale_smt_power(sd, cpu);
> @@ -2301,6 +2294,15 @@ static void update_cpu_power(struct sched_domain *sd,
int cpu)
> power >>= SCHED_LOAD_SHIFT;
> }
>
> + sdg->cpu_power_orig = power;
> +
> + if (sched_feat(ARCH_POWER))
> + power *= arch_scale_freq_power(sd, cpu);
> + else
> + power *= default_scale_freq_power(sd, cpu);
> +
> + power >>= SCHED_LOAD_SHIFT;
> +
> power *= scale_rt_power(cpu);
> power >>= SCHED_LOAD_SHIFT;
>
> @@ -2333,6 +2335,31 @@ static void update_group_power(struct sched_domain *sd
, int cpu)
> sdg->cpu_power = power;
> }
>
> +/*
> + * Try and fix up capacity for tiny siblings, this is needed when
> + * things like SD_ASYM_PACKING need f_b_g to select another sibling
> + * which on its own isn't powerful enough.
> + *
> + * See update_sd_pick_busiest() and check_asym_packing().
> + */
> +static inline int
> +fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
> +{
> + /*
> + * Only siblings can have significantly less than SCHED_LOAD_SCALE
> + */
> + if (sd->level != SD_LV_SIBLING)
> + return 0;
> +
> + /*
> + * If ~90% of the cpu_power is still there, we're good.
> + */
> + if (group->cpu_power * 32 < group->cpu_power_orig * 29)
> + return 1;
> +
> + return 0;
> +}
> +
> /**
> * update_sg_lb_stats - Update sched_group's statistics for load balancing.
> * @sd: The sched_domain whose statistics are to be updated.
> @@ -2426,6 +2453,8 @@ static inline void update_sg_lb_stats(struct sched_doma
in *sd,
>
> sgs->group_capacity =
> DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE);
> + if (!sgs->group_capacity)
> + sgs->group_capacity = fix_small_capacity(sd, group);
> }
>
> /**
> @@ -2724,8 +2753,9 @@ ret:
> * find_busiest_queue - find the busiest runqueue among the cpus in group.
> */
> static struct rq *
> -find_busiest_queue(struct sched_group *group, enum cpu_idle_type idle,
> - unsigned long imbalance, const struct cpumask *cpus)
> +find_busiest_queue(struct sched_domain *sd, struct sched_group *group,
> + enum cpu_idle_type idle, unsigned long imbalance,
> + const struct cpumask *cpus)
> {
> struct rq *busiest = NULL, *rq;
> unsigned long max_load = 0;
> @@ -2736,6 +2766,9 @@ find_busiest_queue(struct sched_group *group, enum cpu_
idle_type idle,
> unsigned long capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SC
ALE);
> unsigned long wl;
>
> + if (!capacity)
> + capacity = fix_small_capacity(sd, group);
> +
> if (!cpumask_test_cpu(i, cpus))
> continue;
>
> @@ -2852,7 +2885,7 @@ redo:
> goto out_balanced;
> }
>
> - busiest = find_busiest_queue(group, idle, imbalance, cpus);
> + busiest = find_busiest_queue(sd, group, idle, imbalance, cpus);
> if (!busiest) {
> schedstat_inc(sd, lb_nobusyq[idle]);
> goto out_balanced;
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2
Prev: arch/s390/kvm: Use GFP_ATOMIC when a lock is held
Next: [PATCH 1/2] Add HW_ERR printk prefix for hardware error logging