From: Peter Zijlstra on
Sorry for the delay, I got side-tracked :/

On Fri, 2010-07-02 at 09:56 -0700, Venkatesh Pallipadi wrote:
> My fault. I completely missed that code path, assuming this_cpu in
> load_balance means this_cpu :(

Right, but local_group is still valid because that's
cpumask_test_cpu(this_cpu, sched_group_cpus(group));

So in that regard your initial patch was still correct. However, since
there can be multiple CPUs in the local group, we want to only have one
do the actual balancing, which is what the !*balance logic is about, so
by keeping the call site where it is gains that.

The below proposal avoids calling update_groups_power() too often for:
- !local_groups,
- multiple cpus in the same group.

Does this look correct?

Related to this, Mikey was looking at avoiding the for_each_cpu_and()
loops by converting update_sg_lb_stats into something similar to
update_group_power() and have 1 cpu in the group update the statistics
and propagate upwards by summing groups instead of individual cpus.

---
kernel/sched_fair.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 9910e1b..62f34a0 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2433,7 +2433,8 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
return;
}

- update_group_power(sd, this_cpu);
+ if (local_group)
+ update_group_power(sd, this_cpu);

/* Adjust by relative CPU power of the group */
sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Thu, 2010-07-08 at 10:45 -0700, Suresh Siddha wrote:
>
> @@ -2456,6 +2454,9 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
> init_sd_power_savings_stats(sd, sds, idle);
> load_idx = get_sd_load_idx(sd, idle);
>
> + if (this_cpu == smp_processor_id())
> + update_group_power(sd, this_cpu);
> +
> do {
> int local_group;

Which will break for nohz_idle_balance..
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Suresh Siddha on
On Thu, 2010-07-08 at 07:12 -0700, Peter Zijlstra wrote:
> Sorry for the delay, I got side-tracked :/
>
> On Fri, 2010-07-02 at 09:56 -0700, Venkatesh Pallipadi wrote:
> > My fault. I completely missed that code path, assuming this_cpu in
> > load_balance means this_cpu :(
>
> Right, but local_group is still valid because that's
> cpumask_test_cpu(this_cpu, sched_group_cpus(group));
>
> So in that regard your initial patch was still correct. However, since
> there can be multiple CPUs in the local group, we want to only have one
> do the actual balancing, which is what the !*balance logic is about, so
> by keeping the call site where it is gains that.
>
> The below proposal avoids calling update_groups_power() too often for:
> - !local_groups,
> - multiple cpus in the same group.
>
> Does this look correct?
>
> Related to this, Mikey was looking at avoiding the for_each_cpu_and()
> loops by converting update_sg_lb_stats into something similar to
> update_group_power() and have 1 cpu in the group update the statistics
> and propagate upwards by summing groups instead of individual cpus.
>
> ---
> kernel/sched_fair.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 9910e1b..62f34a0 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -2433,7 +2433,8 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
> return;
> }
>
> - update_group_power(sd, this_cpu);
> + if (local_group)
> + update_group_power(sd, this_cpu);

if IDLE == CPU_NEWLY_IDLE, then all the cpu's in the local group will do
this. Also update_group_power() can be done on only on the local cpu,
i.e., when this_cpu == smp_processor_id() right?

So it should be something like?

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index a878b53..2d8a74e 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2406,8 +2406,6 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
return;
}

- update_group_power(sd, this_cpu);
-
/* Adjust by relative CPU power of the group */
sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power;

@@ -2456,6 +2454,9 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
init_sd_power_savings_stats(sd, sds, idle);
load_idx = get_sd_load_idx(sd, idle);

+ if (this_cpu == smp_processor_id())
+ update_group_power(sd, this_cpu);
+
do {
int local_group;



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Suresh Siddha on
On Thu, 2010-07-08 at 10:49 -0700, Peter Zijlstra wrote:
> On Thu, 2010-07-08 at 10:45 -0700, Suresh Siddha wrote:
> >
> > @@ -2456,6 +2454,9 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
> > init_sd_power_savings_stats(sd, sds, idle);
> > load_idx = get_sd_load_idx(sd, idle);
> >
> > + if (this_cpu == smp_processor_id())
> > + update_group_power(sd, this_cpu);
> > +
> > do {
> > int local_group;
>
> Which will break for nohz_idle_balance..

Then the logic is broken somewhere because update_group_power() reads
APERF/MPERF MSR's which doesn't make sense when this_cpu !=
smp_processor_id(). What am I missing?

thanks,
suresh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Thu, 2010-07-08 at 10:50 -0700, Suresh Siddha wrote:
> On Thu, 2010-07-08 at 10:49 -0700, Peter Zijlstra wrote:
> > On Thu, 2010-07-08 at 10:45 -0700, Suresh Siddha wrote:
> > >
> > > @@ -2456,6 +2454,9 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
> > > init_sd_power_savings_stats(sd, sds, idle);
> > > load_idx = get_sd_load_idx(sd, idle);
> > >
> > > + if (this_cpu == smp_processor_id())
> > > + update_group_power(sd, this_cpu);
> > > +
> > > do {
> > > int local_group;
> >
> > Which will break for nohz_idle_balance..
>
> Then the logic is broken somewhere because update_group_power() reads
> APERF/MPERF MSR's which doesn't make sense when this_cpu !=
> smp_processor_id(). What am I missing?

The APERF/MPERF code is utterly broken.. (and currently disabled by
default) but yeah, that's one aspect of its brokenness I only realized
after your email.

The problem is that it measures current throughput, not current
capacity. So for an idle thread/core it would return 0, instead of the
potential.

I've been meaning to revisit this.. maybe I should simply rip that out
until I get it working.

I was thinking of measuring temporal maxima to approximate capacity
instead of throughput.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/