From: Benjamin Herrenschmidt on
On Tue, 2010-01-26 at 17:28 -0600, Joel Schopp wrote:
> On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads
> there is performance benefit to idling the higher numbered threads in
> the core.
>
> This patch implements arch_scale_smt_power to dynamically update smt
> thread power in these idle cases in order to prefer threads 0,1 over
> threads 2,3 within a core.
>
> v2 - Same functionality as v1, better coding style.

Better. Some more comments...

> Signed-off-by: Joel Schopp <jschopp(a)austin.ibm.com>
> ---
> Version 2 addresses style and optimization, same basic functionality
> Index: linux-2.6.git/arch/powerpc/kernel/smp.c
> ===================================================================
> --- linux-2.6.git.orig/arch/powerpc/kernel/smp.c
> +++ linux-2.6.git/arch/powerpc/kernel/smp.c
> @@ -620,3 +620,55 @@ void __cpu_die(unsigned int cpu)
> smp_ops->cpu_die(cpu);
> }
> #endif
> +
> +unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu)
> +{
> + int sibling;
> + int idle_count = 0;
> + int thread;
> +
> + struct cpumask *sibling_map = sched_domain_span(sd);

What about an early exit if !cpu_has_feature(CPU_FTR_SMT) ? That would
de-facto compile it out for 32-bit CPU platforms that don't support SMT
at all and avoid some overhead on POWER3,4,970...

> + unsigned long weight = cpumask_weight(sibling_map);
> + unsigned long smt_gain = sd->smt_gain;
> +
> + if (cpu_has_feature(CPU_FTR_ASYNC_SMT4) && weight == 4) {

So that will only handle the case where all 4 threads are online right ?
There is no provision for the case where the user play tricks like
offlining thread, in which case it will stop trying to "push down"
processes right ? Not a big deal per-se I suppose, just something to be
aware of.

Also, can you add a comment as to why this is done in the code itself ?
above the if (cpu_has_feature(...)) statement.

> + for_each_cpu(sibling, sibling_map) {
> + if (idle_cpu(sibling))
> + idle_count++;
> + }
> +
> + /* the following section attempts to tweak cpu power based
> + * on current idleness of the threads dynamically at runtime
> + */
> + if (idle_count > 1) {
> + thread = cpu_thread_in_core(cpu);
> + if (thread < 2) {
> + /* add 75 % to thread power */
> + smt_gain += (smt_gain >> 1) + (smt_gain >> 2);
> + } else {
> + /* subtract 75 % to thread power */
> + smt_gain = smt_gain >> 2;
> + }
> + }
> + }
> +
> + /* default smt gain is 1178, weight is # of SMT threads */
> + switch (weight) {
> + case 1:
> + /*divide by 1, do nothing*/
> + break;
> + case 2:
> + smt_gain = smt_gain >> 1;
> + break;
> + case 4:
> + smt_gain = smt_gain >> 2;
> + break;
> + default:
> + smt_gain /= weight;
> + break;
> + }
> +
> + return smt_gain;
> +}

Appart from that, it looks allright to me.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Benjamin Herrenschmidt on
On Thu, 2010-01-28 at 17:24 -0600, Joel Schopp wrote:
> On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads
> there is performance benefit to idling the higher numbered threads in
> the core.
>
> This patch implements arch_scale_smt_power to dynamically update smt
> thread power in these idle cases in order to prefer threads 0,1 over
> threads 2,3 within a core.

Almost there :-) Joel, Peter, can you help me figure something out tho ?

On machine that don't have SMT, I would like to avoid calling
arch_scale_smt_power() at all if possible (in addition to not compiling
it in if SMT is not enabled in .config).

Now, I must say I'm utterly confused by how the domains are setup and I
haven't quite managed to sort it out... it looks to me that
SD_SHARE_CPUPOWER is always going to be set on all CPUs when the config
option is set (though each CPU will have its own domain) or am I
misguided ? IE. Is there any sense in having at least a fast exit path
out of arch_scale_smt_power() for non-SMT CPUs ?

Joel, can you look at compiling it out when SMT is not set ? We don't
want to bloat SMP kernels for 32-bit non-SMT embedded platforms.

Oh, and one minor nit:

> Signed-off-by: Joel Schopp <jschopp(a)austin.ibm.com>
> ---
> Version 2 addresses style and optimization, same basic functionality
> Version 3 adds a comment, resent due to mailing format error
> Index: linux-2.6.git/arch/powerpc/kernel/smp.c
> ===================================================================
> --- linux-2.6.git.orig/arch/powerpc/kernel/smp.c
> +++ linux-2.6.git/arch/powerpc/kernel/smp.c
> @@ -620,3 +620,59 @@ void __cpu_die(unsigned int cpu)
> smp_ops->cpu_die(cpu);
> }
> #endif

^^^ Please add the /* CONFIG_CPU_HOTPLUG */ (or whatever it is) that's
missing after that #endif :-)

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Fri, 2010-01-29 at 12:23 +1100, Benjamin Herrenschmidt wrote:
> On machine that don't have SMT, I would like to avoid calling
> arch_scale_smt_power() at all if possible (in addition to not compiling
> it in if SMT is not enabled in .config).
>
> Now, I must say I'm utterly confused by how the domains are setup and I
> haven't quite managed to sort it out... it looks to me that
> SD_SHARE_CPUPOWER is always going to be set on all CPUs when the config
> option is set (though each CPU will have its own domain) or am I
> misguided ? IE. Is there any sense in having at least a fast exit path
> out of arch_scale_smt_power() for non-SMT CPUs ?

The sched_domain creation code is a f'ing stink pile that hurts
everybody's brain.

The AMD magny-cours people sort of cleaned it up a bit but didn't go
nearly far enough. Doing so is somewhere on my todo list, but sadly that
thing is way larger than my spare time.

Now SD_SHARE_CPUPOWER _should_ only be set for SMT domains, because only
SMT siblings share cpupower.

SD_SHARE_PKG_RESOURCES _should_ be set for both SMT and MC, because they
all share the same cache domain.

If it all works out that way in practice on powerpc is another question
entirely ;-)

That said, I'm still not entirely convinced I like this usage of
cpupower, its supposed to be a normalization scale for load-balancing,
not a placement hook.

I'd be much happier with a SD_GROUP_ORDER or something like that, that
works together with SD_PREFER_SIBLING to pack active tasks to cpus in
ascending group order.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Fri, 2010-02-05 at 14:57 -0600, Joel Schopp wrote:
> On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads
> there is performance benefit to idling the higher numbered threads in
> the core.
>
> This patch implements arch_scale_smt_power to dynamically update smt
> thread power in these idle cases in order to prefer threads 0,1 over
> threads 2,3 within a core.
>
> Signed-off-by: Joel Schopp <jschopp(a)austin.ibm.com>
> ---

> Index: linux-2.6.git/arch/powerpc/kernel/smp.c
> ===================================================================
> --- linux-2.6.git.orig/arch/powerpc/kernel/smp.c
> +++ linux-2.6.git/arch/powerpc/kernel/smp.c
> @@ -620,3 +620,61 @@ void __cpu_die(unsigned int cpu)
> smp_ops->cpu_die(cpu);
> }
> #endif
> +
> +#ifdef CONFIG_SCHED_SMT
> +unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu)
> +{
> + int sibling;
> + int idle_count = 0;
> + int thread;
> +
> + /* Setup the default weight and smt_gain used by most cpus for SMT
> + * Power. Doing this right away covers the default case and can be
> + * used by cpus that modify it dynamically.
> + */
> + struct cpumask *sibling_map = sched_domain_span(sd);
> + unsigned long weight = cpumask_weight(sibling_map);
> + unsigned long smt_gain = sd->smt_gain;
> +
> +
> + if (cpu_has_feature(CPU_FTR_ASYNC_SMT4) && weight == 4) {
> + for_each_cpu(sibling, sibling_map) {
> + if (idle_cpu(sibling))
> + idle_count++;
> + }
> +
> + /* the following section attempts to tweak cpu power based
> + * on current idleness of the threads dynamically at runtime
> + */
> + if (idle_count > 1) {
> + thread = cpu_thread_in_core(cpu);
> + if (thread < 2) {
> + /* add 75 % to thread power */
> + smt_gain += (smt_gain >> 1) + (smt_gain >> 2);
> + } else {
> + /* subtract 75 % to thread power */
> + smt_gain = smt_gain >> 2;
> + }
> + }
> + }
> +
> + /* default smt gain is 1178, weight is # of SMT threads */
> + switch (weight) {
> + case 1:
> + /*divide by 1, do nothing*/
> + break;
> + case 2:
> + smt_gain = smt_gain >> 1;
> + break;
> + case 4:
> + smt_gain = smt_gain >> 2;
> + break;
> + default:
> + smt_gain /= weight;
> + break;
> + }
> +
> + return smt_gain;
> +
> +}
> +#endif

Suppose for a moment we have 2 threads (hot-unplugged thread 1 and 3, we
can construct an equivalent but more complex example for 4 threads), and
we have 4 tasks, 3 SCHED_OTHER of equal nice level and 1 SCHED_FIFO, the
SCHED_FIFO task will consume exactly 50% walltime of whatever cpu it
ends up on.

In that situation, provided that each cpu's cpu_power is of equal
measure, scale_rt_power() ensures that we run 2 SCHED_OTHER tasks on the
cpu that doesn't run the RT task, and 1 SCHED_OTHER task next to the RT
task, so that each task consumes 50%, which is all fair and proper.

However, if you do the above, thread 0 will have +75% = 1.75 and thread
2 will have -75% = 0.25, then if the RT task will land on thread 0,
we'll be having: 0.875 vs 0.25, or on thread 3, 1.75 vs 0.125. In either
case thread 0 will receive too many (if not all) SCHED_OTHER tasks.

That is, unless these threads 2 and 3 really are _that_ weak, at which
point one wonders why IBM bothered with the silicon ;-)

So tell me again, why is fiddling with the cpu_power a good placement
tool?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Thu, 2010-02-18 at 14:17 +0100, Peter Zijlstra wrote:
>
> There's one fundamental assumption, and one weakness in the
> implementation.
>
Aside from bugs and the like.. ;-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/