accelerate grace period if last non-dynticked CPU [Kernel]

Prev: [34/98] [SCSI] iscsi class: modify handling of replacement timeout
Next: [73/98] md: fix small irregularity with start_ro module parameter

From: Mathieu Desnoyers on 26 Jan 2010 19:00

* Andi Kleen (andi(a)firstfloor.org) wrote:
> "Paul E. McKenney" <paulmck(a)linux.vnet.ibm.com> writes:
>
> Kind of offtopic to the original patch, but I couldn't
> resist...
>
> > +config RCU_FAST_NO_HZ
> > + bool "Accelerate last non-dyntick-idle CPU's grace periods"
> > + depends on TREE_RCU && NO_HZ && SMP
>
> Having such a thing as a config option doesn't really make
> any sense to me. Who would want to recompile their kernel
> to enable/disable this? If anything it should be runtime, or better
> just unconditionally on.
>
> In general RCU could probably reduce its number of "weird"
> Kconfig options.
>
> While I think I have a better understanding of RCU than a lot
> of normal users I often have no clue what to set there when
> building a kernel.

Maybe we could keep them under a CONFIG_DEBUG_RCU umbrella. Compiling
out parts of the rcu options can be useful for debugging purposes, but
I agree with you that end users should not be bothered with that much
options when some of them are "obviously" wanted.

OTOH, I understand that Paul seems to want to introduce new RCU
features gradually, without hitting all kernel users with bugs in newer
features. That's a sane approach to keep things generally stable, but
maybe it is time to set some of the stabilized RCU options to default Y
and move their config to the debug menu.

Let's see what Paul has to say about this...

Thanks,

Mathieu

>
> -Andi
>
> --
> ak(a)linux.intel.com -- Speaking for myself only.

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Paul E. McKenney on 27 Jan 2010 00:20

On Mon, Jan 25, 2010 at 08:28:34PM +0800, Lai Jiangshan wrote:
> Paul E. McKenney wrote:
> > [Experimental RFC, not for inclusion.]
> >
> > I recently received a complaint that RCU was refusing to let a system
> > go into low-power state immediately, instead waiting a few ticks after
> > the system had gone idle before letting go of the last CPU. Of course,
> > the reason for this was that there were a couple of RCU callbacks on
> > the last CPU.
> >
> > Currently, rcu_needs_cpu() simply checks whether the current CPU has
> > an outstanding RCU callback, which means that the last CPU to go into
> > dyntick-idle mode might wait a few ticks for the relevant grace periods
> > to complete. However, if all the other CPUs are in dyntick-idle mode,
> > and if this CPU is in a quiescent state (which it is for RCU-bh and
> > RCU-sched any time that we are considering going into dyntick-idle mode),
> > then the grace period is instantly complete.
> >
> > This patch therefore repeatedly invokes the RCU grace-period machinery
> > in order to force any needed grace periods to complete quickly. It does
> > so a limited number of times in order to prevent starvation by an RCU
> > callback function that might pass itself to call_rcu().
> >
> > Thoughts?
> >
> > Signed-off-by: Paul E. McKenney <paulmck(a)linux.vnet.ibm.com>
> >
> > diff --git a/init/Kconfig b/init/Kconfig
> > index d95ca7c..42bf914 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -396,6 +396,22 @@ config RCU_FANOUT_EXACT
> >
> > Say N if unsure.
> >
> > +config RCU_FAST_NO_HZ
> > + bool "Accelerate last non-dyntick-idle CPU's grace periods"
> > + depends on TREE_RCU && NO_HZ && SMP
> > + default n
> > + help
> > + This option causes RCU to attempt to accelerate grace periods
> > + in order to allow the final CPU to enter dynticks-idle state
> > + more quickly. On the other hand, this option increases the
> > + overhead of the dynticks-idle checking, particularly on systems
> > + with large numbers of CPUs.
> > +
> > + Say Y if energy efficiency is critically important, particularly
> > + if you have relatively few CPUs.
> > +
> > + Say N if you are unsure.
> > +
> > config TREE_RCU_TRACE
> > def_bool RCU_TRACE && ( TREE_RCU || TREE_PREEMPT_RCU )
> > select DEBUG_FS
> > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > index 099a255..29d88c0 100644
> > --- a/kernel/rcutree.c
> > +++ b/kernel/rcutree.c
> > @@ -1550,10 +1550,9 @@ static int rcu_pending(int cpu)
> > /*
> > * Check to see if any future RCU-related work will need to be done
> > * by the current CPU, even if none need be done immediately, returning
> > - * 1 if so. This function is part of the RCU implementation; it is -not-
> > - * an exported member of the RCU API.
> > + * 1 if so.
> > */
> > -int rcu_needs_cpu(int cpu)
> > +static int rcu_needs_cpu_quick_check(int cpu)
> > {
> > /* RCU callbacks either ready or pending? */
> > return per_cpu(rcu_sched_data, cpu).nxtlist ||
> > diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> > index e77cdf3..d6170a9 100644
> > --- a/kernel/rcutree_plugin.h
> > +++ b/kernel/rcutree_plugin.h
> > @@ -906,3 +906,72 @@ static void __init __rcu_init_preempt(void)
> > }
> >
> > #endif /* #else #ifdef CONFIG_TREE_PREEMPT_RCU */
> > +
> > +#if defined(CONFIG_TREE_PREEMPT_RCU) || !defined(CONFIG_RCU_FAST_NO_HZ)
> > +
> > +/*
> > + * Check to see if any future RCU-related work will need to be done
> > + * by the current CPU, even if none need be done immediately, returning
> > + * 1 if so. This function is part of the RCU implementation; it is -not-
> > + * an exported member of the RCU API.
> > + *
> > + * Because we have preemptible RCU, just check whether this CPU needs
> > + * any flavor of RCU. Do not chew up lots of CPU cycles with preemption
> > + * disabled in a most-likely vain attempt to cause RCU not to need this CPU.
> > + */
> > +int rcu_needs_cpu(int cpu)
> > +{
> > + return rcu_needs_cpu_quick_check(cpu);
> > +}
> > +
> > +#else
> > +
> > +#define RCU_NEEDS_CPU_FLUSHES 5
> > +
> > +/*
> > + * Check to see if any future RCU-related work will need to be done
> > + * by the current CPU, even if none need be done immediately, returning
> > + * 1 if so. This function is part of the RCU implementation; it is -not-
> > + * an exported member of the RCU API.
> > + *
> > + * Because we are not supporting preemptible RCU, attempt to accelerate
> > + * any current grace periods so that RCU no longer needs this CPU, but
> > + * only if all other CPUs are already in dynticks-idle mode. This will
> > + * allow the CPU cores to be powered down immediately, as opposed to after
> > + * waiting many milliseconds for grace periods to elapse.
> > + */
> > +int rcu_needs_cpu(int cpu)
> > +{
> > + int c = 1;
> > + int i;
> > + int thatcpu;
> > +
> > + /* Don't bother unless we are the last non-dyntick-idle CPU. */
> > + for_each_cpu(thatcpu, nohz_cpu_mask)
> > + if (thatcpu != cpu)
> > + return rcu_needs_cpu_quick_check(cpu);
>
> The comment and the code are not the same, I think.

Indeed, for this code to be correct, I would need to sequence through
the non-dyntick-idle CPUs, not the dyntick-idle ones.

Good catch!

I will likely come back with something similar to Steve Rostedt's
suggestion. Probably better to sequence through all the CPUs rather
than to allocate a cpumask and invert it. Or a 'for_each_cpu_not()'
or some such. ;-)

There does appear to be a cpumask_next_zero() that I should be able to
use.

Thanx, Paul

> -----------
> I found this thing, Although I think it is a ugly thing.
> Is it help?
>
> See select_nohz_load_balancer().
>
> /*
> * This routine will try to nominate the ilb (idle load balancing)
> * owner among the cpus whose ticks are stopped. ilb owner will do the idle
> * load balancing on behalf of all those cpus. If all the cpus in the system
> * go into this tickless mode, then there will be no ilb owner (as there is
> * no need for one) and all the cpus will sleep till the next wakeup event
> * arrives...
> *
> * For the ilb owner, tick is not stopped. And this tick will be used
> * for idle load balancing. ilb owner will still be part of
> * nohz.cpu_mask..
> *
> * While stopping the tick, this cpu will become the ilb owner if there
> * is no other owner. And will be the owner till that cpu becomes busy
> * or if all cpus in the system stop their ticks at which point
> * there is no need for ilb owner.
> *
> * When the ilb owner becomes busy, it nominates another owner, during the
> * next busy scheduler_tick()
> */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Paul E. McKenney on 27 Jan 2010 00:30

On Tue, Jan 26, 2010 at 06:55:16PM -0500, Mathieu Desnoyers wrote:
> * Andi Kleen (andi(a)firstfloor.org) wrote:
> > "Paul E. McKenney" <paulmck(a)linux.vnet.ibm.com> writes:
> >
> > Kind of offtopic to the original patch, but I couldn't
> > resist...
> >
> > > +config RCU_FAST_NO_HZ
> > > + bool "Accelerate last non-dyntick-idle CPU's grace periods"
> > > + depends on TREE_RCU && NO_HZ && SMP
> >
> > Having such a thing as a config option doesn't really make
> > any sense to me. Who would want to recompile their kernel
> > to enable/disable this? If anything it should be runtime, or better
> > just unconditionally on.
> >
> > In general RCU could probably reduce its number of "weird"
> > Kconfig options.
> >
> > While I think I have a better understanding of RCU than a lot
> > of normal users I often have no clue what to set there when
> > building a kernel.
>
> Maybe we could keep them under a CONFIG_DEBUG_RCU umbrella. Compiling
> out parts of the rcu options can be useful for debugging purposes, but
> I agree with you that end users should not be bothered with that much
> options when some of them are "obviously" wanted.
>
> OTOH, I understand that Paul seems to want to introduce new RCU
> features gradually, without hitting all kernel users with bugs in newer
> features. That's a sane approach to keep things generally stable, but
> maybe it is time to set some of the stabilized RCU options to default Y
> and move their config to the debug menu.

That is indeed a big part of the motivation. ;-)

Thanx, Paul

> Let's see what Paul has to say about this...
>
> Thanks,
>
> Mathieu
>
> >
> > -Andi
> >
> > --
> > ak(a)linux.intel.com -- Speaking for myself only.
>
> --
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andi Kleen on 27 Jan 2010 04:50

On Tue, Jan 26, 2010 at 09:20:50PM -0800, Paul E. McKenney wrote:
> On Tue, Jan 26, 2010 at 10:30:57PM +0100, Andi Kleen wrote:
> > "Paul E. McKenney" <paulmck(a)linux.vnet.ibm.com> writes:
> >
> > Kind of offtopic to the original patch, but I couldn't
> > resist...
> >
> > > +config RCU_FAST_NO_HZ
> > > + bool "Accelerate last non-dyntick-idle CPU's grace periods"
> > > + depends on TREE_RCU && NO_HZ && SMP
> >
> > Having such a thing as a config option doesn't really make
> > any sense to me. Who would want to recompile their kernel
> > to enable/disable this? If anything it should be runtime, or better
> > just unconditionally on.
>
> It adds significant overhead on entry to dyntick-idle mode for systems
> with large numbers of CPUs. :-(

Can't you simply check that at runtime then?

if (num_possible_cpus() > 20)
...

BTW the new small is large. This years high end desktop PC will come with
upto 12 CPU threads. It would likely be challenging to find a good
number for 20 that holds up with the future.

Or better perhaps have some threshold that you don't do it
that often, or only do it when you expect to be idle for a long
enough time that the CPU can enter deeper idle states

(I higher idle states some more wakeups typically don't matter
that much)

The cpufreq/cstate governour have a reasonable good idea
now how "idle" the system is and will be. Maybe you can reuse
that information somehow.

-Andi

--
ak(a)linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andi Kleen on 27 Jan 2010 05:10

On Wed, Jan 27, 2010 at 10:50:50AM +0100, Peter Zijlstra wrote:
> On Wed, 2010-01-27 at 10:43 +0100, Andi Kleen wrote:
> >
> > Can't you simply check that at runtime then?
> >
> > if (num_possible_cpus() > 20)
> > ...
> >
> > BTW the new small is large. This years high end desktop PC will come with
> > upto 12 CPU threads. It would likely be challenging to find a good
> > number for 20 that holds up with the future.
>
> If only scalability were that easy :/

See the rest of my email...

I see you're already suffering from "Linus disease" ... only reading
the first paragraph/sentence in the email.

-Andi

--
ak(a)linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2 3
Prev: [34/98] [SCSI] iscsi class: modify handling of replacement timeout
Next: [73/98] md: fix small irregularity with start_ro module parameter