From: Paul E. McKenney on
On Wed, Jan 27, 2010 at 10:43:36AM +0100, Andi Kleen wrote:
> On Tue, Jan 26, 2010 at 09:20:50PM -0800, Paul E. McKenney wrote:
> > On Tue, Jan 26, 2010 at 10:30:57PM +0100, Andi Kleen wrote:
> > > "Paul E. McKenney" <paulmck(a)linux.vnet.ibm.com> writes:
> > >
> > > Kind of offtopic to the original patch, but I couldn't
> > > resist...
> > >
> > > > +config RCU_FAST_NO_HZ
> > > > + bool "Accelerate last non-dyntick-idle CPU's grace periods"
> > > > + depends on TREE_RCU && NO_HZ && SMP
> > >
> > > Having such a thing as a config option doesn't really make
> > > any sense to me. Who would want to recompile their kernel
> > > to enable/disable this? If anything it should be runtime, or better
> > > just unconditionally on.
> >
> > It adds significant overhead on entry to dyntick-idle mode for systems
> > with large numbers of CPUs. :-(
>
> Can't you simply check that at runtime then?
>
> if (num_possible_cpus() > 20)
> ...
>
> BTW the new small is large. This years high end desktop PC will come with
> upto 12 CPU threads. It would likely be challenging to find a good
> number for 20 that holds up with the future.

And this was another line of reasoning that lead me to the extra kernel
config parameter.

> Or better perhaps have some threshold that you don't do it
> that often, or only do it when you expect to be idle for a long
> enough time that the CPU can enter deeper idle states
>
> (I higher idle states some more wakeups typically don't matter
> that much)
>
> The cpufreq/cstate governour have a reasonable good idea
> now how "idle" the system is and will be. Maybe you can reuse
> that information somehow.

My first thought was to find an existing "I am a small device running on
battery power" or "low power consumption is critical to me" config
parameter. I didn't find anything that looked like that. If there was
one, I would make RCU_FAST_NO_HZ depend on it.

Or did I miss some kernel parameter or API?

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Paul E. McKenney on
On Wed, Jan 27, 2010 at 10:50:50AM +0100, Peter Zijlstra wrote:
> On Wed, 2010-01-27 at 10:43 +0100, Andi Kleen wrote:
> >
> > Can't you simply check that at runtime then?
> >
> > if (num_possible_cpus() > 20)
> > ...
> >
> > BTW the new small is large. This years high end desktop PC will come with
> > upto 12 CPU threads. It would likely be challenging to find a good
> > number for 20 that holds up with the future.
>
> If only scalability were that easy :/
>
> These massive core/thread count things are causing more problems as
> well, the cpus/node ratios are constantly growing, giving grief in the
> page allocator as well as other places that used to scale per node.
>
> As to the current problem, the call_rcu() interface doesn't make a hard
> promise that the callback will be done on the same cpu, right? So why
> not simply move the callback list over to a more active cpu?

I could indeed do that. However, there is nothing stopping the
more-active CPU from going into dynticks-idle mode between the time
that I decide to push the callback to it and the time I actually do
the pushing. :-(

I considered pushing the callbacks to the orphanage, but that is a
global lock that I would rather not acquire on each dyntick-idle
transition.

This conversation is having the effect of making me much more comfortable
adding a kernel configuration parameter. Might not have been the intent,
but there you have it! ;-)

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andi Kleen on
> > Can't you simply check that at runtime then?
> >
> > if (num_possible_cpus() > 20)
> > ...
> >
> > BTW the new small is large. This years high end desktop PC will come with
> > upto 12 CPU threads. It would likely be challenging to find a good
> > number for 20 that holds up with the future.
>
> And this was another line of reasoning that lead me to the extra kernel
> config parameter.

Which doesn't solve the problem at all.

> > Or better perhaps have some threshold that you don't do it
> > that often, or only do it when you expect to be idle for a long
> > enough time that the CPU can enter deeper idle states
> >
> > (I higher idle states some more wakeups typically don't matter
> > that much)
> >
> > The cpufreq/cstate governour have a reasonable good idea
> > now how "idle" the system is and will be. Maybe you can reuse
> > that information somehow.
>
> My first thought was to find an existing "I am a small device running on
> battery power" or "low power consumption is critical to me" config
> parameter. I didn't find anything that looked like that. If there was
> one, I would make RCU_FAST_NO_HZ depend on it.
>
> Or did I miss some kernel parameter or API?

There are a few for scalability (e.g. numa_distance()), but they're
obscure. The really good ones are just known somewhere.

But I think in this case scalability is not the key thing to check
for, but expected idle latency. Even on a large system if near all
CPUs are idle spending some time to keep them idle even longer is a good
thing. But only if the CPUs actually benefit from long idle.

There's the "pm_qos_latency" frame work that could be used for this
in theory, but it's not 100% the right match because it's not
dynamic.

Unfortunately last time I looked the interfaces were rather clumpsy
(e.g. don't allow interrupt level notifiers)

Better would be some insight into the expected future latency:
look at exporting this information from the various frequency/idle
governours.

Perhaps pm_qos_latency could be extended to support that?
CC Arjan, maybe he has some ideas on that.

After all of that there would be still of course the question
what the right latency threshold would be, but at least that's
a much easier question that number of CPUs.

-Andi
--
ak(a)linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Paul E. McKenney on
On Wed, Jan 27, 2010 at 11:13:42AM +0100, Andi Kleen wrote:
> > > Can't you simply check that at runtime then?
> > >
> > > if (num_possible_cpus() > 20)
> > > ...
> > >
> > > BTW the new small is large. This years high end desktop PC will come with
> > > upto 12 CPU threads. It would likely be challenging to find a good
> > > number for 20 that holds up with the future.
> >
> > And this was another line of reasoning that lead me to the extra kernel
> > config parameter.
>
> Which doesn't solve the problem at all.

Depending on what you consider the problem to be, of course.

From what I can see, most people would want RCU_FAST_NO_HZ=n. Only
people with extreme power-consumption concerns would likely care enough
to select this.

> > > Or better perhaps have some threshold that you don't do it
> > > that often, or only do it when you expect to be idle for a long
> > > enough time that the CPU can enter deeper idle states
> > >
> > > (I higher idle states some more wakeups typically don't matter
> > > that much)
> > >
> > > The cpufreq/cstate governour have a reasonable good idea
> > > now how "idle" the system is and will be. Maybe you can reuse
> > > that information somehow.
> >
> > My first thought was to find an existing "I am a small device running on
> > battery power" or "low power consumption is critical to me" config
> > parameter. I didn't find anything that looked like that. If there was
> > one, I would make RCU_FAST_NO_HZ depend on it.
> >
> > Or did I miss some kernel parameter or API?
>
> There are a few for scalability (e.g. numa_distance()), but they're
> obscure. The really good ones are just known somewhere.
>
> But I think in this case scalability is not the key thing to check
> for, but expected idle latency. Even on a large system if near all
> CPUs are idle spending some time to keep them idle even longer is a good
> thing. But only if the CPUs actually benefit from long idle.

The larger the number of CPUs, the lower the probability of all of them
going idle, so the less difference this patch makes. Perhaps some
larger system will care about this on a per-socket basis, but I have yet
to hear any requests.

> There's the "pm_qos_latency" frame work that could be used for this
> in theory, but it's not 100% the right match because it's not
> dynamic.
>
> Unfortunately last time I looked the interfaces were rather clumpsy
> (e.g. don't allow interrupt level notifiers)

I do need to query from interrupt context, but could potentially have a
notifier set up state for me. Still, the real question is "how important
is a small reduction in power consumption?"

> Better would be some insight into the expected future latency:
> look at exporting this information from the various frequency/idle
> governours.
>
> Perhaps pm_qos_latency could be extended to support that?
> CC Arjan, maybe he has some ideas on that.
>
> After all of that there would be still of course the question
> what the right latency threshold would be, but at least that's
> a much easier question that number of CPUs.

Hmmm... I am still believing that very few people want RCU_FAST_NO_HZ,
and that those who want it can select it for their devices.

Trying to apply this to server-class machines gets into questions like
"where are the core/socket boundaries", "can this hardware turn entire
cores/sockets off", "given the current workload, does it really make sense
to try to turn off entire cores/sockets", and "is a few ticks important
when migrating processes, irqs, timers, and whatever else is required to
actually turn off a given core or socket for a significant time period".

I took a quick look at te pm_qos_latency, and, as you note, it doesn't
really seem to be designed to handle this situation.

And we really should not be gold-plating this thing. I have one requester
(off list) who needs it badly, and who is willing to deal with a kernel
configuration parameter. I have no other requesters, and therefore
cannot reasonably anticipate their needs. As a result, we cannot justify
building any kind of infrastructure beyond what is reasonable for the
single requester.

Maybe the situation will be different next year. But if so, we would
then have some information on what people really need. So, if it turns
out that more will be needed in 2011, I will be happy to do something
about it once I have some hard information on what will really be needed.

Fair enough?

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Paul E. McKenney on
On Wed, Jan 27, 2010 at 10:39:22PM +1100, Nick Piggin wrote:
> On Wed, Jan 27, 2010 at 02:04:34AM -0800, Paul E. McKenney wrote:
> > I could indeed do that. However, there is nothing stopping the
> > more-active CPU from going into dynticks-idle mode between the time
> > that I decide to push the callback to it and the time I actually do
> > the pushing. :-(
> >
> > I considered pushing the callbacks to the orphanage, but that is a
> > global lock that I would rather not acquire on each dyntick-idle
> > transition.
>
> Well we already have to do atomic operations on the nohz mask, so
> maybe it would be acceptable to actually have a spinlock there to
> serialise operations on the nohz mask and also allow some subsystem
> specific things (synchronisation here should allow either one of
> those above approaches).
>
> It's not going to be zero cost, but seeing as there is already the
> contended cacheline there, it's not going to introduce a
> fundamentally new bottleneck.

Good point, although a contended global lock is nastier than a contended
cache line.

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/