From: Ingo Molnar on

* Ingo Molnar <mingo(a)elte.hu> wrote:

> - Create a 'deep idle' mode that suspends. This, if all constraints
> are met, is triggered by the scheduler automatically: just like the other
> idle modes are triggered currently. This approach fixes the wakeup
> races because an incoming wakeup event will set need_resched() and
> abort the suspend.
>
> ( This mode can even use the existing suspend code to bring stuff down,
> therefore it also solves the pending timer problem and works even on
> PC style x86. )

Note that this does not necessarily have to be implemented as 'execute suspend
from the idle task' code: scheduling from the idle task, while can certainly
be made to work, is a somewhat recursive concept that we might want to avoid
for robustness reasons.

Instead, the 'deepest idle' (suspend) method could consist of a wakeup of a
kernel thread (or of any of the existing kernel threads such as the migration
thread) - which kernel thread then does a race-free suspend: it offlines all
but one CPU [on platforms that need that] and then initiates the suspend - but
aborts the attempt if there's any sign of wakeup activity.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on


On Fri, 4 Jun 2010, Ingo Molnar wrote:
>
> What you say is absolutely true, hence this would be driven via sched_tick() +
> TIF notifiers - i.e. only ever treat user-mode tasks as 'idle-able'. This can
> be done with no overhead to the regular fastpaths.
>
> The TIF notifier would be the one scheduling to idle - and would thus do it
> only to user-mode tasks.

The thing is, unless there is some _really_ deep other reason to do
something like this, I still think it's total overdesign to push any
knowledge/choices like this into the scheduler. I'd rather keep things way
more independent, less tied to each other and to deep kernel subsystems.

IOW, my personal opinion is that somethng like a suspend (blocker or not)
decision simply shouldn't be important enough to be tied into the
scheduler. Especially not if it could just be its own layer.

That said, as far as I know, the Android people have mostly been looking
at the suspend angle from a single-core standpoint. And I'm not at all
convinced that they should hijack the existing "/sys/power/state" thing
which is what I think they do now.

And those two things go together. The /sys/power/state thing is a global
suspend - which I don't think is appropriate for a opportunistic thing in
the first place, especially for multi-core.

A well-designed opportunistic suspend should be a two-phase thing: an
opportunistc CPU hotunplug (shutting down cores one by one as the system
is idle), and not a "global" event in the first place. And only when
you've reached single-core state should you then say "do I suspend the
system too".

So I've tried to look a bit at the patches, and my admittedly rough
comments so far is

- I really do prefer the "off to the side" approach that the current
google opportunistic suspend patches have. As mentioned, I don't think
this should be deep in the scheduler. Not at all.

- I do think there are possibly races and CPU idle issues there, but I
think they are mainly for the multi-core thing. And I think that's a
totally separate issue. Or it _should_ be.

- once you're single-core (whether because you never had more cores to
begin with, or because the "opportunistic CPU offlining" has taken down
the other cores), I think the suspend-blocker is fine as a concept, and
certainly shouldn't need any deep scheduler hooks.

so I'd like to see the opportunistc suspend thing think about CPU
offlining, and I'd like to see it disconnect from the existing
/sys/power/state. And I'd really not like to involved deep internal kernel
hooks into it.

But I'll also admit that maybe I'm not seeing some problems. I've frankly
tried to avoid the whole discussion until Andrew pulled me in yesterday.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on


On Thu, 3 Jun 2010, Linus Torvalds wrote:
>
> so I'd like to see the opportunistc suspend thing think about CPU
> offlining

Side note: one reason for me being somewhat interested in the CPU
offlining is that I think the Android kind of opportunistic suspend is
_not_ likely something I'd like to see on a desktop. But an the
"opportunistic CPU offliner"? That might _well_ be useful even outside of
any other suspend activity.

If the system is idle (or almost idle) for long times, I would heartily
recommend actively shutting down unused cores. Some CPU's are hopefully
smart enough to not even need that kind of software management, but I
suspect even the really smart ones might be able to take advantage of the
kernel saying: "I'm shutting you down, you don't have to worry about
latency AT ALL, because I'm keeping another CPU active to do any real
work".

I'd also be interested to see if it could even improve single-thread
performance if we end up doing the whole SMP->UP "lock" prefix rewriting
when the system is idle enough that we'd be better off running just a
single core. I dunno - just throwing that out there.

Anyway, the only reason I think this is related is literally because I
think that if we know there is only a single CPU active, I think the
actual "real" opportunistic suspend is easier. Suddenly you don't have to
worry about what happens on other run-queues etc, and whether another CPU
is just about to create a suspend block etc.

So I think they tie together, although it's mostly tangential. And as
mentioned, I think a opportunistic CPU suspend part is more relevant
outside of Android, and thus perhaps more widely interesting.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Arjan van de Ven on
On Thu, 3 Jun 2010 19:26:50 -0700 (PDT)
Linus Torvalds <torvalds(a)linux-foundation.org> wrote:

>
> If the system is idle (or almost idle) for long times, I would
> heartily recommend actively shutting down unused cores. Some CPU's
> are hopefully smart enough to not even need that kind of software
> management, but I suspect even the really smart ones might be able to
> take advantage of the kernel saying: "I'm shutting you down, you
> don't have to worry about latency AT ALL, because I'm keeping another
> CPU active to do any real work".

sadly the reality is that "offline" is actually the same as "deepest C
state". At best.

As far as I can see, this is at least true for all Intel and AMD cpus.

And because there's then no power saving (but a performance cost), it's
actually a negative for battery life/total energy.

(lots of experiments inside Intel seem to confirm that, it's not just
theory)





--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Arve Hjønnevåg on
On Thu, Jun 3, 2010 at 7:16 PM, Linus Torvalds
<torvalds(a)linux-foundation.org> wrote:
>
>
> On Fri, 4 Jun 2010, Ingo Molnar wrote:
>>
>> What you say is absolutely true, hence this would be driven via sched_tick() +
>> TIF notifiers - i.e. only ever treat user-mode tasks as 'idle-able'. This can
>> be done with no overhead to the regular fastpaths.
>>
>> The TIF notifier would be the one scheduling to idle - and would thus do it
>> only to user-mode tasks.
>
> The thing is, unless there is some _really_ deep other reason to do
> something like this, I still think it's total overdesign to push any
> knowledge/choices like this into the scheduler. I'd rather keep things way
> more independent, less tied to each other and to deep kernel subsystems.
>
> IOW, my personal opinion is that somethng like a suspend (blocker or not)
> decision simply shouldn't be important enough to be tied into the
> scheduler. Especially not if it could just be its own layer.
>
> That said, as far as I know, the Android people have mostly been looking
> at the suspend angle from a single-core standpoint. And I'm not at all
> convinced that they should hijack the existing "/sys/power/state" thing
> which is what I think they do now.
>

While it is true that we have not used this code on a multi core
system yet, I'm not sure why multiple cores codes would affect it. We
annotate that works needs to be done before it is safe to suspend, but
we don't care which core does the work (or if multiple cores do pieces
of it).

> And those two things go together. The /sys/power/state thing is a global
> suspend - which I don't think is appropriate for a opportunistic thing in
> the first place, especially for multi-core.
>
> A well-designed opportunistic suspend should be a two-phase thing: an
> opportunistc CPU hotunplug (shutting down cores one by one as the system
> is idle), and not a "global" event in the first place. And only when
> you've reached single-core state should you then say "do I suspend the
> system too".
>

This seems to fit better into the cpuidle and/or frequency scaling framework.

> So I've tried to look a bit at the patches, and my admittedly rough
> comments so far is
>
> �- I really do prefer the "off to the side" approach that the current
> � google opportunistic suspend patches have. As mentioned, I don't think
> � this should be deep in the scheduler. Not at all.
>
> �- I do think there are possibly races and CPU idle issues there, but I
> � think they are mainly for the multi-core thing. And I think that's a
> � totally separate issue. Or it _should_ be.
>

I'm not aware of any races with multi-core systems unless there are
existing problems in suspend. We check if any suspend blockers are
active after disable_nonboot_cpus() has returned.

> �- once you're single-core (whether because you never had more cores to
> � begin with, or because the "opportunistic CPU offlining" has taken down
> � the other cores), I think the suspend-blocker is fine as a concept, and
> � certainly shouldn't need any deep scheduler hooks.
>
> so I'd like to see the opportunistc suspend thing think about CPU
> offlining,

I see this as a separate problem. We ignore a single busy CPU for
opportunistic suspend, so why should the number of online CPUs matter?

> and I'd like to see it disconnect from the existing
> /sys/power/state.

The entry point is not important to us. The current interface is what
Rafael wanted instead of the /sys/power/request-state interface which
is what we changed it to last year.

> And I'd really not like to involved deep internal kernel
> hooks into it.
>
> But I'll also admit that maybe I'm not seeing some problems. I've frankly
> tried to avoid the whole discussion until Andrew pulled me in yesterday.
>
> � � � � � � � � � � � �Linus
>



--
Arve Hj�nnev�g
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/