PM: Avoid losing wakeup events during suspend [Kernel]

Prev: [PATCH] [RFC] kvm: time accelerator/decelerator for kvm guests
Next: wait_for_completion_interruptible does not wait !!

From: Florian Mickler on 22 Jun 2010 02:20

On Mon, 21 Jun 2010 18:58:37 -0700
mark gross <640e9920(a)gmail.com> wrote:

> wrong. They are about 1) adding opportunistic suspend, and 2) adding
> critical sections to block suspend.
>
> No race issues where ever talked about until alternative solutions
> where put up.

The whole and only reason to even define the term "critical sections" is
when you need to define "a race". Or vice versa. A race is prevented by
defining critical sections and protecting these against concurrent
access.

[..snip..]

[some rant that alan is not familiar with android userspace..]

Are you suggesting that only android developers are supposed to talk
about this?

This is a pretty basic thing. It has only to do with system suspend.
(And using system suspend aggressively)

>
> --mgross
>

Cheers,
Flo

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Rafael J. Wysocki on 22 Jun 2010 05:30

On Tuesday, June 22, 2010, Alan Stern wrote:
> On Mon, 21 Jun 2010, mark gross wrote:
>
....
> > I'm starting to think your just trolling.
>
> You may think what you like. Perhaps there is a certain degree of
> truth to the accusation -- goodness knows, I do tend to prolong
> technical conversations when I think I'm right. (Ask anybody who has
> corresponded with me for any length of time.) But this is probably
> true of a lot of kernel developers...

I certainly don't mind discussing with you. :-)

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Rafael J. Wysocki on 22 Jun 2010 05:40

On Tuesday, June 22, 2010, mark gross wrote:
> On Mon, Jun 21, 2010 at 11:57:21AM -0400, Alan Stern wrote:
> > On Sun, 20 Jun 2010, mark gross wrote:
....
> > I don't know -- I have never looked at the Android userspace. No doubt
> > Arve can provide some examples.
> who are you? You don't know about the Android stack, and you keep
> blowing about how suspend blockers solve all the PM problems? I'm
> starting to think your just trolling.

And that's just an excellent way of convincing people that you're right. Guess
how it worked on me.

For your information, Alan is one of the people most actively engaged in
power management development at the kernel level and his ideas have heavily
influenced the design of the I/O runtime PM framework, among other things.

I certainly appeciate Alan's opinions very much, because he usually is
technically right.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Rafael J. Wysocki on 22 Jun 2010 06:30

On Tuesday, June 22, 2010, Rafael J. Wysocki wrote:
> On Tuesday, June 22, 2010, Alan Stern wrote:
> > On Mon, 21 Jun 2010, Florian Mickler wrote:
> >
> > > > In the end you would want to have communication in both directions:
> > > > suspend blockers _and_ callbacks. Polling is bad if done too often.
> > > > But I think the idea is a good one.
> > >
> > > Actually, I'm not so shure.
> > >
> > > 1. you have to roundtrip whereas in the suspend_blocker scheme you have
> > > active annotations (i.e. no further action needed)
> >
> > That's why it's best to use both. The normal case is that programs
> > activate and deactivate blockers by sending one-way messages to the PM
> > process. The exceptional case is when the PM process is about to
> > initiate a suspend; that's when it does the round-trip polling. Since
> > the only purpose of the polling is to avoid a race, 90% of the time it
> > will succeed.
> >
> > > 2. it may not be possible for a user to determine if a wake-event is
> > > in-flight. you would have to somehow pass the wake-event-number with
> > > it, so that the userspace process could ack it properly without
> > > confusion. Or... I don't know of anything else...
> > >
> > > 1. userspace-manager (UM) reads a number (42).
> > >
> > > 2. it questions userspace program X: is it ok to suspend?
> > >
> > > [please fill in how userspace program X determines to block
> > > suspend]
> > >
> > > 3a. UM's roundtrip ends and it proceeds to write "42" to the
> > > kernel [suspending]
> > > 3b. UM's roundtrip ends and it aborts suspend, because a
> > > (userspace-)suspend-blocker got activated
> > >
> > > I'm not shure how the userspace program could determine that there is a
> > > wake-event in flight. Perhaps by storing the number of last wake-event.
> > > But then you need per-wake-event-counters... :|
> >
> > Rafael seems to think timeouts will fix this. I'm not so sure.
> >
> > > Do you have some thoughts about the wake-event-in-flight detection?
> >
> > Not really, except for something like the original wakelock scheme in
> > which the kernel tells the PM core when an event is over.
>
> But the kernel doesn't really know that, so it really can't tell the PM core
> anything useful. What happens with suspend blockers is that a kernel subsystem
> cooperates with a user space consumer of the event to get the story straight.
>
> However, that will only work if the user space is not buggy and doesn't crash,
> for example, before releasing the suspend blocker it's holding.

Having reconsidered that I think there's more to it.

Take the PCI subsystem as an example, specifically pcie_pme_handle_request().
This is the place where wakeup events are started, but it has no idea about
how they are going to be handled. Thus in the suspend blocker scheme it would
need to activate a blocker, but it wouldn't be able to release it. So, it
seems, we would need to associate a suspend blocker with each PCIe device
that can generate wakeup events and require all drivers of such devices to
deal with a blocker activated by someone else (PCIe PME driver in this
particular case). That sounds cumbersome to say the least.

Moreover, even if we do that, it still doesn't solve the entire problem,
because the event may need to be delivered to user space and processed by it.
While a driver can check if user space has already read the event, it has
no way to detect whether or not it has finished processing it. In fact,
processing an event may involve an interaction with a (human) user and there's
no general way by which software can figure out when the user considers the
event as processed.

It looks like user space suspend blockers only help in some special cases
when the user space processing of a wakeup event is simple enough, but I don't
think they help in general. For an extreme example, a user may want to wake up
a system using wake-on-LAN to log into it, do some work and log out, so
effectively the initial wakeup event has not been processed entirely until the
user finally logs out of the system. Now, after the system wakeup (resulting
from the wake-on-LAN signal) we need to give the user some time to log in, but
if the user doesn't do that in certain time, it may be reasonable to suspend
and let the user wake up the system again.

Similar situation takes place when the system is woken up by a lid switch.
Evidently, the user has opened the laptop lid to do something, but we don't
even know what the user is going to do, so there's no way we can say when
the wakeup event is finally processed.

So, even if we can say when the kernel has finished processing the event
(although that would be complicated in the PCIe case above), I don't think
it's generally possible to ensure that the entire processing of a wakeup event
has been completed. This leads to the question whether or not it is worth
trying to detect the ending of the processing of a wakeup event.

Now, going back to the $subject patch, I didn't really think it would be
suitable for opportunistic suspend, so let's focus on the "forced" suspend
instead. It still has the problem that wakeup events occuring while
/sys/power/state is written to (or even slightly before) should cause the
system to cancel the suspend, but they generally won't. With the patch
applied that can be avoided by (a) reading from /sys/power/wakeup_count,
(b) waiting for certain time (such that if a suspend event is not entirely
processed within that time, it's worth suspending and waking up the
system again) and (c) writing to /sys/power/wakeup_count right before writing
to /sys/power/state (where the latter is only done if the former succeeds).

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Alan Stern on 22 Jun 2010 10:40

On Tue, 22 Jun 2010, Rafael J. Wysocki wrote:

> Having reconsidered that I think there's more to it.
>
> Take the PCI subsystem as an example, specifically pcie_pme_handle_request().
> This is the place where wakeup events are started, but it has no idea about
> how they are going to be handled. Thus in the suspend blocker scheme it would
> need to activate a blocker, but it wouldn't be able to release it. So, it
> seems, we would need to associate a suspend blocker with each PCIe device
> that can generate wakeup events and require all drivers of such devices to
> deal with a blocker activated by someone else (PCIe PME driver in this
> particular case). That sounds cumbersome to say the least.

Maybe this means pcie_pme_handle_request() isn't a good place to note
the arrival of a wakeup event. Doing it in the individual driver might
work out better.

Or look at this from a different point of view. Adopting Mark's
terminology, let's say that the kernel is in a "critical section" at
times when we don't want the system to suspend. Currently there's no
way to tell the PM core when the kernel enters or leaves a critical
section, so there's no way to prevent the system from suspending at the
wrong time.

Most wakeup events indicate the start of a critical section, in the
sense that you hardly ever say: "I want the computer to wake up when I
press this button, but I don't care what it does afterward -- it can go
right back to sleep without doing anything if it wants." Much more
common is that a wakeup event requires a certain amount of processing,
and you don't want the system to go back to sleep until the processing
is over. Of course, if the processing is simple enough that it can all
be done in an interrupt handler or a resume method, then nothing
extra is needed since obviously the system won't suspend while an
interrupt handler or a resume method is running. But for more
complicated cases, we need to do more.

The problem in your example is that pcie_pme_handle_request() has no
idea about the nature or extent of the critical section to follow.
Therefore it's not in a good position to mark the beginning of the
critical section, even though it is in an excellent position to mark
the receipt of a wakeup event.

> Moreover, even if we do that, it still doesn't solve the entire problem,
> because the event may need to be delivered to user space and processed by it.
> While a driver can check if user space has already read the event, it has
> no way to detect whether or not it has finished processing it. In fact,
> processing an event may involve an interaction with a (human) user and there's
> no general way by which software can figure out when the user considers the
> event as processed.

Is this the kernel's problem? Once userspace has read the event, we
can safely say that the kernel's critical section is over. Perhaps a
userspace critical section has begun, perhaps not; either way, it's no
longer the kernel's responsibility.

> It looks like user space suspend blockers only help in some special cases
> when the user space processing of a wakeup event is simple enough, but I don't
> think they help in general. For an extreme example, a user may want to wake up
> a system using wake-on-LAN to log into it, do some work and log out, so
> effectively the initial wakeup event has not been processed entirely until the
> user finally logs out of the system. Now, after the system wakeup (resulting
> from the wake-on-LAN signal) we need to give the user some time to log in, but
> if the user doesn't do that in certain time, it may be reasonable to suspend
> and let the user wake up the system again.

I agree. This is a case where there is no clear-cut end to the
kernel's critical section, because the event is not handed over to
userspace. A reasonable approach would be to use a timeout, perhaps
also with some heuristic like: End the critical section early if we
receive 100(?) more network packets before the timeout expires.

> Similar situation takes place when the system is woken up by a lid switch.
> Evidently, the user has opened the laptop lid to do something, but we don't
> even know what the user is going to do, so there's no way we can say when
> the wakeup event is finally processed.

In this case, the kernel could inform an appropriate user process (some
part of DeviceKit? or the power-manager process?) about the lid-switch
event. Once that information had been passed on, the kernel's critical
section would be over. The process could start its own critical
section or not, as it sees fit.

If there is no process to send the information to, the kernel could
again end the critical section after a reasonable timeout (1 minute?).

> So, even if we can say when the kernel has finished processing the event
> (although that would be complicated in the PCIe case above), I don't think
> it's generally possible to ensure that the entire processing of a wakeup event
> has been completed. This leads to the question whether or not it is worth
> trying to detect the ending of the processing of a wakeup event.

As Arve pointed out, in some cases it definitely is worthwhile (the
gpio keypad matrix example). In other cases there may be no reasonable
way to tell. That doesn't mean we have to give up entirely.

> Now, going back to the $subject patch, I didn't really think it would be
> suitable for opportunistic suspend, so let's focus on the "forced" suspend
> instead. It still has the problem that wakeup events occuring while
> /sys/power/state is written to (or even slightly before) should cause the
> system to cancel the suspend, but they generally won't. With the patch
> applied that can be avoided by (a) reading from /sys/power/wakeup_count,
> (b) waiting for certain time (such that if a suspend event is not entirely
> processed within that time, it's worth suspending and waking up the
> system again) and (c) writing to /sys/power/wakeup_count right before writing
> to /sys/power/state (where the latter is only done if the former succeeds).

In other words, you could detect if a critical section begins after the
user process has decided to initiate a suspend. Yes, that's so.

On the other hand, we should already be able to abort a suspend if a
wakeup event arrives after tasks are frozen (to pick a reasonable
boundary point). If we can't -- if some wakeup events are able to slip
through our fingers -- I would say it's a bug and the relevant drivers
need to be fixed. In the end this probably will require adding a
function to notify the PM core that a wakeup event occurred and
therefore a suspend-in-progress should be aborted -- almost exactly
what pm_wakeup_event() does.

So I'm not opposed to the new function. But it doesn't solve the
entire problem of avoiding suspends during critical sections.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12
Prev: [PATCH] [RFC] kvm: time accelerator/decelerator for kvm guests
Next: wait_for_completion_interruptible does not wait !!