PM: Avoid losing wakeup events during suspend [Kernel]

Prev: [PATCH] [RFC] kvm: time accelerator/decelerator for kvm guests
Next: wait_for_completion_interruptible does not wait !!

From: Rafael J. Wysocki on 24 Jun 2010 11:00

On Thursday, June 24, 2010, Andy Lutomirski wrote:
> Rafael J. Wysocki wrote:
> > On Tuesday, June 22, 2010, Rafael J. Wysocki wrote:
> >> On Tuesday, June 22, 2010, Alan Stern wrote:
> >>> On Tue, 22 Jun 2010, Rafael J. Wysocki wrote:
> > ...
> >>>> So, even if we can say when the kernel has finished processing the event
> >>>> (although that would be complicated in the PCIe case above), I don't think
> >>>> it's generally possible to ensure that the entire processing of a wakeup event
> >>>> has been completed. This leads to the question whether or not it is worth
> >>>> trying to detect the ending of the processing of a wakeup event.
> >>> As Arve pointed out, in some cases it definitely is worthwhile (the
> >>> gpio keypad matrix example). In other cases there may be no reasonable
> >>> way to tell. That doesn't mean we have to give up entirely.
> >> Well, I'm not sure, because that really depends on the hardware and bus in
> >> question. The necessary condition seems to be that the event be detected
> >> and handled entirely by the same functional unit (eg. a device driver) within
> >> the kernel and such that it is able to detect whether or not user space has
> >> acquired the event information. That doesn't seem to be a common case to me.
> >
> > Anyway, below's an update that addresses this particular case.
> >
> > It adds two more functions, pm_wakeup_begin() and pm_wakeup_end()
> > that play similar roles to suspend_block() and suspend_unblock(), but they
> > don't operate on suspend blocker objects. Instead, the first of them increases
> > a counter of events in progress and the other one decreases this counter.
> > Together they have the same effect as pm_wakeup_event(), but the counter
> > of wakeup events in progress they operate on is also checked by
> > pm_check_wakeup_events().
> >
> > Thus there are two ways kernel subsystems can signal wakeup events. First,
> > if the event is not explicitly handed over to user space and "instantaneous",
> > they can simply call pm_wakeup_event() and be done with it. Second, if the
> > event is going to be delivered to user space, the subsystem that processes
> > the event can call pm_wakeup_begin() right when the event is detected and
> > pm_wakeup_end() when it's been handed over to user space.
>
> How does userspace handle this without races? (I don't see an example
> in a driver that talks to userspace in your code...)
>
> For example, if I push a button on my keyboard, the driver calls
> pm_wakeup_begin(). Then userspace reads the key from the evdev device
> and tells the userspace suspend manager not to go to sleep.
>
> But there's a race: the keyboard driver (or input subsystem) could call
> pm_wakeup_end() before the userspace program has a chance to tell the
> suspend manager not to sleep.

That basically is what /sys/power/wakeup_count is for. The power manager is
supposed to first read from it (that will block if there are any events in
progress in the last, recently posted, version of the patch), obtain ACKs from
user space event consumers known to it, then write to
/sys/power/wakeup_count (that will fail if there were any wakeup events in the
meantime) and write to /sys/power/state only if that was successful.

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Rafael J. Wysocki on 24 Jun 2010 11:10

On Thursday, June 24, 2010, Rafael J. Wysocki wrote:
> On Thursday, June 24, 2010, Rafael J. Wysocki wrote:
> > On Wednesday, June 23, 2010, Alan Stern wrote:
> > > On Wed, 23 Jun 2010, Rafael J. Wysocki wrote:
> > >
> > > > > Didn't we agree that timeouts would be needed for Wake-on-LAN?
> > > >
> > > > Yes, we did, but in the WoL case the timeout will have to be used by the user
> > > > space rather than the kernel IMO.
> > >
> > > The kernel will still have to specify some small initial timeout. Just
> > > long enough for userspace to realize what has happened and start its
> > > own critical section.
> > >
> > > > It would make sense to add a timeout argument to pm_wakeup_event() that would
> > > > make the system stay in the working state long enough for the driver wakeup
> > > > code to start in the PCIe case. I think pm_wakeup_event() mght just increment
> > > > events_in_progress and the timer might simply decrement it.
> > >
> > > Hmm. I was thinking about a similar problem with the USB hub driver.
> > >
> > > Maybe a better answer for this particular issue is to change the
> > > workqueue code. Don't allow a work thread to enter the freezer until
> > > its queue is empty. Then you wouldn't need a timeout.
> > >
> > > > So, maybe it's just better to have pm_wakeup_event(dev, timeout) that will
> > > > increment events_in_progress and set up a timer and pm_wakeup_commit(dev) that
> > > > will delete the timer, decrement events_in_progress and increment event_count
> > > > (unless the timer has already expired before).
> > > >
> > > > That would cost us a (one more) timer in struct dev_pm_info, but it would
> > > > allow us to cover all of the cases identified so far. So, if a wakeup event is
> > > > handled within one functional unit that both detects it and delivers it to
> > > > user space, it would call pm_wakeup_event(dev, 0) (ie. infinite timeout) at the
> > > > beginning and then pm_wakeup_commit(dev) when it's done with the event.
> > > > If a wakeup event it's just detected by one piece of code and is going to
> > > > be handled by another, the first one could call pm_wakeup_event(dev, tm) and
> > > > allow the other one to call pm_wakeup_commit(dev) when it's done. However,
> > > > calling pm_wakeup_commit(dev) is not mandatory, so the second piece of code
> > > > (eg. a PCI driver) doesn't really need to do anything in the simplest case.
> > >
> > > You have to handle the case where pm_wakeup_commit() gets called after
> > > the timer expires (it should do nothing).
> >
> > Yup.
> >
> > > And what happens if the device gets a second wakeup event before the timer
> > > for the first one expires?
> >
> > Good question. I don't have an answer to it at the moment, but it seems to
> > arise from using a single timer for all events.
> >
> > It looks like it's simpler to make pm_wakeup_event() allocate a timer for each
> > event and make the timer function remove it. That would cause suspend to
> > be blocked until the timer expires without a way to cancel it earlier, though.
>
> So, I decided to try this after all.
>
> Below is a new version of the patch. It introduces pm_stay_awake(dev) and
> pm_relax() that play the roles of the "old" pm_wakeup_begin() and
> pm_wakeup_end().
>
> pm_wakeup_event() now takes an extra timeout argument and uses it for
> deferred execution of pm_relax(). So, one can either use the
> pm_stay_awake(dev) / pm_relax() pair, or use pm_wakeup_event(dev, timeout)
> if the ending is under someone else's control.
>
> In addition to that, pm_get_wakeup_count() blocks until events_in_progress is
> zero.
>
> Please tell me what you think.

Ah, one piece is missing. Namely, the waiting /sys/power/wakeup_count reader
needs to be woken up when events_in_progress goes down to zero.

I'll send a new version with this bug fixed later today.

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andy Lutomirski on 24 Jun 2010 11:30

On Jun 24, 2010, at 10:48 AM, "Rafael J. Wysocki" <rjw(a)sisk.pl> wrote:

> On Thursday, June 24, 2010, Andy Lutomirski wrote:
>> Rafael J. Wysocki wrote:
>>> On Tuesday, June 22, 2010, Rafael J. Wysocki wrote:
>>>> On Tuesday, June 22, 2010, Alan Stern wrote:
>>>>> On Tue, 22 Jun 2010, Rafael J. Wysocki wrote:
>>> ...
>>>>>> So, even if we can say when the kernel has finished processing
>>>>>> the event
>>>>>> (although that would be complicated in the PCIe case above), I
>>>>>> don't think
>>>>>> it's generally possible to ensure that the entire processing of
>>>>>> a wakeup event
>>>>>> has been completed. This leads to the question whether or not
>>>>>> it is worth
>>>>>> trying to detect the ending of the processing of a wakeup event.
>>>>> As Arve pointed out, in some cases it definitely is worthwhile
>>>>> (the
>>>>> gpio keypad matrix example). In other cases there may be no
>>>>> reasonable
>>>>> way to tell. That doesn't mean we have to give up entirely.
>>>> Well, I'm not sure, because that really depends on the hardware
>>>> and bus in
>>>> question. The necessary condition seems to be that the event be
>>>> detected
>>>> and handled entirely by the same functional unit (eg. a device
>>>> driver) within
>>>> the kernel and such that it is able to detect whether or not user
>>>> space has
>>>> acquired the event information. That doesn't seem to be a common
>>>> case to me.
>>>
>>> Anyway, below's an update that addresses this particular case.
>>>
>>> It adds two more functions, pm_wakeup_begin() and pm_wakeup_end()
>>> that play similar roles to suspend_block() and suspend_unblock(),
>>> but they
>>> don't operate on suspend blocker objects. Instead, the first of
>>> them increases
>>> a counter of events in progress and the other one decreases this
>>> counter.
>>> Together they have the same effect as pm_wakeup_event(), but the
>>> counter
>>> of wakeup events in progress they operate on is also checked by
>>> pm_check_wakeup_events().
>>>
>>> Thus there are two ways kernel subsystems can signal wakeup
>>> events. First,
>>> if the event is not explicitly handed over to user space and
>>> "instantaneous",
>>> they can simply call pm_wakeup_event() and be done with it.
>>> Second, if the
>>> event is going to be delivered to user space, the subsystem that
>>> processes
>>> the event can call pm_wakeup_begin() right when the event is
>>> detected and
>>> pm_wakeup_end() when it's been handed over to user space.
>>
>> How does userspace handle this without races? (I don't see an
>> example
>> in a driver that talks to userspace in your code...)
>>
>> For example, if I push a button on my keyboard, the driver calls
>> pm_wakeup_begin(). Then userspace reads the key from the evdev
>> device
>> and tells the userspace suspend manager not to go to sleep.
>>
>> But there's a race: the keyboard driver (or input subsystem) could
>> call
>> pm_wakeup_end() before the userspace program has a chance to tell the
>> suspend manager not to sleep.
>
> That basically is what /sys/power/wakeup_count is for. The power
> manager is
> supposed to first read from it (that will block if there are any
> events in
> progress in the last, recently posted, version of the patch), obtain
> ACKs from
> user space event consumers known to it, then write to
> /sys/power/wakeup_count (that will fail if there were any wakeup
> events in the
> meantime) and write to /sys/power/state only if that was successful.

That sounds a good deal more complicated than the poll, block suspend,
read, unblock approach, but I guess that as long as both work, the
kernel doesn't really have to care.

>
> Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Rafael J. Wysocki on 24 Jun 2010 12:30

On Thursday, June 24, 2010, Alan Stern wrote:
> On Thu, 24 Jun 2010, Rafael J. Wysocki wrote:
>
> > > > And what happens if the device gets a second wakeup event before the timer
> > > > for the first one expires?
> > >
> > > Good question. I don't have an answer to it at the moment, but it seems to
> > > arise from using a single timer for all events.
> > >
> > > It looks like it's simpler to make pm_wakeup_event() allocate a timer for each
> > > event and make the timer function remove it. That would cause suspend to
> > > be blocked until the timer expires without a way to cancel it earlier, though.
> >
> > So, I decided to try this after all.
> >
> > Below is a new version of the patch. It introduces pm_stay_awake(dev) and
> > pm_relax() that play the roles of the "old" pm_wakeup_begin() and
> > pm_wakeup_end().
> >
> > pm_wakeup_event() now takes an extra timeout argument and uses it for
> > deferred execution of pm_relax(). So, one can either use the
> > pm_stay_awake(dev) / pm_relax() pair, or use pm_wakeup_event(dev, timeout)
> > if the ending is under someone else's control.
> >
> > In addition to that, pm_get_wakeup_count() blocks until events_in_progress is
> > zero.
> >
> > Please tell me what you think.
>
> This is slightly different from the wakelock design. Each call to
> pm_stay_awake() must be paired with a call to pm_relax(), allowing one
> device to have multiple concurrent critical sections, whereas calls to
> pm_wakeup_event() must not be paired with anything. With wakelocks,
> you couldn't have multiple pending events for the same device.

You could, but you needed to define multiple wakelocks for the same device for
this purpose.

> I'm not sure which model is better in practice. No doubt the Android people
> will prefer their way.

I suppose so.

> This requires you to define an explicit PCI_WAKEUP_COOLDOWN delay. I
> think that's okay; I had to do something similar with USB and SCSI.
> (And I still think it would be a good idea to prevent workqueue threads
> from freezing until their queues are empty.)

I guess you mean the freezable ones? I'm not sure if that helps a lot, because
new work items may still be added after the workqueue thread has been frozen.

> Instead of allocating the work structures dynamically, would you be
> better off using a memory pool?

Well, it would be kind of equivalent to defining my own slab cache for that,
wouldn't it?

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Rafael J. Wysocki on 24 Jun 2010 19:10

On Thursday, June 24, 2010, Alan Stern wrote:
> On Thu, 24 Jun 2010, Rafael J. Wysocki wrote:
>
> > > This is slightly different from the wakelock design. Each call to
> > > pm_stay_awake() must be paired with a call to pm_relax(), allowing one
> > > device to have multiple concurrent critical sections, whereas calls to
> > > pm_wakeup_event() must not be paired with anything. With wakelocks,
> > > you couldn't have multiple pending events for the same device.
> >
> > You could, but you needed to define multiple wakelocks for the same device for
> > this purpose.
>
> Yeah, okay, but who's going to do that?
>
> > > I'm not sure which model is better in practice. No doubt the Android people
> > > will prefer their way.
> >
> > I suppose so.
>
> It may not make a significant difference in the end. You can always
> emulate the wakelock approach by not calling pm_stay_awake() when you
> know there is an earlier call still pending.
>
> > > This requires you to define an explicit PCI_WAKEUP_COOLDOWN delay. I
> > > think that's okay; I had to do something similar with USB and SCSI.
> > > (And I still think it would be a good idea to prevent workqueue threads
> > > from freezing until their queues are empty.)
> >
> > I guess you mean the freezable ones?
>
> Yes. The unfreezable workqueue threads don't have to worry about
> getting frozen while their queues are non-empty. :-)
>
> > I'm not sure if that helps a lot, because
> > new work items may still be added after the workqueue thread has been frozen.
>
> That's not the point. If a wakeup handler queues a work item (for
> example, by calling pm_request_resume) then it wouldn't need to guess a
> timeout. The work item would be guaranteed to run before the system
> could suspend again.

You seem to be referring to the PM workqueue specifically. Perhaps it would be
better to special-case it and stop it by adding a barrier work during suspend
instead of just freezing? Then, it wouldn't need to be singlethread any more.

Still, I think the timeout is necessary anyway in case the driver simply
doesn't handle the event and user space needs time to catch up. Unfortunately,
the PCI wakeup code doesn't know what happens next in advance.

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12
Prev: [PATCH] [RFC] kvm: time accelerator/decelerator for kvm guests
Next: wait_for_completion_interruptible does not wait !!