From: Alan Stern on
On Tue, 22 Jun 2010, Rafael J. Wysocki wrote:

> Anyway, below's an update that addresses this particular case.
>
> It adds two more functions, pm_wakeup_begin() and pm_wakeup_end()
> that play similar roles to suspend_block() and suspend_unblock(), but they
> don't operate on suspend blocker objects. Instead, the first of them increases
> a counter of events in progress and the other one decreases this counter.
> Together they have the same effect as pm_wakeup_event(), but the counter
> of wakeup events in progress they operate on is also checked by
> pm_check_wakeup_events().
>
> Thus there are two ways kernel subsystems can signal wakeup events. First,
> if the event is not explicitly handed over to user space and "instantaneous",
> they can simply call pm_wakeup_event() and be done with it. Second, if the
> event is going to be delivered to user space, the subsystem that processes
> the event can call pm_wakeup_begin() right when the event is detected and
> pm_wakeup_end() when it's been handed over to user space.

Or if the event is going to be handled entirely in the kernel but over
a prolonged period of time.

> Please tell me what you think.

I like it a lot. It addresses the main weakness in the earlier
version. With this, you could essentially duplicate the in-kernel part
of the wakelocks/suspend blockers stuff. All except the timed
wakelocks -- you might want to consider adding a
pm_wakeup_begin_timeout() convenience routine.

Here's another possible enhancement (if you can figure out a way to do
it without too much effort): After a suspend begins, keep track of the
first wakeup event you get. Then when the suspend is aborted, print a
log message saying why and indicating which device was responsible for
the wakeup.

One little thing: You have the PCI subsystem call pm_wakeup_event().
If the driver then wants to call pm_wakeup_begin(), the event will get
counted twice. I guess this doesn't matter much, but it does seem
peculiar.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Alan Stern on
On Tue, 22 Jun 2010, Rafael J. Wysocki wrote:

> > > Please tell me what you think.
> >
> > I like it a lot. It addresses the main weakness in the earlier
> > version. With this, you could essentially duplicate the in-kernel part
> > of the wakelocks/suspend blockers stuff. All except the timed
> > wakelocks -- you might want to consider adding a
> > pm_wakeup_begin_timeout() convenience routine.
>
> That may be added in future quite easily if it really turns out to be necessary.
> IIRC Arve said Android only used timeouts in user space wakelocks, not in the
> kernel ones.

Didn't we agree that timeouts would be needed for Wake-on-LAN?

> > Here's another possible enhancement (if you can figure out a way to do
> > it without too much effort): After a suspend begins, keep track of the
> > first wakeup event you get. Then when the suspend is aborted, print a
> > log message saying why and indicating which device was responsible for
> > the wakeup.
>
> Good idea, but I'd prefer to add it in a separate patch not to complicate
> things too much to start with.

Okay. Another thing to be considered later is whether there should be
a way to write to /sys/power/state that would block until the active
wakeup count drops to 0. On the other hand polling maybe once per
second wouldn't be so bad. It would happen only when the kernel had
some events outstanding and userspace didn't.

One thing that stands out is the new spinlock. It could potentially be
a big source of contention. Any wakeup-enabled device is liable to
need it during every interrupt. Do you think this could cause a
noticeable slowdown?

> > One little thing: You have the PCI subsystem call pm_wakeup_event().
> > If the driver then wants to call pm_wakeup_begin(), the event will get
> > counted twice. I guess this doesn't matter much, but it does seem
> > peculiar.
>
> Knowing that the PCI core has increased the wakeup count of its device, a
> PCI driver may simply use pm_wakeup_begin(NULL) and that will only cause
> the main counter to be increased in the end. Which kind of makes sense,
> because in that case there really is a sequence of events. First, the PCI core
> detects a wakeup signal and requests wakeup so that the driver has a chance
> to access the device and get the event information from it (although at this
> point it is not known whether or not the driver will need to do that). Second,
> the driver requests that the system stay in the working state, because it needs
> time to process the event data and (presumably) hand it over to user space.
> The device has only signaled wakeup once, though, and that should be recorded.

Okay, that works. Although if anybody wanted to keep track of timing
statistics, the results wouldn't be as effective since the start and
end times would not be associated with the device.

> BTW, I thought I would make pm_get_wakeup_count() and pm_save_wakeup_count()
> fail if they are called when events_in_progress is nonzero. For
> pm_save_wakeup_count() that's pretty obvious (I think) and it also kind of
> makes sense for pm_get_wakeup_count(), because that will tell the reader of
> /sys/power/wakeup_count that the value is going to change immediately so it
> should really try again.

Sensible.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Alan Stern on
On Wed, 23 Jun 2010, Rafael J. Wysocki wrote:

> > Didn't we agree that timeouts would be needed for Wake-on-LAN?
>
> Yes, we did, but in the WoL case the timeout will have to be used by the user
> space rather than the kernel IMO.

The kernel will still have to specify some small initial timeout. Just
long enough for userspace to realize what has happened and start its
own critical section.

> It would make sense to add a timeout argument to pm_wakeup_event() that would
> make the system stay in the working state long enough for the driver wakeup
> code to start in the PCIe case. I think pm_wakeup_event() mght just increment
> events_in_progress and the timer might simply decrement it.

Hmm. I was thinking about a similar problem with the USB hub driver.

Maybe a better answer for this particular issue is to change the
workqueue code. Don't allow a work thread to enter the freezer until
its queue is empty. Then you wouldn't need a timeout.

> So, maybe it's just better to have pm_wakeup_event(dev, timeout) that will
> increment events_in_progress and set up a timer and pm_wakeup_commit(dev) that
> will delete the timer, decrement events_in_progress and increment event_count
> (unless the timer has already expired before).
>
> That would cost us a (one more) timer in struct dev_pm_info, but it would
> allow us to cover all of the cases identified so far. So, if a wakeup event is
> handled within one functional unit that both detects it and delivers it to
> user space, it would call pm_wakeup_event(dev, 0) (ie. infinite timeout) at the
> beginning and then pm_wakeup_commit(dev) when it's done with the event.
> If a wakeup event it's just detected by one piece of code and is going to
> be handled by another, the first one could call pm_wakeup_event(dev, tm) and
> allow the other one to call pm_wakeup_commit(dev) when it's done. However,
> calling pm_wakeup_commit(dev) is not mandatory, so the second piece of code
> (eg. a PCI driver) doesn't really need to do anything in the simplest case.

You have to handle the case where pm_wakeup_commit() gets called after
the timer expires (it should do nothing). And what happens if the
device gets a second wakeup event before the timer for the first one
expires? dev_pm_info would need to store a count of pending events.

In fact, this gets even worse. What if the second event causes you to
move the timeout forward, but then you get a commit for the second
event before the original timer would have expired? It's not clear
that timeouts and early commit work well together.

You could consider changing some of the new function names. Instead of
"wakeup" (which implies that the system was asleep previously) use
"awake" (which implies that you want to prevent the system from going
to sleep, as in "stay awake").

> > One thing that stands out is the new spinlock. It could potentially be
> > a big source of contention. Any wakeup-enabled device is liable to
> > need it during every interrupt. Do you think this could cause a
> > noticeable slowdown?
>
> That really depends on the number of wakeup devices. However, ISTR the
> original wakelocks code had exactly the same issue (it used a spinlock to
> protect the lists of wakelocks).

Yeah, that's right. I have already forgotten the details of how that
original design worked.

Alan Stern


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andy Lutomirski on
Rafael J. Wysocki wrote:
> On Tuesday, June 22, 2010, Rafael J. Wysocki wrote:
>> On Tuesday, June 22, 2010, Alan Stern wrote:
>>> On Tue, 22 Jun 2010, Rafael J. Wysocki wrote:
> ...
>>>> So, even if we can say when the kernel has finished processing the event
>>>> (although that would be complicated in the PCIe case above), I don't think
>>>> it's generally possible to ensure that the entire processing of a wakeup event
>>>> has been completed. This leads to the question whether or not it is worth
>>>> trying to detect the ending of the processing of a wakeup event.
>>> As Arve pointed out, in some cases it definitely is worthwhile (the
>>> gpio keypad matrix example). In other cases there may be no reasonable
>>> way to tell. That doesn't mean we have to give up entirely.
>> Well, I'm not sure, because that really depends on the hardware and bus in
>> question. The necessary condition seems to be that the event be detected
>> and handled entirely by the same functional unit (eg. a device driver) within
>> the kernel and such that it is able to detect whether or not user space has
>> acquired the event information. That doesn't seem to be a common case to me.
>
> Anyway, below's an update that addresses this particular case.
>
> It adds two more functions, pm_wakeup_begin() and pm_wakeup_end()
> that play similar roles to suspend_block() and suspend_unblock(), but they
> don't operate on suspend blocker objects. Instead, the first of them increases
> a counter of events in progress and the other one decreases this counter.
> Together they have the same effect as pm_wakeup_event(), but the counter
> of wakeup events in progress they operate on is also checked by
> pm_check_wakeup_events().
>
> Thus there are two ways kernel subsystems can signal wakeup events. First,
> if the event is not explicitly handed over to user space and "instantaneous",
> they can simply call pm_wakeup_event() and be done with it. Second, if the
> event is going to be delivered to user space, the subsystem that processes
> the event can call pm_wakeup_begin() right when the event is detected and
> pm_wakeup_end() when it's been handed over to user space.

How does userspace handle this without races? (I don't see an example
in a driver that talks to userspace in your code...)

For example, if I push a button on my keyboard, the driver calls
pm_wakeup_begin(). Then userspace reads the key from the evdev device
and tells the userspace suspend manager not to go to sleep.

But there's a race: the keyboard driver (or input subsystem) could call
pm_wakeup_end() before the userspace program has a chance to tell the
suspend manager not to sleep.

One possibility would be for poll to report that events are pending
without calling pm_wakeup_end(), giving userspace a chance to prevent
itself from suspending before actually reading the event.


(Also, should "echo mem >/sys/power/state" be different from "echo
mem_respect_suspend_blockers >/sys/power/state?" If I physically press
the suspend key on my laptop, I want it to go to sleep even though I'm
still holding the Fn key that was part of the suspend hotkey.)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Alan Stern on
On Thu, 24 Jun 2010, Andy Lutomirski wrote:

> How does userspace handle this without races? (I don't see an example
> in a driver that talks to userspace in your code...)
>
> For example, if I push a button on my keyboard, the driver calls
> pm_wakeup_begin(). Then userspace reads the key from the evdev device
> and tells the userspace suspend manager not to go to sleep.
>
> But there's a race: the keyboard driver (or input subsystem) could call
> pm_wakeup_end() before the userspace program has a chance to tell the
> suspend manager not to sleep.

The assumption is that the user program will poll the evdev device.
When the poll indicates data is available, the program will tell the
userspace power manager not to go to sleep before reading the data.
This avoids the race.

> One possibility would be for poll to report that events are pending
> without calling pm_wakeup_end(), giving userspace a chance to prevent
> itself from suspending before actually reading the event.

Exactly. The critical section doesn't end until the events have been
read. Polling alone doesn't affect the critical section.

> (Also, should "echo mem >/sys/power/state" be different from "echo
> mem_respect_suspend_blockers >/sys/power/state?" If I physically press
> the suspend key on my laptop, I want it to go to sleep even though I'm
> still holding the Fn key that was part of the suspend hotkey.)

With Rafael's scheme, the difference is not in the write to
/sys/power/state but rather in the write to /sys/power/wakeup_count.
If the power manager program doesn't write to /sys/power/wakeup_count
first then /sys/power/state takes effect immediately, just as it does
now.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/