- race-free suspend. Was: Re: [linux-pm] [PATCH 0/8] Suspend block api (version 8) [Kernel]

Prev: p54spi: replace internal "cx3110x" name with "p54spi"
Next: [git pull] Please pull powerpc.git merge branch

From: Neil Brown on 2 Jun 2010 19:30

On Wed, 2 Jun 2010 21:05:21 +0200
Florian Mickler <florian(a)mickler.org> wrote:

> Could someone perhaps make a recap on what are the problems with the
> API? I have no clear eye (experience?) for that (or so it seems).

Good interface design is an acquired taste. And it isn't always easy to
explain satisfactorily. But let me try to explain what I see.

A key aspect of a good interface is unity, and sometimes uniformity.
For example, the file descriptor is a key element to the unity of the
Unix (and hence Posix and Linux) interface. "everything is a file" and
even when it isn't, everything is accessed via a file descriptor.
This is one of the reasons that signals cause so much problem when
programming in Unix - they aren't files, don't have file descriptors and
don't look them at all. That is why signalfd was created, to try to tie
signals back in to the 'file descriptor' model.

So unity is important. Adding new concepts is best done as an extension of
an existing concept. That means that all the infrastructure, not only code
and design but also developer understanding, can be leveraged to help get the
new concept *right* first time. It also means that using the new concept is
easier to learn.

So the problem with the wake-locks / suspend-blockers (and I've actually come
to like the first name much more) is that it introduces a new concept without
properly leveraging existing concepts.

The new concept is opportunistic suspend, though maybe a better name would be
automatic suspend - not sure.

There appear to be two ways you can get opportunistic suspend leveraging
already-existing concepts.

One is to leverage the current "power/state = mem" architecture and just let
userspace choose the opportune moment. The user-space daemon that chooses
this moment would need full information about states of various things to do
this, but sysfs is good at providing full information about what is in the
kernel, and there are dozens of ways for user-space processes to communicate
their state to each other. So this is all doable today without introducing
new design concepts.
Except there is a race between suspending and new events, so we just need to
fix the race. Hence my patch.

The other is to leverage the more general power management infrastructure.
We can already detect when the processor won't be needed for a while, and put
it into a low-power state. We can already put devices to sleep when they
aren't being used. We can just generalise this so that we can detect when
all devices are either unused, or capable of supporting an S3 transition, and
detect when the next timer interrupt is far enough in the future that S3
latency wont be a problem - set the rtc alarm to match the next timer and go
to S3. All completely transparent. (I admit I'm not entirely sure what the
qos that is being discussed brings us, but I assume it is better quality
rather than correctness).

So there are at least two different ways that opportunistic suspend could be
integrated into existing infrastructure with virtually no change of interface
and no new concepts - just refining or extending existing concepts.

Yet the approach used and preferred by android is to create something
substantially new. Yes, it does use the existing suspend infrastructure, but
in a very new and different way. Suspend is now initiated by the kernel, but
in a completely different way to the ways that the kernel already initiates
power saving. So we have two infrastructures for essentially one task.
Looked at the other way, it moves the initiation of suspend from user-space
into the kernel, and then allows user-space to tell the kernel not to suspend.
That to me is very ugly.
In general, the kernel should provide information to user-space, and provide
services to user-space, and user-space should use that information to decide
what services to request. This is the essence the "policy lives in
user-space" maxim.
The Android code has user-space giving information to the kernel, so the
kernel can make a policy decision. This approach is generally less flexible
and is best avoided.

Just as a bit of background, let's think about some of the areas in the
kernel where the kernel does make policy decisions based on user-space input.
- the scheduler - based on 'nice' setting it decided who should run when
- the VM - based on read-ahead settings, madvise/fadvise, recent-use
heuristics, it tries to decide what to keep in memory and what to swap out.
I think those are the main ones. There are other smaller fish like the
choice of IO scheduler and various ways to tune network connections.

But the two big ones are perfect examples of subsystems that have proven very
hard to get *right*, and have been substantially re-written more than once.
In each case, the difficulty wasn't how to perform the work, it was the
choice of what work to perform. It probably also involved getting different
sorts of information about the current state.

That perspective leaves me very sceptical of any design that involves making
policy decisions in the kernel. It is too easy to get wrong, then too hard
to change.
Admittedly the power subsystem does seem to make policy decisions in the
kernel, via the various governors. Though I don't know much about how these
work, it seems significant that there is a pluggable infrastructure with
multiple governors, and one of them leaves the decisions to user-space.

So that is what I see as wrong with the android API : it doesn't bring unity
by simply leveraging existing infrastructure, and it makes policy decisions
in the kernel.

>
> > It is a pity that this extra requirement was not clear from your introduction
> > to the "Opportunistic suspend support" patch.
>
> I think that the main problem was that _all_ the requirements were
> not communicated well. That caused everybody to think that their
> solution would be a better fit. You are not alone.
>
> > If that be the case, I'll stop bothering you with suggestions that can never
> > work.
> > Thanks for your time,
> > NeilBrown
>
> Don't be frustrated. What should Arve be? :)
>

Sometimes appearing frustrated can elicit a different style of response to
appearing polite and constructive... can be helpful.
And yes: I fully understand that Arve would be frustrated. There seems to
be a big disconnect in perceptions of what problem is trying to be solved, and
thus disconnects in what the solution should look like, and I suspect that
would be very frustrating all around.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Neil Brown on 2 Jun 2010 23:30

On Wed, 2 Jun 2010 19:44:59 -0700
Arve Hjønnevåg <arve(a)android.com> wrote:

> On Wed, Jun 2, 2010 at 4:32 PM, Dmitry Torokhov
> <dmitry.torokhov(a)gmail.com> wrote:
> > On Wed, Jun 02, 2010 at 09:05:21PM +0200, Florian Mickler wrote:
> >> On Wed, 2 Jun 2010 21:02:24 +1000
> >> Neil Brown <neilb(a)suse.de> wrote:
> >> >
> >> > And this decision (to block suspend) really needs to be made in the driver,
> >> > not in userspace?
> >>
> >> Well, it fits. The requirement is a direct consequence of the intimate
> >> knowledge the driver has about the driven devices.
> >
> > That is not really true. A driver does have intimate knowledge of the
> > device, however it does not necessarily have an idea about the data read
> > from the device. Consider the gpio_matrix driver: Arve says that it has
> > to continue scanning matrix once first interrupt arrvies. But it really
> > depends on what key has been pressed - if user pressed KEY_SUSPEND or
> > KEY_POWER it cmight be better if we did not wait for key release but
> > initiated the action right away. The decision on how system reacts to a
> > key press does not belong to the driver but really to userspace.
>
> If we just suspend while the keypad scan is active, a second press of
> KEY_POWER would not be able wake the device up. The gpio keypad matrix
> driver has two operating modes. No keys are pressed: all the outputs
> are active so a key press will generate an interrupt in one of the
> inputs. Keys are pressed: Activate a single output at a time so we can
> determine which keys are pressed. The second mode is entered when the
> driver gets an interrupt in the first mode. The first mode is entered
> if the scan detected that no keys are pressed. The driver could be
> modified to stop the scan on suspend and forcibly put the device back
> in no-keys-pressed state, but pressing additional keys connected to
> the same input can no longer generate any events (the input does not
> change).
>

Thanks for the detailed explanation. That helps a lot.

I see that if follows that performing a suspend while keys are pressed would
not be good, and keys could be pressed for arbitrarily long periods.
It does not follow that the keypad driver should therefore explicitly disable
suspend. It could simply inform user-space that the keypad is in the
alternate scan mode, so user-space can choose not to enter suspend in at that
time (i.e. policy in user-space).

I can see also how one might want to express this behaviour as a PM_QOS
constraint (now that I have read a bit about PM_QOS), but I cannot see that
you would need to. As the keypad driver is polling during this mode, it
would presumably set a timer that would fire in the near future, and the very
existence of this timer (with a duration shorter than the S3 latency) would
be enough to ensure S3 wasn't automatically entered by PM.
At most you might use set_timer_slack to give a slightly higher upper bound
to the timeout.

So if we take the suspend-from-idle approach, this driver needs no
annotation, and if we take the 'fix the suspend/wake-event race' then this
driver needs no special treatment - just enough to close the race, and some
user-space policy support.

i.e. it doesn't seem to be a special case.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Neil Brown on 3 Jun 2010 02:40

On Wed, 2 Jun 2010 11:05:18 -0700
Brian Swetland <swetland(a)google.com> wrote:

> On Wed, Jun 2, 2010 at 1:06 AM, Neil Brown <neilb(a)suse.de> wrote:
> > On Wed, 2 Jun 2010 00:05:14 -0700
> > Arve Hjønnevåg <arve(a)android.com> wrote:
> >> > The user-space suspend daemon avoids losing wake-events by using
> >> > fcntl(F_OWNER) to ensure it gets a signal whenever any important wake-event
> >> > is ready to be read by user-space. This may involve:
> >> > - the one daemon processing all wake events
> >>
> >> Wake up events are not all processed by one daemon.
> >
> > Not with your current user-space code, no. Are you saying that you are not
> > open to any significant change in the Android user-space code? That would
> > make the situation a lot harder to resolve.
>
> There are many wakeup events possible in a typical system --
> keypresses or other input events, network traffic, telephony events,
> media events (fill audio buffer, fill video decoder buffer, etc), and
> I think requiring that all wakeup event processing bottleneck through
> a single userspace process is non-optimal here.

Just to be clear: I'm not suggesting all wake-events need to go through one
process. That was just one example of how the interface I proposed could be
used. There were two other examples.
However one process would need to know about any wakeup event that happens.
I don't think that needs to be a significant bottleneck, but I don't really
know enough about all the requirement to try devising a demonstration.

>
> The current suspend-blocker proposal already involves userspace
> changes (it's different than our existing wakelock interface), and
> we're certainly not opposed to any/all userspace changes on principle,
> but on the other hand we're not interested in significant reworks of
> userspace unless they actually improve the situation somehow. I think
> bottlenecking events through a central daemon would represent a step
> backwards.

I guess it becomes an question of economics for you then. Does the cost of
whatever user-space changes are required exceed the value of using an upstream
kernel? Both the cost and the value would be very hard to estimate in
advance. I don't envy you the decision...

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev |
Pages: 1 2
Prev: p54spi: replace internal "cx3110x" name with "p54spi"
Next: [git pull] Please pull powerpc.git merge branch