From: Rafael J. Wysocki on
On Friday 11 December 2009, Alan Stern wrote:
> Up front: This is my personal view of the matter. Which probably isn't
> of much interest to anybody, so I won't bother to defend these views or
> comment any further on them. The decision about what version to use is
> up to the two of you. The fact is, either implementation would get the
> job done.
>
> On Thu, 10 Dec 2009, Linus Torvalds wrote:
>
> > Completions really are "locks that were initialized to locked". That is,
> > in fact, how completions came to be: we literally used to use semaphores
> > for them, and the reason for completions is literally the magic lifetime
> > rules they have.
> >
> > So when you do
> >
> > INIT_COMPLETION(dev->power.completion);
> >
> > that really is historically, logically, and conceptually exactly the same
> > thing as initializing a lock to the locked state. We literally used to do
> > it with the equivalent of
> >
> > init_MUTEX_LOCKED()
> >
> > way back when (well, except we didn't have mutexes back then, we had only
> > counting semaphores) and instead of "complete()", we had "up()" on the
> > semaphore to complete it.
>
> You think of it that way because you have been closely involved in the
> development of the various kinds of locks. Speaking as an outsider who
> has relatively little interest in the internal details, completions
> appear simpler than rwsems. Mostly because they have a smaller API:
> complete() (or complete_all()) and wait_for_completion() as opposed to
> down_read(), up_read(), down_write(), and up_write().

Agreed.

> > > Besides, suppose a device driver wants some off-tree constraints to be
> > > satisfied.
> >
> > .. and I've told you several times that we should simply not do such
> > devices asynchronously. At least not unless there is some _overriding_
> > reason to. And so far, nobody has suggested anything even remotely
> > likely for that.
>
> Agreed. The fact that async non-tree suspend constraints are difficult
> with rwsems isn't a drawback if nobody needs to use them.

Well, see my reply to Linus. The only thing that bothers me is that if we use
rwsems, there's no way to handle that even if it turns out that someone
needs them after all.

> > > Well, why actually do we need to preserve the state of the data structure from
> > > one cycle to another? There's no need whatsoever.
> >
> > My point is, with locks, none of that is necessary. Because they
> > automatically do the right thing.
> >
> > By picking the right concept, you don't have any of those "oh, we need to
> > re-initialize things" issues. They just work.
>
> That's true, but it's not entirely clear. There are subtle questions
> about what happens if you stop in the middle or a device gets
> unregistered or registered in the middle. They require careful thought
> in both approaches.
>
> Having to reinitialize a completion each time doesn't bother me. It's
> merely an indication that each suspend & resume is independent of all
> the others.

YES!

> > > I still don't think there are many places where locks are used in a way you're
> > > suggesting. I would even say it's quite unusual to use locks this way.
> >
> > See above. It's what completions _are_.
>
> This is almost a philosophical issue. If each A_i must wait for some
> B_j's, is the onus on each A_i to test the B_j's it's interested in?
> Or is the onus on each B_j to tell the A_i's waiting for it that they
> may proceed? As Humpty-Dumpty said, "The question is which is to be
> master -- that's all".

Agreed.

> > > Well, I guess your point is that the implementation of completions is much
> > > more complicated that we really need, but I'm not sure if that really hurts.
> >
> > No. The implementation of completions is actually pretty simple, exactly
> > because they have that spinlock that is required to protect them.
> >
> > That wasn't the point. The point was that locks are actually the "normal"
> > thing to use.
> >
> > You are arguing as if completions are somehow the simpler model. That's
> > simply not true. Completions are just a _special_case_of_locking_.
>
> Doesn't that make them simpler by definition? Special cases always
> have less to worry about than the general case.

Heh, good point.

> > So why not just use regular locks instead, when it's actually the natural
> > way to do it, and results in simpler code?
>
> Simpler but also more subtle, IMO. If you didn't already know how the
> algorithm worked, figuring it out from the code would be harder with
> rwsems than with completions.

Indeed.

> Partly because of the way readers and
> writers exchange roles in suspend vs. resume, and partly because
> sometimes devices lock themselves and sometimes they lock other
> devices. With completions each device has its own, and each device
> waits for other devices' completions -- easier to keep track of
> mentally.

Agreed again.

> (I still think this whole readers vs. writers thing is a red herring.
> The essential property is that there are two opposing classes of lock
> holders. The fact that multiple writers can't hold the lock at the
> same time whereas multiple readers can is of no importance; the
> algorithm would work just as well if multiple writers _could_ hold the
> lock simultaneously.)
>
> Balancing the additional conceptual complexity of the rwsem approach is
> the conceptual simplicity afforded by not needing to check all the
> children. To me this makes it pretty much a toss-up.

Yup.

Thanks!
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Rafael J. Wysocki on
On Friday 11 December 2009, Linus Torvalds wrote:
>
> On Fri, 11 Dec 2009, Rafael J. Wysocki wrote:
> >
> > But fine, say we use the approach based on rwsems and consider suspend and
> > the inner lock. We acquire it using down_write(), because we want to wait for
> > multiple other dirvers. Now, in fact we could do literally
> >
> > down_write(dev->power.rwsem);
> > up_write(dev->power.rwsem);
> >
> > because the lock doesn't really protect anything from anyone. What it does is
> > to prevent _us_ from doing something too early. To me, personally, it's not a
> > usual use of locks.
>
> I agree that it's fairly unusual, but on the other hand, it's unusual only
> because you contrieved it to be.

Whatever. The very fact that you can freely move the up_write() (as long as
it's after the down_write()) is fairly unusual.

> But think about it this way: if you could abort a failed suspend, and
> start resuming devices immediately, without doing that
> "async_synchronize_full()" in between - simply because you know that the
> node locking itself will just "do the right thing".

I'd rather not. :-)

> To me, that's a sign of a _good_ design. Using a rwsem is simply just more
> robust and natural for the problem in question. Exactly because it's a
> real lock.
....
> Solve the problem at hand _first_. Solve it as simply as you can. And hope
> that you never ever will need anything more complex.

Below is a patch I've just tested, but there's a lockdep problem in it I don't
know how to solve. Namely, lockdep is apparently unhappy with us not releasing
the lock taken in device_suspend() and it complains we take it twice in a row
(which we do, but for another device). I need to use down_read_non_owner()
to make it shut up and then I also need to use up_read_non_owner() in
__device_suspend(), although there's the comment in include/linux/rwsem.h
saying exatly this about that:

/*
* Take/release a lock when not the owner will release it.
*
* [ This API should be avoided as much as possible - the
* proper abstraction for this case is completions. ]
*/

(I'd like to know your opinion about that). Yet, that's not all, because next
it complains during resume that __device_resume() releases a lock it didn't
acquire, which it clearly does, but that is intentional. Unfortunately,
there's no up_write_non_owner() ...

So, what am I supposed to do about that?

Rafael


---
drivers/base/power/main.c | 107 +++++++++++++++++++++++++++++++++++++++----
include/linux/device.h | 6 ++
include/linux/pm.h | 3 +
include/linux/resume-trace.h | 7 ++
4 files changed, 114 insertions(+), 9 deletions(-)

Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -26,6 +26,7 @@
#include <linux/spinlock.h>
#include <linux/wait.h>
#include <linux/timer.h>
+#include <linux/rwsem.h>

/*
* Callbacks for platform drivers to implement.
@@ -412,9 +413,11 @@ struct dev_pm_info {
pm_message_t power_state;
unsigned int can_wakeup:1;
unsigned int should_wakeup:1;
+ unsigned async_suspend:1;
enum dpm_state status; /* Owned by the PM core */
#ifdef CONFIG_PM_SLEEP
struct list_head entry;
+ struct rw_semaphore rwsem;
#endif
#ifdef CONFIG_PM_RUNTIME
struct timer_list suspend_timer;
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -25,6 +25,7 @@
#include <linux/resume-trace.h>
#include <linux/rwsem.h>
#include <linux/interrupt.h>
+#include <linux/async.h>

#include "../base.h"
#include "power.h"
@@ -42,6 +43,7 @@
LIST_HEAD(dpm_list);

static DEFINE_MUTEX(dpm_list_mtx);
+static pm_message_t pm_transition;

/*
* Set once the preparation of devices for a PM transition has started, reset
@@ -56,6 +58,7 @@ static bool transition_started;
void device_pm_init(struct device *dev)
{
dev->power.status = DPM_ON;
+ init_rwsem(&dev->power.rwsem);
pm_runtime_init(dev);
}

@@ -381,17 +384,22 @@ void dpm_resume_noirq(pm_message_t state
EXPORT_SYMBOL_GPL(dpm_resume_noirq);

/**
- * device_resume - Execute "resume" callbacks for given device.
+ * __device_resume - Execute "resume" callbacks for given device.
* @dev: Device to handle.
* @state: PM transition of the system being carried out.
*/
-static int device_resume(struct device *dev, pm_message_t state)
+static int __device_resume(struct device *dev, pm_message_t state)
{
+ struct device *parent = dev->parent;
int error = 0;

TRACE_DEVICE(dev);
TRACE_RESUME(0);

+ /* Wait for the parent's resume to complete, if necessary. */
+ if (parent)
+ down_read_nested(&parent->power.rwsem, SINGLE_DEPTH_NESTING);
+
down(&dev->sem);

if (dev->bus) {
@@ -426,11 +434,41 @@ static int device_resume(struct device *
}
End:
up(&dev->sem);
+ if (parent)
+ up_read(&parent->power.rwsem);
+
+ /* Allow the children to resume now. */
+ up_write(&dev->power.rwsem);

TRACE_RESUME(error);
return error;
}

+static void async_resume(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error;
+
+ error = __device_resume(dev, pm_transition);
+ if (error)
+ pm_dev_err(dev, pm_transition, " async", error);
+ put_device(dev);
+}
+
+static int device_resume(struct device *dev)
+{
+ /* Prevent the children from resuming before us. */
+ down_write(&dev->power.rwsem);
+
+ if (dev->power.async_suspend && !pm_trace_is_enabled()) {
+ get_device(dev);
+ async_schedule(async_resume, dev);
+ return 0;
+ }
+
+ return __device_resume(dev, pm_transition);
+}
+
/**
* dpm_resume - Execute "resume" callbacks for non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -444,6 +482,7 @@ static void dpm_resume(pm_message_t stat

INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.next);

@@ -454,7 +493,7 @@ static void dpm_resume(pm_message_t stat
dev->power.status = DPM_RESUMING;
mutex_unlock(&dpm_list_mtx);

- error = device_resume(dev, state);
+ error = device_resume(dev);

mutex_lock(&dpm_list_mtx);
if (error)
@@ -469,6 +508,7 @@ static void dpm_resume(pm_message_t stat
}
list_splice(&list, &dpm_list);
mutex_unlock(&dpm_list_mtx);
+ async_synchronize_full();
}

/**
@@ -584,13 +624,11 @@ static int device_suspend_noirq(struct d
{
int error = 0;

- if (!dev->bus)
- return 0;
-
- if (dev->bus->pm) {
+ if (dev->bus && dev->bus->pm) {
pm_dev_dbg(dev, state, "LATE ");
error = pm_noirq_op(dev, dev->bus->pm, state);
}
+
return error;
}

@@ -623,17 +661,24 @@ int dpm_suspend_noirq(pm_message_t state
}
EXPORT_SYMBOL_GPL(dpm_suspend_noirq);

+static int async_error;
+
/**
* device_suspend - Execute "suspend" callbacks for given device.
* @dev: Device to handle.
* @state: PM transition of the system being carried out.
*/
-static int device_suspend(struct device *dev, pm_message_t state)
+static int __device_suspend(struct device *dev, pm_message_t state)
{
int error = 0;

+ /* Wait for the suspends of the children to complete, if necessary. */
+ down_write_nested(&dev->power.rwsem, SINGLE_DEPTH_NESTING);
down(&dev->sem);

+ if (async_error)
+ goto End;
+
if (dev->class) {
if (dev->class->pm) {
pm_dev_dbg(dev, state, "class ");
@@ -666,12 +711,50 @@ static int device_suspend(struct device
suspend_report_result(dev->bus->suspend, error);
}
}
+
+ if (!error)
+ dev->power.status = DPM_OFF;
+
End:
up(&dev->sem);
+ up_write(&dev->power.rwsem);
+
+ /* Allow the parent to suspend now. */
+ if (dev->parent)
+ up_read_non_owner(&dev->parent->power.rwsem);

return error;
}

+static void async_suspend(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error;
+
+ error = __device_suspend(dev, pm_transition);
+ if (error) {
+ pm_dev_err(dev, pm_transition, " async", error);
+ async_error = error;
+ }
+
+ put_device(dev);
+}
+
+static int device_suspend(struct device *dev, pm_message_t state)
+{
+ /* Prevent the parent from suspending before us. */
+ if (dev->parent)
+ down_read_non_owner(&dev->parent->power.rwsem);
+
+ if (dev->power.async_suspend) {
+ get_device(dev);
+ async_schedule(async_suspend, dev);
+ return 0;
+ }
+
+ return __device_suspend(dev, pm_transition);
+}
+
/**
* dpm_suspend - Execute "suspend" callbacks for all non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -683,6 +766,7 @@ static int dpm_suspend(pm_message_t stat

INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.prev);

@@ -697,13 +781,17 @@ static int dpm_suspend(pm_message_t stat
put_device(dev);
break;
}
- dev->power.status = DPM_OFF;
if (!list_empty(&dev->power.entry))
list_move(&dev->power.entry, &list);
put_device(dev);
+ if (async_error)
+ break;
}
list_splice(&list, dpm_list.prev);
mutex_unlock(&dpm_list_mtx);
+ async_synchronize_full();
+ if (!error)
+ error = async_error;
return error;
}

@@ -762,6 +850,7 @@ static int dpm_prepare(pm_message_t stat
INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
transition_started = true;
+ async_error = 0;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.next);

Index: linux-2.6/include/linux/resume-trace.h
===================================================================
--- linux-2.6.orig/include/linux/resume-trace.h
+++ linux-2.6/include/linux/resume-trace.h
@@ -6,6 +6,11 @@

extern int pm_trace_enabled;

+static inline int pm_trace_is_enabled(void)
+{
+ return pm_trace_enabled;
+}
+
struct device;
extern void set_trace_device(struct device *);
extern void generate_resume_trace(const void *tracedata, unsigned int user);
@@ -17,6 +22,8 @@ extern void generate_resume_trace(const

#else

+static inline int pm_trace_is_enabled(void) { return 0; }
+
#define TRACE_DEVICE(dev) do { } while (0)
#define TRACE_RESUME(dev) do { } while (0)

Index: linux-2.6/include/linux/device.h
===================================================================
--- linux-2.6.orig/include/linux/device.h
+++ linux-2.6/include/linux/device.h
@@ -472,6 +472,12 @@ static inline int device_is_registered(s
return dev->kobj.state_in_sysfs;
}

+static inline void device_enable_async_suspend(struct device *dev, bool enable)
+{
+ if (dev->power.status == DPM_ON)
+ dev->power.async_suspend = enable;
+}
+
void driver_init(void);

/*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Rafael J. Wysocki on
On Saturday 12 December 2009, Alan Stern wrote:
> On Sat, 12 Dec 2009, Rafael J. Wysocki wrote:
>
> > Below is a patch I've just tested, but there's a lockdep problem in it I don't
> > know how to solve. Namely, lockdep is apparently unhappy with us not releasing
> > the lock taken in device_suspend() and it complains we take it twice in a row
> > (which we do, but for another device). I need to use down_read_non_owner()
> > to make it shut up and then I also need to use up_read_non_owner() in
> > __device_suspend(), although there's the comment in include/linux/rwsem.h
> > saying exatly this about that:
> >
> > /*
> > * Take/release a lock when not the owner will release it.
> > *
> > * [ This API should be avoided as much as possible - the
> > * proper abstraction for this case is completions. ]
> > */
> >
> > (I'd like to know your opinion about that). Yet, that's not all, because next
> > it complains during resume that __device_resume() releases a lock it didn't
> > acquire, which it clearly does, but that is intentional. Unfortunately,
> > there's no up_write_non_owner() ...
>
> Hah! I knew it!
>
> How come lockdep didn't complain earlier? What's different about this
> patch? Only the nesting annotations? Why should adding annotations
> make lockdep less happy?

I'm not sure. Perhaps I made a mistake during the previous tests.

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Rafael J. Wysocki on
On Saturday 12 December 2009, Linus Torvalds wrote:
>
> On Sat, 12 Dec 2009, Rafael J. Wysocki wrote:
> >
> > I'd like to put it into my tree in this form, if you don't mind.
>
> This version still has a major problem, which is not related to
> completions vs rwsems, but simply to the fact that you wanted to do this
> at the generic device layer level rather than do it at the actual
> low-level suspend/resume level.
>
> Namely that there's no apparent sane way to say "don't wait for children".
>
> PCI bridges that don't suspend at all - or any other device that only
> suspends in the 'suspend_late()' thing, for that matter - don't have any
> reason what-so-ever to wait for children, since they aren't actually
> suspending in the first place. But you make them wait regardless, which
> then serializes things unnecessarily (for example, two unrelated USB
> controllers).

This is a problem that needs to be solved.

One solution that we have discussed on linux-pm is to start a bunch of async
threads searching for async devices that can be suspended and suspending
them (assuming suspend is considered) out of order with respect to dpm_list.
For example, leaf async devices can always be suspended at the same time
regardless of their positions in dpm_list. This way we could get almost the
entire gain resulting from suspending or resuming devices in parallel without
bothering drivers with the problem of dependencies that need to be honoured.

That's something we can add on top of this patch, though, not to complicate
things from the start and it surely requires more discussion.

> And no, making _everything_ be async is _not_ the answer.

I'm not sure what you mean, really.

Speaking of PCI bridges, even though they don't "suspend" in the sense of
being put into low power states or something, we still need to save their
registers on suspend and restore them on resume, and that restore has to
be done before we start to access devices below the bridge.

There are devices with totally null suspend and resume routines that even
the bus type doesn't really handle, but those can be marked as "async" from
the start and they won't really get in the way any more (this creates another
issue to solve, namely that we shouldn't really start a new async thread for
each of them; we have considered that too).

Even if we move that all to drivers, the constraints won't go away and someone
will have to take care of them. Now, since _we_ have problems with reaching
an agreement about how to do it, the driver writers will be even less likely to
figure that out.

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Rafael J. Wysocki on
On Saturday 12 December 2009, Rafael J. Wysocki wrote:
> On Saturday 12 December 2009, Linus Torvalds wrote:
> >
> > On Sat, 12 Dec 2009, Rafael J. Wysocki wrote:
> > >
....
>
> > And no, making _everything_ be async is _not_ the answer.
>
> I'm not sure what you mean, really.
>
> Speaking of PCI bridges, even though they don't "suspend" in the sense of
> being put into low power states or something, we still need to save their
> registers on suspend and restore them on resume, and that restore has to
> be done before we start to access devices below the bridge.

Of course we restore them at the early stage now so the above remark does't
apply to the patch in question, sorry.

But the one below does.

> Even if we move that all to drivers, the constraints won't go away and someone
> will have to take care of them. Now, since _we_ have problems with reaching
> an agreement about how to do it, the driver writers will be even less likely to
> figure that out.

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/