From: Rafael J. Wysocki on
On Wednesday 09 December 2009, Alan Stern wrote:
> On Wed, 9 Dec 2009, Rafael J. Wysocki wrote:
>
> > On Wednesday 09 December 2009, Alan Stern wrote:
> > > On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
> > >
> > > > For completness, below is the full async suspend/resume patch with rwlocks,
> > > > that has been (very slightly) tested and doesn't seem to break things.
> > > >
> > > > [Note to Alan: lockdep doesn't seem to complain about the not annotated nested
> > > > locks.]
> > >
> > > I can't imagine why not. And wouldn't lockdep get confused by the fact
> > > that in the async case, the rwsems are released by a different process
> > > from the one that acquired them?
> >
> > /me looks at the .config
> >
> > I have CONFIG_LOCKDEP_SUPPORT set, is there anything else I need to set
> > in .config?
>
> How about CONFIG_PROVE_LOCKING? If lockdep really does start
> complaining then switching to completions would be a simple way to
> appease it.

Ah, that one is not set. I guess I'll try it later, although I've already
decided to use completions anyway.

....
> > > How about exporting a wait_for_device_to_resume() routine? Drivers
> > > could call it for non-tree resume constraints:
> > >
> > > void wait_for_device_to_resume(struct device *other)
> > > {
> > > down_read(&other->power.rwsem);
> > > up_read(&other->power.rwsem);
> > > }
> > >
> > > Unfortunately there is no equivalent for non-tree suspend constraints.
> >
> > If we use completions, it will be possible to just export something like
> >
> > dpm_wait(dev)
> > {
> > if (dev)
> > wait_for_completion(dev->power.completion);
> > }
> >
> > I think. It appears that will also work for suspend, unless I'm missing
> > something.
>
> It will.

Completions it is, then.

Additionally, I've removed the async support from the _noirq parts and moved
the setting of power.status on suspend to __device_suspend(). The result is
appended.

Rafael


---
drivers/base/power/main.c | 124 ++++++++++++++++++++++++++++++++++++++++---
include/linux/device.h | 6 ++
include/linux/pm.h | 12 ++++
include/linux/resume-trace.h | 7 ++
4 files changed, 143 insertions(+), 6 deletions(-)

Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -26,6 +26,7 @@
#include <linux/spinlock.h>
#include <linux/wait.h>
#include <linux/timer.h>
+#include <linux/completion.h>

/*
* Callbacks for platform drivers to implement.
@@ -412,9 +413,11 @@ struct dev_pm_info {
pm_message_t power_state;
unsigned int can_wakeup:1;
unsigned int should_wakeup:1;
+ unsigned async_suspend:1;
enum dpm_state status; /* Owned by the PM core */
#ifdef CONFIG_PM_SLEEP
struct list_head entry;
+ struct completion completion;
#endif
#ifdef CONFIG_PM_RUNTIME
struct timer_list suspend_timer;
@@ -508,6 +511,13 @@ extern void __suspend_report_result(cons
__suspend_report_result(__func__, fn, ret); \
} while (0)

+extern int __dpm_wait(struct device *dev, void *ign);
+
+static inline void dpm_wait(struct device *dev)
+{
+ __dpm_wait(dev, NULL);
+}
+
#else /* !CONFIG_PM_SLEEP */

#define device_pm_lock() do {} while (0)
@@ -520,6 +530,8 @@ static inline int dpm_suspend_start(pm_m

#define suspend_report_result(fn, ret) do {} while (0)

+static inline void dpm_wait(struct device *dev) {}
+
#endif /* !CONFIG_PM_SLEEP */

/* How to reorder dpm_list after device_move() */
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -25,6 +25,7 @@
#include <linux/resume-trace.h>
#include <linux/rwsem.h>
#include <linux/interrupt.h>
+#include <linux/async.h>

#include "../base.h"
#include "power.h"
@@ -42,6 +43,7 @@
LIST_HEAD(dpm_list);

static DEFINE_MUTEX(dpm_list_mtx);
+static pm_message_t pm_transition;

/*
* Set once the preparation of devices for a PM transition has started, reset
@@ -56,6 +58,7 @@ static bool transition_started;
void device_pm_init(struct device *dev)
{
dev->power.status = DPM_ON;
+ init_completion(&dev->power.completion);
pm_runtime_init(dev);
}

@@ -162,6 +165,39 @@ void device_pm_move_last(struct device *
}

/**
+ * __dpm_wait - Wait for a PM operation to complete.
+ * @dev: Device to wait for.
+ * @ign: This value is not used by the function.
+ */
+int __dpm_wait(struct device *dev, void *ign)
+{
+ if (dev)
+ wait_for_completion(&dev->power.completion);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(__dpm_wait);
+
+static void dpm_wait_for_children(struct device *dev)
+{
+ device_for_each_child(dev, NULL, __dpm_wait);
+}
+
+/**
+ * dpm_synchronize - Wait for PM callbacks of all devices to complete.
+ */
+static void dpm_synchronize(void)
+{
+ struct device *dev;
+
+ async_synchronize_full();
+
+ mutex_lock(&dpm_list_mtx);
+ list_for_each_entry(dev, &dpm_list, power.entry)
+ INIT_COMPLETION(dev->power.completion);
+ mutex_unlock(&dpm_list_mtx);
+}
+
+/**
* pm_op - Execute the PM operation appropriate for given PM event.
* @dev: Device to handle.
* @ops: PM operations to choose from.
@@ -381,17 +417,18 @@ void dpm_resume_noirq(pm_message_t state
EXPORT_SYMBOL_GPL(dpm_resume_noirq);

/**
- * device_resume - Execute "resume" callbacks for given device.
+ * __device_resume - Execute "resume" callbacks for given device.
* @dev: Device to handle.
* @state: PM transition of the system being carried out.
*/
-static int device_resume(struct device *dev, pm_message_t state)
+static int __device_resume(struct device *dev, pm_message_t state)
{
int error = 0;

TRACE_DEVICE(dev);
TRACE_RESUME(0);

+ dpm_wait(dev->parent);
down(&dev->sem);

if (dev->bus) {
@@ -426,11 +463,34 @@ static int device_resume(struct device *
}
End:
up(&dev->sem);
+ complete_all(&dev->power.completion);

TRACE_RESUME(error);
return error;
}

+static void async_resume(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error;
+
+ error = __device_resume(dev, pm_transition);
+ if (error)
+ pm_dev_err(dev, pm_transition, " async", error);
+ put_device(dev);
+}
+
+static int device_resume(struct device *dev)
+{
+ if (dev->power.async_suspend && !pm_trace_is_enabled()) {
+ get_device(dev);
+ async_schedule(async_resume, dev);
+ return 0;
+ }
+
+ return __device_resume(dev, pm_transition);
+}
+
/**
* dpm_resume - Execute "resume" callbacks for non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -444,6 +504,7 @@ static void dpm_resume(pm_message_t stat

INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.next);

@@ -454,7 +515,7 @@ static void dpm_resume(pm_message_t stat
dev->power.status = DPM_RESUMING;
mutex_unlock(&dpm_list_mtx);

- error = device_resume(dev, state);
+ error = device_resume(dev);

mutex_lock(&dpm_list_mtx);
if (error)
@@ -469,6 +530,7 @@ static void dpm_resume(pm_message_t stat
}
list_splice(&list, &dpm_list);
mutex_unlock(&dpm_list_mtx);
+ dpm_synchronize();
}

/**
@@ -533,6 +595,8 @@ static void dpm_complete(pm_message_t st
mutex_unlock(&dpm_list_mtx);
}

+static atomic_t async_error;
+
/**
* dpm_resume_end - Execute "resume" callbacks and complete system transition.
* @state: PM transition of the system being carried out.
@@ -628,10 +692,11 @@ EXPORT_SYMBOL_GPL(dpm_suspend_noirq);
* @dev: Device to handle.
* @state: PM transition of the system being carried out.
*/
-static int device_suspend(struct device *dev, pm_message_t state)
+static int __device_suspend(struct device *dev, pm_message_t state)
{
int error = 0;

+ dpm_wait_for_children(dev);
down(&dev->sem);

if (dev->class) {
@@ -666,12 +731,50 @@ static int device_suspend(struct device
suspend_report_result(dev->bus->suspend, error);
}
}
+
+ if (!error)
+ dev->power.status = DPM_OFF;
+
End:
up(&dev->sem);
+ complete_all(&dev->power.completion);

return error;
}

+static void async_suspend(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error = atomic_read(&async_error);
+
+ if (error) {
+ complete_all(&dev->power.completion);
+ goto End;
+ }
+
+ error = __device_suspend(dev, pm_transition);
+ if (error) {
+ pm_dev_err(dev, pm_transition, " async", error);
+ atomic_set(&async_error, error);
+ }
+
+ End:
+ put_device(dev);
+}
+
+static int device_suspend(struct device *dev, pm_message_t state)
+{
+ int error;
+
+ if (dev->power.async_suspend) {
+ get_device(dev);
+ async_schedule(async_suspend, dev);
+ return 0;
+ }
+
+ return __device_suspend(dev, pm_transition);
+}
+
/**
* dpm_suspend - Execute "suspend" callbacks for all non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -683,6 +786,7 @@ static int dpm_suspend(pm_message_t stat

INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.prev);

@@ -697,13 +801,18 @@ static int dpm_suspend(pm_message_t stat
put_device(dev);
break;
}
- dev->power.status = DPM_OFF;
if (!list_empty(&dev->power.entry))
list_move(&dev->power.entry, &list);
put_device(dev);
+ error = atomic_read(&async_error);
+ if (error)
+ break;
}
list_splice(&list, dpm_list.prev);
mutex_unlock(&dpm_list_mtx);
+ dpm_synchronize();
+ if (!error)
+ error = atomic_read(&async_error);
return error;
}

@@ -762,6 +871,7 @@ static int dpm_prepare(pm_message_t stat
INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
transition_started = true;
+ atomic_set(&async_error, 0);
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.next);

@@ -793,8 +903,10 @@ static int dpm_prepare(pm_message_t stat
break;
}
dev->power.status = DPM_SUSPENDING;
- if (!list_empty(&dev->power.entry))
+ if (!list_empty(&dev->power.entry)) {
list_move_tail(&dev->power.entry, &list);
+ INIT_COMPLETION(dev->power.completion);
+ }
put_device(dev);
}
list_splice(&list, &dpm_list);
Index: linux-2.6/include/linux/resume-trace.h
===================================================================
--- linux-2.6.orig/include/linux/resume-trace.h
+++ linux-2.6/include/linux/resume-trace.h
@@ -6,6 +6,11 @@

extern int pm_trace_enabled;

+static inline int pm_trace_is_enabled(void)
+{
+ return pm_trace_enabled;
+}
+
struct device;
extern void set_trace_device(struct device *);
extern void generate_resume_trace(const void *tracedata, unsigned int user);
@@ -17,6 +22,8 @@ extern void generate_resume_trace(const

#else

+static inline int pm_trace_is_enabled(void) { return 0; }
+
#define TRACE_DEVICE(dev) do { } while (0)
#define TRACE_RESUME(dev) do { } while (0)

Index: linux-2.6/include/linux/device.h
===================================================================
--- linux-2.6.orig/include/linux/device.h
+++ linux-2.6/include/linux/device.h
@@ -472,6 +472,12 @@ static inline int device_is_registered(s
return dev->kobj.state_in_sysfs;
}

+static inline void device_enable_async_suspend(struct device *dev, bool enable)
+{
+ if (dev->power.status == DPM_ON)
+ dev->power.async_suspend = enable;
+}
+
void driver_init(void);

/*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Rafael J. Wysocki on
On Thursday 10 December 2009, Linus Torvalds wrote:
>
> On Thu, 10 Dec 2009, Rafael J. Wysocki wrote:
> >
> > Completions it is, then.
>
> What was so hard with the "Try the simple one first" to understand? You
> had a simpler working patch, why are you making this more complex one
> without ever having had any problems with the simpler one?

OK, why don't you just say you won't merge anything that doesn't use rwsems
(although you said before that completions would be fine with you)? That would
make things clear, but also it would mean we gave up handling the off-tree
dependencies in general.

> Btw, your 'atomic_set()' with errors is pure voodoo programming. That's
> not how atomics work. They do SMP-atomic addition etc, the 'atomic_set()'
> and 'atomic_read()' things are not in any way more atomic than any other
> access.
>
> They are meant for racy reads (atomic_read()) and for initializations
> (atomic_set()), and the way you use them that 'atomic' part is entirely
> pointless, because it really isn't anything different from an 'int',
> except that it may be very very expensive on some architectures due to
> hashed spinlocks etc.
>
> So stop this overdesign thing. Start simple. If you _ever_ see real
> problems, that's when you add stuff. As it is, any time you add
> complexity, you just add bugs.

OK, so that need not be atomic.

> > +/**
> > + * dpm_synchronize - Wait for PM callbacks of all devices to complete.
> > + */
> > +static void dpm_synchronize(void)
> > +{
> > + struct device *dev;
> > +
> > + async_synchronize_full();
> > +
> > + mutex_lock(&dpm_list_mtx);
> > + list_for_each_entry(dev, &dpm_list, power.entry)
> > + INIT_COMPLETION(dev->power.completion);
> > + mutex_unlock(&dpm_list_mtx);
> > +}
>
> And this, for example, is pretty disgusting. Not only is that
> INIT_COMPLETION purely brought on by the whole problem with completions
> (they are fundamentally one-shot, but you want to use them over and over

Actually, twice. However, since I don't want to do any async handling in the
_noirq phases any more, I can get rid of this whole function. Thanks for
pointing that out to me.

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Rafael J. Wysocki on
On Thursday 10 December 2009, Alan Stern wrote:
> On Thu, 10 Dec 2009, Rafael J. Wysocki wrote:
>
> > > How about CONFIG_PROVE_LOCKING? If lockdep really does start
> > > complaining then switching to completions would be a simple way to
> > > appease it.
> >
> > Ah, that one is not set. I guess I'll try it later, although I've already
> > decided to use completions anyway.
>
> You should see how badly lockdep complains about the rwsems. If it
> really doesn't like them then using completions makes sense.

It does complain about them, but when the nested _down operations are marked
as nested, it stops complaining (that's in the version where there's no async
in the _noirq phases).

> > Index: linux-2.6/drivers/base/power/main.c
> > ===================================================================
> > --- linux-2.6.orig/drivers/base/power/main.c
> > +++ linux-2.6/drivers/base/power/main.c
> > @@ -56,6 +58,7 @@ static bool transition_started;
> > void device_pm_init(struct device *dev)
> > {
> > dev->power.status = DPM_ON;
> > + init_completion(&dev->power.completion);
> > pm_runtime_init(dev);
> > }
>
> You need a matching complete_all() in device_pm_remove(), in case
> someone else is waiting for the device when it gets unregistered.

Right, added.

> > +/**
> > + * dpm_synchronize - Wait for PM callbacks of all devices to complete.
> > + */
> > +static void dpm_synchronize(void)
> > +{
> > + struct device *dev;
> > +
> > + async_synchronize_full();
> > +
> > + mutex_lock(&dpm_list_mtx);
> > + list_for_each_entry(dev, &dpm_list, power.entry)
> > + INIT_COMPLETION(dev->power.completion);
> > + mutex_unlock(&dpm_list_mtx);
> > +}
>
> I agree with Linus, initializing the completions here is weird. You
> should initialize them just before using them.

I removed that completely and now the INIT_COMPLETION() is always done in the
preceding phase.

> > @@ -683,6 +786,7 @@ static int dpm_suspend(pm_message_t stat
> >
> > INIT_LIST_HEAD(&list);
> > mutex_lock(&dpm_list_mtx);
> > + pm_transition = state;
> > while (!list_empty(&dpm_list)) {
> > struct device *dev = to_device(dpm_list.prev);
> >
> > @@ -697,13 +801,18 @@ static int dpm_suspend(pm_message_t stat
> > put_device(dev);
> > break;
> > }
> > - dev->power.status = DPM_OFF;
> > if (!list_empty(&dev->power.entry))
> > list_move(&dev->power.entry, &list);
> > put_device(dev);
> > + error = atomic_read(&async_error);
> > + if (error)
> > + break;
> > }
> > list_splice(&list, dpm_list.prev);
>
> Here's something you might want to do in a later patch. These awkward
> list-pointer manipulations can be simplified as follows:

Well, I'm not sure if that's more straightforward.

Anyway, as you said, that's something for a different patch. :-)

Below is an updated version of the $subject one. I don't use the atomic_t for
async_error any more and (apart from this fixed issue) I don't see any problems
in the suspend error path now.

Rafael


---
drivers/base/power/main.c | 113 ++++++++++++++++++++++++++++++++++++++++---
include/linux/device.h | 6 ++
include/linux/pm.h | 12 ++++
include/linux/resume-trace.h | 7 ++
4 files changed, 131 insertions(+), 7 deletions(-)

Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -26,6 +26,7 @@
#include <linux/spinlock.h>
#include <linux/wait.h>
#include <linux/timer.h>
+#include <linux/completion.h>

/*
* Callbacks for platform drivers to implement.
@@ -412,9 +413,11 @@ struct dev_pm_info {
pm_message_t power_state;
unsigned int can_wakeup:1;
unsigned int should_wakeup:1;
+ unsigned async_suspend:1;
enum dpm_state status; /* Owned by the PM core */
#ifdef CONFIG_PM_SLEEP
struct list_head entry;
+ struct completion completion;
#endif
#ifdef CONFIG_PM_RUNTIME
struct timer_list suspend_timer;
@@ -508,6 +511,13 @@ extern void __suspend_report_result(cons
__suspend_report_result(__func__, fn, ret); \
} while (0)

+extern int __dpm_wait(struct device *dev, void *ign);
+
+static inline void dpm_wait(struct device *dev)
+{
+ __dpm_wait(dev, NULL);
+}
+
#else /* !CONFIG_PM_SLEEP */

#define device_pm_lock() do {} while (0)
@@ -520,6 +530,8 @@ static inline int dpm_suspend_start(pm_m

#define suspend_report_result(fn, ret) do {} while (0)

+static inline void dpm_wait(struct device *dev) {}
+
#endif /* !CONFIG_PM_SLEEP */

/* How to reorder dpm_list after device_move() */
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -25,6 +25,7 @@
#include <linux/resume-trace.h>
#include <linux/rwsem.h>
#include <linux/interrupt.h>
+#include <linux/async.h>

#include "../base.h"
#include "power.h"
@@ -42,6 +43,7 @@
LIST_HEAD(dpm_list);

static DEFINE_MUTEX(dpm_list_mtx);
+static pm_message_t pm_transition;

/*
* Set once the preparation of devices for a PM transition has started, reset
@@ -56,6 +58,7 @@ static bool transition_started;
void device_pm_init(struct device *dev)
{
dev->power.status = DPM_ON;
+ init_completion(&dev->power.completion);
pm_runtime_init(dev);
}

@@ -111,6 +114,7 @@ void device_pm_remove(struct device *dev
pr_debug("PM: Removing info for %s:%s\n",
dev->bus ? dev->bus->name : "No Bus",
kobject_name(&dev->kobj));
+ complete_all(&dev->power.completion);
mutex_lock(&dpm_list_mtx);
list_del_init(&dev->power.entry);
mutex_unlock(&dpm_list_mtx);
@@ -162,6 +166,24 @@ void device_pm_move_last(struct device *
}

/**
+ * __dpm_wait - Wait for a PM operation to complete.
+ * @dev: Device to wait for.
+ * @ign: This value is not used by the function.
+ */
+int __dpm_wait(struct device *dev, void *ign)
+{
+ if (dev)
+ wait_for_completion(&dev->power.completion);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(__dpm_wait);
+
+static void dpm_wait_for_children(struct device *dev)
+{
+ device_for_each_child(dev, NULL, __dpm_wait);
+}
+
+/**
* pm_op - Execute the PM operation appropriate for given PM event.
* @dev: Device to handle.
* @ops: PM operations to choose from.
@@ -366,7 +388,7 @@ void dpm_resume_noirq(pm_message_t state

mutex_lock(&dpm_list_mtx);
transition_started = false;
- list_for_each_entry(dev, &dpm_list, power.entry)
+ list_for_each_entry(dev, &dpm_list, power.entry) {
if (dev->power.status > DPM_OFF) {
int error;

@@ -375,23 +397,27 @@ void dpm_resume_noirq(pm_message_t state
if (error)
pm_dev_err(dev, state, " early", error);
}
+ /* Needed by the subsequent dpm_resume(). */
+ INIT_COMPLETION(dev->power.completion);
+ }
mutex_unlock(&dpm_list_mtx);
resume_device_irqs();
}
EXPORT_SYMBOL_GPL(dpm_resume_noirq);

/**
- * device_resume - Execute "resume" callbacks for given device.
+ * __device_resume - Execute "resume" callbacks for given device.
* @dev: Device to handle.
* @state: PM transition of the system being carried out.
*/
-static int device_resume(struct device *dev, pm_message_t state)
+static int __device_resume(struct device *dev, pm_message_t state)
{
int error = 0;

TRACE_DEVICE(dev);
TRACE_RESUME(0);

+ dpm_wait(dev->parent);
down(&dev->sem);

if (dev->bus) {
@@ -426,11 +452,34 @@ static int device_resume(struct device *
}
End:
up(&dev->sem);
+ complete_all(&dev->power.completion);

TRACE_RESUME(error);
return error;
}

+static void async_resume(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error;
+
+ error = __device_resume(dev, pm_transition);
+ if (error)
+ pm_dev_err(dev, pm_transition, " async", error);
+ put_device(dev);
+}
+
+static int device_resume(struct device *dev)
+{
+ if (dev->power.async_suspend && !pm_trace_is_enabled()) {
+ get_device(dev);
+ async_schedule(async_resume, dev);
+ return 0;
+ }
+
+ return __device_resume(dev, pm_transition);
+}
+
/**
* dpm_resume - Execute "resume" callbacks for non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -444,6 +493,7 @@ static void dpm_resume(pm_message_t stat

INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.next);

@@ -454,7 +504,7 @@ static void dpm_resume(pm_message_t stat
dev->power.status = DPM_RESUMING;
mutex_unlock(&dpm_list_mtx);

- error = device_resume(dev, state);
+ error = device_resume(dev);

mutex_lock(&dpm_list_mtx);
if (error)
@@ -469,6 +519,7 @@ static void dpm_resume(pm_message_t stat
}
list_splice(&list, &dpm_list);
mutex_unlock(&dpm_list_mtx);
+ async_synchronize_full();
}

/**
@@ -623,15 +674,18 @@ int dpm_suspend_noirq(pm_message_t state
}
EXPORT_SYMBOL_GPL(dpm_suspend_noirq);

+static int async_error;
+
/**
* device_suspend - Execute "suspend" callbacks for given device.
* @dev: Device to handle.
* @state: PM transition of the system being carried out.
*/
-static int device_suspend(struct device *dev, pm_message_t state)
+static int __device_suspend(struct device *dev, pm_message_t state)
{
int error = 0;

+ dpm_wait_for_children(dev);
down(&dev->sem);

if (dev->class) {
@@ -666,12 +720,48 @@ static int device_suspend(struct device
suspend_report_result(dev->bus->suspend, error);
}
}
+
+ if (!error)
+ dev->power.status = DPM_OFF;
+
End:
up(&dev->sem);
+ complete_all(&dev->power.completion);

return error;
}

+static void async_suspend(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error;
+
+ if (async_error) {
+ complete_all(&dev->power.completion);
+ goto End;
+ }
+
+ error = __device_suspend(dev, pm_transition);
+ if (error) {
+ pm_dev_err(dev, pm_transition, " async", error);
+ async_error = error;
+ }
+
+ End:
+ put_device(dev);
+}
+
+static int device_suspend(struct device *dev, pm_message_t state)
+{
+ if (dev->power.async_suspend) {
+ get_device(dev);
+ async_schedule(async_suspend, dev);
+ return 0;
+ }
+
+ return __device_suspend(dev, pm_transition);
+}
+
/**
* dpm_suspend - Execute "suspend" callbacks for all non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -683,6 +773,7 @@ static int dpm_suspend(pm_message_t stat

INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.prev);

@@ -697,13 +788,17 @@ static int dpm_suspend(pm_message_t stat
put_device(dev);
break;
}
- dev->power.status = DPM_OFF;
if (!list_empty(&dev->power.entry))
list_move(&dev->power.entry, &list);
put_device(dev);
+ if (async_error)
+ break;
}
list_splice(&list, dpm_list.prev);
mutex_unlock(&dpm_list_mtx);
+ async_synchronize_full();
+ if (!error)
+ error = async_error;
return error;
}

@@ -762,6 +857,7 @@ static int dpm_prepare(pm_message_t stat
INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
transition_started = true;
+ async_error = 0;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.next);

@@ -793,8 +889,11 @@ static int dpm_prepare(pm_message_t stat
break;
}
dev->power.status = DPM_SUSPENDING;
- if (!list_empty(&dev->power.entry))
+ if (!list_empty(&dev->power.entry)) {
list_move_tail(&dev->power.entry, &list);
+ /* Needed by the subsequent dpm_suspend(). */
+ INIT_COMPLETION(dev->power.completion);
+ }
put_device(dev);
}
list_splice(&list, &dpm_list);
Index: linux-2.6/include/linux/resume-trace.h
===================================================================
--- linux-2.6.orig/include/linux/resume-trace.h
+++ linux-2.6/include/linux/resume-trace.h
@@ -6,6 +6,11 @@

extern int pm_trace_enabled;

+static inline int pm_trace_is_enabled(void)
+{
+ return pm_trace_enabled;
+}
+
struct device;
extern void set_trace_device(struct device *);
extern void generate_resume_trace(const void *tracedata, unsigned int user);
@@ -17,6 +22,8 @@ extern void generate_resume_trace(const

#else

+static inline int pm_trace_is_enabled(void) { return 0; }
+
#define TRACE_DEVICE(dev) do { } while (0)
#define TRACE_RESUME(dev) do { } while (0)

Index: linux-2.6/include/linux/device.h
===================================================================
--- linux-2.6.orig/include/linux/device.h
+++ linux-2.6/include/linux/device.h
@@ -472,6 +472,12 @@ static inline int device_is_registered(s
return dev->kobj.state_in_sysfs;
}

+static inline void device_enable_async_suspend(struct device *dev, bool enable)
+{
+ if (dev->power.status == DPM_ON)
+ dev->power.async_suspend = enable;
+}
+
void driver_init(void);

/*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Rafael J. Wysocki on
On Thursday 10 December 2009, Alan Stern wrote:
> On Thu, 10 Dec 2009, Rafael J. Wysocki wrote:
>
> > > You should see how badly lockdep complains about the rwsems. If it
> > > really doesn't like them then using completions makes sense.
> >
> > It does complain about them, but when the nested _down operations are marked
> > as nested, it stops complaining (that's in the version where there's no async
> > in the _noirq phases).
>
> Did you set the async_suspend flag for any devices during the test?

Yes. All ACPI, all PCI, all serio, as usual. ;-)

> And did you run more than one suspend/resume cycle?

Sure. Actually, I test it in the /sys/power/pm_test = core mode, but that
shouldn't really matter.

> > +extern int __dpm_wait(struct device *dev, void *ign);
> > +
> > +static inline void dpm_wait(struct device *dev)
> > +{
> > + __dpm_wait(dev, NULL);
> > +}
>
> Sorry, I intended to mention this before but forgot. This design is
> inelegant. You shouldn't have inlines calling functions with extra
> unused arguments; they just waste code space. Make dpm_wait() be a
> real routine and add a shim to the device_for_each_child() loop.

I thought about that myself, done now.

> > @@ -366,7 +388,7 @@ void dpm_resume_noirq(pm_message_t state
> >
> > mutex_lock(&dpm_list_mtx);
> > transition_started = false;
> > - list_for_each_entry(dev, &dpm_list, power.entry)
> > + list_for_each_entry(dev, &dpm_list, power.entry) {
> > if (dev->power.status > DPM_OFF) {
> > int error;
> >
> > @@ -375,23 +397,27 @@ void dpm_resume_noirq(pm_message_t state
> > if (error)
> > pm_dev_err(dev, state, " early", error);
> > }
> > + /* Needed by the subsequent dpm_resume(). */
> > + INIT_COMPLETION(dev->power.completion);
>
> You're still doing it. Don't initialize the completions in a totally
> different phase! Initialize them directly before they are used.
> Namely, at the start of device_resume() and device_suspend().

The idea was to initialize them all at the same time, before entering the
phase in which they were used, but I came to the conclusion that this was not
necessary, because the dpm_list ordering was such that the devices to be waited
for would always have their completions reinitialized before starting
__device_suspend() or __device_resume() for the waiting ones.

> One more thing. A logical time to check for errors is just after
> waiting for the children in __device_suspend(), instead of beforehand
> in async_suspend(). After all, if an error occurs then it's likely to
> happen while we are waiting.

Good idea, done.

Updated patch is appended.

Rafael


---
drivers/base/power/main.c | 106 ++++++++++++++++++++++++++++++++++++++++---
include/linux/device.h | 6 ++
include/linux/pm.h | 7 ++
include/linux/resume-trace.h | 7 ++
4 files changed, 121 insertions(+), 5 deletions(-)

Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -26,6 +26,7 @@
#include <linux/spinlock.h>
#include <linux/wait.h>
#include <linux/timer.h>
+#include <linux/completion.h>

/*
* Callbacks for platform drivers to implement.
@@ -412,9 +413,11 @@ struct dev_pm_info {
pm_message_t power_state;
unsigned int can_wakeup:1;
unsigned int should_wakeup:1;
+ unsigned async_suspend:1;
enum dpm_state status; /* Owned by the PM core */
#ifdef CONFIG_PM_SLEEP
struct list_head entry;
+ struct completion completion;
#endif
#ifdef CONFIG_PM_RUNTIME
struct timer_list suspend_timer;
@@ -508,6 +511,8 @@ extern void __suspend_report_result(cons
__suspend_report_result(__func__, fn, ret); \
} while (0)

+extern void dpm_wait(struct device *dev);
+
#else /* !CONFIG_PM_SLEEP */

#define device_pm_lock() do {} while (0)
@@ -520,6 +525,8 @@ static inline int dpm_suspend_start(pm_m

#define suspend_report_result(fn, ret) do {} while (0)

+static inline void dpm_wait(struct device *dev) {}
+
#endif /* !CONFIG_PM_SLEEP */

/* How to reorder dpm_list after device_move() */
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -25,6 +25,7 @@
#include <linux/resume-trace.h>
#include <linux/rwsem.h>
#include <linux/interrupt.h>
+#include <linux/async.h>

#include "../base.h"
#include "power.h"
@@ -42,6 +43,7 @@
LIST_HEAD(dpm_list);

static DEFINE_MUTEX(dpm_list_mtx);
+static pm_message_t pm_transition;

/*
* Set once the preparation of devices for a PM transition has started, reset
@@ -56,6 +58,7 @@ static bool transition_started;
void device_pm_init(struct device *dev)
{
dev->power.status = DPM_ON;
+ init_completion(&dev->power.completion);
pm_runtime_init(dev);
}

@@ -111,6 +114,7 @@ void device_pm_remove(struct device *dev
pr_debug("PM: Removing info for %s:%s\n",
dev->bus ? dev->bus->name : "No Bus",
kobject_name(&dev->kobj));
+ complete_all(&dev->power.completion);
mutex_lock(&dpm_list_mtx);
list_del_init(&dev->power.entry);
mutex_unlock(&dpm_list_mtx);
@@ -162,6 +166,28 @@ void device_pm_move_last(struct device *
}

/**
+ * dpm_wait - Wait for a PM operation to complete.
+ * @dev: Device to wait for.
+ */
+void dpm_wait(struct device *dev)
+{
+ if (dev)
+ wait_for_completion(&dev->power.completion);
+}
+EXPORT_SYMBOL_GPL(dpm_wait);
+
+static int dpm_wait_fn(struct device *dev, void *ignore)
+{
+ dpm_wait(dev);
+ return 0;
+}
+
+static void dpm_wait_for_children(struct device *dev)
+{
+ device_for_each_child(dev, NULL, dpm_wait_fn);
+}
+
+/**
* pm_op - Execute the PM operation appropriate for given PM event.
* @dev: Device to handle.
* @ops: PM operations to choose from.
@@ -381,17 +407,18 @@ void dpm_resume_noirq(pm_message_t state
EXPORT_SYMBOL_GPL(dpm_resume_noirq);

/**
- * device_resume - Execute "resume" callbacks for given device.
+ * __device_resume - Execute "resume" callbacks for given device.
* @dev: Device to handle.
* @state: PM transition of the system being carried out.
*/
-static int device_resume(struct device *dev, pm_message_t state)
+static int __device_resume(struct device *dev, pm_message_t state)
{
int error = 0;

TRACE_DEVICE(dev);
TRACE_RESUME(0);

+ dpm_wait(dev->parent);
down(&dev->sem);

if (dev->bus) {
@@ -426,11 +453,34 @@ static int device_resume(struct device *
}
End:
up(&dev->sem);
+ complete_all(&dev->power.completion);

TRACE_RESUME(error);
return error;
}

+static void async_resume(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error;
+
+ error = __device_resume(dev, pm_transition);
+ if (error)
+ pm_dev_err(dev, pm_transition, " async", error);
+ put_device(dev);
+}
+
+static int device_resume(struct device *dev)
+{
+ if (dev->power.async_suspend && !pm_trace_is_enabled()) {
+ get_device(dev);
+ async_schedule(async_resume, dev);
+ return 0;
+ }
+
+ return __device_resume(dev, pm_transition);
+}
+
/**
* dpm_resume - Execute "resume" callbacks for non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -444,6 +494,7 @@ static void dpm_resume(pm_message_t stat

INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.next);

@@ -451,10 +502,11 @@ static void dpm_resume(pm_message_t stat
if (dev->power.status >= DPM_OFF) {
int error;

+ INIT_COMPLETION(dev->power.completion);
dev->power.status = DPM_RESUMING;
mutex_unlock(&dpm_list_mtx);

- error = device_resume(dev, state);
+ error = device_resume(dev);

mutex_lock(&dpm_list_mtx);
if (error)
@@ -469,6 +521,7 @@ static void dpm_resume(pm_message_t stat
}
list_splice(&list, &dpm_list);
mutex_unlock(&dpm_list_mtx);
+ async_synchronize_full();
}

/**
@@ -623,17 +676,23 @@ int dpm_suspend_noirq(pm_message_t state
}
EXPORT_SYMBOL_GPL(dpm_suspend_noirq);

+static int async_error;
+
/**
* device_suspend - Execute "suspend" callbacks for given device.
* @dev: Device to handle.
* @state: PM transition of the system being carried out.
*/
-static int device_suspend(struct device *dev, pm_message_t state)
+static int __device_suspend(struct device *dev, pm_message_t state)
{
int error = 0;

+ dpm_wait_for_children(dev);
down(&dev->sem);

+ if (async_error)
+ goto End;
+
if (dev->class) {
if (dev->class->pm) {
pm_dev_dbg(dev, state, "class ");
@@ -666,12 +725,42 @@ static int device_suspend(struct device
suspend_report_result(dev->bus->suspend, error);
}
}
+
+ if (!error)
+ dev->power.status = DPM_OFF;
+
End:
up(&dev->sem);
+ complete_all(&dev->power.completion);

return error;
}

+static void async_suspend(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error;
+
+ error = __device_suspend(dev, pm_transition);
+ if (error) {
+ pm_dev_err(dev, pm_transition, " async", error);
+ async_error = error;
+ }
+
+ put_device(dev);
+}
+
+static int device_suspend(struct device *dev, pm_message_t state)
+{
+ if (dev->power.async_suspend) {
+ get_device(dev);
+ async_schedule(async_suspend, dev);
+ return 0;
+ }
+
+ return __device_suspend(dev, pm_transition);
+}
+
/**
* dpm_suspend - Execute "suspend" callbacks for all non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -683,10 +772,12 @@ static int dpm_suspend(pm_message_t stat

INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.prev);

get_device(dev);
+ INIT_COMPLETION(dev->power.completion);
mutex_unlock(&dpm_list_mtx);

error = device_suspend(dev, state);
@@ -697,13 +788,17 @@ static int dpm_suspend(pm_message_t stat
put_device(dev);
break;
}
- dev->power.status = DPM_OFF;
if (!list_empty(&dev->power.entry))
list_move(&dev->power.entry, &list);
put_device(dev);
+ if (async_error)
+ break;
}
list_splice(&list, dpm_list.prev);
mutex_unlock(&dpm_list_mtx);
+ async_synchronize_full();
+ if (!error)
+ error = async_error;
return error;
}

@@ -762,6 +857,7 @@ static int dpm_prepare(pm_message_t stat
INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
transition_started = true;
+ async_error = 0;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.next);

Index: linux-2.6/include/linux/resume-trace.h
===================================================================
--- linux-2.6.orig/include/linux/resume-trace.h
+++ linux-2.6/include/linux/resume-trace.h
@@ -6,6 +6,11 @@

extern int pm_trace_enabled;

+static inline int pm_trace_is_enabled(void)
+{
+ return pm_trace_enabled;
+}
+
struct device;
extern void set_trace_device(struct device *);
extern void generate_resume_trace(const void *tracedata, unsigned int user);
@@ -17,6 +22,8 @@ extern void generate_resume_trace(const

#else

+static inline int pm_trace_is_enabled(void) { return 0; }
+
#define TRACE_DEVICE(dev) do { } while (0)
#define TRACE_RESUME(dev) do { } while (0)

Index: linux-2.6/include/linux/device.h
===================================================================
--- linux-2.6.orig/include/linux/device.h
+++ linux-2.6/include/linux/device.h
@@ -472,6 +472,12 @@ static inline int device_is_registered(s
return dev->kobj.state_in_sysfs;
}

+static inline void device_enable_async_suspend(struct device *dev, bool enable)
+{
+ if (dev->power.status == DPM_ON)
+ dev->power.async_suspend = enable;
+}
+
void driver_init(void);

/*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Rafael J. Wysocki on
On Friday 11 December 2009, Linus Torvalds wrote:
>
> On Thu, 10 Dec 2009, Rafael J. Wysocki wrote:
....
>
> IOW, I'll happily take the completions version, but dammit, I refuse to
> take it when there is a simpler approach that does NOT need to iterate,
> and does NOT need to re-initialize the data structures each round etc.

I don't think it really is that simple. For example, the fact that the outer
lock has to be taken by one thread and released by another is not exactly
straightforward. [One might ask what's the critical section in this case.]

Besides, suppose a device driver wants some off-tree constraints to be
satisfied. What's the driver writer supposed to do? He only can lock the
other device, but that will cause lockdep to complain, because this lock
is going to be nested. Moreover, it's already too late, because his async
thread has started and there's no guarantee that the other device hasn't
acquired its rwsem yet.

With completions, the driver doesn't have to take any action to prevent another
one from suspending too early. Instead, the other one has to wait for its
suspend to complete, and for me personally this is a much more natural thing
to do. IOW, if I were a driver writed, I'd probably prefer to wait on a
completion than to use a lock in a tricky manner.

> That's what I've been arguing against the whole time. It started as
> arguing against complex and unnecessary infrastructure, and trying to show
> that it _can_ be done so much simpler using existing basic locking.
>
> And I get annoyed when you guys continually seem to want to make it more
> complex than it needs to be.
>
> > > And this, for example, is pretty disgusting. Not only is that
> > > INIT_COMPLETION purely brought on by the whole problem with completions
> > > (they are fundamentally one-shot, but you want to use them over and over
> >
> > Actually, twice. However, since I don't want to do any async handling in the
> > _noirq phases any more, I can get rid of this whole function. Thanks for
> > pointing that out to me.
>
> Well, my point was that you'll need to do that
>
> INIT_COMPLETION(dev->power.completion);
>
> thing each suspend and each resume. Exactly because completions are
> designed to be "onw-way" things, so you end up having to reset them each
> cycle (you just reset them even _more_ than you needed).

Well, why actually do we need to preserve the state of the data structure from
one cycle to another? There's no need whatsoever.

> Again, my point was that using locks is actually a very _natural_ thing to
> do. I really don't understand what problems you and Alan have with just
> using locks - we have way more locks in the kernel than we have
> completions, so they are the "default" thing to do, and they really are
> very natural to use.
>
> [ Ok, so admittedly the actual use of 'struct rw_semaphore' is pretty
> unusual, but my point is that people are used to locking semantics in
> general, more so than the semantics of completions ]

I still don't think there are many places where locks are used in a way you're
suggesting. I would even say it's quite unusual to use locks this way.

> Completions were literally designed to be used for one-off things - one of
> the most common uses is that the 'struct completion' is on the _stack_. It
> doesn't get much more one-off than that - and the completions are really
> very explicitly designed so that you can do a 'complete()' on something
> that will literally disappear from under you as you do it (because the
> struct completion might be on the stack of the thing that is waiting for
> it, and gets de-allocated when the waiter goes ahead).

We could literally throw away a completion after all of the potentially waiting
threads have finished their operations and then allocate it back again when
necessary. We only need the synchronization in this particular phase of
suspend or resume and it doesn't need to extend to the other phases or other
cycles, because all of the concurrent threads we need to synchronize will
only live during this one particular phase of suspend or resume. They will
all exit when it's finished anyway.

> That is why 'wait_for_completion()' always has to take the spinlock, for
> example - there is no fastpath for completion, because the races for the
> waiter releasing things too early are too nasty.
>
> So completions are actually very subtle things - and you don't need any of
> that subtlety. I realize that from a user perspective, completions look
> very simple, but in many ways they actually have subtler semantics than a
> regular lock has.

Well, I guess your point is that the implementation of completions is much
more complicated that we really need, but I'm not sure if that really hurts.

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/