From: Rafael J. Wysocki on
Hi,

One of the arguments during the suspend blockers discussion was that the
mainline kernel didn't contain any mechanisms allowing it to avoid losing
wakeup events during system suspend.

Generally, there are two problems in that area. First, if a wakeup event
occurs exactly at the same time when /sys/power/state is being written to,
the even may be delivered to user space right before the freezing of it,
in which case the user space consumer of the event may not be able to process
it before the system is suspended. Second, if a wakeup event occurs after user
space has been frozen and that event is not a wakeup interrupt, the kernel will
not react to it and the system will be suspended.

The following patch illustrates my idea of how these two problems may be
addressed. It introduces a new global sysfs attribute,
/sys/power/wakeup_count, associated with a running counter of wakeup events
and a helper function, pm_wakeup_event(), that may be used by kernel subsystems
to increment the wakeup events counter.

/sys/power/wakeup_count may be read from or written to by user space. Reads
will always succeed and return the current value of the wakeup events counter.
Writes, however, will only succeed if the written number is equal to the
current value of the wakeup events counter. If a write is successful, it will
cause the kernel to save the current value of the wakeup events counter and
to compare the saved number with the current value of the counter at certain
points of the subsequent suspend (or hibernate) sequence. If the two values
don't match, the suspend will be aborted just as though a wakeup interrupt
happened. Reading from /sys/power/wakeup_count again will turn that mechanism
off.

The assumption is that there's a user space power manager that will first
read from /sys/power/wakeup_count. Then it will check all user space consumers
of wakeup events known to it for unprocessed events. If there are any, it will
wait for them to be processed and repeat. In turn, if there are not any,
it will try to write to /sys/power/wakeup_count and if the write is successful,
it will write to /sys/power/state to start suspend, so if any wakeup events
accur past that point, they will be noticed by the kernel and will eventually
cause the suspend to be aborted.

In addition to the above, the patch adds a wakeup events counter to the
power member of struct device and makes these per-device wakeup event counters
available via sysfs, so that it's possible to check the activity of various
wakeup event sources within the kernel.

To illustrate how subsystems can use pm_wakeup_event(), I added it to the
PCI runtime PM wakeup-handling code.

At the moment the patch only contains code changes (ie. no documentation),
but I'm going to add comments etc. if people like the idea.

Please tell me what you think.

Rafael

---
drivers/base/power/Makefile | 2 -
drivers/base/power/main.c | 1
drivers/base/power/power.h | 3 +
drivers/base/power/sysfs.c | 9 ++++
drivers/base/power/wakeup.c | 74 ++++++++++++++++++++++++++++++++++++++++
drivers/pci/pci-acpi.c | 2 +
drivers/pci/pcie/pme/pcie_pme.c | 2 +
include/linux/pm.h | 6 +++
kernel/power/hibernate.c | 14 ++++---
kernel/power/main.c | 24 ++++++++++++
kernel/power/power.h | 6 +++
kernel/power/suspend.c | 2 -
12 files changed, 138 insertions(+), 7 deletions(-)

Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -204,6 +204,28 @@ static ssize_t state_store(struct kobjec

power_attr(state);

+static ssize_t wakeup_count_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "%lu\n", pm_get_wakeup_count());
+}
+
+static ssize_t wakeup_count_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t n)
+{
+ unsigned long val;
+
+ if (sscanf(buf, "%lu", &val) == 1) {
+ if (pm_save_wakeup_count(val))
+ return n;
+ }
+ return -EINVAL;
+}
+
+power_attr(wakeup_count);
+
#ifdef CONFIG_PM_TRACE
int pm_trace_enabled;

@@ -236,6 +258,7 @@ static struct attribute * g[] = {
#endif
#ifdef CONFIG_PM_SLEEP
&pm_async_attr.attr,
+ &wakeup_count_attr.attr,
#ifdef CONFIG_PM_DEBUG
&pm_test_attr.attr,
#endif
@@ -266,6 +289,7 @@ static int __init pm_init(void)
int error = pm_start_workqueue();
if (error)
return error;
+ pm_wakeup_events_init();
power_kobj = kobject_create_and_add("power", NULL);
if (!power_kobj)
return -ENOMEM;
Index: linux-2.6/drivers/base/power/wakeup.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/base/power/wakeup.c
@@ -0,0 +1,74 @@
+
+#include <linux/device.h>
+#include <linux/pm.h>
+
+static unsigned long event_count;
+static unsigned long saved_event_count;
+static bool events_check_enabled;
+static spinlock_t events_lock;
+
+void pm_wakeup_events_init(void)
+{
+ spin_lock_init(&events_lock);
+}
+
+void pm_wakeup_event(struct device *dev)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&events_lock, flags);
+ event_count++;
+ if (dev)
+ dev->power.wakeup_count++;
+ spin_unlock_irqrestore(&events_lock, flags);
+}
+
+bool pm_check_wakeup_events(bool enable)
+{
+ unsigned long flags;
+ bool ret;
+
+ spin_lock_irqsave(&events_lock, flags);
+ ret = !events_check_enabled || (event_count == saved_event_count);
+ events_check_enabled = enable;
+ spin_unlock_irqrestore(&events_lock, flags);
+ return ret;
+}
+
+unsigned long pm_get_wakeup_count(void)
+{
+ unsigned long flags;
+ unsigned long count;
+
+ spin_lock_irqsave(&events_lock, flags);
+ events_check_enabled = false;
+ count = event_count;
+ spin_unlock_irqrestore(&events_lock, flags);
+ return count;
+}
+
+bool pm_save_wakeup_count(unsigned long count)
+{
+ unsigned long flags;
+ bool ret = false;
+
+ spin_lock_irqsave(&events_lock, flags);
+ if (count == event_count) {
+ saved_event_count = count;
+ events_check_enabled = true;
+ ret = true;
+ }
+ spin_unlock_irqrestore(&events_lock, flags);
+ return ret;
+}
+
+unsigned long pm_dev_wakeup_count(struct device *dev)
+{
+ unsigned long flags;
+ unsigned long count;
+
+ spin_lock_irqsave(&events_lock, flags);
+ count = dev->power.wakeup_count;
+ spin_unlock_irqrestore(&events_lock, flags);
+ return count;
+}
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -457,6 +457,7 @@ struct dev_pm_info {
#ifdef CONFIG_PM_SLEEP
struct list_head entry;
struct completion completion;
+ unsigned long wakeup_count;
#endif
#ifdef CONFIG_PM_RUNTIME
struct timer_list suspend_timer;
@@ -552,6 +553,9 @@ extern void __suspend_report_result(cons
} while (0)

extern void device_pm_wait_for_dev(struct device *sub, struct device *dev);
+
+/* drivers/base/power/wakeup.c */
+extern void pm_wakeup_event(struct device *dev);
#else /* !CONFIG_PM_SLEEP */

#define device_pm_lock() do {} while (0)
@@ -565,6 +569,8 @@ static inline int dpm_suspend_start(pm_m
#define suspend_report_result(fn, ret) do {} while (0)

static inline void device_pm_wait_for_dev(struct device *a, struct device *b) {}
+
+static inline void pm_wakeup_event(struct device *dev) {}
#endif /* !CONFIG_PM_SLEEP */

/* How to reorder dpm_list after device_move() */
Index: linux-2.6/drivers/base/power/Makefile
===================================================================
--- linux-2.6.orig/drivers/base/power/Makefile
+++ linux-2.6/drivers/base/power/Makefile
@@ -1,5 +1,5 @@
obj-$(CONFIG_PM) += sysfs.o
-obj-$(CONFIG_PM_SLEEP) += main.o
+obj-$(CONFIG_PM_SLEEP) += main.o wakeup.o
obj-$(CONFIG_PM_RUNTIME) += runtime.o
obj-$(CONFIG_PM_OPS) += generic_ops.o
obj-$(CONFIG_PM_TRACE_RTC) += trace.o
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -59,6 +59,7 @@ void device_pm_init(struct device *dev)
{
dev->power.status = DPM_ON;
init_completion(&dev->power.completion);
+ dev->power.wakeup_count = 0;
pm_runtime_init(dev);
}

Index: linux-2.6/kernel/power/power.h
===================================================================
--- linux-2.6.orig/kernel/power/power.h
+++ linux-2.6/kernel/power/power.h
@@ -184,6 +184,12 @@ static inline void suspend_test_finish(c
#ifdef CONFIG_PM_SLEEP
/* kernel/power/main.c */
extern int pm_notifier_call_chain(unsigned long val);
+
+/* drivers/base/power/wakeup.c */
+extern void pm_wakeup_events_init(void);
+extern bool pm_check_wakeup_events(bool enable);
+extern unsigned long pm_get_wakeup_count(void);
+extern bool pm_save_wakeup_count(unsigned long count);
#endif

#ifdef CONFIG_HIGHMEM
Index: linux-2.6/kernel/power/suspend.c
===================================================================
--- linux-2.6.orig/kernel/power/suspend.c
+++ linux-2.6/kernel/power/suspend.c
@@ -157,7 +157,7 @@ static int suspend_enter(suspend_state_t

error = sysdev_suspend(PMSG_SUSPEND);
if (!error) {
- if (!suspend_test(TEST_CORE))
+ if (!suspend_test(TEST_CORE) && pm_check_wakeup_events(false))
error = suspend_ops->enter(state);
sysdev_resume();
}
Index: linux-2.6/kernel/power/hibernate.c
===================================================================
--- linux-2.6.orig/kernel/power/hibernate.c
+++ linux-2.6/kernel/power/hibernate.c
@@ -277,7 +277,7 @@ static int create_image(int platform_mod
goto Enable_irqs;
}

- if (hibernation_test(TEST_CORE))
+ if (hibernation_test(TEST_CORE) || !pm_check_wakeup_events(true))
goto Power_up;

in_suspend = 1;
@@ -511,14 +511,18 @@ int hibernation_platform_enter(void)

local_irq_disable();
sysdev_suspend(PMSG_HIBERNATE);
+ if (!pm_check_wakeup_events(false))
+ goto Power_up;
+
hibernation_ops->enter();
/* We should never get here */
while (1);

- /*
- * We don't need to reenable the nonboot CPUs or resume consoles, since
- * the system is going to be halted anyway.
- */
+ Power_up:
+ sysdev_resume();
+ local_irq_enable();
+ enable_nonboot_cpus();
+
Platform_finish:
hibernation_ops->finish();

Index: linux-2.6/drivers/pci/pci-acpi.c
===================================================================
--- linux-2.6.orig/drivers/pci/pci-acpi.c
+++ linux-2.6/drivers/pci/pci-acpi.c
@@ -48,6 +48,8 @@ static void pci_acpi_wake_dev(acpi_handl
if (event == ACPI_NOTIFY_DEVICE_WAKE && pci_dev) {
pci_check_pme_status(pci_dev);
pm_runtime_resume(&pci_dev->dev);
+ if (device_may_wakeup(&pci_dev->dev))
+ pm_wakeup_event(&pci_dev->dev);
if (pci_dev->subordinate)
pci_pme_wakeup_bus(pci_dev->subordinate);
}
Index: linux-2.6/drivers/pci/pcie/pme/pcie_pme.c
===================================================================
--- linux-2.6.orig/drivers/pci/pcie/pme/pcie_pme.c
+++ linux-2.6/drivers/pci/pcie/pme/pcie_pme.c
@@ -147,6 +147,8 @@ static bool pcie_pme_walk_bus(struct pci
/* Skip PCIe devices in case we started from a root port. */
if (!pci_is_pcie(dev) && pci_check_pme_status(dev)) {
pm_request_resume(&dev->dev);
+ if (device_may_wakeup(&dev->dev))
+ pm_wakeup_event(&dev->dev);
ret = true;
}

Index: linux-2.6/drivers/base/power/power.h
===================================================================
--- linux-2.6.orig/drivers/base/power/power.h
+++ linux-2.6/drivers/base/power/power.h
@@ -30,6 +30,9 @@ extern void device_pm_move_before(struct
extern void device_pm_move_after(struct device *, struct device *);
extern void device_pm_move_last(struct device *);

+/* drivers/base/power/wakeup.c */
+extern unsigned long pm_dev_wakeup_count(struct device *dev);
+
#else /* !CONFIG_PM_SLEEP */

static inline void device_pm_init(struct device *dev)
Index: linux-2.6/drivers/base/power/sysfs.c
===================================================================
--- linux-2.6.orig/drivers/base/power/sysfs.c
+++ linux-2.6/drivers/base/power/sysfs.c
@@ -144,6 +144,14 @@ wake_store(struct device * dev, struct d

static DEVICE_ATTR(wakeup, 0644, wake_show, wake_store);

+static ssize_t wakeup_count_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%lu\n", pm_dev_wakeup_count(dev));
+}
+
+static DEVICE_ATTR(wakeup_count, 0444, wakeup_count_show, NULL);
+
#ifdef CONFIG_PM_ADVANCED_DEBUG
#ifdef CONFIG_PM_RUNTIME

@@ -230,6 +238,7 @@ static struct attribute * power_attrs[]
&dev_attr_control.attr,
#endif
&dev_attr_wakeup.attr,
+ &dev_attr_wakeup_count.attr,
#ifdef CONFIG_PM_ADVANCED_DEBUG
&dev_attr_async.attr,
#ifdef CONFIG_PM_RUNTIME

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Brownell on

> > > Indeed, the same problem arises if the event
> isn't delivered to
> > > userspace until after userspace is frozen.

Can we put this more directly: the problem is
that the *SYSTEM ISN'T FULLY SUSPENDED* when the
hardware wake event triggers? (Where "*SYSTEM*
includes userspace not just kernel. In fact the
overall system is built from many subsystems,
some in the kernel and some in userspace.

At the risk of being prematurely general: I'd
point out that these subsystems probably have
sequencing requirements. kernel-then-user is
a degenerate case, and surely oversimplified.
There are other examples, e.g. between kernel
subsystems... Like needing to suspend a PMIC
before the bus it uses, where that bus uses
a task to manage request/response protocols.
(Think I2C or SPI.)

This is like the __init/__exit sequencing mess...

In terms of userspace event delivery, I'd say
it's a bug in the event mechanism if taking the
next step in suspension drops any event. It
should be queued, not lost... As a rule the
hardware queuing works (transparently)...

> Of course, the underlying
> > > issue here is that the kernel has no direct way
> to know when userspace
> > > has finished processing an event.


Again said more directly: there's no current
mechanism to coordinate subsystems. Userspace
can't communicate "I'm ready" to kernel, and
vice versa. (a few decades ago, APM could do
that ... we dropped such mechanisms though, and
I'm fairly sure APM's implementation was holey.)




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Brownell on


--- On Sun, 6/20/10, David Brownell <david-b(a)pacbell.net> wrote:

.... in a sort of "aren't we asking the
wrong questions??" manner ...


I suspect that
looking at the problem in terms of how to
coordinate subsystems (an abstraction which
is at best very ad-hoc today!) we would
end up with a cleaner model, which doesn't
bother so many folk the ay wakelocks or
even suspend blockers seem to bother them...


> From: David Brownell <david-b(a)pacbell.net>
> Subject: Re: [linux-pm] [RFC][PATCH] PM: Avoid losing wakeup events during suspend
> To: markgross(a)thegnar.org, "Alan Stern" <stern(a)rowland.harvard.edu>
> Cc: "Neil Brown" <neilb(a)suse.de>, linux-pm(a)lists.linux-foundation.org, "Dmitry Torokhov" <dmitry.torokhov(a)gmail.com>, "Linux Kernel Mailing List" <linux-kernel(a)vger.kernel.org>, "mark gross" <640e9920(a)gmail.com>
> Date: Sunday, June 20, 2010, 9:04 PM
>
> > > > Indeed, the same problem arises if the
> event
> > isn't delivered to
> > > > userspace until after userspace is frozen.
>
> Can we put this more directly:� the problem is
> that the *SYSTEM ISN'T FULLY SUSPENDED* when the
> hardware wake event triggers?� (Where "*SYSTEM*
> includes userspace not just kernel.� In fact the
> overall system is built from many subsystems,
> some in the kernel and some in userspace.
>
> At the risk of being prematurely general:� I'd
> point out that these subsystems probably have
> sequencing requirements.� kernel-then-user is
> a degenerate case, and surely oversimplified.
> There are other examples, e.g. between kernel
> subsystems...� Like needing to suspend a PMIC
> before the bus it uses, where that bus uses
> a task to manage request/response protocols.
> (Think I2C or SPI.)
>
> This is like the __init/__exit sequencing mess...
>
> In terms of userspace event delivery, I'd say
> it's a bug in the event mechanism if taking the
> next step in suspension drops any event.� It
> should be queued, not lost...� As a rule the
> hardware queuing works (transparently)...
>
> > Of course, the underlying
> > > > issue here is that the kernel has no direct
> way
> > to know when userspace
> > > > has finished processing an event.
>
>
> Again said more directly:� there's no current
> mechanism to coordinate subsystems.� Userspace
> can't communicate "I'm ready" to kernel, and
> vice versa.� (a few decades ago, APM could do
> that ... we dropped such mechanisms though, and
> I'm fairly sure APM's implementation was holey.)
>
>
>
>
> _______________________________________________
> linux-pm mailing list
> linux-pm(a)lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/linux-pm
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
 | 
Pages: 1
Prev: Tracing of syscalls on x86
Next: (none)