From: Don Zickus on
On Tue, Jun 01, 2010 at 04:46:28PM +0200, Jiri Slaby wrote:
> On 06/01/2010 03:50 PM, Don Zickus wrote:
> > On Mon, May 31, 2010 at 04:22:00PM +0200, Jiri Slaby wrote:
> >> Hi,
> >>
> >> with -next I get the following errors while trying to hibernate in
> >> qemu-kvm after the image is stored on disk:
> >
> > Is this the host that is hibernating or the guest?
>
> Guest.
>
> > KVM guests don't emulate the performance counters, so the nmi piece
> > shouldn't be functioning and the soft lockup piece just sits on top of an
> > hrtimer, so off the top of my head it is hard to imagine it intefering
> > with a sata driver.
> >
> > I'll need your whole boot up log to see how the lockup detector
> > initialized itself.

Ok, so I found out what is causing the problem, not entirely sure why or
what the right fix is, but this patch should do the trick.

This is probably one of those fixing the symptoms but not the problem patch,
but I don't know enough about suspend/resume to understand what the real
problem is.

---->SNIP<---------------------
[lockup detector] don't return NOTIFY_BAD when cpu goes online for suspend

KVM guests do not support performance counter emulation, so if the nmi
watchdog piece is compiled in, it will always fail during boot. The
failure returns NOTIFY_BAD when the cpu goes online in the cpu notifier
callback. Returning NOTIFY_BAD causes hibernation to do really bad
things, so avoid doing that.

The cpu failure shouldn't be a critical failure anyway, so returning
NOTIFY_BAD was probably overstating things.

Signed-off-by: Don Zickus <dzickus(a)redhat.com>


diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 6b7fad8..fda9770 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -550,8 +550,7 @@ cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
break;
case CPU_ONLINE:
case CPU_ONLINE_FROZEN:
- if (watchdog_enable(hotcpu))
- return NOTIFY_BAD;
+ watchdog_enable(hotcpu)
break;
#ifdef CONFIG_HOTPLUG_CPU
case CPU_UP_CANCELED:
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Frederic Weisbecker on
On Wed, Jun 02, 2010 at 02:44:59PM -0400, Don Zickus wrote:
> On Tue, Jun 01, 2010 at 04:46:28PM +0200, Jiri Slaby wrote:
> > On 06/01/2010 03:50 PM, Don Zickus wrote:
> > > On Mon, May 31, 2010 at 04:22:00PM +0200, Jiri Slaby wrote:
> > >> Hi,
> > >>
> > >> with -next I get the following errors while trying to hibernate in
> > >> qemu-kvm after the image is stored on disk:
> > >
> > > Is this the host that is hibernating or the guest?
> >
> > Guest.
> >
> > > KVM guests don't emulate the performance counters, so the nmi piece
> > > shouldn't be functioning and the soft lockup piece just sits on top of an
> > > hrtimer, so off the top of my head it is hard to imagine it intefering
> > > with a sata driver.
> > >
> > > I'll need your whole boot up log to see how the lockup detector
> > > initialized itself.
>
> Ok, so I found out what is causing the problem, not entirely sure why or
> what the right fix is, but this patch should do the trick.
>
> This is probably one of those fixing the symptoms but not the problem patch,
> but I don't know enough about suspend/resume to understand what the real
> problem is.


So the problem is that we stop the cpu hotplug notifying, I guess this prevents
some ata callbacks to execute in the cpu hotplug notifier and then provoke this
crash.

The patch looks ok, but I think you should at least print a message in such
case of watchdog failure.

Thanks.



>
> ---->SNIP<---------------------
> [lockup detector] don't return NOTIFY_BAD when cpu goes online for suspend
>
> KVM guests do not support performance counter emulation, so if the nmi
> watchdog piece is compiled in, it will always fail during boot. The
> failure returns NOTIFY_BAD when the cpu goes online in the cpu notifier
> callback. Returning NOTIFY_BAD causes hibernation to do really bad
> things, so avoid doing that.
>
> The cpu failure shouldn't be a critical failure anyway, so returning
> NOTIFY_BAD was probably overstating things.
>
> Signed-off-by: Don Zickus <dzickus(a)redhat.com>
>
>
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 6b7fad8..fda9770 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -550,8 +550,7 @@ cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
> break;
> case CPU_ONLINE:
> case CPU_ONLINE_FROZEN:
> - if (watchdog_enable(hotcpu))
> - return NOTIFY_BAD;
> + watchdog_enable(hotcpu)
> break;
> #ifdef CONFIG_HOTPLUG_CPU
> case CPU_UP_CANCELED:

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Frederic Weisbecker on
On Wed, Jun 02, 2010 at 09:13:40PM +0200, Frederic Weisbecker wrote:
> On Wed, Jun 02, 2010 at 02:44:59PM -0400, Don Zickus wrote:
> > On Tue, Jun 01, 2010 at 04:46:28PM +0200, Jiri Slaby wrote:
> > > On 06/01/2010 03:50 PM, Don Zickus wrote:
> > > > On Mon, May 31, 2010 at 04:22:00PM +0200, Jiri Slaby wrote:
> > > >> Hi,
> > > >>
> > > >> with -next I get the following errors while trying to hibernate in
> > > >> qemu-kvm after the image is stored on disk:
> > > >
> > > > Is this the host that is hibernating or the guest?
> > >
> > > Guest.
> > >
> > > > KVM guests don't emulate the performance counters, so the nmi piece
> > > > shouldn't be functioning and the soft lockup piece just sits on top of an
> > > > hrtimer, so off the top of my head it is hard to imagine it intefering
> > > > with a sata driver.
> > > >
> > > > I'll need your whole boot up log to see how the lockup detector
> > > > initialized itself.
> >
> > Ok, so I found out what is causing the problem, not entirely sure why or
> > what the right fix is, but this patch should do the trick.
> >
> > This is probably one of those fixing the symptoms but not the problem patch,
> > but I don't know enough about suspend/resume to understand what the real
> > problem is.
>
>
> So the problem is that we stop the cpu hotplug notifying, I guess this prevents
> some ata callbacks to execute in the cpu hotplug notifier and then provoke this
> crash.


(Adding more people in Cc)


But I'm eventually surprised about this: stopping the cpu hotplug callbacks
prevents some ATA resume callbacks to execute.

Does that mean some ata resume path are done from a cpu hotplug notifier?
That looks weird.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Don Zickus on
On Wed, Jun 02, 2010 at 09:13:40PM +0200, Frederic Weisbecker wrote:
> On Wed, Jun 02, 2010 at 02:44:59PM -0400, Don Zickus wrote:
> > On Tue, Jun 01, 2010 at 04:46:28PM +0200, Jiri Slaby wrote:
> > > On 06/01/2010 03:50 PM, Don Zickus wrote:
> > > > On Mon, May 31, 2010 at 04:22:00PM +0200, Jiri Slaby wrote:
> > > >> Hi,
> > > >>
> > > >> with -next I get the following errors while trying to hibernate in
> > > >> qemu-kvm after the image is stored on disk:
> > > >
> > > > Is this the host that is hibernating or the guest?
> > >
> > > Guest.
> > >
> > > > KVM guests don't emulate the performance counters, so the nmi piece
> > > > shouldn't be functioning and the soft lockup piece just sits on top of an
> > > > hrtimer, so off the top of my head it is hard to imagine it intefering
> > > > with a sata driver.
> > > >
> > > > I'll need your whole boot up log to see how the lockup detector
> > > > initialized itself.
> >
> > Ok, so I found out what is causing the problem, not entirely sure why or
> > what the right fix is, but this patch should do the trick.
> >
> > This is probably one of those fixing the symptoms but not the problem patch,
> > but I don't know enough about suspend/resume to understand what the real
> > problem is.
>
>
> So the problem is that we stop the cpu hotplug notifying, I guess this prevents
> some ata callbacks to execute in the cpu hotplug notifier and then provoke this
> crash.
>
> The patch looks ok, but I think you should at least print a message in such
> case of watchdog failure.

Well this is already printed and in Jiri's dmesg output

NMI watchdog failed to create perf event on cpu1: ffffffffffffffed

I could change it to make it more obvious?

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Tejun Heo on
Hello,

On 06/02/2010 09:43 PM, Frederic Weisbecker wrote:
> But I'm eventually surprised about this: stopping the cpu hotplug callbacks
> prevents some ATA resume callbacks to execute.
>
> Does that mean some ata resume path are done from a cpu hotplug notifier?
> That looks weird.

libata itself doesn't use any cpu hotplug notifier. It just uses
suspend and resume callbacks. It looks like somehow IRQ delivery is
screwed up?

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/