From: Cyrill Gorcunov on
On Fri, Apr 16, 2010 at 04:46:17PM +0200, Frederic Weisbecker wrote:
....
> > > > + if (hardlockup_panic)
> > > > + panic("Watchdog detected hard LOCKUP on cpu %d", this_cpu);
> > > > + else
> > > > + WARN(1, "Watchdog detected hard LOCKUP on cpu %d", this_cpu);
> > > > +
> > > > + cpumask_set_cpu(this_cpu, to_cpumask(hardlockup_mask));
> > >
> > >
> > >
> > > May be have an arch spin lock there to update your cpu mask safely.
> > >
> >
> > Hmm, this is NMI handler path so from what we protect this per-cpu data?
> > Do I miss something? /me confused
>
>
> The cpu mask is not per cpu here, this is a shared bitmap, so you
> can race against other cpus NMIs.
>
> That said, as I suggested, having a per cpu var that we set when we
> warned would be much better than a spinlock here.
>

yeah, saw DECLARE_BITMAP but read it as DEFINE_PER_CPU for some reason.
having any spinlock in irq handler is really under suspicious.

-- Cyrill
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Don Zickus on
On Fri, Apr 16, 2010 at 04:43:04PM +0200, Frederic Weisbecker wrote:
> On Fri, Apr 16, 2010 at 10:12:13AM -0400, Don Zickus wrote:
> > On Fri, Apr 16, 2010 at 03:47:14AM +0200, Frederic Weisbecker wrote:
> > > > config PERF_EVENTS_NMI
> > > > bool
> > > > + depends on PERF_EVENTS
> > > > help
> > > > Arch has support for nmi_watchdog
> > >
> > >
> > >
> > > That looks too general. It's more about the fact the arch supports
> > > cpu cycle events and generates NMIs on overflow.
> >
> > I was trying to figure out a way to add the PERF_EVENTS dependency as I
> > didn't want to impose it on the CONFIG_NMI_WATCHDOG if that config
> > supported softlockup (which doesn't need the PERF_EVENTS).
>
>
>
> Yeah and this is fine. I was talking about the help description.

Oh. heh. ok, will expand that.

>
>
>
> > > I'm confused, do we have two versions of the softlockup
> > > detector now? You should drop the older one.
> >
> > Originally Ingo talked about a migration path, so I was going to support
> > the older one in case the new one was having issues, sort of like what he
> > suggested about moving the nmi code from arch/x86/kernel/apic/nmi.c to
> > kernel/watchdog.c. But I can probably drop the softlockup case as the
> > migration isn't as tricky as the nmi case.
>
>
>
> Ok.
>
> > > > + return;
> > > > + }
> > > > +
> > > > + cpumask_clear_cpu(this_cpu, to_cpumask(hardlockup_mask));
> > >
> > >
> > >
> > > Hmm...this is probably not necessary.
> >
> > I was just thinking of the case where dispite the WARN above, the cpu
> > actually recovered and then failed again separately. But I probably won't
> > spend anymore time defending it. :-)
>
>
>
> This is really just a corner case, I guess you don't need to
> bother with that. It is actually racy against other cpus and adding
> a spinlock here (in the everything is fine path) would be an overkill.
>
> In fact, having two per cpu vars named hardlockup_warned and
> softlockup_warned would be better than cpumasks. I'm sorry I
> suggested you the cpumask, but such per cpu vars will avoid
> you dealing with these synchonization issues. And one of the primary
> rules is usually to never take a lock from NMIs if we can :)

Yeah, I guess per cpu is better. I agree that locks in NMI are frowned
upon but I wasn't sure of it was dealt with.

I'll try to implement this. Any objections if I combined hardlockup and
softlockup with per cpu watchdog_warn and have bit masks for HARDLOCKUP
and SOFTLOCKUP? I hate to just waste per cpu space for this.

>
>
>
> > > You probably want a backtrace cpu mask here as well
> > > (but better don't use the same than the hardlockup thing)
> >
> > yup.
>
>
> So actually, per_cpu softlockup_warned would be better :)
>
>
> > > Also you should half-drop the DETECT_SOFTLOCKUP thing:
> > > keep it's definition but drop the ability to choose it from
> > > the prompt:
> > >
> > > config DETECT_SOFTLOCKUP
> > > bool
> > > depends on DEBUG_KERNEL && !S390
> > > default y
> > >
> > > This way we keep it for compatibility with def_configs, it will
> > > enable the WATCHDOG by default if it is "y", we can schedule
> > > its removal later.
>
> > I understand the general idea but not quite the implementation idea. I will work
> > on it and see what I come up with.
>
>
> We current have:
>
> config DETECT_SOFTLOCKUP
> bool "Blah"
> depends on DEBUG_KERNEL && !S390
> default y
> help
> .......
>
> The idea is to remove the "Blah" so that the user can't select it
> anymore from make menuconfig, and to remove the help too as it's useless
> too.
>
> So that config WATCHDOG can be default y if DETECT_SOFTLOCKUP.
> Then if someone comes with a config that has DETECT_SOFTLOCKUP,
> it's new implementation (WATCHDOG) will enabled by default.

Ah, I missed the bool part. I got it. Thanks for the clarification.

Cheers,
Don

>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Frederic Weisbecker on
On Fri, Apr 16, 2010 at 11:04:07AM -0400, Don Zickus wrote:
> > This is really just a corner case, I guess you don't need to
> > bother with that. It is actually racy against other cpus and adding
> > a spinlock here (in the everything is fine path) would be an overkill.
> >
> > In fact, having two per cpu vars named hardlockup_warned and
> > softlockup_warned would be better than cpumasks. I'm sorry I
> > suggested you the cpumask, but such per cpu vars will avoid
> > you dealing with these synchonization issues. And one of the primary
> > rules is usually to never take a lock from NMIs if we can :)
>
> Yeah, I guess per cpu is better. I agree that locks in NMI are frowned
> upon but I wasn't sure of it was dealt with.


They work in fact. They are just not checked by lockdep.
And mostly they are very dangerous: if something else can
take it (from interrupt, from context) then this is a deadlock.
And even though we ensure this is only taken from NMI, we tend
to avoid that.



> I'll try to implement this. Any objections if I combined hardlockup and
> softlockup with per cpu watchdog_warn and have bit masks for HARDLOCKUP
> and SOFTLOCKUP? I hate to just waste per cpu space for this.



Hmm, a hardlockup can come in after a softlockup.
Don't worry too much about memory: usually the more you have cpu,
the more you have memory :)
Plus this is debugging code, not something supposed to be enabled
in production.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Don Zickus on
On Fri, Apr 16, 2010 at 05:32:12PM +0200, Frederic Weisbecker wrote:
> > I'll try to implement this. Any objections if I combined hardlockup and
> > softlockup with per cpu watchdog_warn and have bit masks for HARDLOCKUP
> > and SOFTLOCKUP? I hate to just waste per cpu space for this.
>
>
>
> Hmm, a hardlockup can come in after a softlockup.

Let me re-explain what I meant. It was meant to do double duty. The
softlockup code only checks the SOFTLOCKUP bit and the hardlockup only
ever checks the HARDLOCKUP bit.

ie if get_cpu_var(watchdog_warn) && HARDLOCKUP { return; }

> Don't worry too much about memory: usually the more you have cpu,
> the more you have memory :)
> Plus this is debugging code, not something supposed to be enabled
> in production.

Well depends on your POV. In RHEL we enable both NMI_WATCHDOG and
SOFTLOCKUP on production systems (and we have customers that are
thankful for that :-) ).

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Frederic Weisbecker on
On Fri, Apr 16, 2010 at 12:14:01PM -0400, Don Zickus wrote:
> On Fri, Apr 16, 2010 at 05:32:12PM +0200, Frederic Weisbecker wrote:
> > > I'll try to implement this. Any objections if I combined hardlockup and
> > > softlockup with per cpu watchdog_warn and have bit masks for HARDLOCKUP
> > > and SOFTLOCKUP? I hate to just waste per cpu space for this.
> >
> >
> >
> > Hmm, a hardlockup can come in after a softlockup.
>
> Let me re-explain what I meant. It was meant to do double duty. The
> softlockup code only checks the SOFTLOCKUP bit and the hardlockup only
> ever checks the HARDLOCKUP bit.
>
> ie if get_cpu_var(watchdog_warn) && HARDLOCKUP { return; }


Ah right.



>
> > Don't worry too much about memory: usually the more you have cpu,
> > the more you have memory :)
> > Plus this is debugging code, not something supposed to be enabled
> > in production.
>
> Well depends on your POV. In RHEL we enable both NMI_WATCHDOG and
> SOFTLOCKUP on production systems (and we have customers that are
> thankful for that :-) ).


Ok :)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/