Prev: serial: add support for OX16PCI958 card
Next: x86: Do not try to disable hpet if it hasn't been initialized before
From: Yinghai Lu on 22 Jul 2010 19:10 On 07/22/2010 02:51 PM, Don Zickus wrote: > Hi, > > When debugging a problem with Yinghai, I noticed that when the perf event > subsystem has a user (in this case the new generic nmi_watchdog), it just > blindly swallows all the NMIs in the system. > > This causes issues for people like Yinghai, who want to use an external > nmi button to generate a panic, or other big companies that like to > registered the nmi handlers at a lower priority to be a catch-all for NMI > problems or also it will start masking any unknown nmi problems that would > have cropped up due to broken firmware or such. > > The problem is spelled out in the comment in > arch/x86/kernel/cpu/perf_event.c::perf_event_nmi_handler > > perf_event_nmi_handler(struct notifier_block *self, > unsigned long cmd, void *__args) > { > struct die_args *args = __args; > struct pt_regs *regs; > static int eat_nmis = 0; > > if (!atomic_read(&active_events)) > return NOTIFY_DONE; > > switch (cmd) { > case DIE_NMI: > case DIE_NMI_IPI: > break; > > default: > return NOTIFY_DONE; > } > > regs = args->regs; > > apic_write(APIC_LVTPC, APIC_DM_NMI); > /* > * Can't rely on the handled return value to say it was our NMI, > * two > * events could trigger 'simultaneously' raising two back-to-back > * NMIs. > * > * If the first NMI handles both, the latter will be empty and > * daze > * the CPU. > */ > x86_pmu.handle_irq(regs); > > return NOTIFY_STOP; > } > > In the normal case, there is no perf user, so the function returns with > NOTIFY_DONE right away. But with the new nmi_watchdog, which is a user of > the perf subsystem, it catches DIE_NMI, executes x86_pmu.handle_irq, and > finally returns NOTIFY_STOP. > > The comment above describes the problem well, but as a result no other > NMIs can get through. > > I looked at the code and thought I could modify the handle_irq to only > handle one PMU at a time, with the thought that there is probably another > NMI waiting for the other PMUs. This would handle the problem nicely. > > But I believe the code is structured such that an event can occupy more > than one PMU in complex cases and as a result would probably break things > because the event would be in limbo until all the NMIs happened to > disable it?? I am not familiar enough with how perf works to know if that > case is correct or not. > > So I hacked up some stupid code to start a conversation that just keeps > track of how many NMIs are supposed to happen based on the number of PMUs > handled. Then on future NMIs those are 'eaten' until the count is zero > again. > > Like I said this patch is just something to start a conversation. I > tested it, but could not do anything complicated enough such that more > than one PMU was handled during one NMI call. > > Comments? > > Cheers, > Don > > diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c > index f2da20f..df6255c 100644 > --- a/arch/x86/kernel/cpu/perf_event.c > +++ b/arch/x86/kernel/cpu/perf_event.c > @@ -1154,7 +1156,7 @@ static int x86_pmu_handle_irq(struct pt_regs *regs) > /* > * event overflow > */ > - handled = 1; > + handled += 1; > data.period = event->hw.last_period; > > if (!x86_perf_event_set_period(event)) > @@ -1206,6 +1210,7 @@ perf_event_nmi_handler(struct notifier_block *self, > { > struct die_args *args = __args; > struct pt_regs *regs; > + static int eat_nmis = 0; > > if (!atomic_read(&active_events)) > return NOTIFY_DONE; > @@ -1229,9 +1234,13 @@ perf_event_nmi_handler(struct notifier_block *self, > * If the first NMI handles both, the latter will be empty and daze > * the CPU. > */ > - x86_pmu.handle_irq(regs); > + eat_nmis += x86_pmu.handle_irq(regs); > + if (eat_nmis) { > + eat_nmis--; > + return NOTIFY_STOP; > + } > > - return NOTIFY_STOP; > + return NOTIFY_DONE; > } > > static __read_mostly struct notifier_block perf_event_nmi_notifier = { cool, with your patch and following patch i can use nmi button now with CONFIG_LOCKUP_DETECTOR defined [PATCH] x86,nmi: move unknown_nmi_panic to traps.c So we use it even LOCKUP_DETECTOR is defined. need Don's patch... Signed-off-by: Yinghai Lu <yinghai(a)kernel.org> --- arch/x86/include/asm/nmi.h | 8 -------- arch/x86/kernel/apic/hw_nmi.c | 1 - arch/x86/kernel/apic/nmi.c | 27 --------------------------- arch/x86/kernel/traps.c | 29 ++++++++++++++++++++++++++++- kernel/sysctl.c | 4 +++- 5 files changed, 31 insertions(+), 38 deletions(-) Index: linux-2.6/arch/x86/kernel/apic/nmi.c =================================================================== --- linux-2.6.orig/arch/x86/kernel/apic/nmi.c +++ linux-2.6/arch/x86/kernel/apic/nmi.c @@ -37,7 +37,6 @@ #include <asm/mach_traps.h> -int unknown_nmi_panic; int nmi_watchdog_enabled; /* For reliability, we're prepared to waste bits here. */ @@ -483,23 +482,6 @@ static void disable_ioapic_nmi_watchdog( on_each_cpu(stop_apic_nmi_watchdog, NULL, 1); } -static int __init setup_unknown_nmi_panic(char *str) -{ - unknown_nmi_panic = 1; - return 1; -} -__setup("unknown_nmi_panic", setup_unknown_nmi_panic); - -static int unknown_nmi_panic_callback(struct pt_regs *regs, int cpu) -{ - unsigned char reason = get_nmi_reason(); - char buf[64]; - - sprintf(buf, "NMI received for unknown reason %02x\n", reason); - die_nmi(buf, regs, 1); /* Always panic here */ - return 0; -} - /* * proc handler for /proc/sys/kernel/nmi */ @@ -540,15 +522,6 @@ int proc_nmi_enabled(struct ctl_table *t #endif /* CONFIG_SYSCTL */ -int do_nmi_callback(struct pt_regs *regs, int cpu) -{ -#ifdef CONFIG_SYSCTL - if (unknown_nmi_panic) - return unknown_nmi_panic_callback(regs, cpu); -#endif - return 0; -} - void arch_trigger_all_cpu_backtrace(void) { int i; Index: linux-2.6/arch/x86/kernel/apic/hw_nmi.c =================================================================== --- linux-2.6.orig/arch/x86/kernel/apic/hw_nmi.c +++ linux-2.6/arch/x86/kernel/apic/hw_nmi.c @@ -100,7 +100,6 @@ void acpi_nmi_disable(void) { return; } #endif atomic_t nmi_active = ATOMIC_INIT(0); /* oprofile uses this */ EXPORT_SYMBOL(nmi_active); -int unknown_nmi_panic; void cpu_nmi_set_wd_enabled(void) { return; } void stop_apic_nmi_watchdog(void *unused) { return; } void setup_apic_nmi_watchdog(void *unused) { return; } Index: linux-2.6/arch/x86/kernel/traps.c =================================================================== --- linux-2.6.orig/arch/x86/kernel/traps.c +++ linux-2.6/arch/x86/kernel/traps.c @@ -377,6 +377,33 @@ unknown_nmi_error(unsigned char reason, printk(KERN_EMERG "Dazed and confused, but trying to continue\n"); } +#if defined(CONFIG_SYSCTL) && defined(CONFIG_X86_LOCAL_APIC) +int unknown_nmi_panic; +static int __init setup_unknown_nmi_panic(char *str) +{ + unknown_nmi_panic = 1; + return 1; +} +__setup("unknown_nmi_panic", setup_unknown_nmi_panic); + +static int unknown_nmi_panic_callback(struct pt_regs *regs, int cpu) +{ + unsigned char reason = get_nmi_reason(); + char buf[64]; + + sprintf(buf, "NMI received for unknown reason %02x\n", reason); + die_nmi(buf, regs, 1); /* Always panic here */ + return 0; +} + +static int do_nmi_callback(struct pt_regs *regs, int cpu) +{ + if (unknown_nmi_panic) + return unknown_nmi_panic_callback(regs, cpu); + return 0; +} +#endif + static notrace __kprobes void default_do_nmi(struct pt_regs *regs) { unsigned char reason = 0; @@ -405,8 +432,8 @@ static notrace __kprobes void default_do */ if (nmi_watchdog_tick(regs, reason)) return; - if (!do_nmi_callback(regs, cpu)) #endif /* !CONFIG_LOCKUP_DETECTOR */ + if (!do_nmi_callback(regs, cpu)) unknown_nmi_error(reason, regs); #else unknown_nmi_error(reason, regs); Index: linux-2.6/kernel/sysctl.c =================================================================== --- linux-2.6.orig/kernel/sysctl.c +++ linux-2.6/kernel/sysctl.c @@ -741,7 +741,7 @@ static struct ctl_table kern_table[] = { .extra2 = &one, }, #endif -#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86) && !defined(CONFIG_LOCKUP_DETECTOR) +#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86) { .procname = "unknown_nmi_panic", .data = &unknown_nmi_panic, @@ -749,6 +749,8 @@ static struct ctl_table kern_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, +#endif +#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86) && !defined(CONFIG_LOCKUP_DETECTOR) { .procname = "nmi_watchdog", .data = &nmi_watchdog_enabled, Index: linux-2.6/arch/x86/include/asm/nmi.h =================================================================== --- linux-2.6.orig/arch/x86/include/asm/nmi.h +++ linux-2.6/arch/x86/include/asm/nmi.h @@ -7,14 +7,6 @@ #ifdef ARCH_HAS_NMI_WATCHDOG -/** - * do_nmi_callback - * - * Check to see if a callback exists and execute it. Return 1 - * if the handler exists and was handled successfully. - */ -int do_nmi_callback(struct pt_regs *regs, int cpu); - extern void die_nmi(char *str, struct pt_regs *regs, int do_panic); extern int check_nmi_watchdog(void); #if !defined(CONFIG_LOCKUP_DETECTOR) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Don Zickus on 23 Jul 2010 09:00 On Thu, Jul 22, 2010 at 03:58:21PM -0700, Yinghai Lu wrote: > cool, with your patch and following patch i can use nmi button now with CONFIG_LOCKUP_DETECTOR defined > > [PATCH] x86,nmi: move unknown_nmi_panic to traps.c > > So we use it even LOCKUP_DETECTOR is defined. > > need Don's patch... Thanks for the feedback Yinghai! > > Signed-off-by: Yinghai Lu <yinghai(a)kernel.org> With regards to your patch, I am still a little uncomfortable with it, as it seems to be redundant with what is used in unknown_nmi_error() and the panic_on_unrecovered_nmi flag. I agree that keeping the familiar flag 'unknown_nmi_panic' is needed. But I was still wondering if wrapping it around 'panic_on_unrecovered_nmi' would be simpler and less code. Cheers, Don > > --- > arch/x86/include/asm/nmi.h | 8 -------- > arch/x86/kernel/apic/hw_nmi.c | 1 - > arch/x86/kernel/apic/nmi.c | 27 --------------------------- > arch/x86/kernel/traps.c | 29 ++++++++++++++++++++++++++++- > kernel/sysctl.c | 4 +++- > 5 files changed, 31 insertions(+), 38 deletions(-) > > Index: linux-2.6/arch/x86/kernel/apic/nmi.c > =================================================================== > --- linux-2.6.orig/arch/x86/kernel/apic/nmi.c > +++ linux-2.6/arch/x86/kernel/apic/nmi.c > @@ -37,7 +37,6 @@ > > #include <asm/mach_traps.h> > > -int unknown_nmi_panic; > int nmi_watchdog_enabled; > > /* For reliability, we're prepared to waste bits here. */ > @@ -483,23 +482,6 @@ static void disable_ioapic_nmi_watchdog( > on_each_cpu(stop_apic_nmi_watchdog, NULL, 1); > } > > -static int __init setup_unknown_nmi_panic(char *str) > -{ > - unknown_nmi_panic = 1; > - return 1; > -} > -__setup("unknown_nmi_panic", setup_unknown_nmi_panic); > - > -static int unknown_nmi_panic_callback(struct pt_regs *regs, int cpu) > -{ > - unsigned char reason = get_nmi_reason(); > - char buf[64]; > - > - sprintf(buf, "NMI received for unknown reason %02x\n", reason); > - die_nmi(buf, regs, 1); /* Always panic here */ > - return 0; > -} > - > /* > * proc handler for /proc/sys/kernel/nmi > */ > @@ -540,15 +522,6 @@ int proc_nmi_enabled(struct ctl_table *t > > #endif /* CONFIG_SYSCTL */ > > -int do_nmi_callback(struct pt_regs *regs, int cpu) > -{ > -#ifdef CONFIG_SYSCTL > - if (unknown_nmi_panic) > - return unknown_nmi_panic_callback(regs, cpu); > -#endif > - return 0; > -} > - > void arch_trigger_all_cpu_backtrace(void) > { > int i; > Index: linux-2.6/arch/x86/kernel/apic/hw_nmi.c > =================================================================== > --- linux-2.6.orig/arch/x86/kernel/apic/hw_nmi.c > +++ linux-2.6/arch/x86/kernel/apic/hw_nmi.c > @@ -100,7 +100,6 @@ void acpi_nmi_disable(void) { return; } > #endif > atomic_t nmi_active = ATOMIC_INIT(0); /* oprofile uses this */ > EXPORT_SYMBOL(nmi_active); > -int unknown_nmi_panic; > void cpu_nmi_set_wd_enabled(void) { return; } > void stop_apic_nmi_watchdog(void *unused) { return; } > void setup_apic_nmi_watchdog(void *unused) { return; } > Index: linux-2.6/arch/x86/kernel/traps.c > =================================================================== > --- linux-2.6.orig/arch/x86/kernel/traps.c > +++ linux-2.6/arch/x86/kernel/traps.c > @@ -377,6 +377,33 @@ unknown_nmi_error(unsigned char reason, > printk(KERN_EMERG "Dazed and confused, but trying to continue\n"); > } > > +#if defined(CONFIG_SYSCTL) && defined(CONFIG_X86_LOCAL_APIC) > +int unknown_nmi_panic; > +static int __init setup_unknown_nmi_panic(char *str) > +{ > + unknown_nmi_panic = 1; > + return 1; > +} > +__setup("unknown_nmi_panic", setup_unknown_nmi_panic); > + > +static int unknown_nmi_panic_callback(struct pt_regs *regs, int cpu) > +{ > + unsigned char reason = get_nmi_reason(); > + char buf[64]; > + > + sprintf(buf, "NMI received for unknown reason %02x\n", reason); > + die_nmi(buf, regs, 1); /* Always panic here */ > + return 0; > +} > + > +static int do_nmi_callback(struct pt_regs *regs, int cpu) > +{ > + if (unknown_nmi_panic) > + return unknown_nmi_panic_callback(regs, cpu); > + return 0; > +} > +#endif > + > static notrace __kprobes void default_do_nmi(struct pt_regs *regs) > { > unsigned char reason = 0; > @@ -405,8 +432,8 @@ static notrace __kprobes void default_do > */ > if (nmi_watchdog_tick(regs, reason)) > return; > - if (!do_nmi_callback(regs, cpu)) > #endif /* !CONFIG_LOCKUP_DETECTOR */ > + if (!do_nmi_callback(regs, cpu)) > unknown_nmi_error(reason, regs); > #else > unknown_nmi_error(reason, regs); > Index: linux-2.6/kernel/sysctl.c > =================================================================== > --- linux-2.6.orig/kernel/sysctl.c > +++ linux-2.6/kernel/sysctl.c > @@ -741,7 +741,7 @@ static struct ctl_table kern_table[] = { > .extra2 = &one, > }, > #endif > -#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86) && !defined(CONFIG_LOCKUP_DETECTOR) > +#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86) > { > .procname = "unknown_nmi_panic", > .data = &unknown_nmi_panic, > @@ -749,6 +749,8 @@ static struct ctl_table kern_table[] = { > .mode = 0644, > .proc_handler = proc_dointvec, > }, > +#endif > +#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86) && !defined(CONFIG_LOCKUP_DETECTOR) > { > .procname = "nmi_watchdog", > .data = &nmi_watchdog_enabled, > Index: linux-2.6/arch/x86/include/asm/nmi.h > =================================================================== > --- linux-2.6.orig/arch/x86/include/asm/nmi.h > +++ linux-2.6/arch/x86/include/asm/nmi.h > @@ -7,14 +7,6 @@ > > #ifdef ARCH_HAS_NMI_WATCHDOG > > -/** > - * do_nmi_callback > - * > - * Check to see if a callback exists and execute it. Return 1 > - * if the handler exists and was handled successfully. > - */ > -int do_nmi_callback(struct pt_regs *regs, int cpu); > - > extern void die_nmi(char *str, struct pt_regs *regs, int do_panic); > extern int check_nmi_watchdog(void); > #if !defined(CONFIG_LOCKUP_DETECTOR) > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Yinghai Lu on 23 Jul 2010 12:50 On 07/23/2010 05:55 AM, Don Zickus wrote: > On Thu, Jul 22, 2010 at 03:58:21PM -0700, Yinghai Lu wrote: >> cool, with your patch and following patch i can use nmi button now with CONFIG_LOCKUP_DETECTOR defined >> >> [PATCH] x86,nmi: move unknown_nmi_panic to traps.c >> >> So we use it even LOCKUP_DETECTOR is defined. >> >> need Don's patch... > > Thanks for the feedback Yinghai! >> >> Signed-off-by: Yinghai Lu <yinghai(a)kernel.org> > > With regards to your patch, I am still a little uncomfortable with it, as > it seems to be redundant with what is used in unknown_nmi_error() and the > panic_on_unrecovered_nmi flag. I agree that keeping the familiar flag > 'unknown_nmi_panic' is needed. But I was still wondering if wrapping it > around 'panic_on_unrecovered_nmi' would be simpler and less code. how about the one in sysctl? Thanks Yinghai -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Don Zickus on 23 Jul 2010 13:20
On Fri, Jul 23, 2010 at 09:42:18AM -0700, Yinghai Lu wrote: > On 07/23/2010 05:55 AM, Don Zickus wrote: > > On Thu, Jul 22, 2010 at 03:58:21PM -0700, Yinghai Lu wrote: > >> cool, with your patch and following patch i can use nmi button now with CONFIG_LOCKUP_DETECTOR defined > >> > >> [PATCH] x86,nmi: move unknown_nmi_panic to traps.c > >> > >> So we use it even LOCKUP_DETECTOR is defined. > >> > >> need Don's patch... > > > > Thanks for the feedback Yinghai! > >> > >> Signed-off-by: Yinghai Lu <yinghai(a)kernel.org> > > > > With regards to your patch, I am still a little uncomfortable with it, as > > it seems to be redundant with what is used in unknown_nmi_error() and the > > panic_on_unrecovered_nmi flag. I agree that keeping the familiar flag > > 'unknown_nmi_panic' is needed. But I was still wondering if wrapping it > > around 'panic_on_unrecovered_nmi' would be simpler and less code. > > how about the one in sysctl? Would it be wrong to do something like (sorry for the copy/paste): --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -744,7 +744,7 @@ static struct ctl_table kern_table[] = { #if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86) && !defined(CONFIG_LOCKUP_DETECTOR) { .procname = "unknown_nmi_panic", - .data = &unknown_nmi_panic, + .data = &panic_on_unrecovered_nmi, .maxlen = sizeof (int), .mode = 0644, .proc_handler = proc_dointvec, Cheers, Don -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |