[RFC] x86: perf swallows all NMIs when registered with a user [Kernel]

Prev: [PATCH] ACPI: create "processor.bm_check_disable" boot param
Next: perf: export tracepoint events via sysfs: power

From: Don Zickus on 22 Jul 2010 18:00

Hi,

When debugging a problem with Yinghai, I noticed that when the perf event
subsystem has a user (in this case the new generic nmi_watchdog), it just
blindly swallows all the NMIs in the system.

This causes issues for people like Yinghai, who want to use an external
nmi button to generate a panic, or other big companies that like to
registered the nmi handlers at a lower priority to be a catch-all for NMI
problems or also it will start masking any unknown nmi problems that would
have cropped up due to broken firmware or such.

The problem is spelled out in the comment in
arch/x86/kernel/cpu/perf_event.c::perf_event_nmi_handler

perf_event_nmi_handler(struct notifier_block *self,
unsigned long cmd, void *__args)
{
struct die_args *args = __args;
struct pt_regs *regs;
static int eat_nmis = 0;

if (!atomic_read(&active_events))
return NOTIFY_DONE;

switch (cmd) {
case DIE_NMI:
case DIE_NMI_IPI:
break;

default:
return NOTIFY_DONE;
}

regs = args->regs;

apic_write(APIC_LVTPC, APIC_DM_NMI);
/*
* Can't rely on the handled return value to say it was our NMI,
* two
* events could trigger 'simultaneously' raising two back-to-back
* NMIs.
*
* If the first NMI handles both, the latter will be empty and
* daze
* the CPU.
*/
x86_pmu.handle_irq(regs);

return NOTIFY_STOP;
}

In the normal case, there is no perf user, so the function returns with
NOTIFY_DONE right away. But with the new nmi_watchdog, which is a user of
the perf subsystem, it catches DIE_NMI, executes x86_pmu.handle_irq, and
finally returns NOTIFY_STOP.

The comment above describes the problem well, but as a result no other
NMIs can get through.

I looked at the code and thought I could modify the handle_irq to only
handle one PMU at a time, with the thought that there is probably another
NMI waiting for the other PMUs. This would handle the problem nicely.

But I believe the code is structured such that an event can occupy more
than one PMU in complex cases and as a result would probably break things
because the event would be in limbo until all the NMIs happened to
disable it?? I am not familiar enough with how perf works to know if that
case is correct or not.

So I hacked up some stupid code to start a conversation that just keeps
track of how many NMIs are supposed to happen based on the number of PMUs
handled. Then on future NMIs those are 'eaten' until the count is zero
again.

Like I said this patch is just something to start a conversation. I
tested it, but could not do anything complicated enough such that more
than one PMU was handled during one NMI call.

Comments?

Cheers,
Don

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index f2da20f..df6255c 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1154,7 +1156,7 @@ static int x86_pmu_handle_irq(struct pt_regs *regs)
/*
* event overflow
*/
- handled = 1;
+ handled += 1;
data.period = event->hw.last_period;

if (!x86_perf_event_set_period(event))
@@ -1206,6 +1210,7 @@ perf_event_nmi_handler(struct notifier_block *self,
{
struct die_args *args = __args;
struct pt_regs *regs;
+ static int eat_nmis = 0;

if (!atomic_read(&active_events))
return NOTIFY_DONE;
@@ -1229,9 +1234,13 @@ perf_event_nmi_handler(struct notifier_block *self,
* If the first NMI handles both, the latter will be empty and daze
* the CPU.
*/
- x86_pmu.handle_irq(regs);
+ eat_nmis += x86_pmu.handle_irq(regs);
+ if (eat_nmis) {
+ eat_nmis--;
+ return NOTIFY_STOP;
+ }

- return NOTIFY_STOP;
+ return NOTIFY_DONE;
}

static __read_mostly struct notifier_block perf_event_nmi_notifier = {
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

|
Pages: 1
Prev: [PATCH] ACPI: create "processor.bm_check_disable" boot param
Next: perf: export tracepoint events via sysfs: power