From: Zhang, Yanmin on
On Tue, 2010-07-13 at 10:50 +0200, Ingo Molnar wrote:
> * Zhang, Yanmin <yanmin_zhang(a)linux.intel.com> wrote:
>
> > Peter,
> >
> > perf doesn't work on my Nehalem EX machine.
> > 1) The 1st start of 'perf top' is ok;
> > 2) Kill the 1st perf and restart it. It doesn't work. No data is showed.
> >
> > I located below commit:
> > commit 1ac62cfff252fb668405ef3398a1fa7f4a0d6d15
> > Author: Peter Zijlstra <peterz(a)infradead.org>
> > Date: Fri Mar 26 14:08:44 2010 +0100
> >
> > perf, x86: Add Nehelem PMU programming errata workaround
> >
> > workaround From: Peter Zijlstra <a.p.zijlstra(a)chello.nl>
> > Date: Fri Mar 26 13:59:41 CET 2010
> >
> > Implement the workaround for Intel Errata AAK100 and AAP53.
> >
> > Also, remove the Core-i7 name for Nehalem events since there are
> > also Westmere based i7 chips.
> >
> >
> > If I comment out the workaround in function intel_pmu_nhm_enable_all,
> > perf could work.
> >
> > A quick glance shows:
> > wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x3);
> > should be:
> > wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x7);
>
> > I triggered sysrq to dump PMU registers and found the last bit of
> > global status register is 1. I added a status reset operation like below patch:
> >
> > --- linux-2.6.35-rc5/arch/x86/kernel/cpu/perf_event_intel.c 2010-07-14 09:38:11.000000000 +0800
> > +++ linux-2.6.35-rc5_fork/arch/x86/kernel/cpu/perf_event_intel.c 2010-07-14 14:41:42.000000000 +0800
> > @@ -505,8 +505,13 @@ static void intel_pmu_nhm_enable_all(int
> > wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + 1, 0x4300B1);
> > wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + 2, 0x4300B5);
> >
> > - wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x3);
> > + wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x7);
> > wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x0);
> > + /*
> > + * Reset the last 3 bits of global status register in case
> > + * previous enabling causes overflows.
> > + */
> > + wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, 0x7);
> >
> > for (i = 0; i < 3; i++) {
> > struct perf_event *event = cpuc->events[i];
> >
> >
> > However, it still doesn't work. Current right way is to comment out
> > the workaround.
>
> Well, how about doing it like this:
>
> wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x7);
> /*
> * Reset the last 3 bits of global status register in case
> * previous enabling causes overflows.
> */
> wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, 0x7);
>
> for (i = 0; i < 3; i++) {
> struct perf_event *event = cpuc->events[i];
> ...
> }
>
> wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x0);
>

> I.e. global-mask, overflow-clear, explicit-enable, then global-enable?
Ingo,

wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x7) is global enable. So it's
global-enable, overflow-clear, explicit-enable, then global-disable?

Yanmin


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Zhang, Yanmin on
On Wed, 2010-07-14 at 02:36 +0200, Stephane Eranian wrote:
> What about running simpler commands like perf stat?
'perf stat ls' seems ok.
I compare the PMU register dumping info before and after 'perf stat ls'.
Some processors' bit0 of status registers become 0 from 1.

I can't guarantee all 'perf stat' is ok because it seems some processors
wouldn't collect perf statistics correctly while others seem ok.

'gdb perf' could work because one processor seems ok while others couldn't.
So 'gdb perf' actually miss some data.

>
>
> On Wed, Jul 14, 2010 at 2:13 AM, Zhang, Yanmin
> <yanmin_zhang(a)linux.intel.com> wrote:
> > On Tue, 2010-07-13 at 17:16 +0200, Stephane Eranian wrote:
> >> On Tue, Jul 13, 2010 at 10:14 AM, Zhang, Yanmin
> >> <yanmin_zhang(a)linux.intel.com> wrote:
> >> > Peter,
> >> >
> >> > perf doesn't work on my Nehalem EX machine.
> >> > 1) The 1st start of 'perf top' is ok;
> >> > 2) Kill the 1st perf and restart it. It doesn't work. No data is showed.
> >> >
> >> > I located below commit:
> >> > commit 1ac62cfff252fb668405ef3398a1fa7f4a0d6d15
> >> > Author: Peter Zijlstra <peterz(a)infradead.org>
> >> > Date: Fri Mar 26 14:08:44 2010 +0100
> >> >
> >> > perf, x86: Add Nehelem PMU programming errata workaround
> >> >
> >> > workaround From: Peter Zijlstra <a.p.zijlstra(a)chello.nl>
> >> > Date: Fri Mar 26 13:59:41 CET 2010
> >> >
> >> > Implement the workaround for Intel Errata AAK100 and AAP53.
> >> >
> >> > Also, remove the Core-i7 name for Nehalem events since there are
> >> > also Westmere based i7 chips.
> >> >
> >> >
> >> > If I comment out the workaround in function intel_pmu_nhm_enable_all,
> >> > perf could work.
> >> >
> >> > A quick glance shows:
> >> > wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x3);
> >> > should be:
> >> > wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x7);
> >> >
> >> >
> >> > I triggered sysrq to dump PMU registers and found the last bit of
> >> > global status register is 1. I added a status reset operation like below patch:
> >> >
> >> What do you call the last bit? bit0 or bit63?
> > Sorry for confusing you. It's bit0, mapping to PERFMON_EVENTSEL0.
> >
> >>
> >> > --- linux-2.6.35-rc5/arch/x86/kernel/cpu/perf_event_intel.c 2010-07-14 09:38:11.000000000 +0800
> >> > +++ linux-2.6.35-rc5_fork/arch/x86/kernel/cpu/perf_event_intel.c 2010-07-14 14:41:42.000000000 +0800
> >> > @@ -505,8 +505,13 @@ static void intel_pmu_nhm_enable_all(int
> >> > wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + 1, 0x4300B1);
> >> > wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + 2, 0x4300B5);
> >> >
> >> > - wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x3);
> >> > + wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x7);
> >> > wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x0);
> >> > + /*
> >> > + * Reset the last 3 bits of global status register in case
> >> > + * previous enabling causes overflows.
> >> > + */
> >>
> >> The workaround cannot cause on overflow because the associated counters
> >> won't count anything given their umask value is 0 (which does not correspond
> >> to anything for event 0xB1, event 0xB5 is undocumented). This is for the events
> >> described in table A.2. If NHM-EX has a different definition for 0xB1, 0xB5,
> >> then that's another story.
> > I found the status bit is set by triggering sysrq to dump PMU registers.
> >
> > If I start perf by gdb, sometimes, perf could work. I found one processor's 1st status
> > register is equal to 0 while other processors' are 1. If just starting perf, all 1st
> > status registers are equal to 1.
> >
> >>
> >>
> >> > + wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, 0x7);
> >> >
> >> > for (i = 0; i < 3; i++) {
> >> > struct perf_event *event = cpuc->events[i];
> >> >
> >> >
> >> >
> >> > However, it still doesn't work. Current right way is to comment out
> >> > the workaround.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Zhang, Yanmin on
On Wed, 2010-07-14 at 08:49 +0800, Zhang, Yanmin wrote:
> On Tue, 2010-07-13 at 10:50 +0200, Ingo Molnar wrote:
> > * Zhang, Yanmin <yanmin_zhang(a)linux.intel.com> wrote:
> >
> > > Peter,
> > >
> > > perf doesn't work on my Nehalem EX machine.
> > > 1) The 1st start of 'perf top' is ok;
> > > 2) Kill the 1st perf and restart it. It doesn't work. No data is showed.
> > >
> > > I located below commit:
> > > commit 1ac62cfff252fb668405ef3398a1fa7f4a0d6d15
> > > Author: Peter Zijlstra <peterz(a)infradead.org>
> > > Date: Fri Mar 26 14:08:44 2010 +0100
> > >
> > > perf, x86: Add Nehelem PMU programming errata workaround
> > >
> > > workaround From: Peter Zijlstra <a.p.zijlstra(a)chello.nl>
> > > Date: Fri Mar 26 13:59:41 CET 2010
> > >
> > > Implement the workaround for Intel Errata AAK100 and AAP53.
> > >
> > > Also, remove the Core-i7 name for Nehalem events since there are
> > > also Westmere based i7 chips.
> > >
> > >
> > > If I comment out the workaround in function intel_pmu_nhm_enable_all,
> > > perf could work.
> > >
> > > A quick glance shows:
> > > wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x3);
> > > should be:
> > > wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x7);
> >
> > > I triggered sysrq to dump PMU registers and found the last bit of
> > > global status register is 1. I added a status reset operation like below patch:
> > >
> > > --- linux-2.6.35-rc5/arch/x86/kernel/cpu/perf_event_intel.c 2010-07-14 09:38:11.000000000 +0800
> > > +++ linux-2.6.35-rc5_fork/arch/x86/kernel/cpu/perf_event_intel.c 2010-07-14 14:41:42.000000000 +0800
> > > @@ -505,8 +505,13 @@ static void intel_pmu_nhm_enable_all(int
> > > wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + 1, 0x4300B1);
> > > wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + 2, 0x4300B5);
> > >
> > > - wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x3);
> > > + wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x7);
> > > wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x0);
> > > + /*
> > > + * Reset the last 3 bits of global status register in case
> > > + * previous enabling causes overflows.
> > > + */
> > > + wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, 0x7);
> > >
> > > for (i = 0; i < 3; i++) {
> > > struct perf_event *event = cpuc->events[i];
> > >
> > >
> > > However, it still doesn't work. Current right way is to comment out
> > > the workaround.
> >
> > Well, how about doing it like this:
> >
> > wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x7);
> > /*
> > * Reset the last 3 bits of global status register in case
> > * previous enabling causes overflows.
> > */
> > wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, 0x7);
> >
> > for (i = 0; i < 3; i++) {
> > struct perf_event *event = cpuc->events[i];
> > ...
> > }
> >
> > wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x0);
> >
>
> > I.e. global-mask, overflow-clear, explicit-enable, then global-enable?
> Ingo,
>
> wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x7) is global enable. So it's
> global-enable, overflow-clear, explicit-enable, then global-disable?
It doesn't work.

I copied the PMU dumping of logical processor 0.

Besides status register's bit0 is equal to 0, PMC0 count is small. So it takes
a long time to overflow.


CPU#0: ctrl: 000000070000000f
CPU#0: status: 0000000000000001
CPU#0: overflow: 0000000000000000
CPU#0: fixed: 0000000000000000
CPU#0: pebs: 0000000000000000
CPU#0: active: 0000000000000001
CPU#0: gen-PMC0 ctrl: 000000000053003c
CPU#0: gen-PMC0 count: 00000000549bffd9
CPU#0: gen-PMC0 left: 0000000000000002
CPU#0: gen-PMC1 ctrl: 00000000004300b1
CPU#0: gen-PMC1 count: 0000000000000000
CPU#0: gen-PMC1 left: 0000000000000000
CPU#0: gen-PMC2 ctrl: 00000000004300b5
CPU#0: gen-PMC2 count: 0000000000000000
CPU#0: gen-PMC2 left: 0000000000000000
CPU#0: gen-PMC3 ctrl: 0000000000000000
CPU#0: gen-PMC3 count: 0000000000000000
CPU#0: gen-PMC3 left: 0000000000000000
CPU#0: fixed-PMC0 count: 0000000000000000
CPU#0: fixed-PMC1 count: 0000000000000000
CPU#0: fixed-PMC2 count: 0000000000000000
SysRq : Show Regs


I instrumented function intel_pmu_nhm_enable_all to check
PMC0 count register before the workaround and after disabling
the 3 bits of MSR_CORE_PERF_GLOBAL_CTRL. It's changed unexpectedly.
below is a debug output about processor 0.

PMU register counter is changed. before[281474976710654] after[1]

So I think the event 0x4300D2 overflows. We need do a save and restore.
Below patch fixes it.

Yanmin

---

--- linux-2.6.35-rc5/arch/x86/kernel/cpu/perf_event_intel.c 2005-01-01 13:19:50.800000253 +0800
+++ linux-2.6.35-rc5_perf/arch/x86/kernel/cpu/perf_event_intel.c 2005-01-01 16:01:35.324000300 +0800
@@ -499,21 +499,34 @@ static void intel_pmu_nhm_enable_all(int
{
if (added) {
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+ struct perf_event *event;
int i;

+ for (i = 0; i < 3; i++) {
+ event = cpuc->events[i];
+ if (!event)
+ continue;
+ x86_perf_event_update(event);
+ }
wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + 0, 0x4300D2);
wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + 1, 0x4300B1);
wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + 2, 0x4300B5);

- wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x3);
+ wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x7);
wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x0);
+ /*
+ * Reset the last 3 bits of global status register in case
+ * previous enabling causes overflows.
+ */
+ wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, 0x7);

for (i = 0; i < 3; i++) {
- struct perf_event *event = cpuc->events[i];
+ event = cpuc->events[i];

if (!event)
continue;

+ x86_perf_event_set_period(event);
__x86_pmu_enable_event(&event->hw,
ARCH_PERFMON_EVENTSEL_ENABLE);
}


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Tue, 2010-07-13 at 16:14 +0800, Zhang, Yanmin wrote:

> If I comment out the workaround in function intel_pmu_nhm_enable_all,
> perf could work.
>
> A quick glance shows:
> wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x3);
> should be:
> wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x7);

Correct, I've got a patch for that laying around,.. posted it several
times in the struct pmu rework series.

> I triggered sysrq to dump PMU registers and found the last bit of
> global status register is 1. I added a status reset operation like below patch:
>
> --- linux-2.6.35-rc5/arch/x86/kernel/cpu/perf_event_intel.c 2010-07-14 09:38:11.000000000 +0800
> +++ linux-2.6.35-rc5_fork/arch/x86/kernel/cpu/perf_event_intel.c 2010-07-14 14:41:42.000000000 +0800
> @@ -505,8 +505,13 @@ static void intel_pmu_nhm_enable_all(int
> wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + 1, 0x4300B1);
> wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + 2, 0x4300B5);
>
> - wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x3);
> + wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x7);
> wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x0);
> + /*
> + * Reset the last 3 bits of global status register in case
> + * previous enabling causes overflows.
> + */
> + wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, 0x7);

Like Stephane already pointed out (and the comment above this function
mentions), these counters should be non-counting, and thus should not
cause an overflow (and this function gets called with GLOBAL_CTRL
cleared, so it shouldn't be counting to begin with).

The Nehalem-EX errata BA72 seems to agree (assuming Xeon-7500 is indeed
the EX, Intel really should do something about this naming madness)
http://www.intel.com/Assets/en_US/PDF/specupdate/323344.pdf

[ Also see the trace outout below, PERFCTR[12] stay 0 ]

> for (i = 0; i < 3; i++) {
> struct perf_event *event = cpuc->events[i];
>
>
>
> However, it still doesn't work. Current right way is to comment out
> the workaround.

So I frobbed the below patchlet to debug things, as it appears I can
indeed reproduce on my westmere.

---
arch/x86/kernel/cpu/perf_event.c | 24 +++++++++++++++++++++++-
arch/x86/kernel/cpu/perf_event_intel.c | 2 +-
2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index f2da20f..2ca41f7 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -31,7 +31,7 @@
#include <asm/nmi.h>
#include <asm/compat.h>

-#if 0
+#if 1
#undef wrmsrl
#define wrmsrl(msr, val) \
do { \
@@ -42,6 +42,16 @@ do { \
} while (0)
#endif

+#if 1
+#undef rdmsrl
+#define rdmsrl(msr, val) \
+do { \
+ (val) = native_read_msr(msr); \
+ trace_printk("rdmsrl(%lx, %lx)\n", (unsigned long)(msr),\
+ (unsigned long)(val)); \
+} while (0)
+#endif
+
/*
* best effort, GUP based copy_from_user() that assumes IRQ or NMI context
*/
@@ -583,6 +593,14 @@ static void x86_pmu_disable_all(void)
}
}

+static void noinline check_status(void)
+{
+ u64 status;
+ rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, status);
+ if (WARN_ON(status))
+ perf_event_print_debug();
+}
+
void hw_perf_disable(void)
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
@@ -598,6 +616,8 @@ void hw_perf_disable(void)
barrier();

x86_pmu.disable_all();
+
+ check_status();
}

static void x86_pmu_enable_all(int added)
@@ -816,6 +836,8 @@ void hw_perf_enable(void)
if (cpuc->enabled)
return;

+ check_status();
+
if (cpuc->n_added) {
int n_running = cpuc->n_events - cpuc->n_added;
/*
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 214ac86..56ce7dc 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -505,7 +505,7 @@ static void intel_pmu_nhm_enable_all(int added)
wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + 1, 0x4300B1);
wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + 2, 0x4300B5);

- wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x3);
+ wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x7);
wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x0);

for (i = 0; i < 3; i++) {
---

It results in the below trace (edited out some superfluous stack traces:

<...>-1871 [000] 96.759853: intel_pmu_disable_all: wrmsrl(38f, 0)
<...>-1871 [000] 96.759853: <stack trace>
=> hw_perf_disable
=> perf_disable
=> perf_ctx_adjust_freq
=> perf_event_task_tick
=> scheduler_tick
=> update_process_times
=> tick_sched_timer
=> __run_hrtimer
<...>-1871 [000] 96.759872: check_status: rdmsrl(38e, 1)
<...>-1871 [000] 96.759872: <stack trace>
=> hw_perf_disable
=> perf_disable
=> perf_ctx_adjust_freq
=> perf_event_task_tick
=> scheduler_tick
=> update_process_times
=> tick_sched_timer
=> __run_hrtimer
<...>-1871 [000] 96.760180: perf_event_print_debug: rdmsrl(38f, 0)
<...>-1871 [000] 96.760186: perf_event_print_debug: rdmsrl(38e, 1)
<...>-1871 [000] 96.760191: perf_event_print_debug: rdmsrl(390, 0)
<...>-1871 [000] 96.760196: perf_event_print_debug: rdmsrl(38d, 0)
<...>-1871 [000] 96.760201: perf_event_print_debug: rdmsrl(3f1, 0)
<...>-1871 [000] 96.760248: perf_event_print_debug: rdmsrl(186, 53003c)
<...>-1871 [000] 96.760258: perf_event_print_debug: rdmsrl(c1, d1e91)
<...>-1871 [000] 96.760285: perf_event_print_debug: rdmsrl(187, 4300b1)
<...>-1871 [000] 96.760290: perf_event_print_debug: rdmsrl(c2, 0)
<...>-1871 [000] 96.760320: perf_event_print_debug: rdmsrl(188, 4300b5)
<...>-1871 [000] 96.760340: perf_event_print_debug: rdmsrl(c3, 0)
<...>-1871 [000] 96.760359: perf_event_print_debug: rdmsrl(189, 0)
<...>-1871 [000] 96.760379: perf_event_print_debug: rdmsrl(c4, 0)
<...>-1871 [000] 96.760398: perf_event_print_debug: rdmsrl(309, 0)
<...>-1871 [000] 96.760417: perf_event_print_debug: rdmsrl(30a, 0)
<...>-1871 [000] 96.760432: perf_event_print_debug: rdmsrl(30b, 0)
<...>-1871 [000] 96.760432: <stack trace>
=> check_status
=> hw_perf_disable
=> perf_disable
=> perf_ctx_adjust_freq
=> perf_event_task_tick
=> scheduler_tick
=> update_process_times
=> tick_sched_timer

Which seems to indicate we somehow disable the PMU (clear GLOBAL_CTRL)
with a pending overflow (as indicated by GLOBAL_STATUS and PERFCTR0).

So there's two questions:
1) how can we end up in the above state
2) why does the fixup change things

most confusing..
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Stephane Eranian on
On Tue, Aug 3, 2010 at 2:20 PM, Peter Zijlstra <peterz(a)infradead.org> wrote:
> On Tue, 2010-07-13 at 16:14 +0800, Zhang, Yanmin wrote:
>
>> If I comment out the workaround in function intel_pmu_nhm_enable_all,
>> perf could work.
>>
>> A quick glance shows:
>> wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x3);
>> should be:
>> wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x7);
>
> Correct, I've got a patch for that laying around,.. posted it several
> times in the struct pmu rework series.
>
>> I triggered sysrq to dump PMU registers and found the last bit of
>> global status register is 1. I added a status reset operation like below patch:
>>
>> --- linux-2.6.35-rc5/arch/x86/kernel/cpu/perf_event_intel.c   2010-07-14 09:38:11.000000000 +0800
>> +++ linux-2.6.35-rc5_fork/arch/x86/kernel/cpu/perf_event_intel.c      2010-07-14 14:41:42.000000000 +0800
>> @@ -505,8 +505,13 @@ static void intel_pmu_nhm_enable_all(int
>>               wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + 1, 0x4300B1);
>>               wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + 2, 0x4300B5);
>>
>> -             wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x3);
>> +             wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x7);
>>               wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x0);
>> +             /*
>> +              * Reset the last 3 bits of global status register in case
>> +              * previous enabling causes overflows.
>> +              */
>> +             wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, 0x7);
>
> Like Stephane already pointed out (and the comment above this function
> mentions), these counters should be non-counting, and thus should not
> cause an overflow (and this function gets called with GLOBAL_CTRL
> cleared, so it shouldn't be counting to begin with).
>
> The Nehalem-EX errata BA72 seems to agree (assuming Xeon-7500 is indeed
> the EX, Intel really should do something about this naming madness)
> http://www.intel.com/Assets/en_US/PDF/specupdate/323344.pdf
>
> [ Also see the trace outout below, PERFCTR[12] stay 0 ]
>
>>               for (i = 0; i < 3; i++) {
>>                       struct perf_event *event = cpuc->events[i];
>>
>>
>>
>> However, it still doesn't work. Current right way is to comment out
>> the workaround.
>
> So I frobbed the below patchlet to debug things, as it appears I can
> indeed reproduce on my westmere.
>
> ---
>  arch/x86/kernel/cpu/perf_event.c       |   24 +++++++++++++++++++++++-
>  arch/x86/kernel/cpu/perf_event_intel.c |    2 +-
>  2 files changed, 24 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index f2da20f..2ca41f7 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -31,7 +31,7 @@
>  #include <asm/nmi.h>
>  #include <asm/compat.h>
>
> -#if 0
> +#if 1
>  #undef wrmsrl
>  #define wrmsrl(msr, val)                                       \
>  do {                                                           \
> @@ -42,6 +42,16 @@ do {                                                         \
>  } while (0)
>  #endif
>
> +#if 1
> +#undef rdmsrl
> +#define rdmsrl(msr, val)                                       \
> +do {                                                           \
> +       (val) = native_read_msr(msr);                           \
> +       trace_printk("rdmsrl(%lx, %lx)\n", (unsigned long)(msr),\
> +                       (unsigned long)(val));                  \
> +} while (0)
> +#endif
> +
>  /*
>  * best effort, GUP based copy_from_user() that assumes IRQ or NMI context
>  */
> @@ -583,6 +593,14 @@ static void x86_pmu_disable_all(void)
>        }
>  }
>
> +static void noinline check_status(void)
> +{
> +       u64 status;
> +       rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, status);
> +       if (WARN_ON(status))
> +               perf_event_print_debug();
> +}
> +
>  void hw_perf_disable(void)
>  {
>        struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> @@ -598,6 +616,8 @@ void hw_perf_disable(void)
>        barrier();
>
>        x86_pmu.disable_all();
> +
> +       check_status();
>  }
>
>  static void x86_pmu_enable_all(int added)
> @@ -816,6 +836,8 @@ void hw_perf_enable(void)
>        if (cpuc->enabled)
>                return;
>
> +       check_status();
> +
>        if (cpuc->n_added) {
>                int n_running = cpuc->n_events - cpuc->n_added;
>                /*
> diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
> index 214ac86..56ce7dc 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> @@ -505,7 +505,7 @@ static void intel_pmu_nhm_enable_all(int added)
>                wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + 1, 0x4300B1);
>                wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + 2, 0x4300B5);
>
> -               wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x3);
> +               wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x7);
>                wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0x0);
>
>                for (i = 0; i < 3; i++) {
> ---
>
> It results in the below trace (edited out some superfluous stack traces:
>
>           <...>-1871  [000]    96.759853: intel_pmu_disable_all: wrmsrl(38f, 0)
>           <...>-1871  [000]    96.759853: <stack trace>
>  => hw_perf_disable
>  => perf_disable
>  => perf_ctx_adjust_freq
>  => perf_event_task_tick
>  => scheduler_tick
>  => update_process_times
>  => tick_sched_timer
>  => __run_hrtimer
>           <...>-1871  [000]    96.759872: check_status: rdmsrl(38e, 1)
>           <...>-1871  [000]    96.759872: <stack trace>
>  => hw_perf_disable
>  => perf_disable
>  => perf_ctx_adjust_freq
>  => perf_event_task_tick
>  => scheduler_tick
>  => update_process_times
>  => tick_sched_timer
>  => __run_hrtimer
>           <...>-1871  [000]    96.760180: perf_event_print_debug: rdmsrl(38f, 0)
>           <...>-1871  [000]    96.760186: perf_event_print_debug: rdmsrl(38e, 1)
>           <...>-1871  [000]    96.760191: perf_event_print_debug: rdmsrl(390, 0)
>           <...>-1871  [000]    96.760196: perf_event_print_debug: rdmsrl(38d, 0)
>           <...>-1871  [000]    96.760201: perf_event_print_debug: rdmsrl(3f1, 0)
>           <...>-1871  [000]    96.760248: perf_event_print_debug: rdmsrl(186, 53003c)
>           <...>-1871  [000]    96.760258: perf_event_print_debug: rdmsrl(c1, d1e91)
>           <...>-1871  [000]    96.760285: perf_event_print_debug: rdmsrl(187, 4300b1)
>           <...>-1871  [000]    96.760290: perf_event_print_debug: rdmsrl(c2, 0)
>           <...>-1871  [000]    96.760320: perf_event_print_debug: rdmsrl(188, 4300b5)
>           <...>-1871  [000]    96.760340: perf_event_print_debug: rdmsrl(c3, 0)
>           <...>-1871  [000]    96.760359: perf_event_print_debug: rdmsrl(189, 0)
>           <...>-1871  [000]    96.760379: perf_event_print_debug: rdmsrl(c4, 0)
>           <...>-1871  [000]    96.760398: perf_event_print_debug: rdmsrl(309, 0)
>           <...>-1871  [000]    96.760417: perf_event_print_debug: rdmsrl(30a, 0)
>           <...>-1871  [000]    96.760432: perf_event_print_debug: rdmsrl(30b, 0)
>           <...>-1871  [000]    96.760432: <stack trace>
>  => check_status
>  => hw_perf_disable
>  => perf_disable
>  => perf_ctx_adjust_freq
>  => perf_event_task_tick
>  => scheduler_tick
>  => update_process_times
>  => tick_sched_timer
>
> Which seems to indicate we somehow disable the PMU (clear GLOBAL_CTRL)
> with a pending overflow (as indicated by GLOBAL_STATUS and PERFCTR0).
>
At the time you stop the PMU, you could have an in-flight overflow, i.e., it may
take some time to set the overflow status bit. But this seems far
fetched because
it would be very hard to reproduce. Do you get the same problem is you restrict
the event to counting at the user level, for instance?

Another good source of infos is the Intel VTUNE/PTU driver, called SEP. The
source is provided when you download PTU. I checked the code and they do
have a fix for the same errata. It seems to issue many more wrmsrl().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/