[tip:perf/urgent] perf_events: Fix event scheduling issues introduced by transactional API [Kernel]

Prev: [tip:perf/urgent] perf_events, trace: Fix perf_trace_destroy(), mutex went missing
Next: [tip:perf/urgent] perf_events, trace: Fix probe unregister race

From: tip-bot for Stephane Eranian on 31 May 2010 03:30

Commit-ID: 90151c35b19633e0cab5a6c80f1ba4a51e7c913b
Gitweb: http://git.kernel.org/tip/90151c35b19633e0cab5a6c80f1ba4a51e7c913b
Author: Stephane Eranian <eranian(a)google.com>
AuthorDate: Tue, 25 May 2010 16:23:10 +0200
Committer: Ingo Molnar <mingo(a)elte.hu>
CommitDate: Mon, 31 May 2010 08:46:10 +0200

perf_events: Fix event scheduling issues introduced by transactional API

The transactional API patch between the generic and model-specific
code introduced several important bugs with event scheduling, at
least on X86. If you had pinned events, e.g., watchdog, and were
over-committing the PMU, you would get bogus counts. The bug was
showing up on Intel CPU because events would move around more
often that on AMD. But the problem also existed on AMD, though
harder to expose.

The issues were:

- group_sched_in() was missing a cancel_txn() in the error path

- cpuc->n_added was not properly maintained, leading to missing
actions in hw_perf_enable(), i.e., n_running being 0. You cannot
update n_added until you know the transaction has succeeded. In
case of failed transaction n_added was not adjusted back.

- in case of failed transactions, event_sched_out() was called
and eventually invoked x86_disable_event() to touch the HW reg.
But with transactions, on X86, event_sched_in() does not touch
HW registers, it simply collects events into a list. Thus, you
could end up calling x86_disable_event() on a counter which
did not correspond to the current event when idx != -1.

The patch modifies the generic and X86 code to avoid all those problems.

First, we keep track of the number of events added last. In case the
transaction fails, we substract them from n_added. This approach is
necessary (as opposed to delaying updates to n_added) because not all
event updates use the transaction API, e.g., single events.

Second, we encapsulate the event_sched_in() and event_sched_out() in
group_sched_in() inside the transaction. That makes the operations
symmetrical and you can also detect that you are inside a transaction
and skip the HW reg access by checking cpuc->group_flag.

With this patch, you can now overcommit the PMU even with pinned
system-wide events present and still get valid counts.

Signed-off-by: Stephane Eranian <eranian(a)google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra(a)chello.nl>
LKML-Reference: <1274796225.5882.1389.camel(a)twins>
Signed-off-by: Ingo Molnar <mingo(a)elte.hu>
---
arch/x86/kernel/cpu/perf_event.c | 22 ++++++++++++++++++++++
kernel/perf_event.c | 11 +++++++----
2 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index c775860..5db5b7d 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -106,6 +106,7 @@ struct cpu_hw_events {

int n_events;
int n_added;
+ int n_txn;
int assign[X86_PMC_IDX_MAX]; /* event to counter assignment */
u64 tags[X86_PMC_IDX_MAX];
struct perf_event *event_list[X86_PMC_IDX_MAX]; /* in enabled order */
@@ -983,6 +984,7 @@ static int x86_pmu_enable(struct perf_event *event)
out:
cpuc->n_events = n;
cpuc->n_added += n - n0;
+ cpuc->n_txn += n - n0;

return 0;
}
@@ -1089,6 +1091,14 @@ static void x86_pmu_disable(struct perf_event *event)
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
int i;

+ /*
+ * If we're called during a txn, we don't need to do anything.
+ * The events never got scheduled and ->cancel_txn will truncate
+ * the event_list.
+ */
+ if (cpuc->group_flag & PERF_EVENT_TXN_STARTED)
+ return;
+
x86_pmu_stop(event);

for (i = 0; i < cpuc->n_events; i++) {
@@ -1379,6 +1389,7 @@ static void x86_pmu_start_txn(const struct pmu *pmu)
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);

cpuc->group_flag |= PERF_EVENT_TXN_STARTED;
+ cpuc->n_txn = 0;
}

/*
@@ -1391,6 +1402,11 @@ static void x86_pmu_cancel_txn(const struct pmu *pmu)
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);

cpuc->group_flag &= ~PERF_EVENT_TXN_STARTED;
+ /*
+ * Truncate the collected events.
+ */
+ cpuc->n_added -= cpuc->n_txn;
+ cpuc->n_events -= cpuc->n_txn;
}

/*
@@ -1419,6 +1435,12 @@ static int x86_pmu_commit_txn(const struct pmu *pmu)
*/
memcpy(cpuc->assign, assign, n*sizeof(int));

+ /*
+ * Clear out the txn count so that ->cancel_txn() which gets
+ * run after ->commit_txn() doesn't undo things.
+ */
+ cpuc->n_txn = 0;
+
return 0;
}

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 10a1aee..42a0e91 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -687,8 +687,11 @@ group_sched_in(struct perf_event *group_event,
if (txn)
pmu->start_txn(pmu);

- if (event_sched_in(group_event, cpuctx, ctx))
+ if (event_sched_in(group_event, cpuctx, ctx)) {
+ if (txn)
+ pmu->cancel_txn(pmu);
return -EAGAIN;
+ }

/*
* Schedule in siblings as one group (if any):
@@ -710,9 +713,6 @@ group_sched_in(struct perf_event *group_event,
}

group_error:
- if (txn)
- pmu->cancel_txn(pmu);
-
/*
* Groups can be scheduled in as one unit only, so undo any
* partial group before returning:
@@ -724,6 +724,9 @@ group_error:
}
event_sched_out(group_event, cpuctx, ctx);

+ if (txn)
+ pmu->cancel_txn(pmu);
+
return -EAGAIN;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

|
Pages: 1
Prev: [tip:perf/urgent] perf_events, trace: Fix perf_trace_destroy(), mutex went missing
Next: [tip:perf/urgent] perf_events, trace: Fix probe unregister race