perf: Add persistent events [Kernel]

Prev: [PATCH 2/2] x86, mce: Make MCE tracepoint persistent event
Next: [2.6.34-git8][regression] massive polling problems with udevd and other processes

From: Peter Zijlstra on 23 May 2010 16:10

On Sun, 2010-05-23 at 20:54 +0200, Borislav Petkov wrote:
> From: Peter Zijlstra <peterz(a)infradead.org>
> Date: Sun, May 23, 2010 at 08:40:47PM +0200
>
> > > > A persistent event would simply be a regular event, but created by the
> > > > kernel and not tied to a file-desc's lifetime.
> > >
> > > So you're saying the trace_mce_record() tracepoint for example should
> > > be created completely internally in the kernel and cease to be a
> > > tracepoint? Will it still be able to be selected by perf -e?
> >
> > No, it should be a regular tracepoint as far as tracepoints are
> > concerned.
> >
> > But the only thing persistence should add is an instance of a
> > perf_event, it should not modify either the perf_event nor the
> > tracepoint code.
>
> which means that subsystems which initialize earlier than perf (mce,
> for example) should have to be notified when perf is ready so that they
> could register a persistent event. How does that sound?

Either we add some notifier thing, or we simply add an explicit call in
the init sequence after the perf_event subsystem is running. I would
suggest we start with some explicit call, and take it from there.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Steven Rostedt on 24 May 2010 23:20

On Sat, 2010-05-22 at 21:04 +0200, Borislav Petkov wrote:
> From: Borislav Petkov <bp(a)alien8.de>
>
> Register and enable events marked as persistent right after perf events
> has initialized.
>
> Not-yet-signed-off-by: Borislav Petkov <bp(a)alien8.de>
> ---
> include/linux/ftrace_event.h | 10 +++++++
> include/linux/perf_event.h | 1 +
> kernel/perf_event.c | 59 +++++++++++++++++++++++++++++++++++++----
> kernel/trace/trace.h | 1 -
> 4 files changed, 64 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
> index c0f4b36..b40d637 100644
> --- a/include/linux/ftrace_event.h
> +++ b/include/linux/ftrace_event.h
> @@ -13,6 +13,8 @@ struct dentry;
>
> DECLARE_PER_CPU(struct trace_seq, ftrace_event_seq);
>
> +extern struct list_head ftrace_events;
> +
> struct trace_print_flags {
> unsigned long mask;
> const char *name;
> @@ -134,6 +136,7 @@ struct ftrace_event_call {
> int perf_refcount;
> int (*perf_event_enable)(struct ftrace_event_call *);
> void (*perf_event_disable)(struct ftrace_event_call *);
> + unsigned int type;
> };

If you look at lastest tip/perf/core, can you add this to the
ftrace_event_class instead. Or if it must be per event, can we find a
way to include it into the flags field. Changes to flags must have the
event_mutex held.

-- Steve

>
> #define PERF_MAX_TRACE_SIZE 2048
> @@ -155,6 +158,13 @@ enum {
> FILTER_PTR_STRING,
> };

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Borislav Petkov on 25 May 2010 03:40

From: Peter Zijlstra <peterz(a)infradead.org>
Date: Sun, May 23, 2010 at 09:23:21PM +0200

> Either we add some notifier thing, or we simply add an explicit call in
> the init sequence after the perf_event subsystem is running. I would
> suggest we start with some explicit call, and take it from there.

Ok, this couldn't be more straightforward. So I looked at the init
sequence we do when booting wrt to perf/ftrace initialization:

start_kernel
....
|-> sched_init
|-> perf_event_init
....
|-> ftrace_init
rest_init
kernel_init
|-> do_pre_smp_initcalls
|...
|-> smp_int
|-> do_basic_setup
|-> do_initcalls

and one of the convenient places after both perf is initialized and
ftrace has enumerated the tracepoints is do_initcalls() (It cannot be an
early_initcall since at that time we're not running SMP yet and we want
the MCE event per cpu.)

So I added a core_initcall that registers the mce perf event. This makes
it more or less a persistent event without any changes to the perf_event
subsystem. I guess this should work - at least it builds here, will give
it a run later.

As a further enhancement, the init-function should read out all the
logged mce events which survived the warm reboot and those which happen
between mce init and the actual event registration so that perf can
postprocess those too at a more convenient time.

Thanks.

---
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 8a6f0af..e3370a2 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -94,6 +94,7 @@ static char *mce_helper_argv[2] = { mce_helper, NULL };

static DECLARE_WAIT_QUEUE_HEAD(mce_wait);
static DEFINE_PER_CPU(struct mce, mces_seen);
+static DEFINE_PER_CPU(struct perf_event *, mce_event);
static int cpu_missing;

/*
@@ -1996,6 +1997,60 @@ static void __cpuinit mce_reenable_cpu(void *h)
}
}

+struct perf_event_attr pattr = {
+ .type = PERF_TYPE_TRACEPOINT,
+ .size = sizeof(pattr),
+};
+
+static int mcheck_enable_perf_event_on_cpu(int cpu)
+{
+ struct perf_event *event;
+
+ pattr.config = event_mce_record.id;
+
+ event = perf_event_create_kernel_counter(&pattr, cpu, -1, NULL);
+ if (IS_ERR(event))
+ return -EINVAL;
+
+ perf_event_enable(event);
+ per_cpu(mce_event, cpu) = event;
+
+ return 0;
+}
+
+static void mcheck_disable_perf_event_on_cpu(int cpu)
+{
+ struct perf_event *event = per_cpu(mce_event, cpu);
+
+ if (!event)
+ return;
+
+ perf_event_disable(event);
+ per_cpu(mce_event, cpu) = NULL;
+ perf_event_release_kernel(event);
+}
+
+static int mcheck_init_perf_event(void)
+{
+ int cpu, err;
+
+ get_online_cpus();
+
+ for_each_online_cpu(cpu) {
+ err = mcheck_enable_perf_event_on_cpu(cpu);
+ if (err) {
+ printk(KERN_ERR "mce: error initializing mce tracepoint"
+ " on cpu %d\n", cpu);
+ return err;
+ }
+ }
+
+ put_online_cpus();
+
+ return 0;
+}
+core_initcall(mcheck_init_perf_event);
+
/* Get notified when a cpu comes on/off. Be hotplug friendly. */
static int __cpuinit
mce_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
@@ -2009,6 +2064,7 @@ mce_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
mce_create_device(cpu);
if (threshold_cpu_callback)
threshold_cpu_callback(action, cpu);
+ mcheck_enable_perf_event_on_cpu(cpu);
break;
case CPU_DEAD:
case CPU_DEAD_FROZEN:
@@ -2020,6 +2076,7 @@ mce_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
case CPU_DOWN_PREPARE_FROZEN:
del_timer_sync(t);
smp_call_function_single(cpu, mce_disable_cpu, &action, 1);
+ mcheck_disable_perf_event_on_cpu(cpu);
break;
case CPU_DOWN_FAILED:
case CPU_DOWN_FAILED_FROZEN:
@@ -2029,6 +2086,7 @@ mce_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
add_timer_on(t, cpu);
}
smp_call_function_single(cpu, mce_reenable_cpu, &action, 1);
+ mcheck_enable_perf_event_on_cpu(cpu);
break;
case CPU_POST_DEAD:
/* intentionally ignoring frozen here */

--
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Peter Zijlstra on 25 May 2010 11:10

On Tue, 2010-05-25 at 09:32 +0200, Borislav Petkov wrote:
> From: Peter Zijlstra <peterz(a)infradead.org>
> Date: Sun, May 23, 2010 at 09:23:21PM +0200
>
> > Either we add some notifier thing, or we simply add an explicit call in
> > the init sequence after the perf_event subsystem is running. I would
> > suggest we start with some explicit call, and take it from there.
>
> Ok, this couldn't be more straightforward. So I looked at the init
> sequence we do when booting wrt to perf/ftrace initialization:
>
> start_kernel
> ...
> |-> sched_init
> |-> perf_event_init
> ...
> |-> ftrace_init
> rest_init
> kernel_init
> |-> do_pre_smp_initcalls
> |...
> |-> smp_int
> |-> do_basic_setup
> |-> do_initcalls
>
> and one of the convenient places after both perf is initialized and
> ftrace has enumerated the tracepoints is do_initcalls() (It cannot be an
> early_initcall since at that time we're not running SMP yet and we want
> the MCE event per cpu.)
>
> So I added a core_initcall that registers the mce perf event. This makes
> it more or less a persistent event without any changes to the perf_event
> subsystem. I guess this should work - at least it builds here, will give
> it a run later.
>
> As a further enhancement, the init-function should read out all the
> logged mce events which survived the warm reboot and those which happen
> between mce init and the actual event registration so that perf can
> postprocess those too at a more convenient time.

Right, so that looks good. Now the interesting part is twofold:

1) expose these perf_events to userspace, since they're now created
in kernel, there is no user-space access point to them. One way
way would be to extend the perf syscall to allow attaching to an
existing instance (but that would limit us to a single instance per
'attr'), or create some /debug or /sys iteration of all such events.

2) get these things a buffer, perf_events as created don't actually
have an output buffer, normally that is created at mmap() time, but
since you cannot mmap() a kernel side event, it doesn't get to have
a buffer. This could be done by extracting perf_mmap_data_alloc()
into a sensible interface.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Ingo Molnar on 28 May 2010 10:40

* Peter Zijlstra <peterz(a)infradead.org> wrote:

> On Tue, 2010-05-25 at 09:32 +0200, Borislav Petkov wrote:
> > From: Peter Zijlstra <peterz(a)infradead.org>
> > Date: Sun, May 23, 2010 at 09:23:21PM +0200
> >
> > > Either we add some notifier thing, or we simply add an explicit call in
> > > the init sequence after the perf_event subsystem is running. I would
> > > suggest we start with some explicit call, and take it from there.
> >
> > Ok, this couldn't be more straightforward. So I looked at the init
> > sequence we do when booting wrt to perf/ftrace initialization:
> >
> > start_kernel
> > ...
> > |-> sched_init
> > |-> perf_event_init
> > ...
> > |-> ftrace_init
> > rest_init
> > kernel_init
> > |-> do_pre_smp_initcalls
> > |...
> > |-> smp_int
> > |-> do_basic_setup
> > |-> do_initcalls
> >
> > and one of the convenient places after both perf is initialized and
> > ftrace has enumerated the tracepoints is do_initcalls() (It cannot be an
> > early_initcall since at that time we're not running SMP yet and we want
> > the MCE event per cpu.)
> >
> > So I added a core_initcall that registers the mce perf event. This makes
> > it more or less a persistent event without any changes to the perf_event
> > subsystem. I guess this should work - at least it builds here, will give
> > it a run later.
> >
> > As a further enhancement, the init-function should read out all the
> > logged mce events which survived the warm reboot and those which happen
> > between mce init and the actual event registration so that perf can
> > postprocess those too at a more convenient time.
>
> Right, so that looks good. Now the interesting part is twofold:
>
> 1) expose these perf_events to userspace, since they're now created
> in kernel, there is no user-space access point to them. One way
> way would be to extend the perf syscall to allow attaching to an
> existing instance (but that would limit us to a single instance per
> 'attr'), or create some /debug or /sys iteration of all such events.

Yeah.

> 2) get these things a buffer, perf_events as created don't actually
> have an output buffer, normally that is created at mmap() time, but
> since you cannot mmap() a kernel side event, it doesn't get to have
> a buffer. This could be done by extracting perf_mmap_data_alloc()
> into a sensible interface.

#2 could be a new syscall: sys_create_ring_buffer or so?

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: [PATCH 2/2] x86, mce: Make MCE tracepoint persistent event
Next: [2.6.34-git8][regression] massive polling problems with udevd and other processes