perf, ftrace and MCEs [Kernel]

Prev: hv: Added new hv_utils driver to hyper-v - RE-CREATED
Next: hwmon: applesmc: Add generic support for MacBook Pro 7

From: Borislav Petkov on 1 May 2010 14:20

Hi,

so I finally had some spare time to stare at perf/ftrace code and ponder
on how to use those facilities for MCE collecting and reporting. Btw, I
have to say, it took me quite a while to understand what goes where - my
suggestion to anyone who tries to understand how perf/ftrace works is
to do make <file.i> where there is at least one trace_XXX emit record
function call and start untangling code paths from there.

So anyway, here are some questions I had, I just as well may've missed
something so please correct me if I'm wrong:

1. Since machine checks can happen at any time, we need to have the
MCE tracepoint (trace_mce_record() in <include/trace/events/mce.h>)
always enabled. This, in turn, means that we need the ftrace/perf
infrastructure always compiled in (lockless ring buffer, perf_event.c
stuff) on any x86 system so that MCEs can be handled at anytime. Is this
going to be ok to be enabled on _all_ machines, hmmm... I dunno, maybe
only a subset of those facilites at least.

2. Tangential to 1., we need that "thin" software layer prepared for
decoding and reporting them as early as possible. event_trace_init() is
an fs_initcall and executed too late, IMHO. The ->perf_event_enable in
the ftrace_event_call is enabled even later on the perf init path over
the sys_perf_even_open which is at userspace time. In our case, this is
going be executed by the error logging and decoding daemon I guess.

3. Since we want to listen for MCEs all the time, the concept of
enabling and disabling those events does not apply in the sense of
performance profiling. IOW, MCEs need to be able to be logged to the
ring buffer at any time. I guess this is easily done - we simply enable
MCE events at the earliest moment possible and disable them on shutdown;
done.

So yeah, some food for thought but what do you guys think?

Thanks.

--
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Steven Rostedt on 3 May 2010 10:50

On Sat, 2010-05-01 at 20:12 +0200, Borislav Petkov wrote:
> Hi,
>
> so I finally had some spare time to stare at perf/ftrace code and ponder
> on how to use those facilities for MCE collecting and reporting. Btw, I
> have to say, it took me quite a while to understand what goes where - my
> suggestion to anyone who tries to understand how perf/ftrace works is
> to do make <file.i> where there is at least one trace_XXX emit record
> function call and start untangling code paths from there.
>
> So anyway, here are some questions I had, I just as well may've missed
> something so please correct me if I'm wrong:
>
> 1. Since machine checks can happen at any time, we need to have the
> MCE tracepoint (trace_mce_record() in <include/trace/events/mce.h>)
> always enabled. This, in turn, means that we need the ftrace/perf
> infrastructure always compiled in (lockless ring buffer, perf_event.c
> stuff) on any x86 system so that MCEs can be handled at anytime. Is this
> going to be ok to be enabled on _all_ machines, hmmm... I dunno, maybe
> only a subset of those facilites at least.

I'm not exactly sure what you goal is, but if you need to do something
directly, you can bypass ftrace and perf. All trace events can be
connected by anything even when ftrace and perf are not enabled.

That is, you need to connect to the tracepoint and write you own
callback. This can be done pretty much at anytime during boot up. To see
how to connect to a trace point, you can look at
register_trace_sched_switch() in kernel/trace/ftrace.c. This registers a
callback to the trace_sched_switch() trace point in sched.c.

>
> 2. Tangential to 1., we need that "thin" software layer prepared for
> decoding and reporting them as early as possible. event_trace_init() is
> an fs_initcall and executed too late, IMHO. The ->perf_event_enable in
> the ftrace_event_call is enabled even later on the perf init path over
> the sys_perf_even_open which is at userspace time. In our case, this is
> going be executed by the error logging and decoding daemon I guess.
>
> 3. Since we want to listen for MCEs all the time, the concept of
> enabling and disabling those events does not apply in the sense of
> performance profiling. IOW, MCEs need to be able to be logged to the
> ring buffer at any time. I guess this is easily done - we simply enable
> MCE events at the earliest moment possible and disable them on shutdown;
> done.

This looks like a good reason to have your own handler. More than one
callback may be registered to a tracepoint, so you do not need to worry
about having other handlers affect your code.

-- Steve

>
> So yeah, some food for thought but what do you guys think?
>
> Thanks.
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Borislav Petkov on 3 May 2010 17:30

From: Steven Rostedt <rostedt(a)goodmis.org>
Date: Mon, May 03, 2010 at 10:41:12AM -0400

Hi Steven,

> On Sat, 2010-05-01 at 20:12 +0200, Borislav Petkov wrote:
> > Hi,
> >
> > so I finally had some spare time to stare at perf/ftrace code and ponder
> > on how to use those facilities for MCE collecting and reporting. Btw, I
> > have to say, it took me quite a while to understand what goes where - my
> > suggestion to anyone who tries to understand how perf/ftrace works is
> > to do make <file.i> where there is at least one trace_XXX emit record
> > function call and start untangling code paths from there.
> >
> > So anyway, here are some questions I had, I just as well may've missed
> > something so please correct me if I'm wrong:
> >
> > 1. Since machine checks can happen at any time, we need to have the
> > MCE tracepoint (trace_mce_record() in <include/trace/events/mce.h>)
> > always enabled. This, in turn, means that we need the ftrace/perf
> > infrastructure always compiled in (lockless ring buffer, perf_event.c
> > stuff) on any x86 system so that MCEs can be handled at anytime. Is this
> > going to be ok to be enabled on _all_ machines, hmmm... I dunno, maybe
> > only a subset of those facilites at least.
>
> I'm not exactly sure what you goal is, but if you need to do something
> directly, you can bypass ftrace and perf. All trace events can be
> connected by anything even when ftrace and perf are not enabled.

Right, so the idea is to use the perf/ftrace infrastructure to detect
failing hardware which is signalled through machine checks, among
others. I'm thinking a lockless ring buffer would be cool so we can
execute in any context... wait a minute, right, we have that already.

However, if I use the perf/ftrace facilities, I have to have them
enabled on every system since machine checks are core processor
functionality and the software support for those has to be always
enabled. And the code has to be small and execute fast since after a
critical mcheck happens all bets are off.

So I'm thinking maybe a core perf/ftrace stuff which is thin and is
always enabled... but I'm not sure for I haven't stared at the code
enough yet.

> That is, you need to connect to the tracepoint and write you own
> callback. This can be done pretty much at anytime during boot up. To see
> how to connect to a trace point, you can look at
> register_trace_sched_switch() in kernel/trace/ftrace.c. This registers a
> callback to the trace_sched_switch() trace point in sched.c.

The base tracepoint functionality should suffice for now but this is
definitely a cool point and good to know, thanks.

> >
> > 2. Tangential to 1., we need that "thin" software layer prepared for
> > decoding and reporting them as early as possible. event_trace_init() is
> > an fs_initcall and executed too late, IMHO. The ->perf_event_enable in
> > the ftrace_event_call is enabled even later on the perf init path over
> > the sys_perf_even_open which is at userspace time. In our case, this is
> > going be executed by the error logging and decoding daemon I guess.
> >
> > 3. Since we want to listen for MCEs all the time, the concept of
> > enabling and disabling those events does not apply in the sense of
> > performance profiling. IOW, MCEs need to be able to be logged to the
> > ring buffer at any time. I guess this is easily done - we simply enable
> > MCE events at the earliest moment possible and disable them on shutdown;
> > done.
>
> This looks like a good reason to have your own handler. More than one
> callback may be registered to a tracepoint, so you do not need to worry
> about having other handlers affect your code.

Yep, good ideas, thanks. /me goes back to the drawing board.

--
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andi Kleen on 4 May 2010 06:20

Borislav Petkov <bp(a)alien8.de> writes:

> so I finally had some spare time to stare at perf/ftrace code and ponder
> on how to use those facilities for MCE collecting and reporting. Btw, I

A good beginning of any such investigations would be to describe
what exact problems you're trying to solve here.

-Andi

--
ak(a)linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Ingo Molnar on 4 May 2010 07:40

* Borislav Petkov <bp(a)alien8.de> wrote:

> Hi,
>
> so I finally had some spare time to stare at perf/ftrace code and ponder on
> how to use those facilities for MCE collecting and reporting. Btw, I have to
> say, it took me quite a while to understand what goes where - my suggestion
> to anyone who tries to understand how perf/ftrace works is to do make
> <file.i> where there is at least one trace_XXX emit record function call and
> start untangling code paths from there.
>
> So anyway, here are some questions I had, I just as well may've missed
> something so please correct me if I'm wrong:
>
> 1. Since machine checks can happen at any time, we need to have the MCE
> tracepoint (trace_mce_record() in <include/trace/events/mce.h>) always
> enabled. This, in turn, means that we need the ftrace/perf infrastructure
> always compiled in (lockless ring buffer, perf_event.c stuff) on any x86
> system so that MCEs can be handled at anytime. Is this going to be ok to be
> enabled on _all_ machines, hmmm... I dunno, maybe only a subset of those
> facilites at least.

Yeah - and this happens on x86 anyway so you can rely on it.

> 2. Tangential to 1., we need that "thin" software layer prepared for
> decoding and reporting them as early as possible. event_trace_init() is an
> fs_initcall and executed too late, IMHO. The ->perf_event_enable in the
> ftrace_event_call is enabled even later on the perf init path over the
> sys_perf_even_open which is at userspace time. In our case, this is going be
> executed by the error logging and decoding daemon I guess.

We could certainly move bits of this initialization earlier.

Also we could add the notion of 'persistent' events that dont have a
user-space process attached to them - and which would buffer to a certain
degree. Such persistent events could be initialized during bootup and the
daemon could pick up the events later on.

In-kernel actions/policy could work off the callback mechanism. (See for
example how the new NMI watchdog code in tip:perf/nmi makes use of it - or how
the hw-breakpoints code utilizes it.) These would work even if there's no
user-space daemon attached (or if the daemon has been stopped). So i dont see
a significant design problem here - it's all natural extensions of existing
perf facilities and could be used for other purposes as well.

> 3. Since we want to listen for MCEs all the time, the concept of enabling
> and disabling those events does not apply in the sense of performance
> profiling. [...]

Correct.

> [...] IOW, MCEs need to be able to be logged to the ring buffer at any time.
> I guess this is easily done - we simply enable MCE events at the earliest
> moment possible and disable them on shutdown; done.
>
> So yeah, some food for thought but what do you guys think?

Note that it doesnt _have to_ go to the ftrace ring-buffer. If the daemon (or
whatever facility picking up the events) keeps a global (per cpu) MCE perf
event enabled all the time then it might be doing that regardless of ftrace.

Some decoupling from ftrace could be done here easily. I'd suggest to not
worry about it - once we have the MCE event code we can certainly reshape the
underlying support code to be more readily available/configurable. (or even
built-in) This is not really a significant design issue.

To start with this, a quick initial prototype could use the 'perf trace' live
mode tracing script. (See latest -tip, 'perf trace --script <script-name>' and
'perf record -o -' to activate live mode.)

Note that there's also 'perf inject' now, which can be used to simulate rare
events and test the daemon's reactions to it. (Right now perf-inject is only
used to inject certain special build-id events, but it can be used for the
injection of MCE events as well.)

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2
Prev: hv: Added new hv_utils driver to hyper-v - RE-CREATED
Next: hwmon: applesmc: Add generic support for MacBook Pro 7