perf_events: improve Intel event scheduling [Kernel]

Prev: BKL: Explicitly add BKL around get_sb/fill_super
Next: [PATCH 2/3] sysctl: Remove CTL_NONE and CTL_UNNUMBERED

From: Stephane Eranian on 21 Dec 2009 14:10

Hi,

[Repost because of HTML]

On Mon, Dec 21, 2009 at 4:40 PM, Peter Zijlstra <peterz(a)infradead.org> wrote:
>
> On Fri, 2009-12-11 at 12:59 +0100, stephane eranian wrote:
>
> > There is a major difference between PPC and X86 here. PPC has a
> > centralized register to control start/stop. This register uses
> > bitmask to enable or disable counters. Thus, in hw_perf_enable(), if
> > n_added=0, then you just need to use the pre-computed bitmask.
> > Otherwise, you need to recompute the bitmask to include the new
> > registers. The assignment of events and validation is done in
> > hw_group_sched_in().
> >
> > In X86, assignment and validation is done in hw_group_sched_in().
> > Activation is done individually for each counter. There is no
> > centralized register used here, thus no bitmask to update.
>
> intel core2 has the global control reg, but for all intents and purposes
> the perf_enable/disable calls emulate this global enable/disable.
>
> > Disabling a counter does not trigger a complete reschedule of events.
> > This happens only when hw_group_sched_in() is called.
> >
> > The n_events = 0 in hw_perf_disable() is used to signal that something
> > is changing. It should not be here but here. The problem is that
> > hw_group_sched_in() needs a way to know that it is called for a
> > completely new series of group scheduling so it can discard any
> > previous assignment. This goes back to the issue I raised in my
> > previous email. You could add a parameter to hw_group_sched_in() that
> > would indicate this is the first group. that would cause n_events =0
> > and the function would start accumulating events for the new
> > scheduling period.
>
> I'm not really seeing the problem here...
>
>
> perf_disable() <-- shut down the full pmu
>
> pmu->disable() <-- hey someone got removed (easy free the reg)
> pmu->enable() <-- hey someone got added (harder, check constraints)
>
> hw_perf_group_sched_in() <-- hey a full group got added
> (better than multiple ->enable)
>
> perf_enable() <-- re-enable pmu
>
>
> So ->disable() is used to track freeing, ->enable is used to add
> individual counters, check constraints etc..
>
> hw_perf_group_sched_in() is used to optimize the full group enable.
>

Does that mean that after a disable() I can assume that there won't
be an enable() without a group_sched_in()?

I suspect not. In fact, there is a counter-example in perf_ctx_adjust_freq().

> Afaict that is what power does (Paul?) and that should I think be
> sufficient to track x86 as well.
>
> Since sched_in() is balanced with sched_out(), the ->disable() calls
> should provide the required information as to the occupation of the pmu.
> I don't see the need for more hooks.
>
> Paul, could you comment, since you did all this for power?
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Peter Zijlstra on 21 Dec 2009 14:40

On Mon, 2009-12-21 at 20:00 +0100, Stephane Eranian wrote:

> > perf_disable() <-- shut down the full pmu
> >
> > pmu->disable() <-- hey someone got removed (easy free the reg)
> > pmu->enable() <-- hey someone got added (harder, check constraints)
> >
> > hw_perf_group_sched_in() <-- hey a full group got added
> > (better than multiple ->enable)
> >
> > perf_enable() <-- re-enable pmu
> >
> >
> > So ->disable() is used to track freeing, ->enable is used to add
> > individual counters, check constraints etc..
> >
> > hw_perf_group_sched_in() is used to optimize the full group enable.
> >
>
> Does that mean that after a disable() I can assume that there won't
> be an enable() without a group_sched_in()?
>
> I suspect not. In fact, there is a counter-example in perf_ctx_adjust_freq().

Why would that be a problem?

A perf_disable() will simply disable the pmu, but not alter its
configuration, perf_enable() will program the pmu and re-enable it (with
the obvious optimization that if the current state and the new state
match we can skip the actual programming).

If a ->disable() was observed between it will simply not re-enable that
particular counter, that is it will remove that counters' config, and
perf_enable() can do a smart reprogram by simply no re-enabling that
counter.

If an ->enable() or hw_perf_group_sched_in() was observed in between,
you have to recompute the full state in order to validate the
availability, if that fails, no state change, if it succeeds you have a
new pmu state and perf_enable() will program that.

In the case of perf_ctx_adjust_freq():

if (!interrupts) {
perf_disable();
event->pmu->disable(event);
atomic64_set(&hwc->period_left, 0);
event->pmu->enable(event);
perf_enable();
}

You'll very likely end up with the same state you had before, but if not
who cares -- its supposed to be an unlikely case.

That is, the ->disable() will clear the cpu_hw_events::used_mask bit.
The ->enable() will want to place the counter on its last know reg,
which (not so very surprisingly) will be available, hence it will
trivially succeed, depending on its smarts it might or might not find
that the 'new' counter's config is identical to the current hw state, if
not it might set a cpu_hw_event::reprogram state.

perf_enable() will, when it sees cpu_hw_event::reprogram, re-write the
pmu config and re-enable the whole thing (global ctrl reg, or iterate
individual EN bits).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Stephane Eranian on 21 Dec 2009 16:00

On Mon, Dec 21, 2009 at 8:31 PM, Peter Zijlstra <peterz(a)infradead.org> wrote:
> On Mon, 2009-12-21 at 20:00 +0100, Stephane Eranian wrote:
>
>> > perf_disable() <-- shut down the full pmu
>> >
>> > pmu->disable() <-- hey someone got removed (easy free the reg)
>> > pmu->enable() <-- hey someone got added (harder, check constraints)
>> >
>> > hw_perf_group_sched_in() <-- hey a full group got added
>> > (better than multiple ->enable)
>> >
>> > perf_enable() <-- re-enable pmu
>> >
>> >
>> > So ->disable() is used to track freeing, ->enable is used to add
>> > individual counters, check constraints etc..
>> >
>> > hw_perf_group_sched_in() is used to optimize the full group enable.
>> >
>>
>> Does that mean that after a disable() I can assume that there won't
>> be an enable() without a group_sched_in()?
>>
>> I suspect not. In fact, there is a counter-example in perf_ctx_adjust_freq().
>
> Why would that be a problem?
>
> A perf_disable() will simply disable the pmu, but not alter its
> configuration, perf_enable() will program the pmu and re-enable it (with
> the obvious optimization that if the current state and the new state
> match we can skip the actual programming).
>
> If a ->disable() was observed between it will simply not re-enable that
> particular counter, that is it will remove that counters' config, and
> perf_enable() can do a smart reprogram by simply no re-enabling that
> counter.
>
> If an ->enable() or hw_perf_group_sched_in() was observed in between,
> you have to recompute the full state in order to validate the
> availability, if that fails, no state change, if it succeeds you have a
> new pmu state and perf_enable() will program that.
>

Ok, so what you are suggesting is that the assignment is actually done
incrementally in ->enable(). hw_group_sched_in() would simply validate
that a single group is sane (i.e., can be scheduled if it was alone).

> In the case of perf_ctx_adjust_freq():
>
> if (!interrupts) {
> perf_disable();
> event->pmu->disable(event);
> atomic64_set(&hwc->period_left, 0);
> event->pmu->enable(event);
> perf_enable();
> }
>
> You'll very likely end up with the same state you had before, but if not
> who cares -- its supposed to be an unlikely case.
>
> That is, the ->disable() will clear the cpu_hw_events::used_mask bit.
> The ->enable() will want to place the counter on its last know reg,
> which (not so very surprisingly) will be available, hence it will

Not in the case I am looking at on AMD. The counter you need to grab
depends on what is going on on the other cores on the socket. So
there is no guarantee that you will get the same. Something similar
exists on Nehalem but it has to do with HT.

> trivially succeed, depending on its smarts it might or might not find
> that the 'new' counter's config is identical to the current hw state, if
> not it might set a cpu_hw_event::reprogram state.
>
> perf_enable() will, when it sees cpu_hw_event::reprogram, re-write the
> pmu config and re-enable the whole thing (global ctrl reg, or iterate
> individual EN bits).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Peter Zijlstra on 21 Dec 2009 16:20

On Mon, 2009-12-21 at 21:59 +0100, Stephane Eranian wrote:
> On Mon, Dec 21, 2009 at 8:31 PM, Peter Zijlstra <peterz(a)infradead.org> wrote:
> > On Mon, 2009-12-21 at 20:00 +0100, Stephane Eranian wrote:
> >
> >> > perf_disable() <-- shut down the full pmu
> >> >
> >> > pmu->disable() <-- hey someone got removed (easy free the reg)
> >> > pmu->enable() <-- hey someone got added (harder, check constraints)
> >> >
> >> > hw_perf_group_sched_in() <-- hey a full group got added
> >> > (better than multiple ->enable)
> >> >
> >> > perf_enable() <-- re-enable pmu
> >> >
> >> >
> >> > So ->disable() is used to track freeing, ->enable is used to add
> >> > individual counters, check constraints etc..
> >> >
> >> > hw_perf_group_sched_in() is used to optimize the full group enable.
> >> >
> >>
> >> Does that mean that after a disable() I can assume that there won't
> >> be an enable() without a group_sched_in()?
> >>
> >> I suspect not. In fact, there is a counter-example in perf_ctx_adjust_freq().
> >
> > Why would that be a problem?
> >
> > A perf_disable() will simply disable the pmu, but not alter its
> > configuration, perf_enable() will program the pmu and re-enable it (with
> > the obvious optimization that if the current state and the new state
> > match we can skip the actual programming).
> >
> > If a ->disable() was observed between it will simply not re-enable that
> > particular counter, that is it will remove that counters' config, and
> > perf_enable() can do a smart reprogram by simply no re-enabling that
> > counter.
> >
> > If an ->enable() or hw_perf_group_sched_in() was observed in between,
> > you have to recompute the full state in order to validate the
> > availability, if that fails, no state change, if it succeeds you have a
> > new pmu state and perf_enable() will program that.
> >
>
> Ok, so what you are suggesting is that the assignment is actually done
> incrementally in ->enable(). hw_group_sched_in() would simply validate
> that a single group is sane (i.e., can be scheduled if it was alone).

hw_perf_group_sched_in() can be used to do a whole group at once (see
how a !0 return value will short-circuit the whole incremental group
buildup), but yes ->enable() is incremental.

> > In the case of perf_ctx_adjust_freq():
> >
> > if (!interrupts) {
> > perf_disable();
> > event->pmu->disable(event);
> > atomic64_set(&hwc->period_left, 0);
> > event->pmu->enable(event);
> > perf_enable();
> > }
> >
> > You'll very likely end up with the same state you had before, but if not
> > who cares -- its supposed to be an unlikely case.
> >
> > That is, the ->disable() will clear the cpu_hw_events::used_mask bit.
> > The ->enable() will want to place the counter on its last know reg,
> > which (not so very surprisingly) will be available, hence it will
>
> Not in the case I am looking at on AMD. The counter you need to grab
> depends on what is going on on the other cores on the socket. So
> there is no guarantee that you will get the same. Something similar
> exists on Nehalem but it has to do with HT.

Well the above code is assumed correct, so we need to make it true,
which should not be too hard to do.

If there are cross cpu constraints (AMD has per node constraints IIRC)
then we need a structure that describes these (a per node structure for
AND, a per core structure for HT), if we take a lock on this structure
the first time we touch it in a perf_disable() region, and release it on
perf_enable(), then we ensure those constraints remain invariant.

With something like that in place the above code will still be valid,
since the act of removing the counter will freeze all needed state to
re-instate it.

Does that make sense?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Paul Mackerras on 21 Dec 2009 20:20

On Mon, Dec 21, 2009 at 09:59:45PM +0100, Stephane Eranian wrote:

> Ok, so what you are suggesting is that the assignment is actually done
> incrementally in ->enable(). hw_group_sched_in() would simply validate
> that a single group is sane (i.e., can be scheduled if it was alone).

No, hw_group_sched_in needs to validate that this group can go on
along with everything else that has already been enabled. But as I
have said, if you have the complete list of enabled events to hand,
that's not hard.

On the other hand, hw_perf_event_init does need to validate that a
single group is sane by itself.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: BKL: Explicitly add BKL around get_sb/fill_super
Next: [PATCH 2/3] sysctl: Remove CTL_NONE and CTL_UNNUMBERED