perf_events: improve Intel event scheduling [Kernel]

Prev: BKL: Explicitly add BKL around get_sb/fill_super
Next: [PATCH 2/3] sysctl: Remove CTL_NONE and CTL_UNNUMBERED

From: Peter Zijlstra on 18 Nov 2009 11:40

Sorry on the delay.

In general I'd very much like to see some of this generalized because I
think Sparc64 has very similar 'simple' constraints, I'll have to defer
to Paul on how much of this could possibly be re-used for powerpc.

On Mon, 2009-10-19 at 17:03 +0200, Stephane Eranian wrote:
> arch/x86/include/asm/perf_event.h | 6 +-
> arch/x86/kernel/cpu/perf_event.c | 497 +++++++++++++++++++++++--------------
> 2 files changed, 318 insertions(+), 185 deletions(-)
>
> diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
> index 8d9f854..7c737af 100644
> --- a/arch/x86/include/asm/perf_event.h
> +++ b/arch/x86/include/asm/perf_event.h
> @@ -26,7 +26,9 @@
> /*
> * Includes eventsel and unit mask as well:
> */
> -#define ARCH_PERFMON_EVENT_MASK 0xffff
> +#define ARCH_PERFMON_EVENTSEL_EVENT_MASK 0x00ff
> +#define ARCH_PERFMON_EVENTSEL_UNIT_MASK 0xff00
> +#define ARCH_PERFMON_EVENT_MASK (ARCH_PERFMON_EVENTSEL_UNIT_MASK|ARCH_PERFMON_EVENTSEL_EVENT_MASK)

There's various forms of the above throughout the code, it would be nice
not to add more..

And in general this patch has way too many >80 lines and other style
nits, but those can be fixed up easily enough I guess.

> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index 2e20bca..0f96c51 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c

> @@ -68,6 +69,15 @@ struct debug_store {
> u64 pebs_event_reset[MAX_PEBS_EVENTS];
> };
>
> +#define BITS_TO_U64(nr) DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(u64))

Do we need this, is it realistic to expect X86_PMC_IDX_MAX to be
anything else than 64?

> -#define EVENT_CONSTRAINT(c, m) { .code = (c), .idxmsk[0] = (m) }
> -#define EVENT_CONSTRAINT_END { .code = 0, .idxmsk[0] = 0 }
> +#define EVENT_CONSTRAINT(c, n, w, m) { \
> + .code = (c), \
> + .mask = (m), \
> + .weight = (w), \
> + .idxmsk[0] = (n) }

If not, we can do away with the weight argument here and use hweight64()
which should reduce to a compile time constant.

> @@ -124,7 +137,7 @@ static DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events) = {
> .enabled = 1,
> };
>
> -static const struct event_constraint *event_constraints;
> +static struct event_constraint *event_constraints;

I'm thinking this ought to live in x86_pmu, or possible, if we can
generalize this enough, in pmu.

> +static struct event_constraint intel_core_event_constraints[] =
> +{

Inconsistent style with below:

> +static struct event_constraint intel_nehalem_event_constraints[] = {
> + EVENT_CONSTRAINT(0xc0, (0x3|(1ULL<<32)), 3, ARCH_PERFMON_FIXED_EVENT_MASK), /* INSTRUCTIONS_RETIRED */
> + EVENT_CONSTRAINT(0x3c, (0x3|(1ULL<<33)), 3, ARCH_PERFMON_FIXED_EVENT_MASK), /* UNHALTED_CORE_CYCLES */
> + EVENT_CONSTRAINT(0x40, 0x3, 2, ARCH_PERFMON_EVENTSEL_EVENT_MASK), /* L1D_CACHE_LD */
> + EVENT_CONSTRAINT(0x41, 0x3, 2, ARCH_PERFMON_EVENTSEL_EVENT_MASK), /* L1D_CACHE_ST */
> + EVENT_CONSTRAINT(0x42, 0x3, 2, ARCH_PERFMON_EVENTSEL_EVENT_MASK), /* L1D_CACHE_LOCK */
> + EVENT_CONSTRAINT(0x43, 0x3, 2, ARCH_PERFMON_EVENTSEL_EVENT_MASK), /* L1D_ALL_REF */
> + EVENT_CONSTRAINT(0x4e, 0x3, 2, ARCH_PERFMON_EVENTSEL_EVENT_MASK), /* L1D_PREFETCH */
> + EVENT_CONSTRAINT(0x4c, 0x3, 2, ARCH_PERFMON_EVENTSEL_EVENT_MASK), /* LOAD_HIT_PRE */
> + EVENT_CONSTRAINT(0x51, 0x3, 2, ARCH_PERFMON_EVENTSEL_EVENT_MASK), /* L1D */
> + EVENT_CONSTRAINT(0x52, 0x3, 2, ARCH_PERFMON_EVENTSEL_EVENT_MASK), /* L1D_CACHE_PREFETCH_LOCK_FB_HIT */
> + EVENT_CONSTRAINT(0x53, 0x3, 2, ARCH_PERFMON_EVENTSEL_EVENT_MASK), /* L1D_CACHE_LOCK_FB_HIT */
> + EVENT_CONSTRAINT(0xc5, 0x3, 2, ARCH_PERFMON_EVENTSEL_EVENT_MASK), /* CACHE_LOCK_CYCLES */
> EVENT_CONSTRAINT_END
> };

Would be nice to get rid of that EVENT_MASK part, maybe write
EVENT_CONSTRAINT like:

.mask = ARCH_PERFMON_EVENTSEL_EVENT_MASK | ((n)>>32 ? ARCH_PERFMON_FIXED_MASK : 0),

Which ought to work for Intel based things, AMD will need a different
base event mask.

> @@ -1120,9 +1136,15 @@ static void amd_pmu_disable_all(void)
>
> void hw_perf_disable(void)
> {
> + struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> +
> if (!x86_pmu_initialized())
> return;
> - return x86_pmu.disable_all();
> +
> + if (cpuc->enabled)
> + cpuc->n_events = 0;
> +
> + x86_pmu.disable_all();
> }

Right, so I cannot directly see the above being correct. You fully erase
the PMU state when we disable it, but things like single counter
add/remove doesn't reprogram the full PMU afaict.

The powerpc code has n_added, which indicates a delta algorithm is used
there.

> +static struct event_constraint *intel_get_event_constraints(struct perf_event *event)
> +{
> + struct event_constraint *c;
> +
> + c = intel_special_constraints(event);
> + if (c)
> + return c;
> +
> + if (event_constraints)
> + for_each_event_constraint(c, event_constraints) {
> + if ((event->hw.config & c->mask) == c->code)
> + return c;
> + }

This wants extra { }, since its a multi-line stmt.

> + return NULL;
> +}
> +
> +static struct event_constraint *amd_get_event_constraints(struct perf_event *event)
> +{
> + return NULL;
> +}

I guess we'll need to fill that out a bit more, but that can be another
patch.

> +static int schedule_events(struct cpu_hw_events *cpuhw, int n, bool assign)
> +{
> + int i, j , w, num, lim;
> + int weight, wmax;
> + struct event_constraint *c;
> + unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
> + int assignments[X86_PMC_IDX_MAX];
> + struct hw_perf_event *hwc;
> +
> + bitmap_zero(used_mask, X86_PMC_IDX_MAX);
> +
> + /*
> + * weight = number of possible counters
> + *
> + * 1 = most constrained, only works on one counter
> + * wmax = least constrained, works on 1 fixed counter
> + * or any generic counter
> + *
> + * assign events to counters starting with most
> + * constrained events.
> + */
> + wmax = 1 + x86_pmu.num_events;
> + num = n;
> + for(w=1; num && w <= wmax; w++) {
> +
> + /* for each event */
> + for(i=0; i < n; i++) {
> + c = cpuhw->constraints[i];
> + hwc = &cpuhw->event_list[i]->hw;
> +
> + weight = c ? c->weight : x86_pmu.num_events;
> + if (weight != w)
> + continue;
> +
> + /*
> + * try to reuse previous assignment
> + *
> + * This is possible despite the fact that
> + * events or events order may have changed.
> + *
> + * What matters is the level of constraints
> + * of an event and this is constant for now.
> + *
> + * This is possible also because we always
> + * scan from most to least constrained. Thus,
> + * if a counter can be reused, it means no,
> + * more constrained events, needed it. And
> + * next events will either compete for it
> + * (which cannot be solved anyway) or they
> + * have fewer constraints, and they can use
> + * another counter.
> + */
> + j = hwc->idx;
> + if (j != -1 && !test_bit(j, used_mask))
> + goto skip;
> +
> + if (c) {
> + lim = X86_PMC_IDX_MAX;
> + for_each_bit(j, (unsigned long *)c->idxmsk, lim)
> + if (!test_bit(j, used_mask))
> + break;
> +
> + } else {
> + lim = x86_pmu.num_events;
> + /*
> + * fixed counter events have necessarily a
> + * constraint thus we come here only for
> + * generic counters and thus we limit the
> + * scan to those
> + */
> + j = find_first_zero_bit(used_mask, lim);
> + }
> + if (j == lim)
> + return -EAGAIN;
> +skip:
> + set_bit(j, used_mask);
> + assignments[i] = j;
> + num--;
> + }
> + }
> + if (num)
> + return -ENOSPC;
> +
> + /* just simulate scheduling */
> + if (!assign)
> + return 0;
> +
> + /*
> + * commit assignments
> + */
> + for(i=0; i < n; i++) {
> + hwc = &cpuhw->event_list[i]->hw;
> +
> + hwc->idx = assignments[i];
> +
> + set_bit(hwc->idx, cpuhw->used_mask);
> +
> + if (hwc->idx == X86_PMC_IDX_FIXED_BTS) {
> + hwc->config_base = 0;
> + hwc->event_base = 0;
> + } else if (hwc->idx >= X86_PMC_IDX_FIXED) {
> + hwc->config_base = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
> + /*
> + * We set it so that event_base + idx in wrmsr/rdmsr maps to
> + * MSR_ARCH_PERFMON_FIXED_CTR0 ... CTR2:
> + */
> + hwc->event_base =
> + MSR_ARCH_PERFMON_FIXED_CTR0 - X86_PMC_IDX_FIXED;
> + } else {
> + hwc->config_base = x86_pmu.eventsel;
> + hwc->event_base = x86_pmu.perfctr;
> + }
> + }
> + cpuhw->n_events = n;
> + return 0;
> +}

Straight forward O(n^2) algorithm looking for the best match, seems good
from a single read -- will go over it again on another day to find more
details.

> +static int collect_events(struct cpu_hw_events *cpuhw, struct perf_event *leader)
> +{
> + struct perf_event *event;
> + int n, max_count;
> +
> + max_count = x86_pmu.num_events + x86_pmu.num_events_fixed;
> +
> + /* current number of events already accepted */
> + n = cpuhw->n_events;
> +
> + if (!is_software_event(leader)) {

With things like the hw-breakpoint stuff also growing a pmu !
is_software() isn't strong enough, something like:

static inline int is_x86_event(struct perf_event *event)
{
return event->pmu == &pmu;
}

Should work though.

> + if (n >= max_count)
> + return -ENOSPC;
> + cpuhw->constraints[n] = x86_pmu.get_event_constraints(leader);
> + cpuhw->event_list[n] = leader;
> + n++;
> + }
> +
> + list_for_each_entry(event, &leader->sibling_list, group_entry) {
> + if (is_software_event(event) ||
> + event->state == PERF_EVENT_STATE_OFF)
> + continue;
> +
> + if (n >= max_count)
> + return -ENOSPC;
> +
> + cpuhw->constraints[n] = x86_pmu.get_event_constraints(event);
> + cpuhw->event_list[n] = event;
> + n++;
> + }
> + return n;
> +}
> +
> +/*
> + * Called to enable a whole group of events.
> + * Returns 1 if the group was enabled, or -EAGAIN if it could not be.
> + * Assumes the caller has disabled interrupts and has
> + * frozen the PMU with hw_perf_save_disable.
> + */
> +int hw_perf_group_sched_in(struct perf_event *leader,
> + struct perf_cpu_context *cpuctx,
> + struct perf_event_context *ctx, int cpu)
> +{
> + struct cpu_hw_events *cpuhw = &per_cpu(cpu_hw_events, cpu);
> + int n, ret;
> +
> + n = collect_events(cpuhw, leader);
> + if (n < 0)
> + return n;
> +
> + ret = schedule_events(cpuhw, n, true);
> + if (ret)
> + return ret;
> +
> + /* 0 means successful and enable each event in caller */
> + return 0;
> +}

This is where powerpc does n_added += n, and it delays the
schedule_events() bit to hw_perf_enable() conditional on n_added > 0.
When you remove events it simply keeps the current layout and disables
the one.

> @@ -2123,8 +2245,8 @@ static int intel_pmu_init(void)
> memcpy(hw_cache_event_ids, core2_hw_cache_event_ids,
> sizeof(hw_cache_event_ids));
>
> - pr_cont("Core2 events, ");
> event_constraints = intel_core_event_constraints;
> + pr_cont("Core2 events, ");
> break;
> default:
> case 26:

Not that I object to the above change, but it seems out of place in this
patch.

> +/*
> + * validate a single event group
> + *
> + * validation include:
> + * - check events are compatible which each other
> + * - events do not compete for the same counter
> + * - number of events <= number of counters
> + *
> + * validation ensures the group can be loaded onto the
> + * PMU if it were the only group available.
> + */
> static int validate_group(struct perf_event *event)
> {
> - struct perf_event *sibling, *leader = event->group_leader;
> + struct perf_event *leader = event->group_leader;
> struct cpu_hw_events fake_pmu;
> + int n, ret;
>
> memset(&fake_pmu, 0, sizeof(fake_pmu));
>
> - if (!validate_event(&fake_pmu, leader))
> + /*
> + * the event is not yet connected with its
> + * siblings thererfore we must first collect
> + * existing siblings, then add the new event
> + * before we can simulate the scheduling
> + */
> + n = collect_events(&fake_pmu, leader);
> + if (n < 0)
> return -ENOSPC;
>
> - list_for_each_entry(sibling, &leader->sibling_list, group_entry) {
> - if (!validate_event(&fake_pmu, sibling))
> - return -ENOSPC;
> - }
> + fake_pmu.n_events = n;
>
> - if (!validate_event(&fake_pmu, event))
> + n = collect_events(&fake_pmu, event);
> + if (n < 0)
> return -ENOSPC;
>
> - return 0;
> + ret = schedule_events(&fake_pmu, n, false);
> + return ret ? -ENOSPC : 0;
> }

This seems good.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: stephane eranian on 11 Dec 2009 06:10

On Wed, Nov 18, 2009 at 5:32 PM, Peter Zijlstra <peterz(a)infradead.org> wrote:

> In general I'd very much like to see some of this generalized because I
> think Sparc64 has very similar 'simple' constraints, I'll have to defer
> to Paul on how much of this could possibly be re-used for powerpc.
>
Let's wait until we have all X86 constraint scheduling code. There is
way more needed. I have a first cut on the AMD64 constraints but
there is an issue in the generic/arch specific API that needs to be
flushed out first (see below).

> On Mon, 2009-10-19 at 17:03 +0200, Stephane Eranian wrote:
>> diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
>> index 8d9f854..7c737af 100644
>> --- a/arch/x86/include/asm/perf_event.h
>> +++ b/arch/x86/include/asm/perf_event.h
>> @@ -26,7 +26,9 @@
>> /*
>> * Includes eventsel and unit mask as well:
>> */
>> -#define ARCH_PERFMON_EVENT_MASK 0xffff
>> +#define ARCH_PERFMON_EVENTSEL_EVENT_MASK 0x00ff
>> +#define ARCH_PERFMON_EVENTSEL_UNIT_MASK 0xff00
>> +#define ARCH_PERFMON_EVENT_MASK (ARCH_PERFMON_EVENTSEL_UNIT_MASK|ARCH_PERFMON_EVENTSEL_EVENT_MASK)
>
> There's various forms of the above throughout the code, it would be nice
> not to add more..
>
I will simplify this.
>
>> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
>> index 2e20bca..0f96c51 100644
>> --- a/arch/x86/kernel/cpu/perf_event.c
>> +++ b/arch/x86/kernel/cpu/perf_event.c
>
>> @@ -68,6 +69,15 @@ struct debug_store {
>> u64 pebs_event_reset[MAX_PEBS_EVENTS];
>> };
>>
>> +#define BITS_TO_U64(nr) DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(u64))
>
> Do we need this, is it realistic to expect X86_PMC_IDX_MAX to be
> anything else than 64?
>
The issue had to do with i386 mode where long are 32 bits < 64. And in
particular with the initializer .idxmsk[0] = V. In the future we may exceed
32 registers. That means the initializer would have to change. But I guess
we have quite some ways before this case is reached. So I will revert all
of this to unsigned long.

>> -#define EVENT_CONSTRAINT(c, m) { .code = (c), .idxmsk[0] = (m) }
>> -#define EVENT_CONSTRAINT_END { .code = 0, .idxmsk[0] = 0 }
>> +#define EVENT_CONSTRAINT(c, n, w, m) { \
>> + .code = (c), \
>> + .mask = (m), \
>> + .weight = (w), \
>> + .idxmsk[0] = (n) }
>
> If not, we can do away with the weight argument here and use hweight64()
> which should reduce to a compile time constant.
>
I have dropped weight and I use bitmap_weight() to recompute, with all
the inlining
it should come out to be a simple popcnt instruction.

>> @@ -124,7 +137,7 @@ static DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events) = {
>> .enabled = 1,
>> };
>>
>> -static const struct event_constraint *event_constraints;
>> +static struct event_constraint *event_constraints;
>
> I'm thinking this ought to live in x86_pmu, or possible, if we can
> generalize this enough, in pmu.

True.

> Inconsistent style with below:
>
>> +static struct event_constraint intel_nehalem_event_constraints[] = {
>> + EVENT_CONSTRAINT(0xc0, (0x3|(1ULL<<32)), 3, ARCH_PERFMON_FIXED_EVENT_MASK), /* INSTRUCTIONS_RETIRED */
>
> Would be nice to get rid of that EVENT_MASK part, maybe write
> EVENT_CONSTRAINT like:
>
> .mask = ARCH_PERFMON_EVENTSEL_EVENT_MASK | ((n)>>32 ? ARCH_PERFMON_FIXED_MASK : 0),
>
> Which ought to work for Intel based things, AMD will need a different
> base event mask.
>
Will try to clean this up. AMD does not need a mask, it is a
completely different kind of constraints.

>> @@ -1120,9 +1136,15 @@ static void amd_pmu_disable_all(void)
>>
>> void hw_perf_disable(void)
>> {
>> + struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
>> +
>> if (!x86_pmu_initialized())
>> return;
>> - return x86_pmu.disable_all();
>> +
>> + if (cpuc->enabled)
>> + cpuc->n_events = 0;
>> +
>> + x86_pmu.disable_all();
>> }
>
> Right, so I cannot directly see the above being correct. You fully erase
> the PMU state when we disable it, but things like single counter
> add/remove doesn't reprogram the full PMU afaict.
>
Here is the key point. There is no clear API that says 'this is the final stop
for this event group, i.e., next start will be with a different
group'. In other words,
there is no signal that the 'resource' can be freed. This is a
showstopper for the
AMD constraints. The generic code needs to signal that this is the
final stop for
the event group, so the machine specific layer can release the registers. The
hw_perf_disable() is used in many places and does not indicate the same thing
as what I just described.

>> +static int collect_events(struct cpu_hw_events *cpuhw, struct perf_event *leader)
>> +{
>> + struct perf_event *event;
>> + int n, max_count;
>> +
>> + max_count = x86_pmu.num_events + x86_pmu.num_events_fixed;
>> +
>> + /* current number of events already accepted */
>> + n = cpuhw->n_events;
>> +
>> + if (!is_software_event(leader)) {
>
> With things like the hw-breakpoint stuff also growing a pmu !
> is_software() isn't strong enough, something like:
>
> static inline int is_x86_event(struct perf_event *event)
> {
> return event->pmu == &pmu;
> }
>
Agreed, I am using something like this now.

>> +int hw_perf_group_sched_in(struct perf_event *leader,
>> + struct perf_cpu_context *cpuctx,
>> + struct perf_event_context *ctx, int cpu)
>> +{
>> + struct cpu_hw_events *cpuhw = &per_cpu(cpu_hw_events, cpu);
>> + int n, ret;
>> +
>> + n = collect_events(cpuhw, leader);
>> + if (n < 0)
>> + return n;
>> +
>> + ret = schedule_events(cpuhw, n, true);
>> + if (ret)
>> + return ret;
>> +
>> + /* 0 means successful and enable each event in caller */
>> + return 0;
>> +}
>
> This is where powerpc does n_added += n, and it delays the
> schedule_events() bit to hw_perf_enable() conditional on n_added > 0.
> When you remove events it simply keeps the current layout and disables
> the one.
>
Will look into this.

>> @@ -2123,8 +2245,8 @@ static int intel_pmu_init(void)
>> memcpy(hw_cache_event_ids, core2_hw_cache_event_ids,
>> sizeof(hw_cache_event_ids));
>>
>> - pr_cont("Core2 events, ");
>> event_constraints = intel_core_event_constraints;
>> + pr_cont("Core2 events, ");
>> break;
>> default:
>> case 26:
>
> Not that I object to the above change, but it seems out of place in this
> patch.
>
Will remove that.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: stephane eranian on 11 Dec 2009 07:00

>
>>> @@ -1120,9 +1136,15 @@ static void amd_pmu_disable_all(void)
>>>
>>> void hw_perf_disable(void)
>>> {
>>> + struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
>>> +
>>> if (!x86_pmu_initialized())
>>> return;
>>> - return x86_pmu.disable_all();
>>> +
>>> + if (cpuc->enabled)
>>> + cpuc->n_events = 0;
>>> +
>>> + x86_pmu.disable_all();
>>> }
>>
>> Right, so I cannot directly see the above being correct. You fully erase
>> the PMU state when we disable it, but things like single counter
>> add/remove doesn't reprogram the full PMU afaict.
>>

>>> +int hw_perf_group_sched_in(struct perf_event *leader,
>>> + struct perf_cpu_context *cpuctx,
>>> + struct perf_event_context *ctx, int cpu)
>>> +{
>>> + struct cpu_hw_events *cpuhw = &per_cpu(cpu_hw_events, cpu);
>>> + int n, ret;
>>> +
>>> + n = collect_events(cpuhw, leader);
>>> + if (n < 0)
>>> + return n;
>>> +
>>> + ret = schedule_events(cpuhw, n, true);
>>> + if (ret)
>>> + return ret;
>>> +
>>> + /* 0 means successful and enable each event in caller */
>>> + return 0;
>>> +}
>>
>> This is where powerpc does n_added += n, and it delays the
>> schedule_events() bit to hw_perf_enable() conditional on n_added > 0.
>> When you remove events it simply keeps the current layout and disables
>> the one.
>>
There is a major difference between PPC and X86 here. PPC has a centralized
register to control start/stop. This register uses bitmask to enable
or disable counters. Thus, in hw_perf_enable(), if n_added=0, then you
just need to
use the pre-computed bitmask. Otherwise, you need to recompute the bitmask to
include the new registers. The assignment of events and validation is done in
hw_group_sched_in().

In X86, assignment and validation is done in hw_group_sched_in(). Activation is
done individually for each counter. There is no centralized register
used here, thus
no bitmask to update.

Disabling a counter does not trigger a complete reschedule of events.
This happens
only when hw_group_sched_in() is called.

The n_events = 0 in hw_perf_disable() is used to signal that something
is changing.
It should not be here but here. The problem is that
hw_group_sched_in() needs a way
to know that it is called for a completely new series of group
scheduling so it can
discard any previous assignment. This goes back to the issue I raised
in my previous
email. You could add a parameter to hw_group_sched_in() that would
indicate this is
the first group. that would cause n_events =0 and the function would
start accumulating
events for the new scheduling period.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Stephane Eranian on 11 Dec 2009 10:00

On Fri, Dec 11, 2009 at 12:00 PM, stephane eranian
<eranian(a)googlemail.com> wrote:
>>> --- a/arch/x86/kernel/cpu/perf_event.c
>>> +++ b/arch/x86/kernel/cpu/perf_event.c
>>
>>> @@ -68,6 +69,15 @@ struct debug_store {
>>> u64 pebs_event_reset[MAX_PEBS_EVENTS];
>>> };
>>>
>>> +#define BITS_TO_U64(nr) DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(u64))
>>
>> Do we need this, is it realistic to expect X86_PMC_IDX_MAX to be
>> anything else than 64?
>>
> The issue had to do with i386 mode where long are 32 bits < 64. And in
> particular with the initializer .idxmsk[0] = V. In the future we may exceed
> 32 registers. That means the initializer would have to change. But I guess
> we have quite some ways before this case is reached. So I will revert all
> of this to unsigned long.
>
Well, in fact we have to use u64 because you are already using register
indexes > 32, e.g., for the Intel fixed counters which start at position 32.
Using unsigned long would make the static initializer uglier in this case.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Peter Zijlstra on 21 Dec 2009 10:50

On Fri, 2009-12-11 at 12:59 +0100, stephane eranian wrote:

> There is a major difference between PPC and X86 here. PPC has a
> centralized register to control start/stop. This register uses
> bitmask to enable or disable counters. Thus, in hw_perf_enable(), if
> n_added=0, then you just need to use the pre-computed bitmask.
> Otherwise, you need to recompute the bitmask to include the new
> registers. The assignment of events and validation is done in
> hw_group_sched_in().
>
> In X86, assignment and validation is done in hw_group_sched_in().
> Activation is done individually for each counter. There is no
> centralized register used here, thus no bitmask to update.

intel core2 has the global control reg, but for all intents and purposes
the perf_enable/disable calls emulate this global enable/disable.

> Disabling a counter does not trigger a complete reschedule of events.
> This happens only when hw_group_sched_in() is called.
>
> The n_events = 0 in hw_perf_disable() is used to signal that something
> is changing. It should not be here but here. The problem is that
> hw_group_sched_in() needs a way to know that it is called for a
> completely new series of group scheduling so it can discard any
> previous assignment. This goes back to the issue I raised in my
> previous email. You could add a parameter to hw_group_sched_in() that
> would indicate this is the first group. that would cause n_events =0
> and the function would start accumulating events for the new
> scheduling period.

I'm not really seeing the problem here...

perf_disable() <-- shut down the full pmu

pmu->disable() <-- hey someone got removed (easy free the reg)
pmu->enable() <-- hey someone got added (harder, check constraints)

hw_perf_group_sched_in() <-- hey a full group got added
(better than multiple ->enable)

perf_enable() <-- re-enable pmu

So ->disable() is used to track freeing, ->enable is used to add
individual counters, check constraints etc..

hw_perf_group_sched_in() is used to optimize the full group enable.

Afaict that is what power does (Paul?) and that should I think be
sufficient to track x86 as well.

Since sched_in() is balanced with sched_out(), the ->disable() calls
should provide the required information as to the occupation of the pmu.
I don't see the need for more hooks.

Paul, could you comment, since you did all this for power?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2 3 4
Prev: BKL: Explicitly add BKL around get_sb/fill_super
Next: [PATCH 2/3] sysctl: Remove CTL_NONE and CTL_UNNUMBERED