From: Corey Ashford on
On 06/24/2010 07:28 AM, Peter Zijlstra wrote:
> These patches prepare the perf code for multiple pmus (no user
> interface yet, Lin Ming is working on that). These patches remove all
> weak functions and rework the struct pmu interface.
>
> The patches are boot tested on x86_64 and compile tested on: powerpc
> (!fsl, what config is that?), sparc and arm (sorry no SH compiler)
>
> On x86 perf stat and record with both software and hardware events still worked.
>
> I changed all (hardware) pmu implementations so there's a fair chance I messed
> something up, please have a look at it.
>
> If people agree with this, I'll continue with the per-pmu context stuff I
> talked about last time, which should get us the hardware write batching back.

Hi Peter,

These patches look like they are really taking us in the right
direction. Thanks for all your effort on this!

As for the "hardware write batching", can you describe a bit more about
what you have in mind there? I wonder if this might have something to
do with accounting for PMU hardware which is slow to access, for
example, via I2C via an internal bridge.

Cheers,

- Corey

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Sat, 2010-06-26 at 09:22 -0700, Corey Ashford wrote:

> These patches look like they are really taking us in the right
> direction. Thanks for all your effort on this!

Yeah, although I seem to have managed to wreck all architectures tested
so far (I found some iffy on x86 too), I guess I need to carefully look
at things.

> As for the "hardware write batching", can you describe a bit more about
> what you have in mind there? I wonder if this might have something to
> do with accounting for PMU hardware which is slow to access, for
> example, via I2C via an internal bridge.

Right, so the write batching is basically delaying writing out the PMU
state to hardware until pmu::pmu_enable() time. It avoids having to
re-program the hardware when, due to a scheduling constraint, we have to
move counters around.

So say, we context switch a task, and remove the old events and add the
new ones under a single pmu::pmu_disable()/::pmu_enable() pair, we will
only hit the hardware twice (once to disable, once to enable), instead
of for each individual ::del()/::add().

For this to work we need to have an association between a context and a
pmu, otherwise its very hard to know what pmu to disable/enable; the
alternative is all of them which isn't very attractive.

Then again, it doesn't make sense to have task-counters on an I2C
attached PMU simply because writing to the PMU could cause context
switches.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Corey Ashford on
On 06/28/2010 08:13 AM, Peter Zijlstra wrote:
> On Sat, 2010-06-26 at 09:22 -0700, Corey Ashford wrote:
> As for the "hardware write batching", can you describe a bit more about
>> what you have in mind there? I wonder if this might have something to
>> do with accounting for PMU hardware which is slow to access, for
>> example, via I2C via an internal bridge.
>
> Right, so the write batching is basically delaying writing out the PMU
> state to hardware until pmu::pmu_enable() time. It avoids having to
> re-program the hardware when, due to a scheduling constraint, we have to
> move counters around.
>
> So say, we context switch a task, and remove the old events and add the
> new ones under a single pmu::pmu_disable()/::pmu_enable() pair, we will
> only hit the hardware twice (once to disable, once to enable), instead
> of for each individual ::del()/::add().
>
> For this to work we need to have an association between a context and a
> pmu, otherwise its very hard to know what pmu to disable/enable; the
> alternative is all of them which isn't very attractive.
>
> Then again, it doesn't make sense to have task-counters on an I2C
> attached PMU simply because writing to the PMU could cause context
> switches.

Thanks for your reply.

In our case, writing to some of the nest PMUs' control registers is done
via an internal bridge. We write to a specific memory address and an
internal MMIO-to-SCOM bridge (SCOM is similar to I2C) translates it to
serial and sends it over the internal serial bus. The address we write
to is based upon the control register's serial bus address, plus an
offset from the base of MMIO-to-SCOM bridge. The same process works for
reads.

While it does not cause a context switch because there are no IO drivers
to go through, it will take several thousand CPU cycles to complete,
which by the same token, still makes them inappropriate for
task-counters (not to mention that the nest units operate asynchronously
from the CPU).

However, there still are concerns relative to writing these control
registers from an interrupt handler because of the latency that will be
incurred, however slow we choose to do the event rotation. So at least
for the Wire-Speed processor, we may need a worker thread of some sort
to hand off the work to.

Our current code, based on linux 2.6.31 (soon to be 2.6.32) doesn't use
worker threads; we are just taking the latency hit for now.

- Corey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Wed, 2010-06-30 at 10:19 -0700, Corey Ashford wrote:
> However, there still are concerns relative to writing these control
> registers from an interrupt handler because of the latency that will be
> incurred, however slow we choose to do the event rotation. So at least
> for the Wire-Speed processor, we may need a worker thread of some sort
> to hand off the work to.

Right, once we have per-pmu contexts we can look at having means to
over-ride the means of rotation, such that the pmu can optionally drive
it itself instead of through the common tick.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Fri, 2010-06-25 at 16:50 +0200, Peter Zijlstra wrote:

> Not exactly sure how I could have messed up the ARM architecture code to
> make this happen though... will have a peek.

I did find a bug in there, not sure it could have been responsible for
this but who knows...

Pushed out a new git tree with the below delta folded in.

---
Index: linux-2.6/arch/arm/kernel/perf_event.c
===================================================================
--- linux-2.6.orig/arch/arm/kernel/perf_event.c
+++ linux-2.6/arch/arm/kernel/perf_event.c
@@ -221,27 +221,6 @@ again:
}

static void
-armpmu_del(struct perf_event *event, int flags)
-{
- struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
- struct hw_perf_event *hwc = &event->hw;
- int idx = hwc->idx;
-
- WARN_ON(idx < 0);
-
- clear_bit(idx, cpuc->active_mask);
- armpmu->disable(hwc, idx);
-
- barrier();
-
- armpmu_event_update(event, hwc, idx);
- cpuc->events[idx] = NULL;
- clear_bit(idx, cpuc->used_mask);
-
- perf_event_update_userpage(event);
-}
-
-static void
armpmu_read(struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
@@ -267,6 +246,8 @@ armpmu_stop(struct perf_event *event, in
*/
if (!(hwc->state & PERF_HES_STOPPED)) {
armpmu->disable(hwc, hwc->idx);
+ barrier(); /* why? */
+ armpmu_event_update(event, hwc, hwc->idx);
hwc->state |= PERF_HES_STOPPED | PERF_HES_UPTODATE;
}
}
@@ -289,7 +270,7 @@ armpmu_start(struct perf_event *event, i
hwc->state = 0;
/*
* Set the period again. Some counters can't be stopped, so when we
- * were throttled we simply disabled the IRQ source and the counter
+ * were stopped we simply disabled the IRQ source and the counter
* may have been left counting. If we don't do this step then we may
* get an interrupt too soon or *way* too late if the overflow has
* happened since disabling.
@@ -298,6 +279,23 @@ armpmu_start(struct perf_event *event, i
armpmu->enable(hwc, hwc->idx);
}

+static void
+armpmu_del(struct perf_event *event, int flags)
+{
+ struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+ struct hw_perf_event *hwc = &event->hw;
+ int idx = hwc->idx;
+
+ WARN_ON(idx < 0);
+
+ clear_bit(idx, cpuc->active_mask);
+ armpmu_stop(event, PERF_EF_UPDATE);
+ cpuc->events[idx] = NULL;
+ clear_bit(idx, cpuc->used_mask);
+
+ perf_event_update_userpage(event);
+}
+
static int
armpmu_add(struct perf_event *event, int flags)
{
@@ -1075,7 +1073,7 @@ armv6pmu_handle_irq(int irq_num,
continue;

if (perf_event_overflow(event, 0, &data, regs))
- armpmu_stop(event, 0);
+ armpmu->disable(hwc, idx);
}

/*

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/