perf pmu interface -v2 [Kernel]

Prev: [PATCH v2] Documentation/sysctl/vm.txt typo
Next: perf tools: allow cross compiling with DWARF support

From: Peter Zijlstra on 2 Jul 2010 06:00

On Fri, 2010-07-02 at 11:57 +0900, Paul Mundt wrote:
> At the moment it's not an issue since we have big enough counters that
> overflows don't really happen, especially if we're primarily using them
> for one-shot measuring.
>
> SH-4A style counters behave in such a fashion that we have 2 general
> purpose counters, and 2 counters for measuring bus transactions. These
> bus counters can optionally be disabled and used in a chained mode to
> provide the general purpose counters a 64-bit counter (the actual
> validity in the upper half of the chained counter varies depending on the
> CPUs, but all of them can do at least 48-bits when chained).

Right, so I was reading some of that code and I couldn't actually find
where you keep consistency between the hardware counter value and the
stored prev_count value.

That is, suppose I'm counting, the hardware starts at 0, hwc->prev_count
= 0 and event->count = 0.

At some point, x we context switch this task away, so we ->disable(),
which disables the counter and updates the values, so at that time
hwc->prev = x and event->count = x, right?

Now suppose we schedule the task back in, so we do ->enable(), then what
happens? sh_pmu_enable() finds an unused index, (disables it for some
reason.. it should already be cleared if its not used, but I guess a few
extra hardware writes dont hurt) and calls sh4a_pmu_enable() on it.

sh4a_pmu_enable() does 3 writes:

PPC_PMCAT -- does this clear the counter value?
PPC_CCBR -- writes the ->config bits
PPC_CCBR (adds CCBR_DUC, couldn't this be done in the
previous write to this reg?)

Now assuming that enable does indeed clear the hardware counter value,
shouldn't you also set hwc->prev_count to 0 again? Otherwise the next
update will see a massive jump?

Alternatively you could write the hwc->prev_count value back to the
register.

If you eventually want to drop the chained counter support I guess it
would make sense to have sh_perf_event_update() read and clear the
counter so that you're always 0 based and then enforce an update from
the arch tick hander so you never overflow.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Will Deacon on 2 Jul 2010 09:00

Hi Peter,

On Thu, 2010-07-01 at 15:36 +0100, Peter Zijlstra wrote:
> On Fri, 2010-06-25 at 16:50 +0200, Peter Zijlstra wrote:
>
> > Not exactly sure how I could have messed up the ARM architecture code to
> > make this happen though... will have a peek.
>
> I did find a bug in there, not sure it could have been responsible for
> this but who knows...
>
> Pushed out a new git tree with the below delta folded in.
>
I had a look at this yesterday and discovered a bug in the ARM
backend code, which I've posted a patch for to ALKML:

http://lists.infradead.org/pipermail/linux-arm-kernel/2010-July/019461.html

Unfortunately, with this applied and your latest changes I still
get 0 from pinned hardware counters:

# perf stat -r 5 -e cycles -e instructions -e cs -e faults -e branches -a -- git status

Performance counter stats for 'git status' (5 runs):

0 cycles ( +- nan% )
0 instructions # 0.000 IPC ( +- nan% )
88447 context-switches ( +- 12.624% )
13647 page-faults ( +- 0.015% )
0 branches ( +- nan% )

The changes you've made to arch/arm/kernel/perf_event.c
look sane. If I get some time I'll try and dig deeper.

Will

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Paul Mundt on 5 Jul 2010 07:20

On Fri, Jul 02, 2010 at 11:52:03AM +0200, Peter Zijlstra wrote:
> Right, so I was reading some of that code and I couldn't actually find
> where you keep consistency between the hardware counter value and the
> stored prev_count value.
>
> That is, suppose I'm counting, the hardware starts at 0, hwc->prev_count
> = 0 and event->count = 0.
>
> At some point, x we context switch this task away, so we ->disable(),
> which disables the counter and updates the values, so at that time
> hwc->prev = x and event->count = x, right?
>
> Now suppose we schedule the task back in, so we do ->enable(), then what
> happens? sh_pmu_enable() finds an unused index, (disables it for some
> reason.. it should already be cleared if its not used, but I guess a few
> extra hardware writes dont hurt) and calls sh4a_pmu_enable() on it.
>
I don't quite remember where the ->disable() came from, I vaguely recall
copying it from one of the other architectures, but it could have just
been a remnant of something I had for debug code. In any event, you're
correct, we don't seem to need it anymore.

> sh4a_pmu_enable() does 3 writes:
>
> PPC_PMCAT -- does this clear the counter value?

Yes, the counters themselves are read-only, so clearing is done through
the PMCAT control register.

> PPC_CCBR -- writes the ->config bits
> PPC_CCBR (adds CCBR_DUC, couldn't this be done in the
> previous write to this reg?)
>
No, the DUC bit needs to be set by itself or the write is discarded on
some CPUs. Clearing it with other bits is fine, however. This is what
starts the counter running.

> Now assuming that enable does indeed clear the hardware counter value,
> shouldn't you also set hwc->prev_count to 0 again? Otherwise the next
> update will see a massive jump?
>
I think that's a correct observation, but I'm having difficulty verifying
it on my current board since it seems someone moved the PMCAT register,
as the counters aren't being cleared on this particular CPU. I'll test on
the board I wrote this code for initially tomorrow and see how that goes.
It did used to work fine at least.

> Alternatively you could write the hwc->prev_count value back to the
> register.
>
That would be an option if the counters weren't read-only, yes.

> If you eventually want to drop the chained counter support I guess it
> would make sense to have sh_perf_event_update() read and clear the
> counter so that you're always 0 based and then enforce an update from
> the arch tick hander so you never overflow.
>
Yes, I'd thought about that too. I'll give it a go once I find out where
the other half of my registers disappeared to. As it is, it seems my bat
and I have an appointment to make.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Peter Zijlstra on 8 Jul 2010 07:20

On Thu, 2010-07-01 at 17:39 +0200, Peter Zijlstra wrote:
>
> Ah, for sampling for sure, simply group a software perf event and a
> hardware perf event together and use PERF_SAMPLE_READ.

So the idea is to sample using a software event (periodic timer of
sorts, maybe randomize it) and weight its samples by the hardware event
deltas.

Suppose you have a workload consisting of two main parts:

my_important_work()
{
load_my_data();
compute_me_silly();
}

Now, lets assume that both these functions take the same time to
complete for each part of work. In that case a periodic timer generate
samples that are about 50/50 distributed between these two functions.

Now, let us further assume that load_my_data() is so slow because its
missing all the caches and compute_me_silly() is slow because its
defeating the branch predictor.

So what we want to end up with, is that when we sample for cache-misses
we get load_my_data() as the predominant function, not a nice 50/50
relation. Idem for branch misses and compute_me_silly().

By weighting the samples by the hw counter delta we get this, if we
assume that the sampling frequency is not a harmonic of the runtime of
these functions, then statistics will dtrt.

It basically generates a massive skid on the sample, but as long as most
of the samples end up hitting the right function we're good. For a
periodic workload like:
while (lots) { my_important_work() }
that is even true for period > function_runtime with the exception of
that harmonic thing. For less neat workloads like:
while (lots) { my_important_work(); other_random_things(); }
This doesn't need to work unless period < function_runtime.

Clearly we cannot attribute anything to the actual instruction hit due
to the massive skid, but we can (possibly) say something about the
function based on these statistical rules.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Ingo Molnar on 8 Jul 2010 07:30

* Peter Zijlstra <peterz(a)infradead.org> wrote:

> On Thu, 2010-07-01 at 17:39 +0200, Peter Zijlstra wrote:
> >
> > Ah, for sampling for sure, simply group a software perf event and a
> > hardware perf event together and use PERF_SAMPLE_READ.
>
> So the idea is to sample using a software event (periodic timer of sorts,
> maybe randomize it) and weight its samples by the hardware event deltas.
>
> Suppose you have a workload consisting of two main parts:
>
> my_important_work()
> {
> load_my_data();
> compute_me_silly();
> }
>
> Now, lets assume that both these functions take the same time to complete
> for each part of work. In that case a periodic timer generate samples that
> are about 50/50 distributed between these two functions.
>
> Now, let us further assume that load_my_data() is so slow because its
> missing all the caches and compute_me_silly() is slow because its defeating
> the branch predictor.
>
> So what we want to end up with, is that when we sample for cache-misses we
> get load_my_data() as the predominant function, not a nice 50/50 relation.
> Idem for branch misses and compute_me_silly().
>
> By weighting the samples by the hw counter delta we get this, if we assume
> that the sampling frequency is not a harmonic of the runtime of these
> functions, then statistics will dtrt.

Yes.

And if the platform code implements this then the tooling side already takes
care of it - even if the CPU itself cannot geneate interrupts based on say
cachemisses or branches (but can measure them via counts).

The only situation where statistics will not do the right thing is when the
likelyhood of the sample tick significantly correlates with the likelyhood of
the workload itself executing. Timer-dominated workloads would be an example.

Real hrtimers are sufficiently tick-less to solve most of these artifacts in
practice.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: [PATCH v2] Documentation/sysctl/vm.txt typo
Next: perf tools: allow cross compiling with DWARF support