perf_events: ctx_flexible_sched_in() not maximizing PMU utilization [Kernel]

Prev: [PATCH] ext2/ext3:useless code for return value
Next: btusb patch, add_apple_MacBookPro62

From: Peter Zijlstra on 6 May 2010 10:30

On Thu, 2010-05-06 at 16:03 +0200, Stephane Eranian wrote:
> Hi,
>
> Looking at ctx_flexible_sched_in(), the logic is that if group_sched_in()
> fails for a HW group, then no other HW group in the list is even tried.
> I don't understand this restriction. Groups are independent of each other.
> The failure of one group should not block others from being scheduled,
> otherwise you under-utilize the PMU.
>
> What is the reason for this restriction? Can we lift it somehow?

Sure, but it will make scheduling much more expensive. The current
scheme will only ever check the first N events because it stops at the
first that fails, and since you can max fix N events on the PMU its
constant time.

To fix this issue you'd have to basically always iterate all events and
only stop once the PMU is fully booked, which reduces to an O(n) worst
case algorithm.

But yeah, I did think of making the thing an RB-tree and basically
schedule on service received, that should fix the lop-sided RR we get
with constrained events.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Peter Zijlstra on 6 May 2010 11:10

On Thu, 2010-05-06 at 16:41 +0200, Stephane Eranian wrote:
> On Thu, May 6, 2010 at 4:20 PM, Peter Zijlstra <peterz(a)infradead.org> wrote:
> > On Thu, 2010-05-06 at 16:03 +0200, Stephane Eranian wrote:
> >> Hi,
> >>
> >> Looking at ctx_flexible_sched_in(), the logic is that if group_sched_in()
> >> fails for a HW group, then no other HW group in the list is even tried.
> >> I don't understand this restriction. Groups are independent of each other.
> >> The failure of one group should not block others from being scheduled,
> >> otherwise you under-utilize the PMU.
> >>
> >> What is the reason for this restriction? Can we lift it somehow?
> >
> > Sure, but it will make scheduling much more expensive. The current
> > scheme will only ever check the first N events because it stops at the
> > first that fails, and since you can max fix N events on the PMU its
> > constant time.
> >
> You may fail not because the PMU is full but because an event is incompatible
> with the others, i.e., there may still be room for more evens. By relying on the
> RR to get coverage for all events, you also increase blind spots for
> events which
> have been skipped. Longer blind spots implies less accuracy when you scale.
>
> > To fix this issue you'd have to basically always iterate all events and
> > only stop once the PMU is fully booked, which reduces to an O(n) worst
> > case algorithm.
> >
>
> Yes, but if you have X events and you don't know if you have at least N
> that are compatible with each other, then you have to scan the whole list.

I'm not sure why you're arguing, you asked why it did as it did, I gave
an answer ;-)

I agree its not optimal, but fixing it isn't trivial, I would very much
like to avoid a full O(n) loop over all events, esp since creating them
is a non-privilidged operation.

So what we can look at is trying to do better, and making it a service
based scheduler instead of a strict RR should at least get a more equal
distribution.

Another thing we can do is quit at the second or third fail.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Peter Zijlstra on 7 May 2010 06:10

On Fri, 2010-05-07 at 11:37 +0200, Stephane Eranian wrote:

> > If we define lag to be the difference between perfect service and our
> > approximation thereof: lag_i = S - s_i, then for a scheduler to be fair
> > we must place two conditions thereon:
> >
>
> I assume S represents the time an event would be on the PMU in the
> case of perfect scheduling. And thus S is the same for all events. The
> index i represents the event index.

Ah indeed, I should have clarified that.

> > So eligibility can be expressed as: s_i < avg(s_i).
> >
> Which would mean: if my total time on PMU is less than the average
> time on the PMU for all events thus far, then "schedule me now".

Yes, although I would state the action like: "consider me for
scheduling", since there might not be place for all eligible events on
the PMU.

[ If you start adding weights (like we do for task scheduling) this
becomes a weighted average. ]

> You would have to sort the event by increasing s_i (using the RB tree, I assume)

Exactly.

> > With this, we will get a schedule like:
> >
> > / {A, C}, {B} /
> >
> > We are however still fully greedy, which is still O(n), which we don't
> > want. However if we stop being greedy and use the same heuristic we do
> > now, stop filling the PMU at the first fail, we'll still be fair,
> > because the algorithm ensures that.
> >
> Let's see if I understand with an example. Assume the PMU multiplex
> timing is 1ms, 2 counters. s(n) = total time in ms at time n.
>
> evt A B C
> s(0) 0 0 0 -> avg = 0/3=0.00, sort = A, B, C, schedule A, fail on B
> s(1) 1 0 0 -> avg = 1/3=0.33, sort = B, C, A, schedule B, C,
> s(2) 1 1 1 -> avg = 3/3=1.00, sort = A, B, C, schedule A, fail on B
> s(3) 2 1 1 -> avg = 4/3=1.33, sort = B, C, A, schedule B, C
> s(4) 2 2 2 -> avg = 6/3=2.00, sort = A, B, C, schedule A, fail on B
> s(5) 3 2 2 -> avg = 5/3=1.66, sort = B, C, A, schedule B, C
>
> What if there is no constraints on all 3 events?
>
> evt A B C
> s(0) 0 0 0 -> avg = 0/3=0.00, sort = A, B, C, schedule A, B
> s(1) 1 1 0 -> avg = 2/3=0.66, sort = C, A, B, schedule C (A, B > avg)
> s(2) 1 1 1 -> avg = 3/3=1.00, sort = A, B, C, schedule A, B
> s(3) 2 2 1 -> avg = 5/3=1.66, sort = C, A, B, schedule C (A, B > avg)
> s(4) 2 2 2 -> avg = 6/3=2.00, sort = B, C, A, schedule B, C
> s(5) 2 3 3 -> avg = 8/3=2.66, sort = A, B, C, schedule A (B, C > avg)
> s(6) 3 3 3 -> avg = 9/3=3.00, sort = A, B, C, schedule A, B
>
> When all timings are equal, sort could yield any order, it would not matter
> because overtime each event will be scheduled if it lags.
>
> Am I understanding your algorithm right?

Perfectly!

So the ramification of not using a greedy algorithm is that the
potential schedule of constrained events/groups gets longer than is
absolutely required, but I think that is something we'll have to live
with, since O(n) just isn't a nice option.

This can be illustrated if we consider B to be exclusive with both A and
C, in that case we could end up with:

/ {A}, {B}, {C} /

instead of

/ {A, C}, {B} /

Depending on the order in which we find events sorted.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Peter Zijlstra on 7 May 2010 07:20

On Fri, 2010-05-07 at 12:49 +0200, Stephane Eranian wrote:
> You'd have to insert all event into the tree, read leftmost.
> I believe you need more than just basic integer arithmetic
> to compare s_i to avg. Or you need to tweak the values
> to get more precisions. But you may already be doing that
> elsewhere in the kernel.

I've got a modification to CFS which implements EEVDF which needs
similar eligibility checks. So yeah, I've got code to do this.

The trick to computable avg is to keep a monotonic min_s around and use
ds_i = s_i - min_s. These ds_i will be 'small', in the order of the max
lag.

We can thus keep a sum of ds_i up-to-date when inserting/removing events
from the tree without fear of overflowing our integer.

When we update min_s, we must also update our relative sum
proportionally and in the opposite direction.

Comparing for eligibility can be done by:

s_i < 1/n \Sum s_i, or s_i - min_s < 1/n \Sum s_i - min_s, which we can
write as: n*ds_i < \Sum ds_i

Again, this can be done without fear of overflows because ds_i is small.

> Yes. Not clear how you could avoid this without having a global
> view of the dependencies between events (which are really event
> groups, BTW). Here you'd need to know that if you have
> evt A B C
> s(0) 0 0 0 -> avg = 0/3=0.00, sort = A, B, C, schedule A, fail on B
> s(1) 1 0 0 -> avg = 1/3=0.33, sort = B, C, A, schedule B, fail on C
>
> You'd have two options:
> s(2) 1 1 0 -> avg = 2/3=0.66, sort = C, A, B, schedule A, C
> or
> s(2) 1 1 0 -> avg = 2/3=0.66, sort = C, B, A schedule C
>
> The first is more attractive because it shortens the blind spots on C.
> Both are equally fair, though. Looks like you'd need to add a 2nd
> parameter to the sort when s_i are equal. That would have to be
> related to the number of constraints. You could sort in reverse order
> to the number of constraints, assuming you can express the constraint
> as a number. For simple constraints, such as counter restrictions, you
> could simply compare the number of possible counters between events.
> But there are PMU where there is no counter constraints but events are
> incompatible. What values do you compare in this case?

Not sure, but yeah, using constraint data to tie break is indeed an
interesting option.

I wonder how much tie breaking we'll really need in practice though, if
we use event->total_time_running as our s_i we've got ns resolution
timestamps, and with sub jiffies preemption like is common I doubt we'll
actually see a lot equal service numbers.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Peter Zijlstra on 14 May 2010 11:00

On Mon, 2010-05-10 at 11:41 +0200, Stephane Eranian wrote:
> Looks like a good solution, at least better than what is there right now.

Something like the below I guess, still need to make it an actual patch
though ;-)

The interesting bit is the schedule() function which tries to schedule
two queues fairly (cpu context and task context), and selects between
the two based on lag -- I'm not quite sure that that works out as
expected though, still have to do the actual math.

Also, I guess we should come up with a saner solution for the
per-task-per-cpu events (the unservicable stuff)

---

struct flexible_queue {
struct rb_root waiting;
struct rb_node *rb_leftmost;

u64 min_service;
u64 rel_avg;

u64 clock;
u64 running_stamp;

struct list_head running;
struct list_head unservicable;
int unservicable_cpu;
int nr_waiting;
};

enum flexible_type {
flexible_waiting,
flexible_running,
flexible_unservicable,
};

struct flexible_event {
union {
struct rb_node rb_entry;
struct list_head list_entry;
};
enum flexible_type type;

u64 service;
};

static inline
s64 flexible_event_rel(struct flexible_queue *fq, struct flexible_event *fe)
{
return fe->sevice - fq->min_service;
}

static void __enqueue_event(struct flexible_queue *fq, struct flexible_event *fe)
{
struct rb_node **link = &fq->waiting.rb_node;
struct rb_node *parent = NULL;
struct flexible_event *entry;
s64 rel = flexible_event_rel(fq, fe);
int leftmost = 1;

if (rel > 0) {
fq->min_service += rel;
fq->rel_avg -= fq->nr_waiting * rel;
rel = 0;
}

fq->nr_waiting++;
fq->rel_avg += rel;

fe->type = flexible_waiting;

while (*link) {
parent = *link;
entry = rb_entry(parent, struct flexible_event, rb_entry);
/*
* We dont care about collisions. Nodes with
* the same rel stay together.
*/
if (rel < flexible_event_rel(fq, fe)) {
link = &parent->rb_left;
} else {
link = &parent->rb_right;
leftmost = 0;
}
}

if (leftmost)
cfs_rq->rb_leftmost = &se->rb_entry;

rb_link_node(&fe->rb_entry, parent, link);
rb_insert_color(&fe->rb_entry, &fq->waiting);
}

static void __dequeue_event(struct flexible_queue *fq, struct flexible_event *fe)
{
if (fq->rb_leftmost == &fe->rb_entry) {
struct rb_node *next_node;

next_node = rb_next(&fe->rb_entry);
fq->rb_leftmost = next_node;
}

rb_erase(&fe->rb_entry, &fq->waiting);

fq->nr_waiting--;
fq->rel_avg -= flexible_event_rel(fq, fe);
}

static void
flexible_event_add(struct flexible_queue *fq, struct flexible_event *fe)
{
fe->service = fq->min_service + fq->rel_avg;
__enqueue_event(fq, fe);
}

static void
flexible_event_del(struct flexible_queue *fq, struct flexible_event *fe)
{
switch (fe->type) {
case flexible_waiting:
__dequeue_event(fq, fe);
break;

case flexible_running:
case flexible_unservicable:
list_del(&fe->list_entry);
break;
}
}

static void flexible_unschedule(struct flexible_queue *fq)
{
struct flexible_event *fe, *tmp;
s64 delta = (s64)(fq->clock - fq->running_stamp);

list_for_each_entry_safe(fe, tmp, &fq->running, list_entry) {
list_del(&fe->list_entry);
fe->service += delta;
__enqueue_event(fq, fe);
}
}

static
s64 flexible_event_lag(struct flexible_queue *fq, struct flexible_event *fe)
{
return fq->min_service + (fq->rel_avg / fq->nr_waiting) - fe->service;
}

static struct flexible_event *__pick_event(struct flexible_queue *fq)
{
struct flexible_event *fe;

if (!fq->rb_leftmost)
return NULL;

fe = rb_entry(fq->rb_leftmost, struct flexible_event, rb_entry);

return fe;
}

static
int flexible_schedule(struct flexible_queue *fq1, struct flexible_queue *fq2)
{
struct flexible_event *tmp, *fe;
struct flexible_queue *fq;
s64 tmp_lag, max_lag;

if (fq->unservicable_cpu != smp_processor_id()) {
list_for_each_entry_save(fe, tmp, &fq->unservicable, list_entry) {
list_del(&fe->list_entry);
flexible_event_add(fq, fe);
}

fq->unservicable_cpu = smp_processor_id();
}

again:
max_lag = 0;
fe = NULL;

tmp =__pick_event(fq1);
if (tmp) {
tmp_lag = flexible_event_lag(fq1, fe1);
if (tmp_lag > max_lag) {
fq = fq1;
fe = fe1;
max_lag = tmp_lag;
}
}

tmp =__pick_event(fq2);
if (tmp) {
tmp_lag = flexible_event_lag(fq2, fe2);
if (tmp_lag > max_lag) {
fq = fq2;
fe = fe2;
max_lag = tmp_lag;
}
}

if (!fe)
return 1; /* no more events to schedule */

fq->running_stamp = fq->clock;

if (event_of(fe)->cpu != -1 && event_of(fe)->cpu != smp_processor_id()) {
__dequeue_event(fq, fe);
fe->type = flexible_unservicable;
list_add(&fe->list_entry, &fq->unservicable);
goto again; /* can't run this event, try another */
}

if (group_sched_in(event_of(fe), ...) == 0) {
__dequeue_event(fq, fe);
fe->type = flexible_running;
list_add(&fe->list_entry, &fq->running);
return 0; /* success */
}

return -EBUSY; /* doesn't fit */
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2
Prev: [PATCH] ext2/ext3:useless code for return value
Next: btusb patch, add_apple_MacBookPro62