CFS fix place entity spread issue (v2) [Kernel]

Prev: Fix "undefined reference to `anon_inode_getfd'"
Next: [PATCH 1/4] fix kfifo miss use of nozami.c

From: Linus Torvalds on 18 Apr 2010 16:30

On Sun, 18 Apr 2010, Mathieu Desnoyers wrote:
>
> CFS fix place entity spread issue (v2)
>
> Huge CFS vruntime spread (18 minutes) has been observed with LTTng while simply
> running Xorg on a uniprocessor machine. Detailed explanation in my ELC2010
> presentation at:

Hmm. I tested this patch with my favourite non-scientific desktop load
test: do web browsing while doing a kernel compile with "make -j16" (after
doing a "git clean -dqfx" and "ccache -C" to get rid of old object files
and ccache).

This is on a dual-core (with SMT, so 4 threads) Core i5, so "make -j16"
overcommits the CPU's quite a bit.

And quite frankly, I think your patch makes things much worse. I don't
have numbers, but it felt much choppier and slower to do scrolling in
firefox o moving windows around while the load average is 20+.

My testload is in no way objective, nor necessarily any good, but it's the
one I happen to use, and while it's subjective, I think the difference was
pretty clear and not good.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mathieu Desnoyers on 18 Apr 2010 19:00

* Linus Torvalds (torvalds(a)linux-foundation.org) wrote:
>
>
> On Sun, 18 Apr 2010, Mathieu Desnoyers wrote:
> >
> > CFS fix place entity spread issue (v2)
> >
> > Huge CFS vruntime spread (18 minutes) has been observed with LTTng while simply
> > running Xorg on a uniprocessor machine. Detailed explanation in my ELC2010
> > presentation at:
>
> Hmm. I tested this patch with my favourite non-scientific desktop load
> test: do web browsing while doing a kernel compile with "make -j16" (after
> doing a "git clean -dqfx" and "ccache -C" to get rid of old object files
> and ccache).
>
> This is on a dual-core (with SMT, so 4 threads) Core i5, so "make -j16"
> overcommits the CPU's quite a bit.
>
> And quite frankly, I think your patch makes things much worse. I don't
> have numbers, but it felt much choppier and slower to do scrolling in
> firefox o moving windows around while the load average is 20+.
>
> My testload is in no way objective, nor necessarily any good, but it's the
> one I happen to use, and while it's subjective, I think the difference was
> pretty clear and not good.
>
> Linus

Degradation of this particular workload is expected, because the nature of the
large CFS spread issue favors Xorg: it gets litterally minutes of available
vruntime, while most other threads (e.g. kernel build, audio playing, etc..)
suffer from this.

I'll try to play a bit with this (I got plenty of time to kill at the airport
today, missed my connecting flight back from San Francisco). Ideally, we might
want to try to find a way to ensure that place_entity() is kind to sleepers, but
without putting them before min_vruntime. I don't have any SMP machine with Xorg
handy, but I'll try to push my UP laptop a bit more heavily with a kernel build.

If we are lucky enough, we might just have to twist a knob or two to make Xorg
more responsive.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds on 18 Apr 2010 19:30

On Sun, 18 Apr 2010, Mathieu Desnoyers wrote:
>
> Degradation of this particular workload is expected, because the nature of the
> large CFS spread issue favors Xorg: it gets litterally minutes of available
> vruntime, while most other threads (e.g. kernel build, audio playing, etc..)
> suffer from this.

In that case, I think your patch is fundamentally incorrect.

I care a _whole_ lot more about "usable desktop" than I care about some
inane "available vruntime". If the current behavior breaks your vruntime
rules, then the current behaviour is clearly _correct_.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Peter Zijlstra on 19 Apr 2010 05:30

On Sun, 2010-04-18 at 09:13 -0400, Mathieu Desnoyers wrote:

> Detailed explanation in my ELC2010 presentation at:
>
> http://www.efficios.com/elc2010

So what happened to comprehensive changelogs?

> I've torn the CFS scheduler apart in the past days to figure out what is causing
> this weird behavior, and the culprit seems to be place_entity(). The problem
> appears to be the cumulative effect of letting the min_vruntime go backward when
> putting sleepers back on the runqueue.

Right, that should not happen, ever, min_vruntime should be monotonic.
Your explanation fails to mention how exactly this happens.

> It lets the vruntime spread grow to
> "entertaining" values (it is supposed to be in the 5ms range, not 18 minutes!).
>
> In the original code, a max between the sched entity vruntime and the calculated
> vruntime was supposed to "ensure that the thread time never go backward". But I
> don't see why we even care about that.

Fairness, a task's vruntime going backwards means it can obtain more
cputime than a while(1) loop would get, not a good thing.

> The key point is that the min_vruntime
> of the runqueue should not go backward.

That is certainly important too.

> I propose to fix this by calculating the relative offset from
> min_vruntime + sysctl_sched_latency rather than directly from min_vruntime. I
> also ensure that the value never goes below min_vruntime.

I'm not sure that's a good idea, that'll end up being a 'fairer'
scheduler in the strict sense, but I suspect it will break the wakeup
preemption model we have.

> Under the Xorg workload, moving a few windows around and starting firefox while
> executing the wakeup-latency.c program (program waking up every 10ms and
> reporting wakeup latency), this patch brings worse latency from 60ms down to
> 12ms. Even doing a kernel compilation at the same time, the worse latency stays
> around 20ms now.

That seems nice enough. But really, this changelog doesn't explain the
cause at all.

/me goes try and figure out where this min_vruntime goes backwards.

So update_min_vruntime() is the sole place where we change min_vruntime,
it is called from update_curr() where we add runtime to the task we're
currently servicing, and dequeue_entity() where we remove someone from
the tree.

These two sites are the only places where min_vruntime can change
because only adding service to the leftmost task, or removal thereof can
make the min_vruntime go forward.

So update_min_vruntime():

- takes the smaller vruntime between the current task and the leftmost
tree entry (current is not in the tree, and we can pass beyond the
leftmost waiting task due to limited preemption and minimum service
windows).

- set min_vruntime to the larger between the existing min_vruntime and
the smallest vruntime.

So colour me confused if I'm not seeing min_vruntime going backwards...

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mathieu Desnoyers on 19 Apr 2010 10:10

* Peter Zijlstra (peterz(a)infradead.org) wrote:
> On Sun, 2010-04-18 at 09:13 -0400, Mathieu Desnoyers wrote:
>
> > Detailed explanation in my ELC2010 presentation at:
> >
> > http://www.efficios.com/elc2010
>
> So what happened to comprehensive changelogs?
>
> > I've torn the CFS scheduler apart in the past days to figure out what is causing
> > this weird behavior, and the culprit seems to be place_entity(). The problem
> > appears to be the cumulative effect of letting the min_vruntime go backward when
> > putting sleepers back on the runqueue.
>
> Right, that should not happen, ever, min_vruntime should be monotonic.
> Your explanation fails to mention how exactly this happens.
>
> > It lets the vruntime spread grow to
> > "entertaining" values (it is supposed to be in the 5ms range, not 18 minutes!).
> >
> > In the original code, a max between the sched entity vruntime and the calculated
> > vruntime was supposed to "ensure that the thread time never go backward". But I
> > don't see why we even care about that.
>
> Fairness, a task's vruntime going backwards means it can obtain more
> cputime than a while(1) loop would get, not a good thing.
>
> > The key point is that the min_vruntime
> > of the runqueue should not go backward.
>
> That is certainly important too.
>
> > I propose to fix this by calculating the relative offset from
> > min_vruntime + sysctl_sched_latency rather than directly from min_vruntime. I
> > also ensure that the value never goes below min_vruntime.
>
> I'm not sure that's a good idea, that'll end up being a 'fairer'
> scheduler in the strict sense, but I suspect it will break the wakeup
> preemption model we have.
>
> > Under the Xorg workload, moving a few windows around and starting firefox while
> > executing the wakeup-latency.c program (program waking up every 10ms and
> > reporting wakeup latency), this patch brings worse latency from 60ms down to
> > 12ms. Even doing a kernel compilation at the same time, the worse latency stays
> > around 20ms now.
>
> That seems nice enough. But really, this changelog doesn't explain the
> cause at all.
>
> /me goes try and figure out where this min_vruntime goes backwards.
>
> So update_min_vruntime() is the sole place where we change min_vruntime,
> it is called from update_curr() where we add runtime to the task we're
> currently servicing, and dequeue_entity() where we remove someone from
> the tree.
>
> These two sites are the only places where min_vruntime can change
> because only adding service to the leftmost task, or removal thereof can
> make the min_vruntime go forward.
>
> So update_min_vruntime():
>
> - takes the smaller vruntime between the current task and the leftmost
> tree entry (current is not in the tree, and we can pass beyond the
> leftmost waiting task due to limited preemption and minimum service
> windows).
>
> - set min_vruntime to the larger between the existing min_vruntime and
> the smallest vruntime.
>
> So colour me confused if I'm not seeing min_vruntime going backwards...

place_entity(), when placing a sleeper on the runqueue, decrements its vruntime
(derived from min_vruntime) from -= thresh (derived from load calculation). So
if we end up with a task woken up that does not consume enough vruntime to pass
the previous min_vruntime value, we effectively decrement min_vruntime.

The other thing I am currently looking into is that, without even considering
the question of going lower than min_vruntime, given that a woken up sleeper
gets an extra share of vruntime (still because of place_entity), it might always
run between -thresh to the previous min_runtime, therefore "stalling" the
min_vruntime values at low values while other threads push the maximum vruntime
higher, therefore increasing the spread. I'm currently doing some tests trying
to cope with this by consuming woken up sleepers vruntime more quickly for a
vruntime duration of "thresh".

I am also testing alternatives ways to deal with the min_vruntime going backward
problem, by keeping a "floor_vruntime", which ensures that place_entity() can
never decrement min_vruntime below the floor value of its previous decrement.

Sorry for the missing changelog info, I was between planes this weekend.
Hopefully my sleep-deprived explanation makes sense. ;)

Thanks,

Mathieu

>
>
>

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2
Prev: Fix "undefined reference to `anon_inode_getfd'"
Next: [PATCH 1/4] fix kfifo miss use of nozami.c