BFS vs. mainline scheduler benchmarks and measurements [Kernel]

Prev: [PATCH 1/1] AGP: amd64, fix pci reference leaks
Next: [PATCH 2/3] viafb: remove unused structure member

From: Con Kolivas on 9 Sep 2009 21:40

On Thu, 10 Sep 2009 06:50:43 Jens Axboe wrote:
> On Wed, Sep 09 2009, Nikos Chantziaras wrote:
> > On 09/09/2009 09:04 PM, Ingo Molnar wrote:
> >> [...]
> >>
> >> * Jens Axboe<jens.axboe(a)oracle.com> wrote:
> >>> On Wed, Sep 09 2009, Jens Axboe wrote:
> >>> [...]
> >>> BFS210 runs on the laptop (dual core intel core duo). With make -j4
> >>> running, I clock the following latt -c8 'sleep 10' latencies:
> >>>
> >>> -rc9
> >>>
> >>> Max 17895 usec
> >>> Avg 8028 usec
> >>> Stdev 5948 usec
> >>> Stdev mean 405 usec
> >>>
> >>> Max 17896 usec
> >>> Avg 4951 usec
> >>> Stdev 6278 usec
> >>> Stdev mean 427 usec
> >>>
> >>> Max 17885 usec
> >>> Avg 5526 usec
> >>> Stdev 6819 usec
> >>> Stdev mean 464 usec
> >>>
> >>> -rc9 + mike
> >>>
> >>> Max 6061 usec
> >>> Avg 3797 usec
> >>> Stdev 1726 usec
> >>> Stdev mean 117 usec
> >>>
> >>> Max 5122 usec
> >>> Avg 3958 usec
> >>> Stdev 1697 usec
> >>> Stdev mean 115 usec
> >>>
> >>> Max 6691 usec
> >>> Avg 2130 usec
> >>> Stdev 2165 usec
> >>> Stdev mean 147 usec
> >>
> >> At least in my tests these latencies were mainly due to a bug in
> >> latt.c - i've attached the fixed version.
> >>
> >> The other reason was wakeup batching. If you do this:
> >>
> >> echo 0> /proc/sys/kernel/sched_wakeup_granularity_ns
> >>
> >> ... then you can switch on insta-wakeups on -tip too.
> >>
> >> With a dual-core box and a make -j4 background job running, on
> >> latest -tip i get the following latencies:
> >>
> >> $ ./latt -c8 sleep 30
> >> Entries: 656 (clients=8)
> >>
> >> Averages:
> >> ------------------------------
> >> Max 158 usec
> >> Avg 12 usec
> >> Stdev 10 usec
> >
> > With your version of latt.c, I get these results with 2.6-tip vs
> > 2.6.31-rc9-bfs:
> >
> >
> > (mainline)
> > Averages:
> > ------------------------------
> > Max 50 usec
> > Avg 12 usec
> > Stdev 3 usec
> >
> >
> > (BFS)
> > Averages:
> > ------------------------------
> > Max 474 usec
> > Avg 11 usec
> > Stdev 16 usec
> >
> >
> > However, the interactivity problems still remain. Does that mean it's
> > not a latency issue?
>
> It probably just means that latt isn't a good measure of the problem.
> Which isn't really too much of a surprise.

And that's a real shame because this was one of the first real good attempts
I've seen to actually measure the difference, and I thank you for your
efforts Jens. I believe the reason it's limited is because all you're
measuring is time from wakeup and the test app isn't actually doing any work.
The issue is more than just waking up as fast as possible, it's then doing
some meaningful amount of work within a reasonable time frame as well. What
the "meaningful amount of work" and "reasonable time frame" are, remains a
mystery, but I guess could be added on to this testing app.

What does please me now, though, is that this message thread is finally
concentrating on what BFS was all about. The fact that it doesn't scale is no
mystery whatsoever. The fact that that throughput and lack of scaling was
what was given attention was missing the point entirely. To point that out I
used the bluntest response possible, because I know that works on lkml (does
it not?). Unfortunately I was so blunt that I ended up writing it in another
language; Troll. So for that, I apologise.

The unfortunate part is that BFS is still far from a working, complete state,
yet word got out that I had "released" something, which I had not, but
obviously there's no great distinction between putting something on a server
for testing, and a real release with an announce.

BFS is a scheduling experiment to demonstrate what effect the cpu scheduler
really has on the desktop and how it might be able to perform if we design
the scheduler for that one purpose.

It pleases me immensely to see that it has already spurred on a flood of
changes to the interactivity side of mainline development in its few days of
existence, including some ideas that BFS uses itself. That in itself, to me,
means it has already started to accomplish its goal, which ultimately, one
way or another, is to improve what the CPU scheduler can do for the linux
desktop. I can't track all the sensitive areas of the mainline kernel
scheduler changes without getting involved more deeply than I care to so it
would be counterproductive of me to try and hack on mainline. I much prefer
the quieter inbox.

If people want to use BFS for their own purposes or projects, or even better
help hack on it, that would make me happy for different reasons. I will
continue to work on my little project -in my own time- and hope that it
continues to drive further development of the mainline kernel in its own way.
We need more experiments like this to question what we currently have and
accept. Other major kernel subsystems are no exception.

Regards,
--
-ck

<code before rhetoric>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mike Galbraith on 9 Sep 2009 23:20

On Wed, 2009-09-09 at 23:12 +0300, Nikos Chantziaras wrote:

> With your version of latt.c, I get these results with 2.6-tip vs
> 2.6.31-rc9-bfs:
>
>
> (mainline)
> Averages:
> ------------------------------
> Max 50 usec
> Avg 12 usec
> Stdev 3 usec
>
>
> (BFS)
> Averages:
> ------------------------------
> Max 474 usec
> Avg 11 usec
> Stdev 16 usec
>
>
> However, the interactivity problems still remain. Does that mean it's
> not a latency issue?

Could be a fairness issue. If X+client needs more than it's fair share
of CPU, there's nothing to do but use nice levels. I'm stuck with
unaccelerated X (nvidia card), so if I want a good DVD watching or
whatever eye-candy experience while my box does a lot of other work, I
either have to use SCHED_IDLE/nice for the background stuff, or renice
X. That's the down side of a fair scheduler.

There is another variant of latency related interactivity issue for the
desktop though, too LOW latency. If X and clients are switching too
fast, redraw can look nasty, sliced/diced.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Ingo Molnar on 10 Sep 2009 02:10

* Nikos Chantziaras <realnc(a)arcor.de> wrote:

> On 09/09/2009 09:04 PM, Ingo Molnar wrote:
>> [...]
>> * Jens Axboe<jens.axboe(a)oracle.com> wrote:
>>
>>> On Wed, Sep 09 2009, Jens Axboe wrote:
>>> [...]
>>> BFS210 runs on the laptop (dual core intel core duo). With make -j4
>>> running, I clock the following latt -c8 'sleep 10' latencies:
>>>
>>> -rc9
>>>
>>> Max 17895 usec
>>> Avg 8028 usec
>>> Stdev 5948 usec
>>> Stdev mean 405 usec
>>>
>>> Max 17896 usec
>>> Avg 4951 usec
>>> Stdev 6278 usec
>>> Stdev mean 427 usec
>>>
>>> Max 17885 usec
>>> Avg 5526 usec
>>> Stdev 6819 usec
>>> Stdev mean 464 usec
>>>
>>> -rc9 + mike
>>>
>>> Max 6061 usec
>>> Avg 3797 usec
>>> Stdev 1726 usec
>>> Stdev mean 117 usec
>>>
>>> Max 5122 usec
>>> Avg 3958 usec
>>> Stdev 1697 usec
>>> Stdev mean 115 usec
>>>
>>> Max 6691 usec
>>> Avg 2130 usec
>>> Stdev 2165 usec
>>> Stdev mean 147 usec
>>
>> At least in my tests these latencies were mainly due to a bug in
>> latt.c - i've attached the fixed version.
>>
>> The other reason was wakeup batching. If you do this:
>>
>> echo 0> /proc/sys/kernel/sched_wakeup_granularity_ns
>>
>> ... then you can switch on insta-wakeups on -tip too.
>>
>> With a dual-core box and a make -j4 background job running, on
>> latest -tip i get the following latencies:
>>
>> $ ./latt -c8 sleep 30
>> Entries: 656 (clients=8)
>>
>> Averages:
>> ------------------------------
>> Max 158 usec
>> Avg 12 usec
>> Stdev 10 usec
>
> With your version of latt.c, I get these results with 2.6-tip vs
> 2.6.31-rc9-bfs:
>
>
> (mainline)
> Averages:
> ------------------------------
> Max 50 usec
> Avg 12 usec
> Stdev 3 usec
>
>
> (BFS)
> Averages:
> ------------------------------
> Max 474 usec
> Avg 11 usec
> Stdev 16 usec
>
> However, the interactivity problems still remain. Does that mean
> it's not a latency issue?

It means that Jens's test-app, which demonstrated and helped us fix
the issue for him does not help us fix it for you just yet.

The "fluidity problem" you described might not be a classic latency
issue per se (which latt.c measures), but a timeslicing / CPU time
distribution problem.

A slight shift in CPU time allocation can change the flow of tasks
to result in a 'choppier' system.

Have you tried, in addition of the granularity tweaks you've done,
to renice mplayer either up or down? (or compiz and Xorg for that
matter)

I'm not necessarily suggesting this as a 'real' solution (we really
prefer kernels that just get it right) - but it's an additional
parameter dimension along which you can tweak CPU time distribution
on your box.

Here's the general rule of thumb: mine one nice level gives plus 5%
CPU time to a task and takes away 5% CPU time from another task -
i.e. shifts the CPU allocation by 10%.

( this is modified by all sorts of dynamic conditions: by the number
of tasks running and their wakeup patters so not a rule cast into
stone - but still a good ballpark figure for CPU intense tasks. )

Btw., i've read your descriptions about what you've tuned so far -
have you seen/checked the wakeup_granularity tunable as well?
Setting that to 0 will change the general balance of how CPU time is
allocated between tasks too.

There's also a whole bunch of scheduler features you can turn on/off
individually via /debug/sched_features. For example, to turn off
NEW_FAIR_SLEEPERS, you can do:

# cat /debug/sched_features
NEW_FAIR_SLEEPERS NO_NORMALIZED_SLEEPER ADAPTIVE_GRAN WAKEUP_PREEMPT
START_DEBIT AFFINE_WAKEUPS CACHE_HOT_BUDDY SYNC_WAKEUPS NO_HRTICK
NO_DOUBLE_TICK ASYM_GRAN LB_BIAS LB_WAKEUP_UPDATE ASYM_EFF_LOAD
NO_WAKEUP_OVERLAP LAST_BUDDY OWNER_SPIN

# echo NO_NEW_FAIR_SLEEPERS > /debug/sched_features

Btw., NO_NEW_FAIR_SLEEPERS is something that will turn the scheduler
into a more classic fair scheduler (like BFS is too).

NO_START_DEBIT might be another thing that improves (or worsens :-/)
make -j type of kernel build workloads.

Note, these flags are all runtime, the new settings take effect
almost immediately (and at the latest it takes effect when a task
has started up) and safe to do runtime.

It basically gives us 32768 pluggable schedulers each with a
slightly separate algorithm - each setting in essence creates a new
scheduler. (this mechanism is how we introduce new scheduler
features and allow their debugging / regression-testing.)

(okay, almost, so beware: turning on HRTICK might lock up your
system.)

Plus, yet another dimension of tuning on SMP systems (such as
dual-core) are the sched-domains tunable. There's a whole world of
tuning in that area and BFS essentially implements a very agressive
'always balance to other CPUs' policy.

I've attached my sched-tune-domains script which helps tune these
parameters.

For example on a testbox of mine it outputs:

usage: tune-sched-domains <val>
{cpu0/domain0:SIBLING} SD flag: 239
+ 1: SD_LOAD_BALANCE: Do load balancing on this domain
+ 2: SD_BALANCE_NEWIDLE: Balance when about to become idle
+ 4: SD_BALANCE_EXEC: Balance on exec
+ 8: SD_BALANCE_FORK: Balance on fork, clone
- 16: SD_WAKE_IDLE: Wake to idle CPU on task wakeup
+ 32: SD_WAKE_AFFINE: Wake task to waking CPU
+ 64: SD_WAKE_BALANCE: Perform balancing at task wakeup
+ 128: SD_SHARE_CPUPOWER: Domain members share cpu power
- 256: SD_POWERSAVINGS_BALANCE: Balance for power savings
- 512: SD_SHARE_PKG_RESOURCES: Domain members share cpu pkg resources
-1024: SD_SERIALIZE: Only a single load balancing instance
-2048: SD_WAKE_IDLE_FAR: Gain latency sacrificing cache hit
-4096: SD_PREFER_SIBLING: Prefer to place tasks in a sibling domain
{cpu0/domain1:MC} SD flag: 4735
+ 1: SD_LOAD_BALANCE: Do load balancing on this domain
+ 2: SD_BALANCE_NEWIDLE: Balance when about to become idle
+ 4: SD_BALANCE_EXEC: Balance on exec
+ 8: SD_BALANCE_FORK: Balance on fork, clone
+ 16: SD_WAKE_IDLE: Wake to idle CPU on task wakeup
+ 32: SD_WAKE_AFFINE: Wake task to waking CPU
+ 64: SD_WAKE_BALANCE: Perform balancing at task wakeup
- 128: SD_SHARE_CPUPOWER: Domain members share cpu power
- 256: SD_POWERSAVINGS_BALANCE: Balance for power savings
+ 512: SD_SHARE_PKG_RESOURCES: Domain members share cpu pkg resources
-1024: SD_SERIALIZE: Only a single load balancing instance
-2048: SD_WAKE_IDLE_FAR: Gain latency sacrificing cache hit
+4096: SD_PREFER_SIBLING: Prefer to place tasks in a sibling domain
{cpu0/domain2:NODE} SD flag: 3183
+ 1: SD_LOAD_BALANCE: Do load balancing on this domain
+ 2: SD_BALANCE_NEWIDLE: Balance when about to become idle
+ 4: SD_BALANCE_EXEC: Balance on exec
+ 8: SD_BALANCE_FORK: Balance on fork, clone
- 16: SD_WAKE_IDLE: Wake to idle CPU on task wakeup
+ 32: SD_WAKE_AFFINE: Wake task to waking CPU
+ 64: SD_WAKE_BALANCE: Perform balancing at task wakeup
- 128: SD_SHARE_CPUPOWER: Domain members share cpu power
- 256: SD_POWERSAVINGS_BALANCE: Balance for power savings
- 512: SD_SHARE_PKG_RESOURCES: Domain members share cpu pkg resources
+1024: SD_SERIALIZE: Only a single load balancing instance
+2048: SD_WAKE_IDLE_FAR: Gain latency sacrificing cache hit
-4096: SD_PREFER_SIBLING: Prefer to place tasks in a sibling domain

The way i can turn on say SD_WAKE_IDLE for the NODE domain is to:

tune-sched-domains 239 4735 $((3183+16))

( This is a pretty stone-age script i admit ;-)

Thanks for all your testing so far,

Ingo

From: Ingo Molnar on 10 Sep 2009 02:50

* Ingo Molnar <mingo(a)elte.hu> wrote:

> > However, the interactivity problems still remain. Does that
> > mean it's not a latency issue?
>
> It means that Jens's test-app, which demonstrated and helped us
> fix the issue for him does not help us fix it for you just yet.

Lemme qualify that by saying that Jens's issues are improved not
fixed [he has not re-run with latest latt.c yet] but not all things
are fully fixed yet. For example the xmodmap thing sounds
interesting - could that be a child-runs-first effect?

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Peter Zijlstra on 10 Sep 2009 03:00

On Wed, 2009-09-09 at 14:20 +0200, Jens Axboe wrote:
>
> One thing I also noticed is that when I have logged in, I run xmodmap
> manually to load some keymappings (I always tell myself to add this to
> the log in scripts, but I suspend/resume this laptop for weeks at the
> time and forget before the next boot). With the stock kernel, xmodmap
> will halt X updates and take forever to run. With BFS, it returned
> instantly. As I would expect.

Can you provide a little more detail (I'm a xmodmap n00b), how does one
run xmodmap and maybe provide your xmodmap config?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
Prev: [PATCH 1/1] AGP: amd64, fix pci reference leaks
Next: [PATCH 2/3] viafb: remove unused structure member