BFS vs. mainline scheduler benchmarks and measurements [Kernel]

Prev: [PATCH 1/1] AGP: amd64, fix pci reference leaks
Next: [PATCH 2/3] viafb: remove unused structure member

From: Steven Rostedt on 10 Sep 2009 10:00

On Thu, 2009-09-10 at 11:44 +0200, Jens Axboe wrote:
> On Thu, Sep 10 2009, Ingo Molnar wrote:

> trace.txt attached. Steven, you seem to go through a lot of trouble to
> find the debugfs path, yet at the very end do:
>
> > system("cat /debug/tracing/trace");
>
> which doesn't seem quite right :-)
>

That's an older version of the tool. The newer version (still in alpha)
doesn't do that.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Peter Zijlstra on 10 Sep 2009 12:10

On Thu, 2009-09-10 at 09:02 -0700, Bret Towe wrote:
>
> thanks to this thread and others I've seen several kernel tunables
> that can effect how the scheduler performs/acts
> but what I don't see after a bit of looking is where all these are
> documented
> perhaps thats also part of the reason there are unhappy people with
> the current code in the kernel just because they don't know how
> to tune it for their workload

The thing is, ideally they should not need to poke at these. These knobs
are under CONFIG_SCHED_DEBUG, and that is exactly what they are for.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Bret Towe on 10 Sep 2009 12:10

On Wed, Sep 9, 2009 at 11:08 PM, Ingo Molnar <mingo(a)elte.hu> wrote:
>
> * Nikos Chantziaras <realnc(a)arcor.de> wrote:
>
>> On 09/09/2009 09:04 PM, Ingo Molnar wrote:
>>> [...]
>>> * Jens Axboe<jens.axboe(a)oracle.com> �wrote:
>>>
>>>> On Wed, Sep 09 2009, Jens Axboe wrote:
>>>> �[...]
>>>> BFS210 runs on the laptop (dual core intel core duo). With make -j4
>>>> running, I clock the following latt -c8 'sleep 10' latencies:
>>>>
>>>> -rc9
>>>>
>>>> � � � � �Max � � � � � � � �17895 usec
>>>> � � � � �Avg � � � � � � � � 8028 usec
>>>> � � � � �Stdev � � � � � � � 5948 usec
>>>> � � � � �Stdev mean � � � � � 405 usec
>>>>
>>>> � � � � �Max � � � � � � � �17896 usec
>>>> � � � � �Avg � � � � � � � � 4951 usec
>>>> � � � � �Stdev � � � � � � � 6278 usec
>>>> � � � � �Stdev mean � � � � � 427 usec
>>>>
>>>> � � � � �Max � � � � � � � �17885 usec
>>>> � � � � �Avg � � � � � � � � 5526 usec
>>>> � � � � �Stdev � � � � � � � 6819 usec
>>>> � � � � �Stdev mean � � � � � 464 usec
>>>>
>>>> -rc9 + mike
>>>>
>>>> � � � � �Max � � � � � � � � 6061 usec
>>>> � � � � �Avg � � � � � � � � 3797 usec
>>>> � � � � �Stdev � � � � � � � 1726 usec
>>>> � � � � �Stdev mean � � � � � 117 usec
>>>>
>>>> � � � � �Max � � � � � � � � 5122 usec
>>>> � � � � �Avg � � � � � � � � 3958 usec
>>>> � � � � �Stdev � � � � � � � 1697 usec
>>>> � � � � �Stdev mean � � � � � 115 usec
>>>>
>>>> � � � � �Max � � � � � � � � 6691 usec
>>>> � � � � �Avg � � � � � � � � 2130 usec
>>>> � � � � �Stdev � � � � � � � 2165 usec
>>>> � � � � �Stdev mean � � � � � 147 usec
>>>
>>> At least in my tests these latencies were mainly due to a bug in
>>> latt.c - i've attached the fixed version.
>>>
>>> The other reason was wakeup batching. If you do this:
>>>
>>> � � echo 0> �/proc/sys/kernel/sched_wakeup_granularity_ns
>>>
>>> ... then you can switch on insta-wakeups on -tip too.
>>>
>>> With a dual-core box and a make -j4 background job running, on
>>> latest -tip i get the following latencies:
>>>
>>> � $ ./latt -c8 sleep 30
>>> � Entries: 656 (clients=8)
>>>
>>> � Averages:
>>> � ------------------------------
>>> � � �Max � � � � � 158 usec
>>> � � �Avg � � � � � �12 usec
>>> � � �Stdev � � � � �10 usec
>>
>> With your version of latt.c, I get these results with 2.6-tip vs
>> 2.6.31-rc9-bfs:
>>
>>
>> (mainline)
>> Averages:
>> ------------------------------
>> � � � � Max � � � � � �50 usec
>> � � � � Avg � � � � � �12 usec
>> � � � � Stdev � � � � � 3 usec
>>
>>
>> (BFS)
>> Averages:
>> ------------------------------
>> � � � � Max � � � � � 474 usec
>> � � � � Avg � � � � � �11 usec
>> � � � � Stdev � � � � �16 usec
>>
>> However, the interactivity problems still remain. �Does that mean
>> it's not a latency issue?
>
> It means that Jens's test-app, which demonstrated and helped us fix
> the issue for him does not help us fix it for you just yet.
>
> The "fluidity problem" you described might not be a classic latency
> issue per se (which latt.c measures), but a timeslicing / CPU time
> distribution problem.
>
> A slight shift in CPU time allocation can change the flow of tasks
> to result in a 'choppier' system.
>
> Have you tried, in addition of the granularity tweaks you've done,
> to renice mplayer either up or down? (or compiz and Xorg for that
> matter)
>
> I'm not necessarily suggesting this as a 'real' solution (we really
> prefer kernels that just get it right) - but it's an additional
> parameter dimension along which you can tweak CPU time distribution
> on your box.
>
> Here's the general rule of thumb: mine one nice level gives plus 5%
> CPU time to a task and takes away 5% CPU time from another task -
> i.e. shifts the CPU allocation by 10%.
>
> ( this is modified by all sorts of dynamic conditions: by the number
> �of tasks running and their wakeup patters so not a rule cast into
> �stone - but still a good ballpark figure for CPU intense tasks. )
>
> Btw., i've read your descriptions about what you've tuned so far -
> have you seen/checked the wakeup_granularity tunable as well?
> Setting that to 0 will change the general balance of how CPU time is
> allocated between tasks too.
>
> There's also a whole bunch of scheduler features you can turn on/off
> individually via /debug/sched_features. For example, to turn off
> NEW_FAIR_SLEEPERS, you can do:
>
> �# cat /debug/sched_features
> �NEW_FAIR_SLEEPERS NO_NORMALIZED_SLEEPER ADAPTIVE_GRAN WAKEUP_PREEMPT
> �START_DEBIT AFFINE_WAKEUPS CACHE_HOT_BUDDY SYNC_WAKEUPS NO_HRTICK
> �NO_DOUBLE_TICK ASYM_GRAN LB_BIAS LB_WAKEUP_UPDATE ASYM_EFF_LOAD
> �NO_WAKEUP_OVERLAP LAST_BUDDY OWNER_SPIN
>
> �# echo NO_NEW_FAIR_SLEEPERS > /debug/sched_features
>
> Btw., NO_NEW_FAIR_SLEEPERS is something that will turn the scheduler
> into a more classic fair scheduler (like BFS is too).
>
> NO_START_DEBIT might be another thing that improves (or worsens :-/)
> make -j type of kernel build workloads.

thanks to this thread and others I've seen several kernel tunables
that can effect how the scheduler performs/acts
but what I don't see after a bit of looking is where all these are documented
perhaps thats also part of the reason there are unhappy people with
the current code in the kernel just because they don't know how
to tune it for their workload

> Note, these flags are all runtime, the new settings take effect
> almost immediately (and at the latest it takes effect when a task
> has started up) and safe to do runtime.
>
> It basically gives us 32768 pluggable schedulers each with a
> slightly separate algorithm - each setting in essence creates a new
> scheduler. (this mechanism is how we introduce new scheduler
> features and allow their debugging / regression-testing.)
>
> (okay, almost, so beware: turning on HRTICK might lock up your
> system.)
>
> Plus, yet another dimension of tuning on SMP systems (such as
> dual-core) are the sched-domains tunable. There's a whole world of
> tuning in that area and BFS essentially implements a very agressive
> 'always balance to other CPUs' policy.
>
> I've attached my sched-tune-domains script which helps tune these
> parameters.
>
> For example on a testbox of mine it outputs:
>
> usage: tune-sched-domains <val>
> {cpu0/domain0:SIBLING} SD flag: 239
> + � 1: SD_LOAD_BALANCE: � � � � �Do load balancing on this domain
> + � 2: SD_BALANCE_NEWIDLE: � � � Balance when about to become idle
> + � 4: SD_BALANCE_EXEC: � � � � �Balance on exec
> + � 8: SD_BALANCE_FORK: � � � � �Balance on fork, clone
> - �16: SD_WAKE_IDLE: � � � � � � Wake to idle CPU on task wakeup
> + �32: SD_WAKE_AFFINE: � � � � � Wake task to waking CPU
> + �64: SD_WAKE_BALANCE: � � � � �Perform balancing at task wakeup
> + 128: SD_SHARE_CPUPOWER: � � � �Domain members share cpu power
> - 256: SD_POWERSAVINGS_BALANCE: �Balance for power savings
> - 512: SD_SHARE_PKG_RESOURCES: � Domain members share cpu pkg resources
> -1024: SD_SERIALIZE: � � � � � � Only a single load balancing instance
> -2048: SD_WAKE_IDLE_FAR: � � � � Gain latency sacrificing cache hit
> -4096: SD_PREFER_SIBLING: � � � �Prefer to place tasks in a sibling domain
> {cpu0/domain1:MC} SD flag: 4735
> + � 1: SD_LOAD_BALANCE: � � � � �Do load balancing on this domain
> + � 2: SD_BALANCE_NEWIDLE: � � � Balance when about to become idle
> + � 4: SD_BALANCE_EXEC: � � � � �Balance on exec
> + � 8: SD_BALANCE_FORK: � � � � �Balance on fork, clone
> + �16: SD_WAKE_IDLE: � � � � � � Wake to idle CPU on task wakeup
> + �32: SD_WAKE_AFFINE: � � � � � Wake task to waking CPU
> + �64: SD_WAKE_BALANCE: � � � � �Perform balancing at task wakeup
> - 128: SD_SHARE_CPUPOWER: � � � �Domain members share cpu power
> - 256: SD_POWERSAVINGS_BALANCE: �Balance for power savings
> + 512: SD_SHARE_PKG_RESOURCES: � Domain members share cpu pkg resources
> -1024: SD_SERIALIZE: � � � � � � Only a single load balancing instance
> -2048: SD_WAKE_IDLE_FAR: � � � � Gain latency sacrificing cache hit
> +4096: SD_PREFER_SIBLING: � � � �Prefer to place tasks in a sibling domain
> {cpu0/domain2:NODE} SD flag: 3183
> + � 1: SD_LOAD_BALANCE: � � � � �Do load balancing on this domain
> + � 2: SD_BALANCE_NEWIDLE: � � � Balance when about to become idle
> + � 4: SD_BALANCE_EXEC: � � � � �Balance on exec
> + � 8: SD_BALANCE_FORK: � � � � �Balance on fork, clone
> - �16: SD_WAKE_IDLE: � � � � � � Wake to idle CPU on task wakeup
> + �32: SD_WAKE_AFFINE: � � � � � Wake task to waking CPU
> + �64: SD_WAKE_BALANCE: � � � � �Perform balancing at task wakeup
> - 128: SD_SHARE_CPUPOWER: � � � �Domain members share cpu power
> - 256: SD_POWERSAVINGS_BALANCE: �Balance for power savings
> - 512: SD_SHARE_PKG_RESOURCES: � Domain members share cpu pkg resources
> +1024: SD_SERIALIZE: � � � � � � Only a single load balancing instance
> +2048: SD_WAKE_IDLE_FAR: � � � � Gain latency sacrificing cache hit
> -4096: SD_PREFER_SIBLING: � � � �Prefer to place tasks in a sibling domain
>
> The way i can turn on say SD_WAKE_IDLE for the NODE domain is to:
>
> � tune-sched-domains 239 4735 $((3183+16))
>
> ( This is a pretty stone-age script i admit ;-)
>
> Thanks for all your testing so far,
>
> � � � �Ingo
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Bret Towe on 10 Sep 2009 12:20

On Thu, Sep 10, 2009 at 9:05 AM, Peter Zijlstra <a.p.zijlstra(a)chello.nl> wrote:
> On Thu, 2009-09-10 at 09:02 -0700, Bret Towe wrote:
>>
>> thanks to this thread and others I've seen several kernel tunables
>> that can effect how the scheduler performs/acts
>> but what I don't see after a bit of looking is where all these are
>> documented
>> perhaps thats also part of the reason there are unhappy people with
>> the current code in the kernel just because they don't know how
>> to tune it for their workload
>
> The thing is, ideally they should not need to poke at these. These knobs
> are under CONFIG_SCHED_DEBUG, and that is exactly what they are for.

even then I would think they should be documented so people can find out
what item is hurting their workload so they can better report the bug no?

>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo(a)vger.kernel.org
> More majordomo info at �http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at �http://www.tux.org/lkml/
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Ingo Molnar on 10 Sep 2009 12:30

* Bret Towe <magnade(a)gmail.com> wrote:

> On Thu, Sep 10, 2009 at 9:05 AM, Peter Zijlstra <a.p.zijlstra(a)chello.nl> wrote:
> > On Thu, 2009-09-10 at 09:02 -0700, Bret Towe wrote:
> >>
> >> thanks to this thread and others I've seen several kernel tunables
> >> that can effect how the scheduler performs/acts
> >> but what I don't see after a bit of looking is where all these are
> >> documented
> >> perhaps thats also part of the reason there are unhappy people with
> >> the current code in the kernel just because they don't know how
> >> to tune it for their workload
> >
> > The thing is, ideally they should not need to poke at these.
> > These knobs are under CONFIG_SCHED_DEBUG, and that is exactly
> > what they are for.
>
> even then I would think they should be documented so people can
> find out what item is hurting their workload so they can better
> report the bug no?

Would be happy to apply such documentation patches. You could also
help start adding a 'scheduler performance' wiki portion to
perf.wiki.kernel.org, if you have time for that.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Prev: [PATCH 1/1] AGP: amd64, fix pci reference leaks
Next: [PATCH 2/3] viafb: remove unused structure member