BFS vs. mainline scheduler benchmarks and measurements [Kernel]

Prev: [PATCH 1/1] AGP: amd64, fix pci reference leaks
Next: [PATCH 2/3] viafb: remove unused structure member

From: Benjamin Herrenschmidt on 9 Sep 2009 07:40

On Wed, 2009-09-09 at 20:44 +0930, David Newall wrote:
> Benjamin Herrenschmidt wrote:
> > On Tue, 2009-09-08 at 22:22 +0200, Frans Pop wrote:
> >
> >> Arjan van de Ven wrote:
> >>
> >>> the latest version of latencytop also has a GUI (thanks to Ben)
> >>>
> >> That looks nice, but...
> >>
> >> I kind of miss the split screen feature where latencytop would show both
> >> the overall figures + the ones for the currently most affected task.
> >> Downside of that last was that I never managed to keep the display on a
> >> specific task.
> >>
> >
> > Any idea of how to present it ? I'm happy to spend 5mn improving the
> > GUI :-)
>
> Use a second window.

I'm not too fan of cluttering the screen with windows... I suppose I
could have a separate pane for the "global" view but I haven't found a
way to lay it out in a way that doesn't suck :-) I could have done a 3rd
colums on the right with the overall view but it felt like using too
much screen real estate.

I'll experiment a bit, maybe 2 windows is indeed the solution. But you
get into the problem of what to do if only one of them is closed ? Do I
add a menu bar on each of them to re-open the "other" one if closed ?
etc...

Don't get me wrong, I have a shitload of experience doing GUIs (back in
the old days when I was hacking on MacOS), though I'm relatively new to
GTK. But GUI design is rather hard in general :-)

Ben.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Frans Pop on 9 Sep 2009 08:00

On Wednesday 09 September 2009, Benjamin Herrenschmidt wrote:
> On Tue, 2009-09-08 at 22:22 +0200, Frans Pop wrote:
> > Arjan van de Ven wrote:
> > > the latest version of latencytop also has a GUI (thanks to Ben)
> >
> > That looks nice, but...
> >
> > I kind of miss the split screen feature where latencytop would show
> > both the overall figures + the ones for the currently most affected
> > task. Downside of that last was that I never managed to keep the
> > display on a specific task.
>
> Any idea of how to present it ? I'm happy to spend 5mn improving the
> GUI :-)

I'd say add an extra horizontal split in the second column, so you'd get
three areas in the right column:
- top for the global target (permanently)
- middle for current, either:
- "current most lagging" if "Global" is selected in left column
- selected process if a specific target is selected in left column
- bottom for backtrace

Maybe with that setup "Global" in the left column should be renamed to
something like "Dynamic".

The backtrace area would show selection from either top or middle areas
(so selecting a cause in top or middle area should unselect causes in the
other).

Cheers,
FJP
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Jens Axboe on 9 Sep 2009 08:00

On Wed, Sep 09 2009, Jens Axboe wrote:
> On Wed, Sep 09 2009, Mike Galbraith wrote:
> > On Wed, 2009-09-09 at 08:13 +0200, Ingo Molnar wrote:
> > > * Jens Axboe <jens.axboe(a)oracle.com> wrote:
> > >
> > > > On Tue, Sep 08 2009, Peter Zijlstra wrote:
> > > > > On Tue, 2009-09-08 at 11:13 +0200, Jens Axboe wrote:
> > > > > > And here's a newer version.
> > > > >
> > > > > I tinkered a bit with your proglet and finally found the
> > > > > problem.
> > > > >
> > > > > You used a single pipe per child, this means the loop in
> > > > > run_child() would consume what it just wrote out until it got
> > > > > force preempted by the parent which would also get woken.
> > > > >
> > > > > This results in the child spinning a while (its full quota) and
> > > > > only reporting the last timestamp to the parent.
> > > >
> > > > Oh doh, that's not well thought out. Well it was a quick hack :-)
> > > > Thanks for the fixup, now it's at least usable to some degree.
> > >
> > > What kind of latencies does it report on your box?
> > >
> > > Our vanilla scheduler default latency targets are:
> > >
> > > single-core: 20 msecs
> > > dual-core: 40 msecs
> > > quad-core: 60 msecs
> > > opto-core: 80 msecs
> > >
> > > You can enable CONFIG_SCHED_DEBUG=y and set it directly as well via
> > > /proc/sys/kernel/sched_latency_ns:
> > >
> > > echo 10000000 > /proc/sys/kernel/sched_latency_ns
> >
> > He would also need to lower min_granularity, otherwise, it'd be larger
> > than the whole latency target.
> >
> > I'm testing right now, and one thing that is definitely a problem is the
> > amount of sleeper fairness we're giving. A full latency is just too
> > much short term fairness in my testing. While sleepers are catching up,
> > hogs languish. That's the biggest issue going on.
> >
> > I've also been doing some timings of make -j4 (looking at idle time),
> > and find that child_runs_first is mildly detrimental to fork/exec load,
> > as are buddies.
> >
> > I'm running with the below at the moment. (the kthread/workqueue thing
> > is just because I don't see any reason for it to exist, so consider it
> > to be a waste of perfectly good math;)
>
> Using latt, it seems better than -rc9. The below are entries logged
> while running make -j128 on a 64 thread box. I did two runs on each, and
> latt is using 8 clients.
>
> -rc9
> Max 23772 usec
> Avg 1129 usec
> Stdev 4328 usec
> Stdev mean 117 usec
>
> Max 32709 usec
> Avg 1467 usec
> Stdev 5095 usec
> Stdev mean 136 usec
>
> -rc9 + patch
>
> Max 11561 usec
> Avg 1532 usec
> Stdev 1994 usec
> Stdev mean 48 usec
>
> Max 9590 usec
> Avg 1550 usec
> Stdev 2051 usec
> Stdev mean 50 usec
>
> max latency is way down, and much smaller variation as well.

Things are much better with this patch on the notebook! I cannot compare
with BFS as that still doesn't run anywhere I want it to run, but it's
way better than -rc9-git stock. latt numbers on the notebook have 1/3
the max latency, average is lower, and stddev is much smaller too.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nikos Chantziaras on 9 Sep 2009 08:00

On 09/08/2009 06:23 PM, Peter Zijlstra wrote:
> On Tue, 2009-09-08 at 11:13 +0200, Jens Axboe wrote:
>> And here's a newer version.
>
> I tinkered a bit with your proglet and finally found the problem.
>
> You used a single pipe per child, this means the loop in run_child()
> would consume what it just wrote out until it got force preempted by the
> parent which would also get woken.
>
> This results in the child spinning a while (its full quota) and only
> reporting the last timestamp to the parent.
>
> Since consumer (parent) is a single thread the program basically
> measures the worst delay in a thundering herd wakeup of N children.
>
> The below version yields:
>
> idle
>
> [root(a)opteron sched]# ./latt -c8 sleep 30
> Entries: 664 (clients=8)
>
> Averages:
> ------------------------------
> Max 128 usec
> Avg 26 usec
> Stdev 16 usec
>
>
> make -j4
>
> [root(a)opteron sched]# ./latt -c8 sleep 30
> Entries: 648 (clients=8)
>
> Averages:
> ------------------------------
> Max 20861 usec
> Avg 3763 usec
> Stdev 4637 usec
>
>
> Mike's patch, make -j4
>
> [root(a)opteron sched]# ./latt -c8 sleep 30
> Entries: 648 (clients=8)
>
> Averages:
> ------------------------------
> Max 17854 usec
> Avg 6298 usec
> Stdev 4735 usec

I've run two tests with this tool. One with mainline (2.6.31-rc9) and
one patched with 2.6.31-rc9-sched-bfs-210.patch.

Before running this test, I disabled the cron daemon in order not to
have something pop-up in the background out of a sudden.

The test consisted of starting a "make -j2" in the kernel tree inside a
3GB tmpfs mountpoint and then running 'latt "mplayer -vo gl2 -framedrop
videofile.mkv"' (mplayer in this case is a single-threaded
application.) Caches were warmed-up first; the results below are from
the second run of each test.

The kernel .config file used by the running kernels and also for "make
-j2" is:

http://foss.math.aegean.gr/~realnc/kernel/config-2.6.31-rc9-latt-test

The video file used for mplayer is:

http://foss.math.aegean.gr/~realnc/vids/3DMark2000.mkv (100MB)
(The reason this was used is that it's a 60FPS video,
therefore very smooth and makes all skips stand out
clearly.)

Results for mainline:

Averages:
------------------------------
Max 29930 usec
Avg 11043 usec
Stdev 5752 usec

Results for BFS:

Averages:
------------------------------
Max 14017 usec
Avg 49 usec
Stdev 697 usec

One thing that's worth noting is that with mainline, mplayer would
occasionally spit this out:

YOUR SYSTEM IS TOO SLOW TO PLAY THIS

which doesn't happen with BFS.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Jens Axboe on 9 Sep 2009 08:30

On Wed, Sep 09 2009, Jens Axboe wrote:
> On Wed, Sep 09 2009, Jens Axboe wrote:
> > On Wed, Sep 09 2009, Mike Galbraith wrote:
> > > On Wed, 2009-09-09 at 08:13 +0200, Ingo Molnar wrote:
> > > > * Jens Axboe <jens.axboe(a)oracle.com> wrote:
> > > >
> > > > > On Tue, Sep 08 2009, Peter Zijlstra wrote:
> > > > > > On Tue, 2009-09-08 at 11:13 +0200, Jens Axboe wrote:
> > > > > > > And here's a newer version.
> > > > > >
> > > > > > I tinkered a bit with your proglet and finally found the
> > > > > > problem.
> > > > > >
> > > > > > You used a single pipe per child, this means the loop in
> > > > > > run_child() would consume what it just wrote out until it got
> > > > > > force preempted by the parent which would also get woken.
> > > > > >
> > > > > > This results in the child spinning a while (its full quota) and
> > > > > > only reporting the last timestamp to the parent.
> > > > >
> > > > > Oh doh, that's not well thought out. Well it was a quick hack :-)
> > > > > Thanks for the fixup, now it's at least usable to some degree.
> > > >
> > > > What kind of latencies does it report on your box?
> > > >
> > > > Our vanilla scheduler default latency targets are:
> > > >
> > > > single-core: 20 msecs
> > > > dual-core: 40 msecs
> > > > quad-core: 60 msecs
> > > > opto-core: 80 msecs
> > > >
> > > > You can enable CONFIG_SCHED_DEBUG=y and set it directly as well via
> > > > /proc/sys/kernel/sched_latency_ns:
> > > >
> > > > echo 10000000 > /proc/sys/kernel/sched_latency_ns
> > >
> > > He would also need to lower min_granularity, otherwise, it'd be larger
> > > than the whole latency target.
> > >
> > > I'm testing right now, and one thing that is definitely a problem is the
> > > amount of sleeper fairness we're giving. A full latency is just too
> > > much short term fairness in my testing. While sleepers are catching up,
> > > hogs languish. That's the biggest issue going on.
> > >
> > > I've also been doing some timings of make -j4 (looking at idle time),
> > > and find that child_runs_first is mildly detrimental to fork/exec load,
> > > as are buddies.
> > >
> > > I'm running with the below at the moment. (the kthread/workqueue thing
> > > is just because I don't see any reason for it to exist, so consider it
> > > to be a waste of perfectly good math;)
> >
> > Using latt, it seems better than -rc9. The below are entries logged
> > while running make -j128 on a 64 thread box. I did two runs on each, and
> > latt is using 8 clients.
> >
> > -rc9
> > Max 23772 usec
> > Avg 1129 usec
> > Stdev 4328 usec
> > Stdev mean 117 usec
> >
> > Max 32709 usec
> > Avg 1467 usec
> > Stdev 5095 usec
> > Stdev mean 136 usec
> >
> > -rc9 + patch
> >
> > Max 11561 usec
> > Avg 1532 usec
> > Stdev 1994 usec
> > Stdev mean 48 usec
> >
> > Max 9590 usec
> > Avg 1550 usec
> > Stdev 2051 usec
> > Stdev mean 50 usec
> >
> > max latency is way down, and much smaller variation as well.
>
> Things are much better with this patch on the notebook! I cannot compare
> with BFS as that still doesn't run anywhere I want it to run, but it's
> way better than -rc9-git stock. latt numbers on the notebook have 1/3
> the max latency, average is lower, and stddev is much smaller too.

BFS210 runs on the laptop (dual core intel core duo). With make -j4
running, I clock the following latt -c8 'sleep 10' latencies:

-rc9

Max 17895 usec
Avg 8028 usec
Stdev 5948 usec
Stdev mean 405 usec

Max 17896 usec
Avg 4951 usec
Stdev 6278 usec
Stdev mean 427 usec

Max 17885 usec
Avg 5526 usec
Stdev 6819 usec
Stdev mean 464 usec

-rc9 + mike

Max 6061 usec
Avg 3797 usec
Stdev 1726 usec
Stdev mean 117 usec

Max 5122 usec
Avg 3958 usec
Stdev 1697 usec
Stdev mean 115 usec

Max 6691 usec
Avg 2130 usec
Stdev 2165 usec
Stdev mean 147 usec

-rc9 + bfs210

Max 92 usec
Avg 27 usec
Stdev 19 usec
Stdev mean 1 usec

Max 80 usec
Avg 23 usec
Stdev 15 usec
Stdev mean 1 usec

Max 97 usec
Avg 27 usec
Stdev 21 usec
Stdev mean 1 usec

One thing I also noticed is that when I have logged in, I run xmodmap
manually to load some keymappings (I always tell myself to add this to
the log in scripts, but I suspend/resume this laptop for weeks at the
time and forget before the next boot). With the stock kernel, xmodmap
will halt X updates and take forever to run. With BFS, it returned
instantly. As I would expect.

So the BFS design may be lacking in the scalability end (which is
obviously true, if you look at the code), but I can understand the
appeal of the scheduler for "normal" desktop people.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
Prev: [PATCH 1/1] AGP: amd64, fix pci reference leaks
Next: [PATCH 2/3] viafb: remove unused structure member