BFS vs. mainline scheduler benchmarks and measurements [Kernel]

Prev: [PATCH 1/1] AGP: amd64, fix pci reference leaks
Next: [PATCH 2/3] viafb: remove unused structure member

From: Frans Pop on 6 Sep 2009 22:10

Ingo Molnar wrote:
> So the testbox i picked fits into the upper portion of what i
> consider a sane range of systems to tune for - and should still fit
> into BFS's design bracket as well according to your description:
> it's a dual quad core system with hyperthreading.

Ingo,

Nice that you've looked into this.

Would it be possible for you to run the same tests on e.g. a dual core
and/or a UP system (or maybe just offline some CPUs?)? It would be very
interesting to see whether BFS does better in the lower portion of the
range, or if the differences you show between the two schedulers are
consistent across the range.

Cheers,
FJP
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nikos Chantziaras on 6 Sep 2009 23:40

On 09/06/2009 11:59 PM, Ingo Molnar wrote:
>[...]
> Also, i'd like to outline that i agree with the general goals
> described by you in the BFS announcement - small desktop systems
> matter more than large systems. We find it critically important
> that the mainline Linux scheduler performs well on those systems
> too - and if you (or anyone else) can reproduce suboptimal behavior
> please let the scheduler folks know so that we can fix/improve it.

BFS improved behavior of many applications on my Intel Core 2 box in a
way that can't be benchmarked. Examples:

mplayer using OpenGL renderer doesn't drop frames anymore when dragging
and dropping the video window around in an OpenGL composited desktop
(KDE 4.3.1). (Start moving the mplayer window around; then drop it. At
the moment the move starts and at the moment you drop the window back to
the desktop, there's a big frame skip as if mplayer was frozen for a
bit; around 200 or 300ms.)

Composite desktop effects like zoom and fade out don't stall for
sub-second periods of time while there's CPU load in the background. In
other words, the desktop is more fluid and less skippy even during heavy
CPU load. Moving windows around with CPU load in the background doesn't
result in short skips.

LMMS (a tool utilizing real-time sound synthesis) does not produce
"pops", "crackles" and drops in the sound during real-time playback due
to buffer under-runs. Those problems amplify when there's heavy CPU
load in the background, while with BFS heavy load doesn't produce those
artifacts (though LMMS makes itself run SCHED_ISO with BFS) Also,
hitting a key on the keyboard needs less time for the note to become
audible when using BFS. Same should hold true for other tools who
traditionally benefit from the "-rt" kernel sources.

Games like Doom 3 and such don't "freeze" periodically for small amounts
of time (again for sub-second amounts) when something in the background
grabs CPU time (be it my mailer checking for new mail or a cron job, or
whatever.)

And, the most drastic improvement here, with BFS I can do a "make -j2"
in the kernel tree and the GUI stays fluid. Without BFS, things start
to lag, even with in-RAM builds (like having the whole kernel tree
inside a tmpfs) and gcc running with nice 19 and ionice -c 3.

Unfortunately, I can't come up with any way to somehow benchmark all of
this. There's no benchmark for "fluidity" and "responsiveness".
Running the Doom 3 benchmark, or any other benchmark, doesn't say
anything about responsiveness, it only measures how many frames were
calculated in a specific period of time. How "stable" (with no stalls)
those frames were making it to the screen is not measurable.

If BFS would imply small drops in pure performance counted in
instructions per seconds, that would be a totally acceptable regression
for desktop/multimedia/gaming PCs. Not for server machines, of course.
However, on my machine, BFS is faster in classic workloads. When I
run "make -j2" with BFS and the standard scheduler, BFS always finishes
a bit faster. Not by much, but still. One thing I'm noticing here is
that BFS produces 100% CPU load on each core with "make -j2" while the
normal scheduler stays at about 90-95% with -j2 or higher in at least
one of the cores. There seems to be under-utilization of CPU time.

Also, by searching around the net but also through discussions on
various mailing lists, there seems to be a trend: the problems for some
reason seem to occur more often with Intel CPUs (Core 2 chips and lower;
I can't say anything about Core I7) while people on AMD CPUs mostly not
being affected by most or even all of the above. (And due to this flame
wars often break out, with one party accusing the other of imagining
things). Can the integrated memory controller on AMD chips have
something to do with this? Do AMD chips generally offer better
"multithrading" behavior? Unfortunately, you didn't mention on what CPU
you ran your tests. If it was AMD, it might be a good idea to run tests
on Pentium and Core 2 CPUs.

For reference, my system is:

CPU: Intel Core 2 Duo E6600 (2.4GHz)
Mainboard: Asus P5E (Intel X38 chipset)
RAM: 6GB (2+2+1+1) dual channel DDR2 800
GPU: RV770 (Radeon HD4870).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Jens Axboe on 7 Sep 2009 06:00

On Sun, Sep 06 2009, Ingo Molnar wrote:
> So ... to get to the numbers - i've tested both BFS and the tip of
> the latest upstream scheduler tree on a testbox of mine. I
> intentionally didnt test BFS on any really large box - because you
> described its upper limit like this in the announcement:

I ran a simple test as well, since I was curious to see how it performed
wrt interactiveness. One of my pet peeves with the current scheduler is
that I have to nice compile jobs, or my X experience is just awful while
the compile is running.

Now, this test case is something that attempts to see what
interactiveness would be like. It'll run a given command line while at
the same time logging delays. The delays are measured as follows:

- The app creates a pipe, and forks a child that blocks on reading from
that pipe.
- The app sleeps for a random period of time, anywhere between 100ms
and 2s. When it wakes up, it gets the current time and writes that to
the pipe.
- The child then gets woken, checks the time on its own, and logs the
difference between the two.

The idea here being that the delay between writing to the pipe and the
child reading the data and comparing should (in some way) be indicative
of how responsive the system would seem to a user.

The test app was quickly hacked up, so don't put too much into it. The
test run is a simple kernel compile, using -jX where X is the number of
threads in the system. The files are cache hot, so little IO is done.
The -x2 run is using the double number of processes as we have threads,
eg -j128 on a 64 thread box.

And I have to apologize for using a large system to test this on, I
realize it's out of the scope of BFS, but it's just easier to fire one
of these beasts up than it is to sacrifice my notebook or desktop
machine... So it's a 64 thread box. CFS -jX runtime is the baseline at
100, lower number means faster and vice versa. The latency numbers are
in msecs.

Scheduler Runtime Max lat Avg lat Std dev
----------------------------------------------------------------
CFS 100 951 462 267
CFS-x2 100 983 484 308
BFS
BFS-x2

And unfortunately this is where it ends for now, since BFS doesn't boot
on the two boxes I tried. It hard hangs right after disk detection. But
the latency numbers look pretty appalling for CFQ, so it's a bit of a
shame that I did not get to compare. I'll try again later with a newer
revision, when available.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nikos Chantziaras on 7 Sep 2009 06:20

On 09/07/2009 12:49 PM, Jens Axboe wrote:
> [...]
> And I have to apologize for using a large system to test this on, I
> realize it's out of the scope of BFS, but it's just easier to fire one
> of these beasts up than it is to sacrifice my notebook or desktop
> machine...

How does a kernel rebuild constitute "sacrifice"?

> So it's a 64 thread box. CFS -jX runtime is the baseline at
> 100, lower number means faster and vice versa. The latency numbers are
> in msecs.
>
>
> Scheduler Runtime Max lat Avg lat Std dev
> ----------------------------------------------------------------
> CFS 100 951 462 267
> CFS-x2 100 983 484 308
> BFS
> BFS-x2
>
> And unfortunately this is where it ends for now, since BFS doesn't boot
> on the two boxes I tried.

Then who post this in the first place?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Jens Axboe on 7 Sep 2009 06:50

On Mon, Sep 07 2009, Nikos Chantziaras wrote:
> On 09/07/2009 12:49 PM, Jens Axboe wrote:
>> [...]
>> And I have to apologize for using a large system to test this on, I
>> realize it's out of the scope of BFS, but it's just easier to fire one
>> of these beasts up than it is to sacrifice my notebook or desktop
>> machine...
>
> How does a kernel rebuild constitute "sacrifice"?

It's more of a bother since I have to physically be at the notebook,
where as the server type boxes usually have remote management. The
workstation I use currently, so it'd be very disruptive to do it there.
And as things are apparently very alpha on the bfs side currently, it's
easier to 'sacrifice' an idle test box. That's the keyword, 'test'
boxes. You know, machines used for testing. Not production machines.

Plus the notebook is using btrfs which isn't format compatible with
2.6.30 on disk format.

Is there a point to this question?

>> So it's a 64 thread box. CFS -jX runtime is the baseline at
>> 100, lower number means faster and vice versa. The latency numbers are
>> in msecs.
>>
>>
>> Scheduler Runtime Max lat Avg lat Std dev
>> ----------------------------------------------------------------
>> CFS 100 951 462 267
>> CFS-x2 100 983 484 308
>> BFS
>> BFS-x2
>>
>> And unfortunately this is where it ends for now, since BFS doesn't boot
>> on the two boxes I tried.
>
> Then who post this in the first place?

You snipped the relevant part of the conclusion, the part where I make a
comment on the cfs latencies.

Don't bother replying to any of my emails if YOU continue writing emails
in this fashion. I have MUCH better things to do than entertain kiddies.
If you do get your act together and want to reply, follow lkml etiquette
and group reply.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12
Prev: [PATCH 1/1] AGP: amd64, fix pci reference leaks
Next: [PATCH 2/3] viafb: remove unused structure member