From: Jens Axboe on
On Mon, Sep 07 2009, Peter Zijlstra wrote:
> On Mon, 2009-09-07 at 22:46 +0200, Jens Axboe wrote:
> > > a bug in the SMP load-balancer that can cause interactivity problems
> > > on large CPU count systems.
> >
> > Worth trying on the dual core box?
>
> I debugged the issue on a dual core :-)
>
> It should be more pronounced on larger machines, but its present on
> dual-core too.

Alright, I'll upgrade that box to -tip tomorrow and see if it makes
a noticable difference. At -j4 or higher, I can literally see windows
slowly popping up when switching to a different virtual desktop.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Mon, 2009-09-07 at 22:46 +0200, Jens Axboe wrote:
> > a bug in the SMP load-balancer that can cause interactivity problems
> > on large CPU count systems.
>
> Worth trying on the dual core box?

I debugged the issue on a dual core :-)

It should be more pronounced on larger machines, but its present on
dual-core too.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Ingo Molnar on

* Jens Axboe <jens.axboe(a)oracle.com> wrote:

> On Mon, Sep 07 2009, Peter Zijlstra wrote:
> > On Mon, 2009-09-07 at 22:46 +0200, Jens Axboe wrote:
> > > > a bug in the SMP load-balancer that can cause interactivity problems
> > > > on large CPU count systems.
> > >
> > > Worth trying on the dual core box?
> >
> > I debugged the issue on a dual core :-)
> >
> > It should be more pronounced on larger machines, but its present on
> > dual-core too.
>
> Alright, I'll upgrade that box to -tip tomorrow and see if it
> makes a noticable difference. At -j4 or higher, I can literally
> see windows slowly popping up when switching to a different
> virtual desktop.

btw., if you run -tip and have these enabled:

CONFIG_PERF_COUNTER=y
CONFIG_EVENT_TRACING=y

cd tools/perf/
make -j install

.... then you can use a couple of new perfcounters features to
measure scheduler latencies. For example:

perf stat -e sched:sched_stat_wait -e task-clock ./hackbench 20

Will tell you how many times this workload got delayed by waiting
for CPU time.

You can repeat the workload as well and see the statistical
properties of those metrics:

aldebaran:/home/mingo> perf stat --repeat 10 -e \
sched:sched_stat_wait:r -e task-clock ./hackbench 20
Time: 0.251
Time: 0.214
Time: 0.254
Time: 0.278
Time: 0.245
Time: 0.308
Time: 0.242
Time: 0.222
Time: 0.268
Time: 0.244

Performance counter stats for './hackbench 20' (10 runs):

59826 sched:sched_stat_wait # 0.026 M/sec ( +- 5.540% )
2280.099643 task-clock-msecs # 7.525 CPUs ( +- 1.620% )

0.303013390 seconds time elapsed ( +- 3.189% )

To get scheduling events, do:

# perf list 2>&1 | grep sched:
sched:sched_kthread_stop [Tracepoint event]
sched:sched_kthread_stop_ret [Tracepoint event]
sched:sched_wait_task [Tracepoint event]
sched:sched_wakeup [Tracepoint event]
sched:sched_wakeup_new [Tracepoint event]
sched:sched_switch [Tracepoint event]
sched:sched_migrate_task [Tracepoint event]
sched:sched_process_free [Tracepoint event]
sched:sched_process_exit [Tracepoint event]
sched:sched_process_wait [Tracepoint event]
sched:sched_process_fork [Tracepoint event]
sched:sched_signal_send [Tracepoint event]
sched:sched_stat_wait [Tracepoint event]
sched:sched_stat_sleep [Tracepoint event]
sched:sched_stat_iowait [Tracepoint event]

stat_wait/sleep/iowait would be the interesting ones, for latency
analysis.

Or, if you want to see all the specific delays and want to see
min/max/avg, you can do:

perf record -e sched:sched_stat_wait:r -f -R -c 1 ./hackbench 20
perf trace

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Thomas Fjellstrom on
On Sun September 6 2009, Nikos Chantziaras wrote:
> On 09/06/2009 11:59 PM, Ingo Molnar wrote:
> >[...]
> > Also, i'd like to outline that i agree with the general goals
> > described by you in the BFS announcement - small desktop systems
> > matter more than large systems. We find it critically important
> > that the mainline Linux scheduler performs well on those systems
> > too - and if you (or anyone else) can reproduce suboptimal behavior
> > please let the scheduler folks know so that we can fix/improve it.
>
> BFS improved behavior of many applications on my Intel Core 2 box in a
> way that can't be benchmarked. Examples:
>
> mplayer using OpenGL renderer doesn't drop frames anymore when dragging
> and dropping the video window around in an OpenGL composited desktop
> (KDE 4.3.1). (Start moving the mplayer window around; then drop it. At
> the moment the move starts and at the moment you drop the window back to
> the desktop, there's a big frame skip as if mplayer was frozen for a
> bit; around 200 or 300ms.)
>
> Composite desktop effects like zoom and fade out don't stall for
> sub-second periods of time while there's CPU load in the background. In
> other words, the desktop is more fluid and less skippy even during heavy
> CPU load. Moving windows around with CPU load in the background doesn't
> result in short skips.
>
> LMMS (a tool utilizing real-time sound synthesis) does not produce
> "pops", "crackles" and drops in the sound during real-time playback due
> to buffer under-runs. Those problems amplify when there's heavy CPU
> load in the background, while with BFS heavy load doesn't produce those
> artifacts (though LMMS makes itself run SCHED_ISO with BFS) Also,
> hitting a key on the keyboard needs less time for the note to become
> audible when using BFS. Same should hold true for other tools who
> traditionally benefit from the "-rt" kernel sources.
>
> Games like Doom 3 and such don't "freeze" periodically for small amounts
> of time (again for sub-second amounts) when something in the background
> grabs CPU time (be it my mailer checking for new mail or a cron job, or
> whatever.)
>
> And, the most drastic improvement here, with BFS I can do a "make -j2"
> in the kernel tree and the GUI stays fluid. Without BFS, things start
> to lag, even with in-RAM builds (like having the whole kernel tree
> inside a tmpfs) and gcc running with nice 19 and ionice -c 3.
>
> Unfortunately, I can't come up with any way to somehow benchmark all of
> this. There's no benchmark for "fluidity" and "responsiveness".
> Running the Doom 3 benchmark, or any other benchmark, doesn't say
> anything about responsiveness, it only measures how many frames were
> calculated in a specific period of time. How "stable" (with no stalls)
> those frames were making it to the screen is not measurable.
>
> If BFS would imply small drops in pure performance counted in
> instructions per seconds, that would be a totally acceptable regression
> for desktop/multimedia/gaming PCs. Not for server machines, of course.
> However, on my machine, BFS is faster in classic workloads. When I
> run "make -j2" with BFS and the standard scheduler, BFS always finishes
> a bit faster. Not by much, but still. One thing I'm noticing here is
> that BFS produces 100% CPU load on each core with "make -j2" while the
> normal scheduler stays at about 90-95% with -j2 or higher in at least
> one of the cores. There seems to be under-utilization of CPU time.
>
> Also, by searching around the net but also through discussions on
> various mailing lists, there seems to be a trend: the problems for some
> reason seem to occur more often with Intel CPUs (Core 2 chips and lower;
> I can't say anything about Core I7) while people on AMD CPUs mostly not
> being affected by most or even all of the above. (And due to this flame
> wars often break out, with one party accusing the other of imagining
> things). Can the integrated memory controller on AMD chips have
> something to do with this? Do AMD chips generally offer better
> "multithrading" behavior? Unfortunately, you didn't mention on what CPU
> you ran your tests. If it was AMD, it might be a good idea to run tests
> on Pentium and Core 2 CPUs.
>
> For reference, my system is:
>
> CPU: Intel Core 2 Duo E6600 (2.4GHz)
> Mainboard: Asus P5E (Intel X38 chipset)
> RAM: 6GB (2+2+1+1) dual channel DDR2 800
> GPU: RV770 (Radeon HD4870).
>

My Phenom 9550 (2.2Ghz) whips the pants off my Intel Q6600 (2.6Ghz). I and a
friend of mine both get large amounts of stalling when doing a lot of IO. I
haven't seen such horrible desktop interactivity since before the new
schedulers and the -ck patchset came out for 2.4.x. Its a heck of a lot better
on my AMD Phenom's, but some lag is noticeable these days, even when it wasn't
a few kernel releases ago.

Intel Specs:
CPU: Intel Core 2 Quad Q6600 (2.6Ghz)
Mainboard: ASUS P5K-SE (Intel p35 iirc)
RAM: 4G 800Mhz DDR2 dual channel (4x1G)
GPU: NVidia 8800GTS 320M

AMD Specs:
CPU: AMD Phenom I 9550 (2.2Ghz)
Mainboard: Gigabyte MA78GM-S2H
RAM: 4G 800Mhz DDR2 dual channel (2x2G)
GPU: Onboard Radeon 3200HD

AMD Specs x2:
CPU: AMD Phenom II 810 (2.6Ghz)
Mainboard: Gigabyte MA790FXT-UD5P
RAM: 4G 1066Mhz DDR3 dual channel (2x2G)
GPU: NVidia 8800GTS 320M (or currently a 8400GS)

Of course I get better performance out of the Phenom II vs either other box,
but it surprises me that I'd get more out of the budget AMD box over the not
so budget Intel box.

--
Thomas Fjellstrom
tfjellstrom(a)shaw.ca
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Pekka Pietikainen on
On Mon, Sep 07, 2009 at 10:57:01PM +0200, Ingo Molnar wrote:
> > > Could you profile it please? Also, what's the context-switch rate?
> >
> > As far as I can tell, the broadcom mips architecture does not have
> > profiling support. It does only have some proprietary profiling
> > registers that nobody wrote kernel support for, yet.
> Well, what does 'vmstat 1' show - how many context switches are
> there per second on the iperf server? In theory if it's a truly
> saturated box, there shouldnt be many - just a single iperf task
Yay, finally something that's measurable in this thread \o/

Gigabit Ethernet iperf on an Atom or so might be something that
shows similar effects yet is debuggable. Anyone feel like taking a shot?

That beast doing iperf probably ends up making it go quite close to it's
limits (IO, mem bw, cpu). IIRC the routing/bridging performance is
something like 40Mbps (depends a lot on the model, corresponds pretty
well with the Mhz of the beast).

Maybe not totally unlike what make -j16 does to a 1-4 core box?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/