From: Nikos Chantziaras on
On 09/07/2009 05:40 PM, Arjan van de Ven wrote:
> On Mon, 07 Sep 2009 06:38:36 +0300
> Nikos Chantziaras<realnc(a)arcor.de> wrote:
>
>> On 09/06/2009 11:59 PM, Ingo Molnar wrote:
>>> [...]
>>> Also, i'd like to outline that i agree with the general goals
>>> described by you in the BFS announcement - small desktop systems
>>> matter more than large systems. We find it critically important
>>> that the mainline Linux scheduler performs well on those systems
>>> too - and if you (or anyone else) can reproduce suboptimal behavior
>>> please let the scheduler folks know so that we can fix/improve it.
>>
>> BFS improved behavior of many applications on my Intel Core 2 box in
>> a way that can't be benchmarked. Examples:
>
> Have you tried to see if latencytop catches such latencies ?

I've just tried it.

I start latencytop and then mplayer on a video that doesn't max out the
CPU (needs about 20-30% of a single core (out of 2 available)). Then,
while the video is playing, I press Alt+Tab repeatedly which makes the
desktop compositor kick-in and stay active (it lays out all windows as a
"flip-switch", similar to the Microsoft Vista Aero alt+tab effect).
Repeatedly pressing alt+tab results in the compositor (in this case KDE
4.3.1) keep doing processing. With the mainline scheduler, mplayer
starts dropping frames and skip sound like crazy for the whole duration
of this exercise.

latencytop has this to say:

http://foss.math.aegean.gr/~realnc/pics/latop1.png

Though I don't really understand what this tool is trying to tell me, I
hope someone does.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Ingo Molnar on

* Ingo Molnar <mingo(a)elte.hu> wrote:

> That's interesting. I tried to reproduce it on x86, but the
> profile does not show any scheduler overhead at all on the server:

I've now simulated a saturated iperf server by adding an
udelay(3000) to e1000_intr() in via the patch below.

There's no idle time left that way:

Cpu(s): 0.0%us, 2.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 93.2%hi, 4.2%si, 0.0%st
Mem: 1021044k total, 93400k used, 927644k free, 5068k buffers
Swap: 8193140k total, 0k used, 8193140k free, 25404k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1604 mingo 20 0 38300 956 724 S 99.4 0.1 3:15.07 iperf
727 root 15 -5 0 0 0 S 0.2 0.0 0:00.41 kondemand/0
1226 root 20 0 6452 336 240 S 0.2 0.0 0:00.06 irqbalance
1387 mingo 20 0 78872 1988 1300 S 0.2 0.2 0:00.23 sshd
1657 mingo 20 0 12752 1128 800 R 0.2 0.1 0:01.34 top
1 root 20 0 10320 684 572 S 0.0 0.1 0:01.79 init
2 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kthreadd

And the server is only able to saturate half of the 1 gigabit
bandwidth:

Client connecting to t, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 10.0.1.19 port 50836 connected with 10.0.1.14 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 504 MBytes 423 Mbits/sec
------------------------------------------------------------
Client connecting to t, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 10.0.1.19 port 50837 connected with 10.0.1.14 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 502 MBytes 420 Mbits/sec


perf top is showing:

------------------------------------------------------------------------------
PerfTop: 28517 irqs/sec kernel:99.4% [100000 cycles], (all, 1 CPUs)
------------------------------------------------------------------------------

samples pcnt kernel function
_______ _____ _______________

139553.00 - 93.2% : delay_tsc
2098.00 - 1.4% : hmac_digest
561.00 - 0.4% : ip_call_ra_chain
335.00 - 0.2% : neigh_alloc
279.00 - 0.2% : __hash_conntrack
257.00 - 0.2% : dev_activate
186.00 - 0.1% : proc_tcp_available_congestion_control
178.00 - 0.1% : e1000_get_regs
167.00 - 0.1% : tcp_event_data_recv

delay_tsc() dominates, as expected. Still zero scheduler overhead
and the contex-switch rate is well below 1000 per sec.

Then i booted v2.6.30 vanilla, added the udelay(3000) and got:

[ 5] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 47026
[ 5] 0.0-10.0 sec 493 MBytes 412 Mbits/sec
[ 4] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 47027
[ 4] 0.0-10.0 sec 520 MBytes 436 Mbits/sec
[ 5] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 47028
[ 5] 0.0-10.0 sec 506 MBytes 424 Mbits/sec
[ 4] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 47029
[ 4] 0.0-10.0 sec 496 MBytes 415 Mbits/sec

i.e. essentially the same throughput. (and this shows that using .30
versus .31 did not materially impact iperf performance in this test,
under these conditions and with this hardware)

The i applied the BFS patch to v2.6.30 and used the same
udelay(3000) hack and got:

No measurable change in throughput.

Obviously, this test is not equivalent to your test - but it does
show that even saturated iperf is getting scheduled just fine. (or,
rather, does not get scheduled all that much.)

[ 5] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 38505
[ 5] 0.0-10.1 sec 481 MBytes 401 Mbits/sec
[ 4] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 38506
[ 4] 0.0-10.0 sec 505 MBytes 423 Mbits/sec
[ 5] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 38507
[ 5] 0.0-10.0 sec 508 MBytes 426 Mbits/sec
[ 4] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 38508
[ 4] 0.0-10.0 sec 486 MBytes 406 Mbits/sec

So either your MIPS system has some unexpected dependency on the
scheduler, or there's something weird going on.

Mind poking on this one to figure out whether it's all repeatable
and why that slowdown happens? Multiple attempts to reproduce it
failed here for me.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Ingo Molnar on

* Pekka Pietikainen <pp(a)ee.oulu.fi> wrote:

> On Mon, Sep 07, 2009 at 10:57:01PM +0200, Ingo Molnar wrote:
> > > > Could you profile it please? Also, what's the context-switch rate?
> > >
> > > As far as I can tell, the broadcom mips architecture does not have
> > > profiling support. It does only have some proprietary profiling
> > > registers that nobody wrote kernel support for, yet.
> > Well, what does 'vmstat 1' show - how many context switches are
> > there per second on the iperf server? In theory if it's a truly
> > saturated box, there shouldnt be many - just a single iperf task
>
> Yay, finally something that's measurable in this thread \o/

My initial posting in this thread contains 6 separate types of
measurements, rather extensive ones. Out of those, 4 measurements
were latency oriented, two were throughput oriented. Plenty of data,
plenty of results, and very good reproducability.

> Gigabit Ethernet iperf on an Atom or so might be something that
> shows similar effects yet is debuggable. Anyone feel like taking a
> shot?

I tried iperf on x86 and simulated saturation and no, there's no BFS
versus mainline performance difference that i can measure - simply
because a saturated iperf server does not schedule much - it's busy
handling all that networking workload.

I did notice that iperf is somewhat noisy: it can easily have weird
outliers regardless of which scheduler is used. That could be an
effect of queueing/timing: depending on precisely what order packets
arrive and they get queued by the networking stack, does get a
cache-effective pathway of packets get opened - while with slightly
different timings, that pathway closes and we get much worse
queueing performance. I saw noise on the order of magnitude of 10%,
so iperf has to be measured carefully before drawing conclusions.

> That beast doing iperf probably ends up making it go quite close
> to it's limits (IO, mem bw, cpu). IIRC the routing/bridging
> performance is something like 40Mbps (depends a lot on the model,
> corresponds pretty well with the Mhz of the beast).
>
> Maybe not totally unlike what make -j16 does to a 1-4 core box?

No, a single iperf session is very different from kbuild make -j16.

Firstly, iperf server is just a single long-lived task - so we
context-switch between that and the idle thread , [and perhaps a
kernel thread such as ksoftirqd]. The scheduler essentially has no
leeway what task to schedule and for how long: if there's work going
on the iperf server task will run - if there's none, the idle task
runs. [modulo ksoftirqd - depending on the driver model and
dependent on precise timings.]

kbuild -j16 on the other hand is a complex hierarchy and mixture of
thousands of short-lived and long-lived tasks. The scheduler has a
lot of leeway to decide what to schedule and for how long.

From a scheduler perspective the two workloads could not be any more
different. Kbuild does test scheduler decisions in non-trivial ways
- iperf server does not really.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Nikos Chantziaras on
On 09/08/2009 11:04 AM, Ingo Molnar wrote:
>
> * Pekka Pietikainen<pp(a)ee.oulu.fi> wrote:
>
>> On Mon, Sep 07, 2009 at 10:57:01PM +0200, Ingo Molnar wrote:
>>>>> Could you profile it please? Also, what's the context-switch rate?
>>>>
>>>> As far as I can tell, the broadcom mips architecture does not have
>>>> profiling support. It does only have some proprietary profiling
>>>> registers that nobody wrote kernel support for, yet.
>>> Well, what does 'vmstat 1' show - how many context switches are
>>> there per second on the iperf server? In theory if it's a truly
>>> saturated box, there shouldnt be many - just a single iperf task
>>
>> Yay, finally something that's measurable in this thread \o/
>
> My initial posting in this thread contains 6 separate types of
> measurements, rather extensive ones. Out of those, 4 measurements
> were latency oriented, two were throughput oriented. Plenty of data,
> plenty of results, and very good reproducability.

None of which involve latency-prone GUI applications running on cheap
commodity hardware though. I listed examples where mainline seems to
behave sub-optimal and ways to reproduce them but this doesn't seem to
be an area of interest.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Arjan van de Ven on
On Tue, 08 Sep 2009 10:19:06 +0300
Nikos Chantziaras <realnc(a)arcor.de> wrote:

> latencytop has this to say:
>
> http://foss.math.aegean.gr/~realnc/pics/latop1.png
>
> Though I don't really understand what this tool is trying to tell me,
> I hope someone does.

unfortunately this is both an older version of latencytop, and it's
incorrectly installed ;-(
Latencytop is supposed to translate those cryptic strings to english,
but due to not being correctly installed, it does not do this ;(

the latest version of latencytop also has a GUI (thanks to Ben)

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/