From: Arjan van de Ven on
On Tue, 08 Sep 2009 10:19:06 +0300
Nikos Chantziaras <realnc(a)arcor.de> wrote:

> latencytop has this to say:
>
> http://foss.math.aegean.gr/~realnc/pics/latop1.png
>
> Though I don't really understand what this tool is trying to tell me,
> I hope someone does.

despite the untranslated content, it is clear that you have scheduler
delays (either due to scheduler bugs or cpu contention) of upto 68
msecs... Second in line is your binary AMD graphics driver that is
chewing up 14% of your total latency...


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jens Axboe on
On Mon, Sep 07 2009, Jens Axboe wrote:
> On Mon, Sep 07 2009, Jens Axboe wrote:
> > > And yes, it would be wonderful to get a test-app from you that would
> > > express the kind of pain you are seeing during compile jobs.
> >
> > I was hoping this one would, but it's not showing anything. I even added
> > support for doing the ping and wakeup over a socket, to see if the pipe
> > test was doing well because of the sync wakeup we do there. The net
> > latency is a little worse, but still good. So no luck in making that app
> > so far.
>
> Here's a version that bounces timestamps between a producer and a number
> of consumers (clients). Not really tested much, but perhaps someone can
> compare this on a box that boots BFS and see what happens.

And here's a newer version. It ensures that clients are running before
sending a timestamp, and it drops the first and last log entry to
eliminate any weird effects there. Accuracy should also be improved.

On an idle box, it'll usually log all zeroes. Sometimes I see 3-4msec
latencies, weird.

--
Jens Axboe

From: Benjamin Herrenschmidt on
On Tue, 2009-09-08 at 09:48 +0200, Ingo Molnar wrote:
> So either your MIPS system has some unexpected dependency on the
> scheduler, or there's something weird going on.
>
> Mind poking on this one to figure out whether it's all repeatable
> and why that slowdown happens? Multiple attempts to reproduce it
> failed here for me.

Could it be the scheduler using constructs that don't do well on MIPS ?

I remember at some stage we spotted an expensive multiply in there,
maybe there's something similar, or some unaligned or non-cache friendly
vs. the MIPS cache line size data structure, that sort of thing ...

Is this a SW loaded TLB ? Does it misses on kernel space ? That could
also be some differences in how many pages are touched by each scheduler
causing more TLB pressure. This will be mostly invisible on x86.

At this stage, it will be hard to tell without some profile data I
suppose. Maybe next week I can try on a small SW loaded TLB embedded PPC
see if I can reproduce some of that, but no promises here.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Ingo Molnar on

* Nikos Chantziaras <realnc(a)arcor.de> wrote:

> On 09/08/2009 11:04 AM, Ingo Molnar wrote:
>>
>> * Pekka Pietikainen<pp(a)ee.oulu.fi> wrote:
>>
>>> On Mon, Sep 07, 2009 at 10:57:01PM +0200, Ingo Molnar wrote:
>>>>>> Could you profile it please? Also, what's the context-switch rate?
>>>>>
>>>>> As far as I can tell, the broadcom mips architecture does not have
>>>>> profiling support. It does only have some proprietary profiling
>>>>> registers that nobody wrote kernel support for, yet.
>>>> Well, what does 'vmstat 1' show - how many context switches are
>>>> there per second on the iperf server? In theory if it's a truly
>>>> saturated box, there shouldnt be many - just a single iperf task
>>>
>>> Yay, finally something that's measurable in this thread \o/
>>
>> My initial posting in this thread contains 6 separate types of
>> measurements, rather extensive ones. Out of those, 4 measurements
>> were latency oriented, two were throughput oriented. Plenty of
>> data, plenty of results, and very good reproducability.
>
> None of which involve latency-prone GUI applications running on
> cheap commodity hardware though. [...]

The lat_tcp, lat_pipe and pipe-test numbers are all benchmarks that
characterise such workloads - they show the latency of context
switches.

I also tested where Con posted numbers that BFS has an edge over
mainline: kbuild performance. Should i not have done that?

Also note the interbench latency measurements that Con posted:

http://ck.kolivas.org/patches/bfs/interbench-bfs-cfs.txt

--- Benchmarking simulated cpu of Audio in the presence of simulated ---
Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met
None 0.004 +/- 0.00436 0.006 100 100
Video 0.008 +/- 0.00879 0.015 100 100
X 0.006 +/- 0.0067 0.014 100 100
Burn 0.005 +/- 0.00563 0.009 100 100
Write 0.005 +/- 0.00887 0.16 100 100
Read 0.006 +/- 0.00696 0.018 100 100
Compile 0.007 +/- 0.00751 0.019 100 100

Versus the mainline scheduler:

--- Benchmarking simulated cpu of Audio in the presence of simulated ---
Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met
None 0.005 +/- 0.00562 0.007 100 100
Video 0.003 +/- 0.00333 0.009 100 100
X 0.003 +/- 0.00409 0.01 100 100
Burn 0.004 +/- 0.00415 0.006 100 100
Write 0.005 +/- 0.00592 0.021 100 100
Read 0.004 +/- 0.00463 0.009 100 100
Compile 0.003 +/- 0.00426 0.014 100 100

look at those standard deviation numbers, their spread is way too
high, often 50% or more - very hard to compare such noisy data.

Furthermore, they happen to show the 2.6.30 mainline scheduler
outperforming BFS in almost every interactivity metric.

Check it for yourself and compare the entries. I havent made those
measurements, Con did.

For example 'Compile' latencies:

--- Benchmarking simulated cpu of Audio in the presence of simulated Load
Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met
v2.6.30: Compile 0.003 +/- 0.00426 0.014 100 100
BFS: Compile 0.007 +/- 0.00751 0.019 100 100

but ... with a near 100% standard deviation that's pretty hard to
judge. The Max Latency went from 14 usecs under v2.6.30 to 19 usecs
on BFS.

> [...] I listed examples where mainline seems to behave
> sub-optimal and ways to reproduce them but this doesn't seem to be
> an area of interest.

It is an area of interest of course. That's how the interactivity
results above became possible.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Nikos Chantziaras on
On 09/08/2009 11:38 AM, Arjan van de Ven wrote:
> On Tue, 08 Sep 2009 10:19:06 +0300
> Nikos Chantziaras<realnc(a)arcor.de> wrote:
>
>> latencytop has this to say:
>>
>> http://foss.math.aegean.gr/~realnc/pics/latop1.png
>>
>> Though I don't really understand what this tool is trying to tell me,
>> I hope someone does.
>
> despite the untranslated content, it is clear that you have scheduler
> delays (either due to scheduler bugs or cpu contention) of upto 68
> msecs... Second in line is your binary AMD graphics driver that is
> chewing up 14% of your total latency...

I've now used a correctly installed and up-to-date version of latencytop
and repeated the test. Also, I got rid of AMD's binary blob and used
kernel DRM drivers for my graphics card to throw fglrx out of the
equation (which btw didn't help; the exact same problems occur).

Here the result:

http://foss.math.aegean.gr/~realnc/pics/latop2.png

Again: this is on an Intel Core 2 Duo CPU.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/