From: David Miller on
From: Benjamin Herrenschmidt <benh(a)kernel.crashing.org>
Date: Wed, 09 Sep 2009 10:28:22 +1000

>> The TLB is SW loaded, yes. However it should not do any misses on kernel
>> space, since the whole segment is in a wired TLB entry.
>
> Including vmalloc space ?

No, MIPS does take SW tlb misses on vmalloc space. :-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Ralf Baechle on
On Tue, Sep 08, 2009 at 07:50:00PM +1000, Benjamin Herrenschmidt wrote:

> On Tue, 2009-09-08 at 09:48 +0200, Ingo Molnar wrote:
> > So either your MIPS system has some unexpected dependency on the
> > scheduler, or there's something weird going on.
> >
> > Mind poking on this one to figure out whether it's all repeatable
> > and why that slowdown happens? Multiple attempts to reproduce it
> > failed here for me.
>
> Could it be the scheduler using constructs that don't do well on MIPS ?

It would surprise me.

I'm wondering if BFS has properties that make it perform better on a very
low memory system; I guess the BCM74xx system will have like 32MB or 64MB
only.

> I remember at some stage we spotted an expensive multiply in there,
> maybe there's something similar, or some unaligned or non-cache friendly
> vs. the MIPS cache line size data structure, that sort of thing ...
>
> Is this a SW loaded TLB ? Does it misses on kernel space ? That could
> also be some differences in how many pages are touched by each scheduler
> causing more TLB pressure. This will be mostly invisible on x86.

Software refilled. No misses ever for kernel space or low-mem; think of
it as low-mem and kernel executable living in a 512MB page that is mapped
by a mechanism outside the TLB. Vmalloc ranges are TLB mapped. Ioremap
address ranges only if above physical address 512MB.

An emulated unaligned load/store is very expensive; one that is encoded
properly by GCC for __attribute__((packed)) is only 1 cycle and 1
instruction ( = 4 bytes) extra.

> At this stage, it will be hard to tell without some profile data I
> suppose. Maybe next week I can try on a small SW loaded TLB embedded PPC
> see if I can reproduce some of that, but no promises here.

Ralf
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Felix Fietkau on
Ralf Baechle wrote:
>> I remember at some stage we spotted an expensive multiply in there,
>> maybe there's something similar, or some unaligned or non-cache friendly
>> vs. the MIPS cache line size data structure, that sort of thing ...
>>
>> Is this a SW loaded TLB ? Does it misses on kernel space ? That could
>> also be some differences in how many pages are touched by each scheduler
>> causing more TLB pressure. This will be mostly invisible on x86.
>
> Software refilled. No misses ever for kernel space or low-mem; think of
> it as low-mem and kernel executable living in a 512MB page that is mapped
> by a mechanism outside the TLB. Vmalloc ranges are TLB mapped. Ioremap
> address ranges only if above physical address 512MB.
>
> An emulated unaligned load/store is very expensive; one that is encoded
> properly by GCC for __attribute__((packed)) is only 1 cycle and 1
> instruction ( = 4 bytes) extra.
CFS definitely isn't causing any emulated unaligned load/stores on these
devices, we've tested that.

- Felix
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Ingo Molnar on

* Jens Axboe <jens.axboe(a)oracle.com> wrote:

> On Tue, Sep 08 2009, Peter Zijlstra wrote:
> > On Tue, 2009-09-08 at 11:13 +0200, Jens Axboe wrote:
> > > And here's a newer version.
> >
> > I tinkered a bit with your proglet and finally found the
> > problem.
> >
> > You used a single pipe per child, this means the loop in
> > run_child() would consume what it just wrote out until it got
> > force preempted by the parent which would also get woken.
> >
> > This results in the child spinning a while (its full quota) and
> > only reporting the last timestamp to the parent.
>
> Oh doh, that's not well thought out. Well it was a quick hack :-)
> Thanks for the fixup, now it's at least usable to some degree.

What kind of latencies does it report on your box?

Our vanilla scheduler default latency targets are:

single-core: 20 msecs
dual-core: 40 msecs
quad-core: 60 msecs
opto-core: 80 msecs

You can enable CONFIG_SCHED_DEBUG=y and set it directly as well via
/proc/sys/kernel/sched_latency_ns:

echo 10000000 > /proc/sys/kernel/sched_latency_ns

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Nikos Chantziaras on
On 09/09/2009 09:13 AM, Ingo Molnar wrote:
>
> * Jens Axboe<jens.axboe(a)oracle.com> wrote:
>
>> On Tue, Sep 08 2009, Peter Zijlstra wrote:
>>> On Tue, 2009-09-08 at 11:13 +0200, Jens Axboe wrote:
>>>> And here's a newer version.
>>>
>>> I tinkered a bit with your proglet and finally found the
>>> problem.
>>>
>>> You used a single pipe per child, this means the loop in
>>> run_child() would consume what it just wrote out until it got
>>> force preempted by the parent which would also get woken.
>>>
>>> This results in the child spinning a while (its full quota) and
>>> only reporting the last timestamp to the parent.
>>
>> Oh doh, that's not well thought out. Well it was a quick hack :-)
>> Thanks for the fixup, now it's at least usable to some degree.
>
> What kind of latencies does it report on your box?
>
> Our vanilla scheduler default latency targets are:
>
> single-core: 20 msecs
> dual-core: 40 msecs
> quad-core: 60 msecs
> opto-core: 80 msecs
>
> You can enable CONFIG_SCHED_DEBUG=y and set it directly as well via
> /proc/sys/kernel/sched_latency_ns:
>
> echo 10000000> /proc/sys/kernel/sched_latency_ns

I've tried values ranging from 10000000 down to 100000. This results in
the stalls/freezes being a bit shorter, but clearly still there. It
does not eliminate them.

If there's anything else I can try/test, I would be happy to do so.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/