hackbench regression due to commit 9dfc6e68bfe6e [Kernel]

Prev: nvidia controller failed command, possibly related to SMART selftest (2.6.32)
Next: powernow-k8: Core Performance Boost and effective frequency support

From: Zhang, Yanmin on 2 Apr 2010 04:10

On Thu, 2010-04-01 at 10:53 -0500, Christoph Lameter wrote:
> On Thu, 1 Apr 2010, Zhang, Yanmin wrote:
>
> > I suspect the moving of place of cpu_slab in kmem_cache causes the new cache
> > miss. But when I move it to the tail of the structure, kernel always panic when
> > booting. Perhaps there is another potential bug?
>
> Why would that cause an additional cache miss?
>
>
> The node array is following at the end of the structure. If you want to
> move it down then it needs to be placed before the node field

Thanks. The moving cpu_slab to tail doesn't improve it.

I used perf to collect statistics. Only data cache miss has a little difference.
My testing command on my 2 socket machine:
#hackbench 100 process 20000

With 2.6.33, it takes for about 96 seconds while 2.6.34-rc2 (or the latest tip tree)
takes for about 101 seconds.

perf shows some functions around SLUB have more cpu utilization, while some other
SLUB functions have less cpu utilization.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Christoph Lameter on 5 Apr 2010 10:00

On Fri, 2 Apr 2010, Zhang, Yanmin wrote:

> My testing command on my 2 socket machine:
> #hackbench 100 process 20000
>
> With 2.6.33, it takes for about 96 seconds while 2.6.34-rc2 (or the latest tip tree)
> takes for about 101 seconds.
>
> perf shows some functions around SLUB have more cpu utilization, while some other
> SLUB functions have less cpu utilization.

Hmnmmm... The dynamic percpu areas use page tables and that data is used
in the fast path. Maybe the high thread count causes tlb trashing?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Pekka Enberg on 5 Apr 2010 13:40

(I'm CC'ing Tejun)

On Mon, Apr 5, 2010 at 4:54 PM, Christoph Lameter
<cl(a)linux-foundation.org> wrote:
> On Fri, 2 Apr 2010, Zhang, Yanmin wrote:
>
>> My testing command on my 2 socket machine:
>> #hackbench 100 process 20000
>>
>> With 2.6.33, it takes for about 96 seconds while 2.6.34-rc2 (or the latest tip tree)
>> takes for about 101 seconds.
>>
>> perf shows some functions around SLUB have more cpu utilization, while some other
>> SLUB functions have less cpu utilization.
>
> Hmnmmm... The dynamic percpu areas use page tables and that data is used
> in the fast path. Maybe the high thread count causes tlb trashing?

Hmm indeed. I don't see anything particularly funny in the SLUB percpu
conversion so maybe this is a more issue with the new percpu
allocator?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Christoph Lameter on 6 Apr 2010 11:50

On Tue, 6 Apr 2010, Zhang, Yanmin wrote:

> Thanks. I tried 2 and 4 times and didn't see much improvement.
> I checked /proc/vamallocinfo and it doesn't have item of pcpu_get_vm_areas
> when I use 4 times of PERCPU_DYNAMIC_RESERVE.

> I used perf to collect dtlb misses and LLC misses. dtlb miss data is not
> stable. Sometimes, we have a bigger dtlb miss, but get a better result.
>
> LLC misses data are more stable. Only LLC-load-misses is the clear sign now.
> LLC-store-misses has no big difference.

LLC-load-miss is exactly what condition?

The cacheline environment in the hotpath should only include the following
cache lines (without debugging and counters):

1. The first cacheline from the kmem_cache structure

(This is different from the sitation before the 2.6.34 changes. Earlier
some critical values (object length etc) where available
from the kmem_cache_cpu structure. The cacheline containing the percpu
structure array was needed to determome the kmem_cache_cpu address!)

2. The first cacheline from kmem_cache_cpu

3. The first cacheline of the data object (free pointer)

And in case of a kfree/ kmem_cache_free:

4. Cacheline that contains the page struct of the page the object resides
in.

Can you post the .config you are using and the bootup messages?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Christoph Lameter on 6 Apr 2010 17:00

We cannot reproduce the issue here. Our tests here (dual quad dell) show a
performance increase in hackbench instead.

Linux 2.6.33.2 #2 SMP Mon Apr 5 11:30:56 CDT 2010 x86_64 GNU/Linux
../hackbench 100 process 200000
Running with 100*40 (== 4000) tasks.
Time: 3102.142
../hackbench 100 process 20000
Running with 100*40 (== 4000) tasks.
Time: 308.731
../hackbench 100 process 20000
Running with 100*40 (== 4000) tasks.
Time: 311.591
../hackbench 100 process 20000
Running with 100*40 (== 4000) tasks.
Time: 310.200
../hackbench 10 process 20000
Running with 10*40 (== 400) tasks.
Time: 38.048
../hackbench 10 process 20000
Running with 10*40 (== 400) tasks.
Time: 44.711
../hackbench 10 process 20000
Running with 10*40 (== 400) tasks.
Time: 39.407
../hackbench 1 process 20000
Running with 1*40 (== 40) tasks.
Time: 9.411
../hackbench 1 process 20000
Running with 1*40 (== 40) tasks.
Time: 8.765
../hackbench 1 process 20000
Running with 1*40 (== 40) tasks.
Time: 8.822

Linux 2.6.34-rc3 #1 SMP Tue Apr 6 13:30:34 CDT 2010 x86_64 GNU/Linux
../hackbench 100 process 200000
Running with 100*40 (== 4000) tasks.
Time: 3003.578
../hackbench 100 process 20000
Running with 100*40 (== 4000) tasks.
Time: 300.289
../hackbench 100 process 20000
Running with 100*40 (== 4000) tasks.
Time: 301.462
../hackbench 100 process 20000
Running with 100*40 (== 4000) tasks.
Time: 301.173
../hackbench 10 process 20000
Running with 10*40 (== 400) tasks.
Time: 41.191
../hackbench 10 process 20000
Running with 10*40 (== 400) tasks.
Time: 41.964
../hackbench 10 process 20000
Running with 10*40 (== 400) tasks.
Time: 41.470
../hackbench 1 process 20000
Running with 1*40 (== 40) tasks.
Time: 8.829
../hackbench 1 process 20000
Running with 1*40 (== 40) tasks.
Time: 9.166
../hackbench 1 process 20000
Running with 1*40 (== 40) tasks.
Time: 8.681

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: nvidia controller failed command, possibly related to SMART selftest (2.6.32)
Next: powernow-k8: Core Performance Boost and effective frequency support