hackbench regression due to commit 9dfc6e68bfe6e [Kernel]

Prev: nvidia controller failed command, possibly related to SMART selftest (2.6.32)
Next: powernow-k8: Core Performance Boost and effective frequency support

From: Eric Dumazet on 6 Apr 2010 18:20

Le mardi 06 avril 2010 à 15:55 -0500, Christoph Lameter a écrit :
> We cannot reproduce the issue here. Our tests here (dual quad dell) show a
> performance increase in hackbench instead.
>
> Linux 2.6.33.2 #2 SMP Mon Apr 5 11:30:56 CDT 2010 x86_64 GNU/Linux
> ./hackbench 100 process 200000
> Running with 100*40 (== 4000) tasks.
> Time: 3102.142
> ./hackbench 100 process 20000
> Running with 100*40 (== 4000) tasks.
> Time: 308.731
> ./hackbench 100 process 20000
> Running with 100*40 (== 4000) tasks.
> Time: 311.591
> ./hackbench 100 process 20000
> Running with 100*40 (== 4000) tasks.
> Time: 310.200
> ./hackbench 10 process 20000
> Running with 10*40 (== 400) tasks.
> Time: 38.048
> ./hackbench 10 process 20000
> Running with 10*40 (== 400) tasks.
> Time: 44.711
> ./hackbench 10 process 20000
> Running with 10*40 (== 400) tasks.
> Time: 39.407
> ./hackbench 1 process 20000
> Running with 1*40 (== 40) tasks.
> Time: 9.411
> ./hackbench 1 process 20000
> Running with 1*40 (== 40) tasks.
> Time: 8.765
> ./hackbench 1 process 20000
> Running with 1*40 (== 40) tasks.
> Time: 8.822
>
> Linux 2.6.34-rc3 #1 SMP Tue Apr 6 13:30:34 CDT 2010 x86_64 GNU/Linux
> ./hackbench 100 process 200000
> Running with 100*40 (== 4000) tasks.
> Time: 3003.578
> ./hackbench 100 process 20000
> Running with 100*40 (== 4000) tasks.
> Time: 300.289
> ./hackbench 100 process 20000
> Running with 100*40 (== 4000) tasks.
> Time: 301.462
> ./hackbench 100 process 20000
> Running with 100*40 (== 4000) tasks.
> Time: 301.173
> ./hackbench 10 process 20000
> Running with 10*40 (== 400) tasks.
> Time: 41.191
> ./hackbench 10 process 20000
> Running with 10*40 (== 400) tasks.
> Time: 41.964
> ./hackbench 10 process 20000
> Running with 10*40 (== 400) tasks.
> Time: 41.470
> ./hackbench 1 process 20000
> Running with 1*40 (== 40) tasks.
> Time: 8.829
> ./hackbench 1 process 20000
> Running with 1*40 (== 40) tasks.
> Time: 9.166
> ./hackbench 1 process 20000
> Running with 1*40 (== 40) tasks.
> Time: 8.681
>
>

Well, your config might be very different... and hackbench results can
vary by 10% on same machine, same kernel.

This is not a reliable bench, because af_unix is not prepared to get
such a lazy workload.

We really should warn people about this.

# hackbench 25 process 3000
Running with 25*40 (== 1000) tasks.
Time: 12.922
# hackbench 25 process 3000
Running with 25*40 (== 1000) tasks.
Time: 12.696
# hackbench 25 process 3000
Running with 25*40 (== 1000) tasks.
Time: 13.060
# hackbench 25 process 3000
Running with 25*40 (== 1000) tasks.
Time: 14.108
# hackbench 25 process 3000
Running with 25*40 (== 1000) tasks.
Time: 13.165
# hackbench 25 process 3000
Running with 25*40 (== 1000) tasks.
Time: 13.310
# hackbench 25 process 3000
Running with 25*40 (== 1000) tasks.
Time: 12.530

booting with slub_min_order=3 do change hackbench results for example ;)

All writers can compete on spinlock for a target UNIX socket, we spend _lot_ of time spinning.

If we _really_ want to speedup hackbench, we would have to change unix_state_lock()
to use a non spinning locking primitive (aka lock_sock()), and slowdown normal path.

# perf record -f hackbench 25 process 3000
Running with 25*40 (== 1000) tasks.
Time: 13.330
[ perf record: Woken up 289 times to write data ]
[ perf record: Captured and wrote 54.312 MB perf.data (~2372928 samples) ]
# perf report
# Samples: 2370135
#
# Overhead Command Shared Object Symbol
# ........ ......... ............................ ......
#
9.68% hackbench [kernel] [k] do_raw_spin_lock
6.50% hackbench [kernel] [k] schedule
4.38% hackbench [kernel] [k] __kmalloc_track_caller
3.95% hackbench [kernel] [k] copy_to_user
3.86% hackbench [kernel] [k] __alloc_skb
3.77% hackbench [kernel] [k] unix_stream_recvmsg
3.12% hackbench [kernel] [k] sock_alloc_send_pskb
2.75% hackbench [vdso] [.] 0x000000ffffe425
2.28% hackbench [kernel] [k] sysenter_past_esp
2.03% hackbench [kernel] [k] __mutex_lock_common
2.00% hackbench [kernel] [k] kfree
2.00% hackbench [kernel] [k] delay_tsc
1.75% hackbench [kernel] [k] update_curr
1.70% hackbench [kernel] [k] kmem_cache_alloc
1.69% hackbench [kernel] [k] do_raw_spin_unlock
1.60% hackbench [kernel] [k] unix_stream_sendmsg
1.54% hackbench [kernel] [k] sched_clock_local
1.46% hackbench [kernel] [k] __slab_free
1.37% hackbench [kernel] [k] do_raw_read_lock
1.34% hackbench [kernel] [k] __switch_to
1.24% hackbench [kernel] [k] select_task_rq_fair
1.23% hackbench [kernel] [k] sock_wfree
1.21% hackbench [kernel] [k] _raw_spin_unlock_irqrestore
1.19% hackbench [kernel] [k] __mutex_unlock_slowpath
1.05% hackbench [kernel] [k] trace_hardirqs_off
0.99% hackbench [kernel] [k] __might_sleep
0.93% hackbench [kernel] [k] do_raw_read_unlock
0.93% hackbench [kernel] [k] _raw_spin_lock
0.91% hackbench [kernel] [k] try_to_wake_up
0.81% hackbench [kernel] [k] sched_clock
0.80% hackbench [kernel] [k] trace_hardirqs_on

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Zhang, Yanmin on 6 Apr 2010 21:00

On Tue, 2010-04-06 at 10:41 -0500, Christoph Lameter wrote:
> On Tue, 6 Apr 2010, Zhang, Yanmin wrote:
>
> > Thanks. I tried 2 and 4 times and didn't see much improvement.
> > I checked /proc/vamallocinfo and it doesn't have item of pcpu_get_vm_areas
> > when I use 4 times of PERCPU_DYNAMIC_RESERVE.
>
> > I used perf to collect dtlb misses and LLC misses. dtlb miss data is not
> > stable. Sometimes, we have a bigger dtlb miss, but get a better result.
> >
> > LLC misses data are more stable. Only LLC-load-misses is the clear sign now.
> > LLC-store-misses has no big difference.
>
> LLC-load-miss is exactly what condition?
I don't know. I just said it's a clear sign. Otherwise, there is no clear sign.
The function statistics collected by perf with event llc-load-misses are very
scattered.

>
> The cacheline environment in the hotpath should only include the following
> cache lines (without debugging and counters):
>
> 1. The first cacheline from the kmem_cache structure
>
> (This is different from the sitation before the 2.6.34 changes. Earlier
> some critical values (object length etc) where available
> from the kmem_cache_cpu structure. The cacheline containing the percpu
> structure array was needed to determome the kmem_cache_cpu address!)
>
> 2. The first cacheline from kmem_cache_cpu
>
> 3. The first cacheline of the data object (free pointer)
>
> And in case of a kfree/ kmem_cache_free:
>
> 4. Cacheline that contains the page struct of the page the object resides
> in.
I agree with your analysis, but we still have no answer.

>
> Can you post the .config you are using and the bootup messages?
>

Pls. see the 2 attachment.

CONFIG_SLUB_STATS has no big impact on results.

Yanmin

From: Zhang, Yanmin on 6 Apr 2010 22:20

On Tue, 2010-04-06 at 15:55 -0500, Christoph Lameter wrote:
> We cannot reproduce the issue here. Our tests here (dual quad dell) show a
> performance increase in hackbench instead.
I run hackbench on many machines. The regression exists on Nehalem machine
(dual sockets, 2*4*2 logical cpu) and a tigerton (4 socket, 4*4 logical cpu)
machines. I tried it on a dual quad core2 machine and it does like what you said.

The regression also exists on 2 new-generation Nehalem (dual socket 2*6*2 logical cpu)
machines.

It seems hyperthreading cpu has more chances to trigger it.

>
> Linux 2.6.33.2 #2 SMP Mon Apr 5 11:30:56 CDT 2010 x86_64 GNU/Linux
> ./hackbench 100 process 200000
> Running with 100*40 (== 4000) tasks.
> Time: 3102.142
> ./hackbench 100 process 20000
> Running with 100*40 (== 4000) tasks.
> Time: 308.731
> ./hackbench 100 process 20000
> Running with 100*40 (== 4000) tasks.
> Time: 311.591
> ./hackbench 100 process 20000
> Running with 100*40 (== 4000) tasks.
> Time: 310.200
> ./hackbench 10 process 20000
> Running with 10*40 (== 400) tasks.
> Time: 38.048
> ./hackbench 10 process 20000
> Running with 10*40 (== 400) tasks.
> Time: 44.711
> ./hackbench 10 process 20000
> Running with 10*40 (== 400) tasks.
> Time: 39.407
> ./hackbench 1 process 20000
> Running with 1*40 (== 40) tasks.
> Time: 9.411
> ./hackbench 1 process 20000
> Running with 1*40 (== 40) tasks.
> Time: 8.765
> ./hackbench 1 process 20000
> Running with 1*40 (== 40) tasks.
> Time: 8.822
>
> Linux 2.6.34-rc3 #1 SMP Tue Apr 6 13:30:34 CDT 2010 x86_64 GNU/Linux
> ./hackbench 100 process 200000
> Running with 100*40 (== 4000) tasks.
> Time: 3003.578
> ./hackbench 100 process 20000
> Running with 100*40 (== 4000) tasks.
> Time: 300.289
> ./hackbench 100 process 20000
> Running with 100*40 (== 4000) tasks.
> Time: 301.462
> ./hackbench 100 process 20000
> Running with 100*40 (== 4000) tasks.
> Time: 301.173
> ./hackbench 10 process 20000
> Running with 10*40 (== 400) tasks.
> Time: 41.191
> ./hackbench 10 process 20000
> Running with 10*40 (== 400) tasks.
> Time: 41.964
> ./hackbench 10 process 20000
> Running with 10*40 (== 400) tasks.
> Time: 41.470
> ./hackbench 1 process 20000
> Running with 1*40 (== 40) tasks.
> Time: 8.829
> ./hackbench 1 process 20000
> Running with 1*40 (== 40) tasks.
> Time: 9.166
> ./hackbench 1 process 20000
> Running with 1*40 (== 40) tasks.
> Time: 8.681
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Zhang, Yanmin on 6 Apr 2010 22:40

On Wed, 2010-04-07 at 00:10 +0200, Eric Dumazet wrote:
> Le mardi 06 avril 2010 � 15:55 -0500, Christoph Lameter a �crit :
> > We cannot reproduce the issue here. Our tests here (dual quad dell) show a
> > performance increase in hackbench instead.
> >
> > Linux 2.6.33.2 #2 SMP Mon Apr 5 11:30:56 CDT 2010 x86_64 GNU/Linux
> > ./hackbench 100 process 200000
> > Running with 100*40 (== 4000) tasks.
> > Time: 3102.142
> > ./hackbench 100 process 20000
> > Running with 100*40 (== 4000) tasks.
> > Time: 308.731
> > ./hackbench 100 process 20000
> > Running with 100*40 (== 4000) tasks.
> > Time: 311.591
> > ./hackbench 100 process 20000
> > Running with 100*40 (== 4000) tasks.
> > Time: 310.200
> > ./hackbench 10 process 20000
> > Running with 10*40 (== 400) tasks.
> > Time: 38.048
> > ./hackbench 10 process 20000
> > Running with 10*40 (== 400) tasks.
> > Time: 44.711
> > ./hackbench 10 process 20000
> > Running with 10*40 (== 400) tasks.
> > Time: 39.407
> > ./hackbench 1 process 20000
> > Running with 1*40 (== 40) tasks.
> > Time: 9.411
> > ./hackbench 1 process 20000
> > Running with 1*40 (== 40) tasks.
> > Time: 8.765
> > ./hackbench 1 process 20000
> > Running with 1*40 (== 40) tasks.
> > Time: 8.822
> >
> > Linux 2.6.34-rc3 #1 SMP Tue Apr 6 13:30:34 CDT 2010 x86_64 GNU/Linux
> > ./hackbench 100 process 200000
> > Running with 100*40 (== 4000) tasks.
> > Time: 3003.578
> > ./hackbench 100 process 20000
> > Running with 100*40 (== 4000) tasks.
> > Time: 300.289
> > ./hackbench 100 process 20000
> > Running with 100*40 (== 4000) tasks.
> > Time: 301.462
> > ./hackbench 100 process 20000
> > Running with 100*40 (== 4000) tasks.
> > Time: 301.173
> > ./hackbench 10 process 20000
> > Running with 10*40 (== 400) tasks.
> > Time: 41.191
> > ./hackbench 10 process 20000
> > Running with 10*40 (== 400) tasks.
> > Time: 41.964
> > ./hackbench 10 process 20000
> > Running with 10*40 (== 400) tasks.
> > Time: 41.470
> > ./hackbench 1 process 20000
> > Running with 1*40 (== 40) tasks.
> > Time: 8.829
> > ./hackbench 1 process 20000
> > Running with 1*40 (== 40) tasks.
> > Time: 9.166
> > ./hackbench 1 process 20000
> > Running with 1*40 (== 40) tasks.
> > Time: 8.681
> >
> >
>
>
> Well, your config might be very different... and hackbench results can
> vary by 10% on same machine, same kernel.
>
> This is not a reliable bench, because af_unix is not prepared to get
> such a lazy workload.
Thanks. I also found that. Normally, my script runs hackbench for 3 times and
gets an average value. To decrease the variation, I use
'./hackbench 100 process 200000' to get a more stable result.

>
> We really should warn people about this.
>
>
>
> # hackbench 25 process 3000
> Running with 25*40 (== 1000) tasks.
> Time: 12.922
> # hackbench 25 process 3000
> Running with 25*40 (== 1000) tasks.
> Time: 12.696
> # hackbench 25 process 3000
> Running with 25*40 (== 1000) tasks.
> Time: 13.060
> # hackbench 25 process 3000
> Running with 25*40 (== 1000) tasks.
> Time: 14.108
> # hackbench 25 process 3000
> Running with 25*40 (== 1000) tasks.
> Time: 13.165
> # hackbench 25 process 3000
> Running with 25*40 (== 1000) tasks.
> Time: 13.310
> # hackbench 25 process 3000
> Running with 25*40 (== 1000) tasks.
> Time: 12.530
>
>
> booting with slub_min_order=3 do change hackbench results for example ;)
By default, slub_min_order=3 on my Nehalem machines. I also tried different
larger slub_min_order and didn't find help.

>
> All writers can compete on spinlock for a target UNIX socket, we spend _lot_ of time spinning.
>
> If we _really_ want to speedup hackbench, we would have to change unix_state_lock()
> to use a non spinning locking primitive (aka lock_sock()), and slowdown normal path.
>
>
> # perf record -f hackbench 25 process 3000
> Running with 25*40 (== 1000) tasks.
> Time: 13.330
> [ perf record: Woken up 289 times to write data ]
> [ perf record: Captured and wrote 54.312 MB perf.data (~2372928 samples) ]
> # perf report
> # Samples: 2370135
> #
> # Overhead Command Shared Object Symbol
> # ........ ......... ............................ ......
> #
> 9.68% hackbench [kernel] [k] do_raw_spin_lock
> 6.50% hackbench [kernel] [k] schedule
> 4.38% hackbench [kernel] [k] __kmalloc_track_caller
> 3.95% hackbench [kernel] [k] copy_to_user
> 3.86% hackbench [kernel] [k] __alloc_skb
> 3.77% hackbench [kernel] [k] unix_stream_recvmsg
> 3.12% hackbench [kernel] [k] sock_alloc_send_pskb
> 2.75% hackbench [vdso] [.] 0x000000ffffe425
> 2.28% hackbench [kernel] [k] sysenter_past_esp
> 2.03% hackbench [kernel] [k] __mutex_lock_common
> 2.00% hackbench [kernel] [k] kfree
> 2.00% hackbench [kernel] [k] delay_tsc
> 1.75% hackbench [kernel] [k] update_curr
> 1.70% hackbench [kernel] [k] kmem_cache_alloc
> 1.69% hackbench [kernel] [k] do_raw_spin_unlock
> 1.60% hackbench [kernel] [k] unix_stream_sendmsg
> 1.54% hackbench [kernel] [k] sched_clock_local
> 1.46% hackbench [kernel] [k] __slab_free
> 1.37% hackbench [kernel] [k] do_raw_read_lock
> 1.34% hackbench [kernel] [k] __switch_to
> 1.24% hackbench [kernel] [k] select_task_rq_fair
> 1.23% hackbench [kernel] [k] sock_wfree
> 1.21% hackbench [kernel] [k] _raw_spin_unlock_irqrestore
> 1.19% hackbench [kernel] [k] __mutex_unlock_slowpath
> 1.05% hackbench [kernel] [k] trace_hardirqs_off
> 0.99% hackbench [kernel] [k] __might_sleep
> 0.93% hackbench [kernel] [k] do_raw_read_unlock
> 0.93% hackbench [kernel] [k] _raw_spin_lock
> 0.91% hackbench [kernel] [k] try_to_wake_up
> 0.81% hackbench [kernel] [k] sched_clock
> 0.80% hackbench [kernel] [k] trace_hardirqs_on

I collected retired instruction, dtlb miss and LLC miss.
Below is data of LLC miss.

Kernel 2.6.33:
# Samples: 11639436896 LLC-load-misses
#
# Overhead Command Shared Object Symbol
# ........ ............... ...................................................... ......
#
20.94% hackbench [kernel.kallsyms] [k] copy_user_generic_string
14.56% hackbench [kernel.kallsyms] [k] unix_stream_recvmsg
12.88% hackbench [kernel.kallsyms] [k] kfree
7.37% hackbench [kernel.kallsyms] [k] kmem_cache_free
7.18% hackbench [kernel.kallsyms] [k] kmem_cache_alloc_node
6.78% hackbench [kernel.kallsyms] [k] kfree_skb
6.27% hackbench [kernel.kallsyms] [k] __kmalloc_node_track_caller
2.73% hackbench [kernel.kallsyms] [k] __slab_free
2.21% hackbench [kernel.kallsyms] [k] get_partial_node
2.01% hackbench [kernel.kallsyms] [k] _raw_spin_lock
1.59% hackbench [kernel.kallsyms] [k] schedule
1.27% hackbench hackbench [.] receiver
0.99% hackbench libpthread-2.9.so [.] __read
0.87% hackbench [kernel.kallsyms] [k] unix_stream_sendmsg

Kernel 2.6.34-rc3:
# Samples: 13079611308 LLC-load-misses
#
# Overhead Command Shared Object Symbol
# ........ ............... .................................................................... ......
#
18.55% hackbench [kernel.kallsyms] [k] copy_user_generic_str
ing
13.19% hackbench [kernel.kallsyms] [k] unix_stream_recvmsg
11.62% hackbench [kernel.kallsyms] [k] kfree
8.54% hackbench [kernel.kallsyms] [k] kmem_cache_free
7.88% hackbench [kernel.kallsyms] [k] __kmalloc_node_track_
caller
6.54% hackbench [kernel.kallsyms] [k] kmem_cache_alloc_node
5.94% hackbench [kernel.kallsyms] [k] kfree_skb
3.48% hackbench [kernel.kallsyms] [k] __slab_free
2.15% hackbench [kernel.kallsyms] [k] _raw_spin_lock
1.83% hackbench [kernel.kallsyms] [k] schedule
1.82% hackbench [kernel.kallsyms] [k] get_partial_node
1.59% hackbench hackbench [.] receiver
1.37% hackbench libpthread-2.9.so [.] __read

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Eric Dumazet on 7 Apr 2010 02:40

Le mercredi 07 avril 2010 à 10:34 +0800, Zhang, Yanmin a écrit :

> I collected retired instruction, dtlb miss and LLC miss.
> Below is data of LLC miss.
>
> Kernel 2.6.33:
> # Samples: 11639436896 LLC-load-misses
> #
> # Overhead Command Shared Object Symbol
> # ........ ............... ...................................................... ......
> #
> 20.94% hackbench [kernel.kallsyms] [k] copy_user_generic_string
> 14.56% hackbench [kernel.kallsyms] [k] unix_stream_recvmsg
> 12.88% hackbench [kernel.kallsyms] [k] kfree
> 7.37% hackbench [kernel.kallsyms] [k] kmem_cache_free
> 7.18% hackbench [kernel.kallsyms] [k] kmem_cache_alloc_node
> 6.78% hackbench [kernel.kallsyms] [k] kfree_skb
> 6.27% hackbench [kernel.kallsyms] [k] __kmalloc_node_track_caller
> 2.73% hackbench [kernel.kallsyms] [k] __slab_free
> 2.21% hackbench [kernel.kallsyms] [k] get_partial_node
> 2.01% hackbench [kernel.kallsyms] [k] _raw_spin_lock
> 1.59% hackbench [kernel.kallsyms] [k] schedule
> 1.27% hackbench hackbench [.] receiver
> 0.99% hackbench libpthread-2.9.so [.] __read
> 0.87% hackbench [kernel.kallsyms] [k] unix_stream_sendmsg
>
>
>
>
> Kernel 2.6.34-rc3:
> # Samples: 13079611308 LLC-load-misses
> #
> # Overhead Command Shared Object Symbol
> # ........ ............... .................................................................... ......
> #
> 18.55% hackbench [kernel.kallsyms] [k] copy_user_generic_str
> ing
> 13.19% hackbench [kernel.kallsyms] [k] unix_stream_recvmsg
> 11.62% hackbench [kernel.kallsyms] [k] kfree
> 8.54% hackbench [kernel.kallsyms] [k] kmem_cache_free
> 7.88% hackbench [kernel.kallsyms] [k] __kmalloc_node_track_
> caller
> 6.54% hackbench [kernel.kallsyms] [k] kmem_cache_alloc_node
> 5.94% hackbench [kernel.kallsyms] [k] kfree_skb
> 3.48% hackbench [kernel.kallsyms] [k] __slab_free
> 2.15% hackbench [kernel.kallsyms] [k] _raw_spin_lock
> 1.83% hackbench [kernel.kallsyms] [k] schedule
> 1.82% hackbench [kernel.kallsyms] [k] get_partial_node
> 1.59% hackbench hackbench [.] receiver
> 1.37% hackbench libpthread-2.9.so [.] __read
>
>

Please check values of /proc/sys/net/core/rmem_default
and /proc/sys/net/core/wmem_default on your machines.

Their values can also change hackbench results, because increasing
wmem_default allows af_unix senders to consume much more skbs and stress
slab allocators (__slab_free), way beyond slub_min_order can tune them.

When 2000 senders are running (and 2000 receivers), we might consume
something like 2000 * 100.000 bytes of kernel memory for skbs. TLB
trashing is expected, because all these skbs can span many 2MB pages.
Maybe some node imbalance happens too.

You could try to boot your machine with less ram per node and check :

# cat /proc/buddyinfo
Node 0, zone DMA 2 1 2 2 1 1 1 0 1 1 3
Node 0, zone DMA32 219 298 143 584 145 57 44 41 31 26 517
Node 1, zone DMA32 4 1 17 1 0 3 2 2 2 2 123
Node 1, zone Normal 126 169 83 8 7 5 59 59 49 28 459

One experiment on your Nehalem machine would be to change hackbench so
that each group (20 senders/ 20 receivers) run on a particular NUMA
node.

x86info -c ->

CPU #1
EFamily: 0 EModel: 1 Family: 6 Model: 26 Stepping: 5
CPU Model: Core i7 (Nehalem)
Processor name string: Intel(R) Xeon(R) CPU X5570 @ 2.93GHz
Type: 0 (Original OEM) Brand: 0 (Unsupported)
Number of cores per physical package=8
Number of logical processors per socket=16
Number of logical processors per core=2
APIC ID: 0x10 Package: 0 Core: 1 SMT ID 0
Cache info
L1 Instruction cache: 32KB, 4-way associative. 64 byte line size.
L1 Data cache: 32KB, 8-way associative. 64 byte line size.
L2 (MLC): 256KB, 8-way associative. 64 byte line size.
TLB info
Data TLB: 4KB pages, 4-way associative, 64 entries
64 byte prefetching.
Found unknown cache descriptors: 55 5a b2 ca e4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: nvidia controller failed command, possibly related to SMART selftest (2.6.32)
Next: powernow-k8: Core Performance Boost and effective frequency support