hackbench regression due to commit 9dfc6e68bfe6e [Kernel]

Prev: nvidia controller failed command, possibly related to SMART selftest (2.6.32)
Next: powernow-k8: Core Performance Boost and effective frequency support

From: Zhang, Yanmin on 7 Apr 2010 05:10

On Wed, 2010-04-07 at 08:39 +0200, Eric Dumazet wrote:
> Le mercredi 07 avril 2010 � 10:34 +0800, Zhang, Yanmin a �crit :
>
> > I collected retired instruction, dtlb miss and LLC miss.
> > Below is data of LLC miss.
> >
> > Kernel 2.6.33:
> > # Samples: 11639436896 LLC-load-misses
> > #
> > # Overhead Command Shared Object Symbol
> > # ........ ............... ...................................................... ......
> > #
> > 20.94% hackbench [kernel.kallsyms] [k] copy_user_generic_string
> > 14.56% hackbench [kernel.kallsyms] [k] unix_stream_recvmsg
> > 12.88% hackbench [kernel.kallsyms] [k] kfree
> > 7.37% hackbench [kernel.kallsyms] [k] kmem_cache_free
> > 7.18% hackbench [kernel.kallsyms] [k] kmem_cache_alloc_node
> > 6.78% hackbench [kernel.kallsyms] [k] kfree_skb
> > 6.27% hackbench [kernel.kallsyms] [k] __kmalloc_node_track_caller
> > 2.73% hackbench [kernel.kallsyms] [k] __slab_free
> > 2.21% hackbench [kernel.kallsyms] [k] get_partial_node
> > 2.01% hackbench [kernel.kallsyms] [k] _raw_spin_lock
> > 1.59% hackbench [kernel.kallsyms] [k] schedule
> > 1.27% hackbench hackbench [.] receiver
> > 0.99% hackbench libpthread-2.9.so [.] __read
> > 0.87% hackbench [kernel.kallsyms] [k] unix_stream_sendmsg
> >
> >
> >
> >
> > Kernel 2.6.34-rc3:
> > # Samples: 13079611308 LLC-load-misses
> > #
> > # Overhead Command Shared Object Symbol
> > # ........ ............... .................................................................... ......
> > #
> > 18.55% hackbench [kernel.kallsyms] [k] copy_user_generic_str
> > ing
> > 13.19% hackbench [kernel.kallsyms] [k] unix_stream_recvmsg
> > 11.62% hackbench [kernel.kallsyms] [k] kfree
> > 8.54% hackbench [kernel.kallsyms] [k] kmem_cache_free
> > 7.88% hackbench [kernel.kallsyms] [k] __kmalloc_node_track_
> > caller
> > 6.54% hackbench [kernel.kallsyms] [k] kmem_cache_alloc_node
> > 5.94% hackbench [kernel.kallsyms] [k] kfree_skb
> > 3.48% hackbench [kernel.kallsyms] [k] __slab_free
> > 2.15% hackbench [kernel.kallsyms] [k] _raw_spin_lock
> > 1.83% hackbench [kernel.kallsyms] [k] schedule
> > 1.82% hackbench [kernel.kallsyms] [k] get_partial_node
> > 1.59% hackbench hackbench [.] receiver
> > 1.37% hackbench libpthread-2.9.so [.] __read
> >
> >
>
> Please check values of /proc/sys/net/core/rmem_default
> and /proc/sys/net/core/wmem_default on your machines.
>
> Their values can also change hackbench results, because increasing
> wmem_default allows af_unix senders to consume much more skbs and stress
> slab allocators (__slab_free), way beyond slub_min_order can tune them.
>
> When 2000 senders are running (and 2000 receivers), we might consume
> something like 2000 * 100.000 bytes of kernel memory for skbs. TLB
> trashing is expected, because all these skbs can span many 2MB pages.
> Maybe some node imbalance happens too.
It's a good pointer. rmem_default and wmem_default are about 116k on my machine.
I changed them to 52K and it seems there is no improvement.

>
>
>
> You could try to boot your machine with less ram per node and check :
>
> # cat /proc/buddyinfo
> Node 0, zone DMA 2 1 2 2 1 1 1 0 1 1 3
> Node 0, zone DMA32 219 298 143 584 145 57 44 41 31 26 517
> Node 1, zone DMA32 4 1 17 1 0 3 2 2 2 2 123
> Node 1, zone Normal 126 169 83 8 7 5 59 59 49 28 459
>
>
> One experiment on your Nehalem machine would be to change hackbench so
> that each group (20 senders/ 20 receivers) run on a particular NUMA
> node.
I expect process scheduler to work well in scheduling different groups
to different nodes.

I suspected dynamic percpu data didn't take care of NUMA, but kernel dump shows
it does take care of NUMA.

>
> x86info -c ->
>
> CPU #1
> EFamily: 0 EModel: 1 Family: 6 Model: 26 Stepping: 5
> CPU Model: Core i7 (Nehalem)
> Processor name string: Intel(R) Xeon(R) CPU X5570 @ 2.93GHz
> Type: 0 (Original OEM) Brand: 0 (Unsupported)
> Number of cores per physical package=8
> Number of logical processors per socket=16
> Number of logical processors per core=2
> APIC ID: 0x10 Package: 0 Core: 1 SMT ID 0
> Cache info
> L1 Instruction cache: 32KB, 4-way associative. 64 byte line size.
> L1 Data cache: 32KB, 8-way associative. 64 byte line size.
> L2 (MLC): 256KB, 8-way associative. 64 byte line size.
> TLB info
> Data TLB: 4KB pages, 4-way associative, 64 entries
> 64 byte prefetching.
> Found unknown cache descriptors: 55 5a b2 ca e4
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Eric Dumazet on 7 Apr 2010 05:30

Le mercredi 07 avril 2010 à 17:07 +0800, Zhang, Yanmin a écrit :
> >
> > One experiment on your Nehalem machine would be to change hackbench so
> > that each group (20 senders/ 20 receivers) run on a particular NUMA
> > node.
> I expect process scheduler to work well in scheduling different groups
> to different nodes.
>
> I suspected dynamic percpu data didn't take care of NUMA, but kernel dump shows
> it does take care of NUMA.
>

hackbench allocates all unix sockets on one single node, then
forks/spans its children.

Thats huge node imbalance.

You can see this with lsof on a running hackbench :

# lsof -p 14802
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
hackbench 14802 root cwd DIR 104,7 4096 12927240 /data/src/linux-2.6
hackbench 14802 root rtd DIR 104,2 4096 2 /
hackbench 14802 root txt REG 104,2 17524 697317 /usr/bin/hackbench
hackbench 14802 root mem REG 104,2 112212 558042 /lib/ld-2.3.4.so
hackbench 14802 root mem REG 104,2 1547588 558043 /lib/tls/libc-2.3.4.so
hackbench 14802 root mem REG 104,2 107928 557058 /lib/tls/libpthread-2.3.4.so
hackbench 14802 root mem REG 0,0 0 [heap] (stat: No such file or directory)
hackbench 14802 root 0u CHR 136,0 3 /dev/pts/0
hackbench 14802 root 1u CHR 136,0 3 /dev/pts/0
hackbench 14802 root 2u CHR 136,0 3 /dev/pts/0
hackbench 14802 root 3u unix 0xffff8800ac0da100 28939 socket
hackbench 14802 root 4u unix 0xffff8800ac0da400 28940 socket
hackbench 14802 root 5u unix 0xffff8800ac0da700 28941 socket
hackbench 14802 root 6u unix 0xffff8800ac0daa00 28942 socket
hackbench 14802 root 8u unix 0xffff8800aeac1800 28984 socket
hackbench 14802 root 9u unix 0xffff8800aeac1e00 28986 socket
hackbench 14802 root 10u unix 0xffff8800aeac2400 28988 socket
hackbench 14802 root 11u unix 0xffff8800aeac2a00 28990 socket
hackbench 14802 root 12u unix 0xffff8800aeac3000 28992 socket
hackbench 14802 root 13u unix 0xffff8800aeac3600 28994 socket
hackbench 14802 root 14u unix 0xffff8800aeac3c00 28996 socket
hackbench 14802 root 15u unix 0xffff8800aeac4200 28998 socket
hackbench 14802 root 16u unix 0xffff8800aeac4800 29000 socket
hackbench 14802 root 17u unix 0xffff8800aeac4e00 29002 socket
hackbench 14802 root 18u unix 0xffff8800aeac5400 29004 socket
hackbench 14802 root 19u unix 0xffff8800aeac5a00 29006 socket
hackbench 14802 root 20u unix 0xffff8800aeac6000 29008 socket
hackbench 14802 root 21u unix 0xffff8800aeac6600 29010 socket
hackbench 14802 root 22u unix 0xffff8800aeac6c00 29012 socket
hackbench 14802 root 23u unix 0xffff8800aeac7200 29014 socket
hackbench 14802 root 24u unix 0xffff8800aeac0f00 29016 socket
hackbench 14802 root 25u unix 0xffff8800aeac0900 29018 socket
hackbench 14802 root 26u unix 0xffff8800aeac7b00 29020 socket
hackbench 14802 root 27u unix 0xffff8800aeac7500 29022 socket

All sockets structures (where all _hot_ locks reside) are on a single node.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Pekka Enberg on 7 Apr 2010 06:50

Zhang, Yanmin kirjoitti:
> Kernel 2.6.34-rc3:
> # Samples: 13079611308 LLC-load-misses
> #
> # Overhead Command Shared Object Symbol
> # ........ ............... .................................................................... ......
> #
> 18.55% hackbench [kernel.kallsyms] [k] copy_user_generic_str
> ing
> 13.19% hackbench [kernel.kallsyms] [k] unix_stream_recvmsg
> 11.62% hackbench [kernel.kallsyms] [k] kfree
> 8.54% hackbench [kernel.kallsyms] [k] kmem_cache_free
> 7.88% hackbench [kernel.kallsyms] [k] __kmalloc_node_track_
> caller
> 6.54% hackbench [kernel.kallsyms] [k] kmem_cache_alloc_node
> 5.94% hackbench [kernel.kallsyms] [k] kfree_skb
> 3.48% hackbench [kernel.kallsyms] [k] __slab_free
> 2.15% hackbench [kernel.kallsyms] [k] _raw_spin_lock
> 1.83% hackbench [kernel.kallsyms] [k] schedule
> 1.82% hackbench [kernel.kallsyms] [k] get_partial_node
> 1.59% hackbench hackbench [.] receiver
> 1.37% hackbench libpthread-2.9.so [.] __read

Btw, you might want to try out "perf record -g" and "perf report
--callchain fractal,5" to get a better view of where we're spending
time. Perhaps you can spot the difference with that more easily.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Christoph Lameter on 7 Apr 2010 12:40

On Wed, 7 Apr 2010, Zhang, Yanmin wrote:

> > booting with slub_min_order=3 do change hackbench results for example ;)
> By default, slub_min_order=3 on my Nehalem machines. I also tried different
> larger slub_min_order and didn't find help.

Lets stop fiddling with kernel command line parameters for these test.
Leave as default. That is how I tested.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Christoph Lameter on 7 Apr 2010 12:50

On Wed, 7 Apr 2010, Zhang, Yanmin wrote:

> I collected retired instruction, dtlb miss and LLC miss.
> Below is data of LLC miss.
>
> Kernel 2.6.33:
> 20.94% hackbench [kernel.kallsyms] [k] copy_user_generic_string
> 14.56% hackbench [kernel.kallsyms] [k] unix_stream_recvmsg
> 12.88% hackbench [kernel.kallsyms] [k] kfree
> 7.37% hackbench [kernel.kallsyms] [k] kmem_cache_free
> 7.18% hackbench [kernel.kallsyms] [k] kmem_cache_alloc_node
> 6.78% hackbench [kernel.kallsyms] [k] kfree_skb
> 6.27% hackbench [kernel.kallsyms] [k] __kmalloc_node_track_caller
> 2.73% hackbench [kernel.kallsyms] [k] __slab_free
> 2.21% hackbench [kernel.kallsyms] [k] get_partial_node
> 2.01% hackbench [kernel.kallsyms] [k] _raw_spin_lock
> 1.59% hackbench [kernel.kallsyms] [k] schedule
> 1.27% hackbench hackbench [.] receiver
> 0.99% hackbench libpthread-2.9.so [.] __read
> 0.87% hackbench [kernel.kallsyms] [k] unix_stream_sendmsg
>
> Kernel 2.6.34-rc3:
> 18.55% hackbench [kernel.kallsyms] [k] copy_user_generic_str
> ing
> 13.19% hackbench [kernel.kallsyms] [k] unix_stream_recvmsg
> 11.62% hackbench [kernel.kallsyms] [k] kfree
> 8.54% hackbench [kernel.kallsyms] [k] kmem_cache_free
> 7.88% hackbench [kernel.kallsyms] [k] __kmalloc_node_track_
> caller

Seems that the overhead of __kmalloc_node_track_caller was increased. The
function inlines slab_alloc().

> 6.54% hackbench [kernel.kallsyms] [k] kmem_cache_alloc_node
> 5.94% hackbench [kernel.kallsyms] [k] kfree_skb
> 3.48% hackbench [kernel.kallsyms] [k] __slab_free
> 2.15% hackbench [kernel.kallsyms] [k] _raw_spin_lock
> 1.83% hackbench [kernel.kallsyms] [k] schedule
> 1.82% hackbench [kernel.kallsyms] [k] get_partial_node
> 1.59% hackbench hackbench [.] receiver
> 1.37% hackbench libpthread-2.9.so [.] __read

I wonder if this is not related to the kmem_cache_cpu structure straggling
cache line boundaries under some conditions. On 2.6.33 the kmem_cache_cpu
structure was larger and therefore tight packing resulted in different
alignment.

Could you see how the following patch affects the results. It attempts to
increase the size of kmem_cache_cpu to a power of 2 bytes. There is also
the potential that other per cpu fetches to neighboring objects affect the
situation. We could cacheline align the whole thing.

---
include/linux/slub_def.h | 5 +++++
1 file changed, 5 insertions(+)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h 2010-04-07 11:33:50.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h 2010-04-07 11:35:18.000000000 -0500
@@ -38,6 +38,11 @@ struct kmem_cache_cpu {
void **freelist; /* Pointer to first free per cpu object */
struct page *page; /* The slab from which we are allocating */
int node; /* The node of the page (or -1 for debug) */
+#ifndef CONFIG_64BIT
+ int dummy1;
+#endif
+ unsigned long dummy2;
+
#ifdef CONFIG_SLUB_STATS
unsigned stat[NR_SLUB_STAT_ITEMS];
#endif
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: nvidia controller failed command, possibly related to SMART selftest (2.6.32)
Next: powernow-k8: Core Performance Boost and effective frequency support