From: David Rientjes on
On Fri, 9 Jul 2010, Christoph Lameter wrote:

> SLUB+Q also wins against SLAB in netperf:
>
> Script:
>
> #!/bin/bash
>
> TIME=60 # seconds
> HOSTNAME=localhost # netserver
>
> NR_CPUS=$(grep ^processor /proc/cpuinfo | wc -l)
> echo NR_CPUS=$NR_CPUS
>
> run_netperf() {
> for i in $(seq 1 $1); do
> netperf -H $HOSTNAME -t TCP_RR -l $TIME &
> done
> }
>
> ITERATIONS=0
> while [ $ITERATIONS -lt 12 ]; do
> RATE=0
> ITERATIONS=$[$ITERATIONS + 1]
> THREADS=$[$NR_CPUS * $ITERATIONS]
> RESULTS=$(run_netperf $THREADS | grep -v '[a-zA-Z]' | awk '{ print $6 }')
>
> for j in $RESULTS; do
> RATE=$[$RATE + ${j/.*}]
> done
> echo threads=$THREADS rate=$RATE
> done
>
>
> Dell Dual Quad Penryn on Linux 2.6.35-rc4
>
> Loop counts: Larger is better.
>
> Threads SLAB SLUB+Q %
> 8 690869 714788 + 3.4
> 16 680295 711771 + 4.6
> 24 672677 703014 + 4.5
> 32 676780 703914 + 4.0
> 40 668458 699806 + 4.6
> 48 667017 698908 + 4.7
> 56 671227 696034 + 3.6
> 64 667956 696913 + 4.3
> 72 668332 694931 + 3.9
> 80 667073 695658 + 4.2
> 88 682866 697077 + 2.0
> 96 668089 694719 + 3.9
>

I see you're using my script for collecting netperf TCP_RR benchmark data,
thanks very much for looking into this workload for slab allocator
performance!

There are a couple differences between how you're using it compared to how
I showed the initial regression between slab and slub, however: you're
using localhost for your netserver which isn't representative of a real
networking round-robin workload and you're using a smaller system with
eight cores. We never measured a _significant_ performance problem with
slub compared to slab with four or eight cores, the problem only emerges
on larger systems.

When running this patchset on two (client and server running
netperf-2.4.5) four 2.2GHz quad-core AMD processors with 64GB of memory,
here's the results:

threads SLAB SLUB+Q diff
16 205580 179109 -12.9%
32 264024 215613 -18.3%
48 286175 237036 -17.2%
64 305309 253222 -17.1%
80 308248 243848 -20.9%
96 299845 243848 -18.7%
112 305560 259427 -15.1%
128 312668 263803 -15.6%
144 329671 271335 -17.7%
160 318737 280290 -12.1%
176 325295 287918 -11.5%
192 333356 287995 -13.6%

If you'd like to add statistics to your patchset that are enabled with
CONFIG_SLUB_STATS, I'd be happy to run it on this setup and collect more
data for you.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Rientjes on
On Fri, 9 Jul 2010, Christoph Lameter wrote:

> The following patchset cleans some pieces up and then equips SLUB with
> per cpu queues that work similar to SLABs queues.

Pekka, I think patches 4-8 could be applied to your tree now, they're
relatively unchanged from what's been posted before. (I didn't ack patch
9 because I think it makes slab_lock() -> slab_unlock() matching more
difficult with little win, but I don't feel strongly about it.)

I'd also consider patch 7 for 2.6.35-rc6 (and -stable).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Christoph Lameter on
On Wed, 14 Jul 2010, David Rientjes wrote:

> There are a couple differences between how you're using it compared to how
> I showed the initial regression between slab and slub, however: you're
> using localhost for your netserver which isn't representative of a real
> networking round-robin workload and you're using a smaller system with
> eight cores. We never measured a _significant_ performance problem with
> slub compared to slab with four or eight cores, the problem only emerges
> on larger systems.

Larger systems would more NUMA support than is present in the current
patches.

> When running this patchset on two (client and server running
> netperf-2.4.5) four 2.2GHz quad-core AMD processors with 64GB of memory,
> here's the results:

What is their NUMA topology? I dont have anything beyond two nodes here.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Rientjes on
On Thu, 15 Jul 2010, Christoph Lameter wrote:

> > When running this patchset on two (client and server running
> > netperf-2.4.5) four 2.2GHz quad-core AMD processors with 64GB of memory,
> > here's the results:
>
> What is their NUMA topology? I dont have anything beyond two nodes here.
>

These two machines happen to have four 16GB nodes with asymmetrical
distances:

# cat /sys/devices/system/node/node*/distance
10 20 20 30
20 10 20 20
20 20 10 20
30 20 20 10
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Pekka Enberg on
David Rientjes wrote:
> On Fri, 9 Jul 2010, Christoph Lameter wrote:
>
>> The following patchset cleans some pieces up and then equips SLUB with
>> per cpu queues that work similar to SLABs queues.
>
> Pekka, I think patches 4-8 could be applied to your tree now, they're
> relatively unchanged from what's been posted before. (I didn't ack patch
> 9 because I think it makes slab_lock() -> slab_unlock() matching more
> difficult with little win, but I don't feel strongly about it.)

Yup, I applied 4-8. Thanks guys!

> I'd also consider patch 7 for 2.6.35-rc6 (and -stable).

It's an obvious bug fix but is it triggered in practice? Is there a
bugzilla report for that?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/