lock's trace events can improve mutex's performance in userspace? [Kernel]

Prev: [patch 0/12] sched: fastpath cycle recovery
Next: [PATCH] DRIVER: Libertas: Fixed coding style in rx.c

From: Xiao Guangrong on 11 Mar 2010 04:50

We found that if enable lock's trace events, the 'sysbench mutex'
benchmark program can run quicker.

The simple program that is attached can reproduce it, the system info,
kernel config, and the script are also attached.

The test step is below:

# tar -zxvf test-mutex.tar.bz
# cd test-mutex
# make
# ./tscript.sh >& log
# cat log | grep "real"
real 0m46.765s < all trace events are disabled >
real 0m47.073s
real 0m47.402s
real 0m46.458s
real 0m47.433s
real 0m47.395s
real 0m47.010s
real 0m47.454s
real 0m47.044s
real 0m47.464s
real 0m39.245s < enable lock's trace events >
real 0m40.822s
real 0m40.779s
real 0m40.549s
real 0m40.605s
real 0m40.923s
real 0m40.560s
real 0m41.050s
real 0m40.757s
real 0m40.715s

[ "< ... >" is my comments ]

From the result, we can see the program's runtime is less if enable lock's
trace events.

The conclusion is weird but i don't know why.

From: Hitoshi Mitake on 12 Mar 2010 23:30

On 03/11/10 18:43, Xiao Guangrong wrote:
> We found that if enable lock's trace events, the 'sysbench mutex'
> benchmark program can run quicker.
>
> The simple program that is attached can reproduce it, the system info,
> kernel config, and the script are also attached.
>
> The test step is below:
>
> # tar -zxvf test-mutex.tar.bz
> # cd test-mutex
> # make
> # ./tscript.sh>& log
> # cat log | grep "real"
> real 0m46.765s< all trace events are disabled>
> real 0m47.073s
> real 0m47.402s
> real 0m46.458s
> real 0m47.433s
> real 0m47.395s
> real 0m47.010s
> real 0m47.454s
> real 0m47.044s
> real 0m47.464s
> real 0m39.245s< enable lock's trace events>
> real 0m40.822s
> real 0m40.779s
> real 0m40.549s
> real 0m40.605s
> real 0m40.923s
> real 0m40.560s
> real 0m41.050s
> real 0m40.757s
> real 0m40.715s
>
> [ "< ...>" is my comments ]
>
> From the result, we can see the program's runtime is less if enable
lock's
> trace events.
>
> The conclusion is weird but i don't know why.

Hi Xiao,

It's hard to believe, but...

% sudo ./tscript.sh &> log
% grep real log
real 0m24.132s
real 0m23.535s
real 0m20.064s
real 0m16.636s <- enabled from here
real 0m16.435s
real 0m17.339s

I could reproduce your surprising result.
(I only execed your benchmark 3 times.)

I rewrote your mainc.c and checked contended count of each test like
this way,
if (pthread_mutex_trylock(&mutex) == EBUSY) {
pthread_mutex_lock(&mutex);
atomic_inc(&contended);
}
# I'll attach my new mainc.c

% cat log
Run mutex with trace events disabled...
contended:25191221

real 0m24.132s
user 0m17.149s
sys 1m18.933s
contended:25360563

real 0m23.535s
user 0m17.233s
sys 1m16.213s
contended:23813911

real 0m20.064s
user 0m15.561s
sys 1m4.332s
Run mutex with lockdep events enabled...
contended:11458318

real 0m16.636s
user 0m10.173s
sys 0m55.595s
contended:11881095

real 0m16.435s
user 0m10.273s
sys 0m54.911s
contended:11261650

real 0m17.339s
user 0m10.225s
sys 0m58.556s

It seems that num of contention decreased to about half.
I don't know why this happened and effect to performance of it,
but this result is worth to consider.

Thanks,
Hitoshi

From: Xiao Guangrong on 8 Apr 2010 05:10

Hi Michel,

Michel Lespinasse wrote:
> Sorry for the late reply...
>
> One thing to consider in locking micro-benchmarks is that often, code
> changes that slow down parts of the contended code path where the lock
> is not held, will result in an increase of the reported
> micro-benchmark metric. This effect is particularly marked for
> micro-benchmarks that consist of multiple threads doing empty
> acquire/release loops.
>
> As a thought experiment, imagine what would happen if you added a
> one-millisecond sleep in the contended code path for mutex
> acquisition. Soon all but one of your benchmark threads would be
> sleeping, and the only non-sleeping thread would be able to spin on
> that lock/unlock loop with no contention, resulting in very nice
> results for the micro-benchmark. Remove the sleep and the lock/unlock
> threads will have to contend, resulting in lower reported performance
> metrics.

Great thanks for your valuable reply that makes we see the
issue more clearly.

I've do the test address your conjecture that add usleep(1) in mutex
acquisition path, the test result shows contention is reduced.

And i also do the test that does more work in mutex holding path, the
result shows optimization ratio is decreased.

Thanks,
Xiao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

|
Pages: 1
Prev: [patch 0/12] sched: fastpath cycle recovery
Next: [PATCH] DRIVER: Libertas: Fixed coding style in rx.c