From: john cooper on
Darren Hart wrote:
> john cooper wrote:
>> But here we're otherwise completely
>> open to indiscriminate scheduling preemption even though
>> we may be holding a userland lock.
>
> That's true with any userland lock.

Agreed. However from your earlier mail it seemed
addressing this scenario was within the scope of
consideration.

There are two ways to deal with this condition, either
reactive in the sense we do so after the lock holder
has been preempted and subsequently find we're spinning
in sibling thread context attempting to acquire the
lock. Or proactively where we provide a time bounded
deferral of lock holder preemption with the assumption
the lock hold path overhead has negligible effect upon
deferring a potentially coincident scheduling operation.

It is fairly straightforward to demonstrate the impact
to performance with a focused micro benchmark, less so
for a more "typical" application with the effect being
diluted among other app activity.

The two approaches are complimentary with differing
system wide tradeoffs. Both require some insight into
the scheduling disposition of the lock holder, the
preemption deferral mechanism more so. If a scheme to
expose scheduler state transitions to (or cooperate
with) userland locking primitives is being considered,
it seems opportune to consider support as well for
this scenario.

-john

--
john.cooper(a)redhat.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Darren Hart on
dvhltc(a)us.ibm.com wrote:

> Now that an advantage can be shown using FUTEX_LOCK_ADAPTIVE over FUTEX_LOCK,
> the next steps as I see them are:
>
> o Try and show improvement of FUTEX_LOCK_ADAPTIVE over FUTEX_WAIT based
> implementations (pthread_mutex specifically).

I've spent a bit of time on this, and made huge improvements through
some simple optimizations of the testcase lock/unlock routines. I'll be
away for a few days and wanted to let people know where things stand
with FUTEX_LOCK_ADAPTIVE.

I ran all the tests with the following options:
-i 1000000 -p 1000 -d 20
where:
-i iterations
-p period (in instructions)
-d duty cycle (in percent)

MECHANISM KITERS/SEC
----------------------------------
pthread_mutex_adaptive 1562
FUTEX_LOCK_ADAPTIVE 1190
pthread_mutex 1010
FUTEX_LOCK 532


I took some perf data while running each of the above tests as well. Any
thoughts on getting more from perf are appreciated, this is my first
pass at it. I recorded with "perf record -fg" and snippets of "perf
report" follow:

FUTEX_LOCK (not adaptive) spends a lot of time spinning on the futex
hashbucket lock.
# Overhead Command Shared Object Symbol
# ........ .......... .................. ......
#
40.76% futex_lock [kernel.kallsyms] [k] _raw_spin_lock
|
--- _raw_spin_lock
|
|--62.16%-- do_futex
| sys_futex
| system_call_fastpath
| syscall
|
|--31.05%-- futex_wake
| do_futex
| sys_futex
| system_call_fastpath
| syscall
...
14.98% futex_lock futex_lock [.] locktest


FUTEX_LOCK_ADAPTIVE spends much of its time in the test loop itself,
followed by the actual adaptive loop in the kernel. It appears much of
our savings over FUTEX_LOCK comes from not contending on the hashbucket
lock.
# Overhead Command Shared Object Symbol
# ........ .......... .................. ......
#
36.07% futex_lock futex_lock [.] locktest
|
--- locktest
|
--100.00%-- 0x400e7000000000

9.12% futex_lock perf [.] 0x00000000000eee
...
8.26% futex_lock [kernel.kallsyms] [k] futex_spin_on_owner


Pthread Mutex Adaptive spends most of it's time in the glibc heuristic
spinning, as expected, followed by the test loop itself. An impressively
minimal 3.35% is spent on the hashbucket lock.
# Overhead Command Shared Object Symbol
# ........ ............... ........................ ......
#
47.88% pthread_mutex_2 libpthread-2.5.so [.]
__pthread_mutex_lock_internal
|
--- __pthread_mutex_lock_internal

22.78% pthread_mutex_2 pthread_mutex_2 [.] locktest
...
15.16% pthread_mutex_2 perf [.] ...
...
3.35% pthread_mutex_2 [kernel.kallsyms] [k] _raw_spin_lock


Pthread Mutex (not adaptive) spends much of it's time on the hashbucket
lock as expected, followed by the test loop.
33.89% pthread_mutex_2 [kernel.kallsyms] [k] _raw_spin_lock
|
--- _raw_spin_lock
|
|--56.90%-- futex_wake
| do_futex
| sys_futex
| system_call_fastpath
| __lll_unlock_wake
|
|--28.95%-- futex_wait_setup
| futex_wait
| do_futex
| sys_futex
| system_call_fastpath
| __lll_lock_wait
...
16.60% pthread_mutex_2 pthread_mutex_2 [.] locktest


These results mostly confirm the expected: the adaptive versions spend
more time in their spin loops and less time contending for hashbucket
locks while the non-adaptive versions take the hashbucket lock more
often, and therefore shore more contention there.

I believe I should be able to get the plain FUTEX_LOCK implementation to
be much closer in performance to the plain pthread mutex version. I
expect much of the work done to benefit FUTEX_LOCK will also benefit
FUTEX_LOCK_ADAPTIVE. If that's true, and I can make a significant
improvement to FUTEX_LOCK, it wouldn't take much to get
FUTEX_LOCK_ADAPTIVE to beat the heuristics spinlock in glibc.

It could also be that this synthetic benchmark is an ideal situation for
glibc's heuristics, and a more realistic load with varying lock hold
times wouldn't favor the adaptive pthread mutex over FUTEX_LOCK_ADAPTIVE
by such a large margin.

More next week.

Thanks,

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/