futex: FUTEX_LOCK with optional adaptive spinning [Kernel]

Prev: [BUG] x86info fails on 2.6.34 ?
Next: [GIT PULL] 9p file system bug fixes for 2.6.34-rc3

From: Darren Hart on 5 Apr 2010 16:50

And by V2 I meant V4....

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Darren Hart on 5 Apr 2010 19:10

Avi Kivity wrote:

>> > At 10%
>>> duty cycle you have 25 waiters behind the lock on average. I don't
>>> think this is realistic, and it means that spinning is invoked only
>>> rarely.
>>
>> Perhaps some instrumentation is in order, it seems to get invoked
>> enough to achieve some 20% increase in lock/unlock iterations. Perhaps
>> another metric would be of more value - such as average wait time?
>
> Why measure an unrealistic workload?

No argument there, thus my proposal for an alternate configuration below.

>>> I'd be interested in seeing runs where the average number of waiters
>>> is 0.2, 0.5, 1, and 2, corresponding to moderate-to-bad contention.
>>> 25 average waiters on compute bound code means the application needs
>>> to be rewritten, no amount of mutex tweaking will help it.
>>
>> Perhaps something NR_CPUS threads would be of more interest?
>
> That seems artificial.

How so? Several real world applications use one thread per CPU to
dispatch work to, wait for events, etc.

>
>> At 10% that's about .8 and at 25% the 2 of your upper limit. I could
>> add a few more duty-cycle points and make 25% the max. I'll kick that
>> off and post the results... probably tomorrow, 10M iterations takes a
>> while, but makes the results relatively stable.
>
> Thanks. But why not vary the number of threads as well?

Absolutely, I don't disagree that all the variables should vary in order
to get a complete picture. I'm starting with 8 - it takes several hours
to collect the data.

>>> Does the wakeup code select the spinning waiter, or just a random
>>> waiter?
>>
>> The wakeup code selects the highest priority task in fifo order to
>> wake-up - however, under contention it is most likely going to go back
>> to sleep as another waiter will steal the lock out from under it. This
>> locking strategy is unashamedly about as "unfair" as it gets.
>
> Best to avoid the wakeup if we notice the lock was stolen.

You really can't do this precisely. You can read the futex value at
various points along the wakeup path, but at some point you have to
commit to waking a task, and you still have a race between the time you
wake_up_task() and when it is scheduled and attempts the cmpxchg itself.

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Darren Hart on 5 Apr 2010 19:20

Avi Kivity wrote:

>>> An interesting (but perhaps difficult to achieve) optimization would
>>> be to spin in userspace.
>>
>> I couldn't think of a lightweight way to determine when the owner has
>> been scheduled out in userspace. Kernel assistance is required. You
>> could do this on the schedule() side of things, but I figured I'd get
>> some strong pushback if I tried to add a hook into descheduling that
>> flipped a bit in the futex value stating the owner was about to
>> deschedule(). Still, that might be something to explore.
>
> In the futex value it's hopeless (since a thread can hold many locks),

It can, but there is a futex value per lock. If the task_struct had a
list of held futex locks (as it does for pi futex locks) the
deschedule() path could walk that and mark the FUTEX_OWNER_SLEEPING bit.

> but I don't think it's unreasonable to set a bit in the thread local
> storage area. The futex format would then need to be extended to
> contain a pointer to this bit.

This appears to be 1 bit per task instead of 1 bit per lock. Also, the
value is thread-specific... so how would a potential waiter be able to
determine if the owner of a particular lock was running or not with this
method? ... maybe I'm missing some core bit about TLS... are you
talking about pthread_key_create() and pthread_getspecific() ?

Thanks,

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Peter Zijlstra on 6 Apr 2010 05:10

On Tue, 2010-04-06 at 00:15 +0300, Avi Kivity wrote:
>
> An interesting (but perhaps difficult to achieve) optimization would be
> to spin in userspace.

System call overhead isn't that large, but if you want to amortize that
you can do a very small spin in userspace and then take the syscall:

try
spin
try
syscall

like.

Complicating anything beyond that is just not worth it.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Avi Kivity on 6 Apr 2010 09:30

On 04/06/2010 01:59 AM, Darren Hart wrote:
>
>
>>>> I'd be interested in seeing runs where the average number of
>>>> waiters is 0.2, 0.5, 1, and 2, corresponding to moderate-to-bad
>>>> contention.
>>>> 25 average waiters on compute bound code means the application
>>>> needs to be rewritten, no amount of mutex tweaking will help it.
>>>
>>> Perhaps something NR_CPUS threads would be of more interest?
>>
>> That seems artificial.
>
> How so? Several real world applications use one thread per CPU to
> dispatch work to, wait for events, etc.

Yes, but that's the best case for spinning. You could simply use a
userspace spinlock in this case.

>
>>>> Does the wakeup code select the spinning waiter, or just a random
>>>> waiter?
>>>
>>> The wakeup code selects the highest priority task in fifo order to
>>> wake-up - however, under contention it is most likely going to go
>>> back to sleep as another waiter will steal the lock out from under
>>> it. This locking strategy is unashamedly about as "unfair" as it gets.
>>
>> Best to avoid the wakeup if we notice the lock was stolen.
>
> You really can't do this precisely. You can read the futex value at
> various points along the wakeup path, but at some point you have to
> commit to waking a task, and you still have a race between the time
> you wake_up_task() and when it is scheduled and attempts the cmpxchg
> itself.
>

The race is not a problem as it's just an optimization. If you lose it
you get a spurious wakeup, if you win you save some cpu cycles.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2 3 4 5 6 7
Prev: [BUG] x86info fails on 2.6.34 ?
Next: [GIT PULL] 9p file system bug fixes for 2.6.34-rc3