futex: convert hash_bucket locks to raw_spinlock

Prev: futex: free_pi_state outside of hb->lock sections
Next: [PATCH] staging: ti dspbridge: fix compilation error

From: Mike Galbraith on 12 Jul 2010 08:20

On Mon, 2010-07-12 at 07:45 -0400, Steven Rostedt wrote:
> On Sun, 2010-07-11 at 15:33 +0200, Mike Galbraith wrote:
> > On Sat, 2010-07-10 at 21:41 +0200, Mike Galbraith wrote:
>
> > diff --git a/kernel/futex.c b/kernel/futex.c
> > index a6cec32..ef489f3 100644
> > --- a/kernel/futex.c
> > +++ b/kernel/futex.c
> > @@ -2255,7 +2255,14 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
> > /* Queue the futex_q, drop the hb lock, wait for wakeup. */
> > futex_wait_queue_me(hb, &q, to);
> >
> > - spin_lock(&hb->lock);
> > + /*
> > + * Non-blocking synchronization point with futex_requeue().
> > + *
> > + * We dare not block here because this will alter PI state, possibly
> > + * before our waker finishes modifying same in wakeup_next_waiter().
> > + */
> > + while(!spin_trylock(&hb->lock))
> > + cpu_relax();
>
> I agree that this would work. But I wonder if this should have an:
>
> #ifdef PREEMPT_RT
> [...]
> #else
> spin_lock(&hb->lock);
> #endif
>
> around it. Or encapsulate this lock in a macro that does the same thing
> (just to keep the actual code cleaner)

Yeah, it should. I'll wait to see what Darren/others say about holding
the wakee's pi_lock across wakeup to plug it. If he submits something
along that line, I can bin this.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Thomas Gleixner on 12 Jul 2010 09:10

On Fri, 9 Jul 2010, Darren Hart wrote:

> The requeue_pi mechanism introduced proxy locking of the rtmutex. This creates
> a scenario where a task can wake-up, not knowing it has been enqueued on an
> rtmutex. In order to detect this, the task would have to be able to take either
> task->pi_blocked_on->lock->wait_lock and/or the hb->lock. Unfortunately,
> without already holding one of these, the pi_blocked_on variable can change
> from NULL to valid or from valid to NULL. Therefor, the task cannot be allowed
> to take a sleeping lock after wakeup or it could end up trying to block on two
> locks, the second overwriting a valid pi_blocked_on value. This obviously
> breaks the pi mechanism.
>
> This patch increases latency, while running the ltp pthread_cond_many test
> which Michal reported the bug with, I see double digit hrtimer latencies
> (typically only on the first run after boo):
>
> kernel: hrtimer: interrupt took 75911 ns

Eewwww. There must be some more intelligent and less intrusive way to
detect this.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Darren Hart on 12 Jul 2010 15:20

On 07/10/2010 12:41 PM, Mike Galbraith wrote:
> On Fri, 2010-07-09 at 15:33 -0700, Darren Hart wrote:
>> The requeue_pi mechanism introduced proxy locking of the rtmutex. This creates
>> a scenario where a task can wake-up, not knowing it has been enqueued on an
>> rtmutex. In order to detect this, the task would have to be able to take either
>> task->pi_blocked_on->lock->wait_lock and/or the hb->lock. Unfortunately,
>> without already holding one of these, the pi_blocked_on variable can change
>> from NULL to valid or from valid to NULL. Therefor, the task cannot be allowed
>> to take a sleeping lock after wakeup or it could end up trying to block on two
>> locks, the second overwriting a valid pi_blocked_on value. This obviously
>> breaks the pi mechanism.
>
> copy/paste offline query/reply at Darren's request..
>
> On Sat, 2010-07-10 at 10:26 -0700, Darren Hart wrote:
> On 07/09/2010 09:32 PM, Mike Galbraith wrote:
>>> On Fri, 2010-07-09 at 13:05 -0700, Darren Hart wrote:
>>>
>>>> The core of the problem is that the proxy_lock blocks a task on a lock
>>>> the task knows nothing about. So when it wakes up inside of
>>>> futex_wait_requeue_pi, it immediately tries to block on hb->lock to
>>>> check why it woke up. This has the potential to block the task on two
>>>> locks (thus overwriting the pi_blocked_on). Any attempt preventing this
>>>> involves a lock, and ultimiately the hb->lock. The only solution I see
>>>> is to make the hb->locks raw locks (thanks to Steven Rostedt for
>>>> original idea and batting this around with me in IRC).
>>>
>>> Hm, so wakee _was_ munging his own state after all.
>>>
>>> Out of curiosity, what's wrong with holding his pi_lock across the
>>> wakeup? He can _try_ to block, but can't until pi state is stable.
>>>
>>> I presume there's a big fat gotcha that's just not obvious to futex
>>> locking newbie :)

Nor to some of us that have been engrossed in futexes for the last
couple years! I discussed the pi_lock across the wakeup issue with
Thomas. While this fixes the problem for this particular failure case,
it doesn't protect against:

<tglx> assume the following:
<tglx> t1 is on the condvar
<tglx> t2 does the requeue dance and t1 is now blocked on the outer futex
<tglx> t3 takes hb->lock for a futex in the same bucket
<tglx> t2 wakes due to signal/timeout
<tglx> t2 blocks on hb->lock

You are likely to have not hit the above scenario because you only had
one condvar, so the hash_buckets were not heavily shared and you weren't
likely to hit:

<tglx> t3 takes hb->lock for a futex in the same bucket

I'm going to roll up a patchset with your (Mike) spin_trylock patch and
run it through some tests. I'd still prefer a way to detect early wakeup
without having to grab the hb->lock(), but I haven't found it yet.

+ while(!spin_trylock(&hb->lock))
+ cpu_relax();
ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
spin_unlock(&hb->lock);

Thanks,

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Thomas Gleixner on 12 Jul 2010 16:50

On Mon, 12 Jul 2010, Thomas Gleixner wrote:

> On Mon, 12 Jul 2010, Darren Hart wrote:
> > On 07/10/2010 12:41 PM, Mike Galbraith wrote:
> > > On Fri, 2010-07-09 at 15:33 -0700, Darren Hart wrote:
> > > > > Out of curiosity, what's wrong with holding his pi_lock across the
> > > > > wakeup? He can _try_ to block, but can't until pi state is stable.
> > > > >
> > > > > I presume there's a big fat gotcha that's just not obvious to futex
> > > > > locking newbie :)
> >
> > Nor to some of us that have been engrossed in futexes for the last couple
> > years! I discussed the pi_lock across the wakeup issue with Thomas. While this
> > fixes the problem for this particular failure case, it doesn't protect
> > against:
> >
> > <tglx> assume the following:
> > <tglx> t1 is on the condvar
> > <tglx> t2 does the requeue dance and t1 is now blocked on the outer futex
> > <tglx> t3 takes hb->lock for a futex in the same bucket
> > <tglx> t2 wakes due to signal/timeout
> > <tglx> t2 blocks on hb->lock
> >
> > You are likely to have not hit the above scenario because you only had one
> > condvar, so the hash_buckets were not heavily shared and you weren't likely to
> > hit:
> >
> > <tglx> t3 takes hb->lock for a futex in the same bucket
> >
> >
> > I'm going to roll up a patchset with your (Mike) spin_trylock patch and run it
> > through some tests. I'd still prefer a way to detect early wakeup without
> > having to grab the hb->lock(), but I haven't found it yet.
> >
> > + while(!spin_trylock(&hb->lock))
> > + cpu_relax();
> > ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
> > spin_unlock(&hb->lock);
>
> And this is nasty as it will create unbound priority inversion :(
>
> We discussed another solution on IRC in meantime:
>
> in futex_wait_requeue_pi()
>
> futex_wait_queue_me(hb, &q, to);
>
> raw_spin_lock(current->pi_lock);
> if (current->pi_blocked_on) {
> /*
> * We know that we can only be blocked on the outer futex
> * so we can skip the early wakeup check
> */
> raw_spin_unlock(current->pi_lock);
> ret = 0;
> } else {
> current->pi_blocked_on = PI_WAKEUP_INPROGRESS;
> raw_spin_unlock(current->pi_lock);
>
> spin_lock(&hb->lock);
> ret = handle_early_requeue_pi_wakeup();
> ....
> spin_lock(&hb->lock);
> }
>
> Now in the rtmutex magic we need in task_blocks_on_rt_mutex():
>
> raw_spin_lock(task->pi_lock);
>
> /*
> * Add big fat comment why this is only relevant to futex
> * requeue_pi
> */
>
> if (task != current && task->pi_blocked_on == PI_WAKEUP_INPROGRESS) {
> raw_spin_lock(task->pi_lock);
>
> /*
> * Returning 0 here is fine. the requeue code is just going to
> * move the futex_q to the other bucket, but that'll be fixed
> * up in handle_early_requeue_pi_wakeup()
> */
>
> return 0;

We might also return a sensible error code here and just remove the
waiter from all queues, which needs to be handled in
handle_early_requeue_pi_wakeup() after acquiring hb->lock then.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mike Galbraith on 12 Jul 2010 23:10

On Mon, 2010-07-12 at 22:40 +0200, Thomas Gleixner wrote:
> On Mon, 12 Jul 2010, Darren Hart wrote:
> > On 07/10/2010 12:41 PM, Mike Galbraith wrote:
> > > On Fri, 2010-07-09 at 15:33 -0700, Darren Hart wrote:
> > > > > Out of curiosity, what's wrong with holding his pi_lock across the
> > > > > wakeup? He can _try_ to block, but can't until pi state is stable.
> > > > >
> > > > > I presume there's a big fat gotcha that's just not obvious to futex
> > > > > locking newbie :)
> >
> > Nor to some of us that have been engrossed in futexes for the last couple
> > years! I discussed the pi_lock across the wakeup issue with Thomas. While this
> > fixes the problem for this particular failure case, it doesn't protect
> > against:
> >
> > <tglx> assume the following:
> > <tglx> t1 is on the condvar
> > <tglx> t2 does the requeue dance and t1 is now blocked on the outer futex
> > <tglx> t3 takes hb->lock for a futex in the same bucket
> > <tglx> t2 wakes due to signal/timeout
> > <tglx> t2 blocks on hb->lock
> >
> > You are likely to have not hit the above scenario because you only had one
> > condvar, so the hash_buckets were not heavily shared and you weren't likely to
> > hit:
> >
> > <tglx> t3 takes hb->lock for a futex in the same bucket
> >
> >
> > I'm going to roll up a patchset with your (Mike) spin_trylock patch and run it
> > through some tests. I'd still prefer a way to detect early wakeup without
> > having to grab the hb->lock(), but I haven't found it yet.
> >
> > + while(!spin_trylock(&hb->lock))
> > + cpu_relax();
> > ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
> > spin_unlock(&hb->lock);
>
> And this is nasty as it will create unbound priority inversion :(

Oh ma gawd, _it's a train_ :>

> We discussed another solution on IRC in meantime:
>
> in futex_wait_requeue_pi()
>
> futex_wait_queue_me(hb, &q, to);
>
> raw_spin_lock(current->pi_lock);
> if (current->pi_blocked_on) {
> /*
> * We know that we can only be blocked on the outer futex
> * so we can skip the early wakeup check
> */
> raw_spin_unlock(current->pi_lock);
> ret = 0;
> } else {
> current->pi_blocked_on = PI_WAKEUP_INPROGRESS;
> raw_spin_unlock(current->pi_lock);
>
> spin_lock(&hb->lock);
> ret = handle_early_requeue_pi_wakeup();
> ....
> spin_lock(&hb->lock);
> }
>
> Now in the rtmutex magic we need in task_blocks_on_rt_mutex():
>
> raw_spin_lock(task->pi_lock);
>
> /*
> * Add big fat comment why this is only relevant to futex
> * requeue_pi
> */
>
> if (task != current && task->pi_blocked_on == PI_WAKEUP_INPROGRESS) {
> raw_spin_lock(task->pi_lock);
>
> /*
> * Returning 0 here is fine. the requeue code is just going to
> * move the futex_q to the other bucket, but that'll be fixed
> * up in handle_early_requeue_pi_wakeup()
> */
>
> return 0;
> }
>
> Thanks,
>
> tglx
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> the body of a message to majordomo(a)vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3
Prev: futex: free_pi_state outside of hb->lock sections
Next: [PATCH] staging: ti dspbridge: fix compilation error

futex: convert hash_bucket locks to raw_spinlock_t