futex: protect against pi_blocked_on corruption during requeue PI [Kernel]

Prev: Fix for relocatable PowerPC kernels
Next: [PATCH] atlas_btns: fix mixing acpi_status and int for return value

From: Thomas Gleixner on 13 Jul 2010 05:30

On Tue, 13 Jul 2010, Darren Hart wrote:

> Thanks to Thomas, Steven, and Mike for hashing this over me. After an
> IRC discussion with Thomas, I put the following together. It resolves
> the issue for me, Mike please test and let us know if it fixes it for
> you. A couple of points of discussion before we commit this:
>
> The use of the new state flag, PI_WAKEUP_INPROGRESS, is pretty ugly.
> Would a new task_pi_blocked_on_valid() method be preferred (in
> rtmutex.c)?
>
> The new WARN_ON() in task_blocks_on_rt_mutex() is complex. It didn't
> exist before and we've now closed this gap, should we just drop it?

We can simplify it to:

WARN_ON(task->pi_blocked_on &&
task->pi_blocked_on != PI_WAKEUP_INPROGRESS);

We check for !=current and PI_WAKEUP_INPROGRESS just above.

> I've added a couple BUG_ON()s in futex_wait_requeue_pi() dealing with
> the race with requeue and q.lock_ptr. I'd like to leave this for the
> time being if nobody strongly objects.
> -
> /*
> - * In order for us to be here, we know our q.key == key2, and since
> - * we took the hb->lock above, we also know that futex_requeue() has
> - * completed and we no longer have to concern ourselves with a wakeup
> - * race with the atomic proxy lock acquition by the requeue code.
> + * Avoid races with requeue and trying to block on two mutexes
> + * (hb->lock and uaddr2's rtmutex) by serializing access to
> + * pi_blocked_on with pi_lock and setting PI_BLOCKED_ON_PENDING.
> + */
> + raw_spin_lock(&current->pi_lock);

Needs to be raw_spin_lock_irq()

> + if (current->pi_blocked_on) {
> + raw_spin_unlock(&current->pi_lock);
> + } else {
> + current->pi_blocked_on = (struct rt_mutex_waiter *)PI_WAKEUP_INPROGRESS;

#define PI_WAKEUP_INPROGRESS ((struct rt_mutex_waiter *) 1)

perhaps ? That gets rid of all type casts

> + raw_spin_unlock(&current->pi_lock);
> +
> + spin_lock(&hb->lock);

We need to cleanup current->pi_blocked_on here. If we succeed in the
hb->lock fast path then we might leak the PI_WAKEUP_INPROGRESS to user space
and the next requeue will fail.

> diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
> index 23dd443..0399108 100644
> --- a/kernel/rtmutex.c
> +++ b/kernel/rtmutex.c
> @@ -227,7 +227,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
> * reached or the state of the chain has changed while we
> * dropped the locks.
> */
> - if (!waiter || !waiter->task)
> + if (!waiter || (long)waiter == PI_WAKEUP_INPROGRESS || !waiter->task)
> goto out_unlock_pi;

Why do we need that check ? Either the requeue succeeded then
task->pi_blocked_on is set to the real waiter or the wakeup won and
we are in no lock chain.

If we ever find a waiter with PI_WAKEUP_INPROGRESS set in
rt_mutex_adjust_prio_chain() then it's a bug nothing else.

> @@ -469,7 +493,8 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
> plist_add(&waiter->pi_list_entry, &owner->pi_waiters);
>
> __rt_mutex_adjust_prio(owner);
> - if (owner->pi_blocked_on)
> + if (owner->pi_blocked_on &&
> + (long)owner->pi_blocked_on != PI_WAKEUP_INPROGRESS)

Again, that can never happen

> chain_walk = 1;
> raw_spin_unlock(&owner->pi_lock);
> }
> @@ -579,9 +604,11 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
>
> raw_spin_lock(&pendowner->pi_lock);
>
> - WARN_ON(!pendowner->pi_blocked_on);
> - WARN_ON(pendowner->pi_blocked_on != waiter);
> - WARN_ON(pendowner->pi_blocked_on->lock != lock);
> + if (!WARN_ON(!pendowner->pi_blocked_on) &&
> + !WARN_ON((long)pendowner->pi_blocked_on == PI_WAKEUP_INPROGRESS)) {

Ditto

> + WARN_ON(pendowner->pi_blocked_on != waiter);
> + WARN_ON(pendowner->pi_blocked_on->lock != lock);
> + }
>
> pendowner->pi_blocked_on = NULL;
>
> @@ -624,7 +651,8 @@ static void remove_waiter(struct rt_mutex *lock,
> }
> __rt_mutex_adjust_prio(owner);
>
> - if (owner->pi_blocked_on)
> + if (owner->pi_blocked_on &&
> + (long)owner->pi_blocked_on != PI_WAKEUP_INPROGRESS)
> chain_walk = 1;

Same here.

> raw_spin_unlock(&owner->pi_lock);
> @@ -658,7 +686,8 @@ void rt_mutex_adjust_pi(struct task_struct *task)
> raw_spin_lock_irqsave(&task->pi_lock, flags);
>
> waiter = task->pi_blocked_on;
> - if (!waiter || waiter->list_entry.prio == task->prio) {
> + if (!waiter || (long)waiter == PI_WAKEUP_INPROGRESS ||
> + waiter->list_entry.prio == task->prio) {

And here

> /*
> * Convert user-nice values [ -20 ... 0 ... 19 ]
> * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
> @@ -6377,7 +6379,8 @@ void task_setprio(struct task_struct *p, int prio)
> */
> if (unlikely(p == rq->idle)) {
> WARN_ON(p != rq->curr);
> - WARN_ON(p->pi_blocked_on);
> + WARN_ON(p->pi_blocked_on &&
> + (long)p->pi_blocked_on != PI_WAKEUP_INPROGRESS);

Yuck. Paranoia ? If we ever requeue idle, then .....

I'm going to cleanup the stuff and send out a new patch for Mike
to test.

Thanks,

tglx

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Thomas Gleixner on 13 Jul 2010 06:00

On Tue, 13 Jul 2010, Darren Hart wrote:
> diff --git a/kernel/futex.c b/kernel/futex.c
> index a6cec32..c92978d 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1336,6 +1336,9 @@ retry_private:
> requeue_pi_wake_futex(this, &key2, hb2);
> drop_count++;
> continue;
> + } else if (ret == -EAGAIN) {
> + /* Waiter woken by timeout or signal. */

This leaks the pi_state.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Thomas Gleixner on 13 Jul 2010 06:30

On Tue, 13 Jul 2010, Thomas Gleixner wrote:
> On Tue, 13 Jul 2010, Darren Hart wrote:
>
> > --- a/kernel/rtmutex.c
> > +++ b/kernel/rtmutex.c
> > @@ -227,7 +227,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
> > * reached or the state of the chain has changed while we
> > * dropped the locks.
> > */
> > - if (!waiter || !waiter->task)
> > + if (!waiter || (long)waiter == PI_WAKEUP_INPROGRESS || !waiter->task)
> > goto out_unlock_pi;
>
> Why do we need that check ? Either the requeue succeeded then
> task->pi_blocked_on is set to the real waiter or the wakeup won and
> we are in no lock chain.
>
> If we ever find a waiter with PI_WAKEUP_INPROGRESS set in
> rt_mutex_adjust_prio_chain() then it's a bug nothing else.

Grrr, I'm wrong. If we take hb->lock in the fast path then something
else might try to boost us and trip over this :(

This code causes braindamage. I really wonder whether we need to
remove it according to the "United Nations Convention against Torture
and Other Cruel, Inhuman or Degrading Treatment or Punishment".

> > @@ -6377,7 +6379,8 @@ void task_setprio(struct task_struct *p, int prio)
> > */
> > if (unlikely(p == rq->idle)) {
> > WARN_ON(p != rq->curr);
> > - WARN_ON(p->pi_blocked_on);
> > + WARN_ON(p->pi_blocked_on &&
> > + (long)p->pi_blocked_on != PI_WAKEUP_INPROGRESS);
>
> Yuck. Paranoia ? If we ever requeue idle, then .....

At least one which is bogus :)

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

|
Pages: 1
Prev: Fix for relocatable PowerPC kernels
Next: [PATCH] atlas_btns: fix mixing acpi_status and int for return value