CRED: Fix __task_cred()'s lockdep check and banner comment [Kernel]

Prev: SATA / AHCI: Do not play with the link PM during suspend to RAM (was: Re: HDD not suspending properly / dead on resume)
Next: [PATCH]hostap:hostap_hw.c Fix typo.

From: Paul E. McKenney on 2 Aug 2010 16:50

On Thu, Jul 29, 2010 at 12:45:55PM +0100, David Howells wrote:
> Fix __task_cred()'s lockdep check by removing the following validation
> condition:
>
> lockdep_tasklist_lock_is_held()
>
> as commit_creds() does not take the tasklist_lock, and nor do most of the
> functions that call it, so this check is pointless and it can prevent
> detection of the RCU lock not being held if the tasklist_lock is held.
>
> Instead, add the following validation condition:
>
> task->exit_state >= 0
>
> to permit the access if the target task is dead and therefore unable to change
> its own credentials.
>
>
> Fix __task_cred()'s comment to:
>
> (1) discard the bit that says that the caller must prevent the target task
> from being deleted. That shouldn't need saying.
>
> (2) Add a comment indicating the result of __task_cred() should not be passed
> directly to get_cred(), but rather than get_task_cred() should be used
> instead.
>
> Also put a note into the documentation to enforce this point there too.

Acked-by: Paul E. McKenney <paulmck(a)linux.vnet.ibm.com>

> Signed-off-by: David Howells <dhowells(a)redhat.com>
> Acked-by: Jiri Olsa <jolsa(a)redhat.com>
> Cc: Paul E. McKenney <paulmck(a)linux.vnet.ibm.com>
> ---
>
> Documentation/credentials.txt | 3 +++
> include/linux/cred.h | 15 ++++++++++-----
> include/linux/sched.h | 1 +
> 3 files changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/credentials.txt b/Documentation/credentials.txt
> index a2db352..995baf3 100644
> --- a/Documentation/credentials.txt
> +++ b/Documentation/credentials.txt
> @@ -417,6 +417,9 @@ reference on them using:
> This does all the RCU magic inside of it. The caller must call put_cred() on
> the credentials so obtained when they're finished with.
>
> + [*] Note: The result of __task_cred() should not be passed directly to
> + get_cred() as this may race with commit_cred().
> +
> There are a couple of convenience functions to access bits of another task's
> credentials, hiding the RCU magic from the caller:
>
> diff --git a/include/linux/cred.h b/include/linux/cred.h
> index ce40cbc..4d2c395 100644
> --- a/include/linux/cred.h
> +++ b/include/linux/cred.h
> @@ -274,13 +274,18 @@ static inline void put_cred(const struct cred *_cred)
> * @task: The task to query
> *
> * Access the objective credentials of a task. The caller must hold the RCU
> - * readlock.
> + * readlock or the task must be dead and unable to change its own credentials.
> *
> - * The caller must make sure task doesn't go away, either by holding a ref on
> - * task or by holding tasklist_lock to prevent it from being unlinked.
> + * The result of this function should not be passed directly to get_cred();
> + * rather get_task_cred() should be used instead.
> */
> -#define __task_cred(task) \
> - ((const struct cred *)(rcu_dereference_check((task)->real_cred, rcu_read_lock_held() || lockdep_tasklist_lock_is_held())))
> +#define __task_cred(task) \
> + ({ \
> + const struct task_struct *__t = (task); \
> + rcu_dereference_check(__t->real_cred, \
> + rcu_read_lock_held() || \
> + task_is_dead(__t)); \
> + })
>
> /**
> * get_current_cred - Get the current task's subjective credentials
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 747fcae..0478888 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -214,6 +214,7 @@ extern char ___assert_task_state[1 - 2*!!(
>
> #define task_is_traced(task) ((task->state & __TASK_TRACED) != 0)
> #define task_is_stopped(task) ((task->state & __TASK_STOPPED) != 0)
> +#define task_is_dead(task) ((task)->exit_state != 0)
> #define task_is_stopped_or_traced(task) \
> ((task->state & (__TASK_STOPPED | __TASK_TRACED)) != 0)
> #define task_contributes_to_load(task) \
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Thomas Gleixner on 3 Aug 2010 13:50

On Tue, 3 Aug 2010, Linus Torvalds wrote:

> Added Oleg and Thomas to the participants.
>
> Oleg/Thomas: the whole thread is on lkml, but I'm quoting most of the
> relevant parts..
>
> On Tue, Aug 3, 2010 at 2:34 AM, David Howells <dhowells(a)redhat.com> wrote:
> >
> > A previous patch:
> >
> > � � � �commit 8f92054e7ca1d3a3ae50fb42d2253ac8730d9b2a
> > � � � �Author: David Howells <dhowells(a)redhat.com>
> > � � � �Date: � Thu Jul 29 12:45:55 2010 +0100
> > � � � �Subject: CRED: Fix __task_cred()'s lockdep check and banner comment
> >
> > fixed the lockdep checks on __task_cred(). �This has shown up a place in the
> > signalling code where a lock should be held - namely that
> > check_kill_permission() requires its callers to hold the RCU lock.
>
> It's not just check_kill_permission(), is it? I thought we could do
> the "for_each_process()" loops with just RCU, rather than holding the
> whole tasklist_lock? So I _think_ that getting the RCU read-lock would
> make it possible to get rid of the tasklist_lock in there too? At
> least in kill_something_info().
>
> Yes/no? What am I missing? This is an Oleg question, mainly.

Yes, almost all places in the kernel which hold tasklist_lock read
locked are safe with RCU. I started to work on that, but got
distracted.

Thanks,

tglx

From: Eric W. Biederman on 5 Aug 2010 03:20

Linus Torvalds <torvalds(a)linux-foundation.org> writes:

> Added Oleg and Thomas to the participants.
>
> Oleg/Thomas: the whole thread is on lkml, but I'm quoting most of the
> relevant parts..
>
> On Tue, Aug 3, 2010 at 2:34 AM, David Howells <dhowells(a)redhat.com> wrote:
>>
>> A previous patch:
>>
>> commit 8f92054e7ca1d3a3ae50fb42d2253ac8730d9b2a
>> Author: David Howells <dhowells(a)redhat.com>
>> Date: Thu Jul 29 12:45:55 2010 +0100
>> Subject: CRED: Fix __task_cred()'s lockdep check and banner comment
>>
>> fixed the lockdep checks on __task_cred(). This has shown up a place in the
>> signalling code where a lock should be held - namely that
>> check_kill_permission() requires its callers to hold the RCU lock.
>
> It's not just check_kill_permission(), is it? I thought we could do
> the "for_each_process()" loops with just RCU, rather than holding the
> whole tasklist_lock? So I _think_ that getting the RCU read-lock would
> make it possible to get rid of the tasklist_lock in there too? At
> least in kill_something_info().
>
> Yes/no? What am I missing? This is an Oleg question, mainly.

No. When we send a signal to multiple processes it needs to be an
atomic operation so that kill -KILL -pgrp won't let processes escape.
It is what posix specifies, it is what real programs expect, and it
is the useful semantic in userspace.

Just using rcu_read_lock() breaks that atomicity of sending a signal
to a process group, which means we can have new processes that escape
the signal. Even with the tasklist_lock we have to have a special
case in fork to ensure we don't add a process to a process group
while a signal is being delivered to it.

I have scratched my head a few times wondering if there is a way to
preserve the atomic guarantee without taking a global lock, but I
haven't found one I can wrap my head around yet.

>> success = 0;
>> retval = -ESRCH;
>> + rcu_read_lock();
>> do_each_pid_task(pgrp, PIDTYPE_PGID, p) {
>> int err = group_send_sig_info(sig, info, p);
>> success |= !err;
>> retval = err;
>> } while_each_pid_task(pgrp, PIDTYPE_PGID, p);
>> + rcu_read_unlock();
>
> Ok, makes sense. At the same time, I do wonder if we shouldn't perhaps
> do this in the caller. The callers would seem to all want it - look at
> kill_something_info(), for example, where it really looks like it
> would be nicer to put that _whole_ function under rcu_read_lock() (in
> the caller itself). Or in kernel/exit.c, we have two calls
> back-to-back, so doing it in the caller would clean that up and do it
> just once.

Likely. At one point read_lock(&tasklist_lock) implied having the
rcu_read_lock(). As it no longer does in some cases we have a small
pile of places with outdated assumptions. I know I have been guilty a
few times of forgetting about the new rule that we have to take rcu_read_lock()
everywhere.

> Look here:
>
>> @@ -1261,6 +1263,7 @@ static int kill_something_info(int sig, struct siginfo *info, pid_t pid)
>> int retval = 0, count = 0;
>> struct task_struct * p;
>>
>> + rcu_read_lock();
>> for_each_process(p) {
>> if (task_pid_vnr(p) > 1 &&
>> !same_thread_group(p, current)) {
>> @@ -1270,6 +1273,7 @@ static int kill_something_info(int sig, struct siginfo *info, pid_t pid)
>> retval = err;
>> }
>> }
>> + rcu_read_unlock();
>> ret = count ? retval : -ESRCH;
>> }
>> read_unlock(&tasklist_lock);
>
> with those fragments, we now end up having rcu_read_lock() for _all_
> the three cases in kill_something_info(), but rather than do it once,
> we do it explicitly, and we do it in two different ways (two cases do
> it explicitly, the middle one does it implicitly when it calls
> __kill_pgrp_info().
>
> So I think the patch is correct, but at the same time I do get the
> feeling that we should just do it slightly differently.
>
> Comments?

With the tasklist_lock the rule in these functions is that the caller
will take the lock, so we probably make the rule the caller should
take the lock in the same scenarios for the rcu_read_lock(). Aka just
say:

read_lock(&tasklist_lock);
rcu_read_lock();

everywhere, that today we say just:

read_lock(&tasklist_lock);

It is what was meant when the code was written and the rcu-ized, and
it will certainly preserve my human intuition and general practical
reality that when you have the tasklist_lock you have the
rcu_read_lock.

*Shrug* I don't think it matters a whole lot.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Eric W. Biederman on 5 Aug 2010 16:20

Linus Torvalds <torvalds(a)linux-foundation.org> writes:

> On Thu, Aug 5, 2010 at 12:19 AM, Eric W. Biederman
> <ebiederm(a)xmission.com> wrote:
>>
>> No. When we send a signal to multiple processes it needs to be an
>> atomic operation so that kill -KILL -pgrp won't let processes escape.
>> It is what posix specifies, it is what real programs expect, and it
>> is the useful semantic in userspace.
>
> Ok. However, in that case, it's not really about the whole list
> traversal, it's a totally separate thing, and it's really sad that we
> end up using the (rather hot) tasklist_lock for something like that.
> With the dcache/inode locks basically going away, I think
> tasklist_lock ends up being one of the few hot locks left.

It is about the list traversal. In the process group case it is about
traversing the pid->tasks[PIDTYPE_PGID] hlist, which is also protected
by the tasklist_lock.

> Wouldn't it be much nicer to:
> - make it clear that all the "real" signal locking can rely on RCU
> - use a separate per-pgrp lock that ends up being the one that gives
> the signal _semantic_ meaning?
>
> That would automatically document why we get the lock too, which
> certainly isn't clear from the code as-is.
>
> The per-pgrp lock might be something as simple as a silly hash that
> just spreads out the process groups over some random number of simple
> spinlocks.

I think it is totally reasonable to add a per pid lock,
that would protect the pid->task[...] hlist. That would make
things clearer and finer grained without a lot of effort. Just
a little more struct pid bloat, and a little extra care in fork,
when we add to those lists.

Even with the per-pgrp lock we still need a lock on the global process
list for the kill -KILL -1 case. Which suggests that tasklist_lock is
still needed for part of kill_something_info.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Eric W. Biederman on 5 Aug 2010 17:30

Linus Torvalds <torvalds(a)linux-foundation.org> writes:

> On Thu, Aug 5, 2010 at 1:13 PM, Eric W. Biederman <ebiederm(a)xmission.com> wrote:
>>
>> I think it is totally reasonable to add a per pid lock,
>> that would protect the pid->task[...] hlist. That would make
>> things clearer and finer grained without a lot of effort. Just
>> a little more struct pid bloat, and a little extra care in fork,
>> when we add to those lists.
>
> Hmm. Have you taken a look at Nick Piggin's VFS scalability patches?
> They introduce this "RCU-safe hash chain lock", where each hashchain
> has a lock-bit in the low bit. I wonder if that would be the right
> thing to use?

Interesting. I haven't looked but a lock bit in the low bit of the
hlist head in struct pid would not add any space, and would not add
any extra cache line bounces. So that would just be a matter of
adding the extra locks and getting the lock ordering correct.

>> Even with the per-pgrp lock we still need a lock on the global process
>> list for the kill -KILL -1 case. Which suggests that tasklist_lock is
>> still needed for part of kill_something_info.
>
> Well, that -1 case is special anyway. The fact that we might want to
> use the tasklist_lock there is not very relevant, I think. That is
> _not_ a hotpath, really (at least not under any relevant loads, I'm
> sure you could make a silly benchmark of "kill(-1,0)").

I expect even signal deliver to process groups is relatively rare.

The interesting question is can we kill the tasklist_lock and/or the
tasklist. A quick grep shows that we have maybe 100 users of the tasklist
in the entire kernel. Poking a little deeper it looks like man of those
are connected to scheduling and are uses that today take the tasklist_lock
but would be fine with the rcu_read_lock().

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

|
Pages: 1
Prev: SATA / AHCI: Do not play with the link PM during suspend to RAM (was: Re: HDD not suspending properly / dead on resume)
Next: [PATCH]hostap:hostap_hw.c Fix typo.