From: Linus Torvalds on
Added Oleg and Thomas to the participants.

Oleg/Thomas: the whole thread is on lkml, but I'm quoting most of the
relevant parts..

On Tue, Aug 3, 2010 at 2:34 AM, David Howells <dhowells(a)redhat.com> wrote:
>
> A previous patch:
>
> � � � �commit 8f92054e7ca1d3a3ae50fb42d2253ac8730d9b2a
> � � � �Author: David Howells <dhowells(a)redhat.com>
> � � � �Date: � Thu Jul 29 12:45:55 2010 +0100
> � � � �Subject: CRED: Fix __task_cred()'s lockdep check and banner comment
>
> fixed the lockdep checks on __task_cred(). �This has shown up a place in the
> signalling code where a lock should be held - namely that
> check_kill_permission() requires its callers to hold the RCU lock.

It's not just check_kill_permission(), is it? I thought we could do
the "for_each_process()" loops with just RCU, rather than holding the
whole tasklist_lock? So I _think_ that getting the RCU read-lock would
make it possible to get rid of the tasklist_lock in there too? At
least in kill_something_info().

Yes/no? What am I missing? This is an Oleg question, mainly.

> It's may be that it would be better to add RCU read lock calls in
> group_send_sig_info() only, around the call to check_kill_permission(). �On the
> other hand, some of the callers are either holding the RCU read lock already,
> or have disabled interrupts, in which case, it's just extra overhead to do it
> in g_s_s_i().
>
> Without this patch, the following warning can occur:
>
> [ �140.173556] ===================================================
> [ �140.215379] [ INFO: suspicious rcu_dereference_check() usage. ]
> [ �140.216461] ---------------------------------------------------
> [ �140.217530] kernel/signal.c:660 invoked rcu_dereference_check() without protection!
> [ �140.218937]
> [ �140.218938] other info that might help us debug this:
> [ �140.218939]
> [ �140.220508]
> [ �140.220509] rcu_scheduler_active = 1, debug_locks = 1
> [ �140.221991] 1 lock held by init/1:
> [ �140.222668] �#0: �(tasklist_lock){.+.+..}, at: [<c104a0ac>] kill_something_info+0x7c/0x160
> [ �140.224709]
> [ �140.224711] stack backtrace:
> [ �140.225661] Pid: 1, comm: init Not tainted 2.6.35 #1
> [ �140.226576] Call Trace:
> [ �140.227111] �[<c103cca8>] ? printk+0x18/0x20
> [ �140.227908] �[<c1069884>] lockdep_rcu_dereference+0x94/0xb0
> [ �140.228931] �[<c104936a>] check_kill_permission+0x15a/0x170
> [ �140.229932] �[<c104a0ac>] ? kill_something_info+0x7c/0x160
> [ �140.230921] �[<c1049cca>] group_send_sig_info+0x1a/0x50
> [ �140.231866] �[<c1049d36>] __kill_pgrp_info+0x36/0x60
> [ �140.232780] �[<c104a0d0>] kill_something_info+0xa0/0x160
> [ �140.233740] �[<c10831c5>] ? __call_rcu+0xa5/0x110
> [ �140.234596] �[<c104b7ee>] sys_kill+0x5e/0x70
> [ �140.235387] �[<c10d1eee>] ? mntput_no_expire+0x1e/0xa0
> [ �140.236329] �[<c10bbd10>] ? __fput+0x170/0x220
> [ �140.257756] �[<c10bbdd9>] ? fput+0x19/0x20
> [ �140.258529] �[<c137ad94>] ? restore_all_notrace+0x0/0x18
> [ �140.259496] �[<c11bfb04>] ? trace_hardirqs_on_thunk+0xc/0x10
> [ �140.260531] �[<c137ad61>] syscall_call+0x7/0xb
> [ �144.627841] nfsd: last server has exited, flushing export cache
> [ �154.040420] Restarting system.
> [ �154.041061] machine restart
>
> Reported-by: Tetsuo Handa <penguin-kernel(a)i-love.sakura.ne.jp>
> Signed-off-by: David Howells <dhowells(a)redhat.com>
> ---
>
> �kernel/exit.c � | � �2 ++
> �kernel/signal.c | � �8 ++++++--
> �2 files changed, 8 insertions(+), 2 deletions(-)
>
>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index ceffc67..7858ebf 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -773,6 +773,7 @@ static void forget_original_parent(struct task_struct *father)
>
> � � � �exit_ptrace(father);
>
> + � � � rcu_read_lock();
> � � � �write_lock_irq(&tasklist_lock);
> � � � �reaper = find_new_reaper(father);
>
> @@ -791,6 +792,7 @@ static void forget_original_parent(struct task_struct *father)
> � � � � � � � �reparent_leader(father, p, &dead_children);
> � � � �}
> � � � �write_unlock_irq(&tasklist_lock);
> + � � � rcu_read_unlock();
>
> � � � �BUG_ON(!list_empty(&father->children));
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 906ae5a..f771156 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -637,7 +637,7 @@ static inline bool si_fromuser(const struct siginfo *info)
>
> �/*
> �* Bad permissions for sending the signal
> - * - the caller must hold at least the RCU read lock
> + * - the caller must hold the RCU read lock
> �*/
> �static int check_kill_permission(int sig, struct siginfo *info,
> � � � � � � � � � � � � � � � � struct task_struct *t)
> @@ -1127,7 +1127,7 @@ struct sighand_struct *lock_task_sighand(struct task_struct *tsk, unsigned long
>
> �/*
> �* send signal info to all the members of a group
> - * - the caller must hold the RCU read lock at least
> + * - the caller must hold the RCU read lock
> �*/
> �int group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p)
> �{
> @@ -1151,11 +1151,13 @@ int __kill_pgrp_info(int sig, struct siginfo *info, struct pid *pgrp)
>
> � � � �success = 0;
> � � � �retval = -ESRCH;
> + � � � rcu_read_lock();
> � � � �do_each_pid_task(pgrp, PIDTYPE_PGID, p) {
> � � � � � � � �int err = group_send_sig_info(sig, info, p);
> � � � � � � � �success |= !err;
> � � � � � � � �retval = err;
> � � � �} while_each_pid_task(pgrp, PIDTYPE_PGID, p);
> + � � � rcu_read_unlock();

Ok, makes sense. At the same time, I do wonder if we shouldn't perhaps
do this in the caller. The callers would seem to all want it - look at
kill_something_info(), for example, where it really looks like it
would be nicer to put that _whole_ function under rcu_read_lock() (in
the caller itself). Or in kernel/exit.c, we have two calls
back-to-back, so doing it in the caller would clean that up and do it
just once.

Look here:

> @@ -1261,6 +1263,7 @@ static int kill_something_info(int sig, struct siginfo *info, pid_t pid)
> � � � � � � � �int retval = 0, count = 0;
> � � � � � � � �struct task_struct * p;
>
> + � � � � � � � rcu_read_lock();
> � � � � � � � �for_each_process(p) {
> � � � � � � � � � � � �if (task_pid_vnr(p) > 1 &&
> � � � � � � � � � � � � � � � � � � � �!same_thread_group(p, current)) {
> @@ -1270,6 +1273,7 @@ static int kill_something_info(int sig, struct siginfo *info, pid_t pid)
> � � � � � � � � � � � � � � � � � � � �retval = err;
> � � � � � � � � � � � �}
> � � � � � � � �}
> + � � � � � � � rcu_read_unlock();
> � � � � � � � �ret = count ? retval : -ESRCH;
> � � � �}
> � � � �read_unlock(&tasklist_lock);

with those fragments, we now end up having rcu_read_lock() for _all_
the three cases in kill_something_info(), but rather than do it once,
we do it explicitly, and we do it in two different ways (two cases do
it explicitly, the middle one does it implicitly when it calls
__kill_pgrp_info().

So I think the patch is correct, but at the same time I do get the
feeling that we should just do it slightly differently.

Comments?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on
On Thu, Aug 5, 2010 at 12:19 AM, Eric W. Biederman
<ebiederm(a)xmission.com> wrote:
>
> No. �When we send a signal to multiple processes it needs to be an
> atomic operation so that kill -KILL -pgrp won't let processes escape.
> It is what posix specifies, it is what real programs expect, and it
> is the useful semantic in userspace.

Ok. However, in that case, it's not really about the whole list
traversal, it's a totally separate thing, and it's really sad that we
end up using the (rather hot) tasklist_lock for something like that.
With the dcache/inode locks basically going away, I think
tasklist_lock ends up being one of the few hot locks left.

Wouldn't it be much nicer to:
- make it clear that all the "real" signal locking can rely on RCU
- use a separate per-pgrp lock that ends up being the one that gives
the signal _semantic_ meaning?

That would automatically document why we get the lock too, which
certainly isn't clear from the code as-is.

The per-pgrp lock might be something as simple as a silly hash that
just spreads out the process groups over some random number of simple
spinlocks.

> With the tasklist_lock the rule in these functions is that the caller
> will take the lock, so we probably make the rule the caller should
> take the lock in the same scenarios for the rcu_read_lock(). Aka just
> say:
>
> read_lock(&tasklist_lock);
> rcu_read_lock();
>
> everywhere, that today we say just:
>
> read_lock(&tasklist_lock);

I agree that we probably should have done that originally, in order to
not have these bugs show up later. However, I don't think it makes
sense any more, especially not if tasklist_lock isn't even a "real"
lock from a kernel internal consistency standpoint, but has a totally
secondary meaning.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on
On Thu, Aug 5, 2010 at 1:13 PM, Eric W. Biederman <ebiederm(a)xmission.com> wrote:
>
> I think it is totally reasonable to add a per pid lock,
> that would protect the pid->task[...] hlist. �That would make
> things clearer and finer grained without a lot of effort. �Just
> a little more struct pid bloat, and a little extra care in fork,
> when we add to those lists.

Hmm. Have you taken a look at Nick Piggin's VFS scalability patches?
They introduce this "RCU-safe hash chain lock", where each hashchain
has a lock-bit in the low bit. I wonder if that would be the right
thing to use?

> Even with the per-pgrp lock we still need a lock on the global process
> list for the kill -KILL -1 case. �Which suggests that tasklist_lock is
> still needed for part of kill_something_info.

Well, that -1 case is special anyway. The fact that we might want to
use the tasklist_lock there is not very relevant, I think. That is
_not_ a hotpath, really (at least not under any relevant loads, I'm
sure you could make a silly benchmark of "kill(-1,0)").

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/