while_each_thread() under rcu_read_lock() is broken? [Kernel]

Prev: [PATCH] x86: Fix vsyscall on gcc 4.5 with -Os
Next: [PATCH 10/12] scsi: megaraid_sas - Add input parameter for max_sectors

From: Oleg Nesterov on 29 Jun 2010 09:10

On 06/28, Paul E. McKenney wrote:
>
> On Fri, Jun 25, 2010 at 11:55:48AM +0200, Oleg Nesterov wrote:
> > On 06/24, Paul E. McKenney wrote:
> > >
> > > So it is OK to skip some of the other threads in this case, even
> > > though they were present throughout the whole procedure?
> >
> > I think, yes. We can miss them in any case, they can go away before
> > while_each_thread(g, t) starts the scan.
> >
> > If g == group_leader (old or new), then we should notice this thread
> > at least.
> >
> > Otherwise we can miss them all, with or without next_thread_careful().
>
> Just to be sure that we are actually talking about the same scenario...
>
> Suppose that a task group is lead by 2908 and has member 2909, 2910,
> 2911, and 2912. Suppose that 2910 does pthread_exit() just as some
> other task is "ls"ing the relevant /proc entry. Is it really OK for
> "ls" to show 2909 but not 2911 and 2912, even though 2911 and 2912
> were alive and kicking the entire time?

Confused.

Let's return to

do
printk("%d\n", t->pid);
while_each_thread(g, t);

for the moment.

In that case, if g != 2910 (the exiting thread) we will print all pids,
except we can miss 2910. With or without next_thread_careful().

Only if we start at g == 2910, then

current code: print 2910, then spin forever printing
other pids

next_thread_careful: stop printing when we notice that 2910
was unhashed.

So, yes, in this case we can miss all
other threads.

As for "ls"ing the relevant /proc entry. proc_task_readdir() is complicated,
it can drop rcu lock, sleep, etc. But basically it mimics while_each_thread()
logic. Let's assume that proc_task_fill_cache() never fails.

proc_task_readdir() always starts at the group_leader, 2908. So, with or
without next_thread_careful() we can only miss the exiting 2910.

But (again, unless I missed something) the current code can race with exec,
and s/next_thread/next_thread_careful/ in first_tid() can fix the race.
(just in case, we can fix it differently).

But, of course, if you do "ls /proc/2910/task" instead of "ls /proc/2908/task"
you can miss _all_ threads if 2910 exits before proc_task_readdir() finds
its leader, 2908. Again, this is with or without next_thread_careful().

Paul, please let me know if I misunderstood your concerns, or if I missed
something.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Paul E. McKenney on 29 Jun 2010 12:10

On Tue, Jun 29, 2010 at 03:05:03PM +0200, Oleg Nesterov wrote:
> On 06/28, Paul E. McKenney wrote:
> >
> > On Fri, Jun 25, 2010 at 11:55:48AM +0200, Oleg Nesterov wrote:
> > > On 06/24, Paul E. McKenney wrote:
> > > >
> > > > So it is OK to skip some of the other threads in this case, even
> > > > though they were present throughout the whole procedure?
> > >
> > > I think, yes. We can miss them in any case, they can go away before
> > > while_each_thread(g, t) starts the scan.
> > >
> > > If g == group_leader (old or new), then we should notice this thread
> > > at least.
> > >
> > > Otherwise we can miss them all, with or without next_thread_careful().
> >
> > Just to be sure that we are actually talking about the same scenario...
> >
> > Suppose that a task group is lead by 2908 and has member 2909, 2910,
> > 2911, and 2912. Suppose that 2910 does pthread_exit() just as some
> > other task is "ls"ing the relevant /proc entry. Is it really OK for
> > "ls" to show 2909 but not 2911 and 2912, even though 2911 and 2912
> > were alive and kicking the entire time?
>
> Confused.
>
> Let's return to
>
> do
> printk("%d\n", t->pid);
> while_each_thread(g, t);
>
> for the moment.
>
> In that case, if g != 2910 (the exiting thread) we will print all pids,
> except we can miss 2910. With or without next_thread_careful().
>
> Only if we start at g == 2910, then
>
> current code: print 2910, then spin forever printing
> other pids
>
> next_thread_careful: stop printing when we notice that 2910
> was unhashed.
>
> So, yes, in this case we can miss all
> other threads.
>
> As for "ls"ing the relevant /proc entry. proc_task_readdir() is complicated,
> it can drop rcu lock, sleep, etc. But basically it mimics while_each_thread()
> logic. Let's assume that proc_task_fill_cache() never fails.
>
> proc_task_readdir() always starts at the group_leader, 2908. So, with or
> without next_thread_careful() we can only miss the exiting 2910.
>
> But (again, unless I missed something) the current code can race with exec,
> and s/next_thread/next_thread_careful/ in first_tid() can fix the race.
> (just in case, we can fix it differently).
>
> But, of course, if you do "ls /proc/2910/task" instead of "ls /proc/2908/task"
> you can miss _all_ threads if 2910 exits before proc_task_readdir() finds
> its leader, 2908. Again, this is with or without next_thread_careful().
>
>
> Paul, please let me know if I misunderstood your concerns, or if I missed
> something.

Thank you very much for laying this out completely! I was having a hard
time believing that it was OK to miss threads in the "ls /proc/2910/task"
case. But of course similar issues can arise when running "ls" on a
directory with lots of files that are coming and going quickly in the
meantime, I guess. And if proc_task_fill_cache() fails, we can miss
tasks as well, correct?

Given all this, I believe that your fix really does work.

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Oleg Nesterov on 29 Jun 2010 14:00

On 06/29, Paul E. McKenney wrote:
>
> On Tue, Jun 29, 2010 at 03:05:03PM +0200, Oleg Nesterov wrote:
> >
> > Paul, please let me know if I misunderstood your concerns, or if I missed
> > something.
>
> Thank you very much for laying this out completely! I was having a hard
> time believing that it was OK to miss threads in the "ls /proc/2910/task"
> case. But of course similar issues can arise when running "ls" on a
> directory with lots of files that are coming and going quickly in the
> meantime, I guess.

Yes. And again, even if 2910 is not the group leader and it is exiting,
"ls /proc/2910/task" will work because proc_task_readdir() akways starts
at 2910->group_leader == 2008.

It doesn't work only if proc_task_readdir() can't find its leader, in
this particular case this just means 2910 no longer exists, and thus
/proc/2910/ is dead even if we can still find this dentry.

> And if proc_task_fill_cache() fails, we can miss
> tasks as well, correct?

Well, yes and no.

Sure, if proc_task_fill_cache() fails we didn't reported all threads.
But if /bin/ls does readdir() again after that, proc_task_readdir()
tries to contunue starting from the last-pid-we-failed-to-report.
If there is no task with that pid, we start from the group_leader
and skip the number-of-already-reported-threads.

So, we have a lot of issues here, we can miss some thread because
"skip the number-of-already-reported-threads" can't be really accurate.

But, to clarify, this has almost nothing to do with the original problem.
Afaics, if we change first_tid() to use next_thread_careful() instead
of next_thread(), we close the pure-theoretical race with exec but that
is all. (and I am still not sure this race does exist, and even if it
does we can fix it without next_thread_careful).

> Given all this, I believe that your fix really does work.

Great. I'll send the patch once I inspect zap_threads() and
current_is_single_threaded() to figure out which changes they need.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Roland McGrath on 8 Jul 2010 20:10

> We can't do call_rcu(release_task),
> we can't take tasklist for writing in the softirq context. But even
> if we could, this can't help in fact or I missed something.

Ah, I had missed that constraint of call_rcu. (It's not mentioned in the
kerneldoc comment directly, and is only buried in a couple of brief
mentions in Documentation/RCU/.) I was thinking that would suffice because
it does it between rcu_read_lock critical sections, but I guess it actually
only does it after the last one and doesn't prevent a new one from having
started on another CPU. (And we can't use it anyway.)

Thanks,
Roland
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Roland McGrath on 8 Jul 2010 20:50

> The reason for my "destroying the old" and "forming the new" is the
> possibility of someone doing proc_task_readdir() when the group leader
> does exec(), which causes all to die, and then the new process does
> pthread_create(), forming a new thread group. Because proc_task_readdir()
> neither holds a lock nor stays in an RCU read-side critical section
> for the full /proc scan, the old group might really be destroyed from
> the reader's point of view.

I haven't tried to understand the /proc code. From the userland
perspective, there is one thing called a "process" with a single ID that
the kernel calls the TGID and everybody else calls the PID, and that
container survives execs regardless of which of its threads do them.
Listing /proc/TGID/task is the way userland (i.e. debuggers) enumerate all
the threads (e.g. for attaching them all with ptrace). It's of course
expected that threads will be coming and going, so userland expects to read
it several times, until there were no new threads in the list after it
attached all the ones from the last read (so it would know if those ones
created any more).

I can't quite tell but it sounds like you may be saying that this procedure
won't work with rewinding the same fd. After an exec, that fd may point to
a reaped former leader and yield no results. (Looking at the code now, it
looks like readdir will fail with the unusual error ENOENT in that case, so
userland could detect that case easily now.) To pick up the next iteration
of that procedure correctly, you'd have to reopen /proc/TGID/task by name
to get an fd associated with the new leader. That is the only thing I can
think of that is meaningful in userland terms and might be what you mean by
"destroying the old and forming the new". Is that it?

But it also sounds like you may be saying that the lack of locking in
proc_task_readdir means it could just punt out of its listing loop early at
any time that the task it just looked at is reaped. Is that right? That
is OK for userland if any given readdir call returns a short list--but not
if it returns a premature EOF. It looks to me like that's possible too.

If so, that is startling off hand, and breaks the userland expectation by
the "least astonishment" principle. (That is, you can sometimes get a
short listing showing a subset of threads that does not include some
threads you previously saw as alive and are still alive.) It can also
actually break the procedure I described above if one false EOF causes the
reader to miss a new thread it hasn't seen before, so it thinks it has
already stopped all the threads that are alive.

I don't know of anything in userland using that procedure. But it's
what I would have recommended if asked, before you brought this issue
up. (strace -p does a single pass and is only intending to offer any
guarantee if you've already finished stopping the thing with SIGSTOP
first. gdb uses other means that amount to setting a breakpoint inside
pthread_create before reading libpthread's data structures from user
memory for the userland thread list without regard to the actual kernel
thread situation.)

I suppose we can just say that proc_task_readdir is entirely unreliable
unless you're sure otherwise that threads are not being reaped while you
read it, since that seems to be the truth of it. I would sure be
happier as a userland programmer if the kernel gave something that I
could rely on by some feasible race-noticing procedure akin to that
above, but it's not the end of the world.

Thanks,
Roland
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6
Prev: [PATCH] x86: Fix vsyscall on gcc 4.5 with -Os
Next: [PATCH 10/12] scsi: megaraid_sas - Add input parameter for max_sectors