procfs: Do not release pid_ns->proc

Prev: [RFC][PATCH 0/8] perf pmu interface
Next: $100,000 OFFER

From: Louis Rilling on 21 Jun 2010 07:10

On 18/06/10 19:55 +0200, Oleg Nesterov wrote:
> On 06/18, Louis Rilling wrote:
> > > @@ -74,7 +74,7 @@ static int proc_get_sb(struct file_syste
> > > ei = PROC_I(sb->s_root->d_inode);
> > > if (!ei->pid) {
> > > rcu_read_lock();
> > > - ei->pid = get_pid(find_pid_ns(1, ns));
> > > + ei->pid = find_pid_ns(1, ns);
> >
> > I don't think that this is correct. IIUC, proc_delete_inode() calls put_pid() on
> > ei->pid.
>
> Yes,
>
> > So either a special case is added in proc_delete_inode(), or we try to
> > live with get_pid() here. I'm actually not sure that we can pretend that this
> > pid remains valid if we don't get_pid() here.
>
> But please see another change below,
>
> > > +static void proc_mntput(struct work_struct *work)
> > > {
> > > + struct pid_namespace *ns = container_of(work, struct pid_namespace, proc_put);
> > > +
> > > + PROC_I(ns->proc_mnt->mnt_sb->s_root->d_inode)->pid = NULL;
> > > mntput(ns->proc_mnt);
> > > }
>
> it clears ei->pid.
>
> We are called from free_pid_ns() path, this ->pid must not have any reference.
> Any get_pid() implies get_pid_ns().
>
> What do you think?

Hm, I didn't look close enough. Sorry about that. However, I'm still concerned
with this since this pid can have been freed right after container init's
release_task(), and I don't see how it is guaranteed that nobody still tries to
access this proc_mnt.

Thanks,

Louis

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes

From: Louis Rilling on 21 Jun 2010 07:20

On 18/06/10 19:55 +0200, Oleg Nesterov wrote:
> On 06/18, Louis Rilling wrote:
> >
> > On 18/06/10 18:08 +0200, Oleg Nesterov wrote:
> > >
> > > Not sure I ever understood this code. Certainly I can't say I understand
> > > it now. Still, do we really need this circle? I am almost sure the patch
> > > below is not right (and it wasn't tested at all), but could you take a
> > > look?
> >
> > I won't pretend understanding better than you do. Still I can try to give my 2
> > cents.
> >
> > Overall, I don't feel comfortable at being able to have a living proc_mnt
> > with a dead pid_ns.
>
> Yes, this should be fixed, I already realized this. work->func(ns) is
> called when ns is already fixed.
>
> Otherwise, nobody should use ns->proc_mount when ns is already dead/freed.

I meant the opposite. proc_mnt can be kept mounted somewhere, and accesses to it
will likely try to access (freed) pid_ns from it.

Thanks,

Louis

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes

From: Louis Rilling on 21 Jun 2010 07:20

On 18/06/10 18:27 +0200, Oleg Nesterov wrote:
> On 06/18, Louis Rilling wrote:
> >
> > On 17/06/10 23:36 +0200, Oleg Nesterov wrote:
> > > On 06/17, Eric W. Biederman wrote:
> > > >
> > > > The task->children isn't changed until __unhash_process() which runs
> > > > after flush_proc_task().
> > >
> > > Yes. But this is only the current implementation detail.
> > > It would be nice to cleanup the code so that EXIT_DEAD tasks are
> > > never sit in ->children list.
> > >
> > > > So we should be able to come up with
> > > > a variant of do_wait() that zap_pid_ns_processes can use that does
> > > > what we need.
> > >
> > > See above...
> > >
> > > Even if we modify do_wait() or add the new variant, how the caller
> > > can wait for EXIT_DEAD tasks? I don't think we want to modify
> > > release_task() to do __wake_up_parent() or something similar.
> >
> > Indeed, I was thinking about calling __wake_up_parent() from release_task()
> > once parent->children becomes empty.
> >
> > Not sure about the performance impact though. Maybe some WAIT_NO_CHILDREN flag
> > in parent->signal could limit it. But if EXIT_DEAD children are removed from
> > ->children before release_task(), I'm afraid that this becomes impossible.
>
> Thinking more, even the current do_wait() from zap_pid_ns_processes()
> is not really good. Suppose that some none-init thread is ptraced, then
> zap_pid_ns_processes() will hange until the tracer does do_wait() or
> exits.

Is this really a bad thing? If somebody ptraces a task in a pid namespace, that
sounds reasonable to have this namespace (and it's init task) pinned.

Louis

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes

From: Eric W. Biederman on 21 Jun 2010 09:00

Louis Rilling <Louis.Rilling(a)kerlabs.com> writes:

> On 18/06/10 18:27 +0200, Oleg Nesterov wrote:
>> On 06/18, Louis Rilling wrote:
>> >
>> > On 17/06/10 23:36 +0200, Oleg Nesterov wrote:
>> > > On 06/17, Eric W. Biederman wrote:
>> > > >
>> > > > The task->children isn't changed until __unhash_process() which runs
>> > > > after flush_proc_task().
>> > >
>> > > Yes. But this is only the current implementation detail.
>> > > It would be nice to cleanup the code so that EXIT_DEAD tasks are
>> > > never sit in ->children list.
>> > >
>> > > > So we should be able to come up with
>> > > > a variant of do_wait() that zap_pid_ns_processes can use that does
>> > > > what we need.
>> > >
>> > > See above...
>> > >
>> > > Even if we modify do_wait() or add the new variant, how the caller
>> > > can wait for EXIT_DEAD tasks? I don't think we want to modify
>> > > release_task() to do __wake_up_parent() or something similar.
>> >
>> > Indeed, I was thinking about calling __wake_up_parent() from release_task()
>> > once parent->children becomes empty.
>> >
>> > Not sure about the performance impact though. Maybe some WAIT_NO_CHILDREN flag
>> > in parent->signal could limit it. But if EXIT_DEAD children are removed from
>> > ->children before release_task(), I'm afraid that this becomes impossible.
>>
>> Thinking more, even the current do_wait() from zap_pid_ns_processes()
>> is not really good. Suppose that some none-init thread is ptraced, then
>> zap_pid_ns_processes() will hange until the tracer does do_wait() or
>> exits.
>
> Is this really a bad thing? If somebody ptraces a task in a pid namespace, that
> sounds reasonable to have this namespace (and it's init task) pinned.

Louis. Have you seen this problem hit without my setns patch?

I'm pretty certain that this hits because there are processes do_wait
does not wait for, in particular processes in a disjoint process tree.

So at this point I am really favoring killing the do_wait and making
this all asynchronous.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Louis Rilling on 21 Jun 2010 10:20

On 21/06/10 5:58 -0700, Eric W. Biederman wrote:
> Louis Rilling <Louis.Rilling(a)kerlabs.com> writes:
>
> > On 18/06/10 18:27 +0200, Oleg Nesterov wrote:
> >> On 06/18, Louis Rilling wrote:
> >> >
> >> > On 17/06/10 23:36 +0200, Oleg Nesterov wrote:
> >> > > On 06/17, Eric W. Biederman wrote:
> >> > > >
> >> > > > The task->children isn't changed until __unhash_process() which runs
> >> > > > after flush_proc_task().
> >> > >
> >> > > Yes. But this is only the current implementation detail.
> >> > > It would be nice to cleanup the code so that EXIT_DEAD tasks are
> >> > > never sit in ->children list.
> >> > >
> >> > > > So we should be able to come up with
> >> > > > a variant of do_wait() that zap_pid_ns_processes can use that does
> >> > > > what we need.
> >> > >
> >> > > See above...
> >> > >
> >> > > Even if we modify do_wait() or add the new variant, how the caller
> >> > > can wait for EXIT_DEAD tasks? I don't think we want to modify
> >> > > release_task() to do __wake_up_parent() or something similar.
> >> >
> >> > Indeed, I was thinking about calling __wake_up_parent() from release_task()
> >> > once parent->children becomes empty.
> >> >
> >> > Not sure about the performance impact though. Maybe some WAIT_NO_CHILDREN flag
> >> > in parent->signal could limit it. But if EXIT_DEAD children are removed from
> >> > ->children before release_task(), I'm afraid that this becomes impossible.
> >>
> >> Thinking more, even the current do_wait() from zap_pid_ns_processes()
> >> is not really good. Suppose that some none-init thread is ptraced, then
> >> zap_pid_ns_processes() will hange until the tracer does do_wait() or
> >> exits.
> >
> > Is this really a bad thing? If somebody ptraces a task in a pid namespace, that
> > sounds reasonable to have this namespace (and it's init task) pinned.
>
> Louis. Have you seen this problem hit without my setns patch?

Yes. I hit it with Kerrighed patches. I also have an ugly reproducer on
2.6.35-rc3 (see attachments). Ugly because I introduced artifical delays
in release_task(). I couldn't trigger the bug without it, probably because the
scheduler is too kind :)

I'm using memory poisoining (SLAB and DEBUG_SLAB) to make it easy to observe the
bug.

Example:
# ./proc_flush_task-bug-reproducer 1

>
> I'm pretty certain that this hits because there are processes do_wait
> does not wait for, in particular processes in a disjoint process tree.

Indeed do_wait() misses EXIT_DEAD children.

>
> So at this point I am really favoring killing the do_wait and making
> this all asynchronous.

Any idea about how to do it?

Thanks,

Louis

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: [RFC][PATCH 0/8] perf pmu interface
Next: $100,000 OFFER

procfs: Do not release pid_ns->proc_mnt too early