procfs: Do not release pid_ns->proc

Prev: [RFC][PATCH 0/8] perf pmu interface
Next: $100,000 OFFER

From: Pavel Emelyanov on 16 Jun 2010 12:10

> +void proc_new_task(struct task_struct *task)
> +{
> + struct pid *pid;
> + int i;
> +
> + if (!task->pid)
> + return;
> +
> + pid = task_pid(task);
> + for (i = 0; i <= pid->level; i++)
> + mntget(pid->numbers[i].ns->proc_mnt);

I feel I'm missing something significant, but this patch breaks
the mntget/mntput balance. Doesn't it?

> +}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Louis Rilling on 16 Jun 2010 12:20

On 16/06/10 20:04 +0400, Pavel Emelyanov wrote:
> > +void proc_new_task(struct task_struct *task)
> > +{
> > + struct pid *pid;
> > + int i;
> > +
> > + if (!task->pid)
> > + return;
> > +
> > + pid = task_pid(task);
> > + for (i = 0; i <= pid->level; i++)
> > + mntget(pid->numbers[i].ns->proc_mnt);
>
> I feel I'm missing something significant, but this patch breaks
> the mntget/mntput balance. Doesn't it?

Gah. Sorry, the part in proc_flush_task() is missing. Will resend ASAP.

Thanks,

Louis

>
> > +}
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo(a)vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Pavel Emelyanov on 17 Jun 2010 06:00

On 06/16/2010 08:34 PM, Louis Rilling wrote:
> [ Resending, hopefully with all pieces ]
>
> Detached tasks are not seen by zap_pid_ns_processes()->sys_wait4(), so
> that release_task()->proc_flush_task() of container init can be called
> before it is for some detached tasks in the namespace.
>
> Pin proc_mnt's in copy_process(), so that proc_flush_task() becomes safe
> whatever the ordering of tasks.

See one comment below.

> Signed-off-by: Louis Rilling <louis.rilling(a)kerlabs.com>
> ---
> fs/proc/base.c | 18 ++++++++++++++++++
> include/linux/proc_fs.h | 4 ++++
> kernel/fork.c | 1 +
> 3 files changed, 23 insertions(+), 0 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index acb7ef8..d6cdd91 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2663,6 +2663,23 @@ static const struct inode_operations proc_tgid_base_inode_operations = {
> .setattr = proc_setattr,
> };
>
> +/*
> + * Pin all proc_mnt so that detached tasks can safely call proc_flush_task()
> + * after container init calls itself proc_flush_task().
> + */
> +void proc_new_task(struct task_struct *task)
> +{
> + struct pid *pid;
> + int i;
> +
> + if (!task->pid)
> + return;
> +
> + pid = task_pid(task);
> + for (i = 0; i <= pid->level; i++)
> + mntget(pid->numbers[i].ns->proc_mnt);

NAK. Pids live their own lives - task can change one, pid will
become orphan and will be destroyed, so you'll leak.

There's another big problem with proc mount - you can umount proc
and the namespace will have a stale pointer.

I think we should have a kern_mount-ed proc at the namespace createi

> +}
> +
> static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
> {
> struct dentry *dentry, *leader, *dir;
> @@ -2744,6 +2761,7 @@ void proc_flush_task(struct task_struct *task)
> upid = &pid->numbers[i];
> proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr,
> tgid->numbers[i].nr);
> + mntput(upid->ns->proc_mnt);
> }
>
> upid = &pid->numbers[pid->level];
> diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> index 379eaed..f24faa1 100644
> --- a/include/linux/proc_fs.h
> +++ b/include/linux/proc_fs.h
> @@ -104,6 +104,7 @@ struct vmcore {
>
> extern void proc_root_init(void);
>
> +void proc_new_task(struct task_struct *task);
> void proc_flush_task(struct task_struct *task);
>
> extern struct proc_dir_entry *create_proc_entry(const char *name, mode_t mode,
> @@ -184,6 +185,9 @@ extern void dup_mm_exe_file(struct mm_struct *oldmm, struct mm_struct *newmm);
> #define proc_net_fops_create(net, name, mode, fops) ({ (void)(mode), NULL; })
> static inline void proc_net_remove(struct net *net, const char *name) {}
>
> +static inline void proc_new_task(struct task_struct *task)
> +{
> +}
> static inline void proc_flush_task(struct task_struct *task)
> {
> }
> diff --git a/kernel/fork.c b/kernel/fork.c
> index b6cce14..c6c2874 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1281,6 +1281,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> total_forks++;
> spin_unlock(&current->sighand->siglock);
> write_unlock_irq(&tasklist_lock);
> + proc_new_task(p);
> proc_fork_connector(p);
> cgroup_post_fork(p);
> perf_event_fork(p);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Eric W. Biederman on 17 Jun 2010 09:50

Pavel Emelyanov <xemul(a)parallels.com> writes:

> On 06/16/2010 08:34 PM, Louis Rilling wrote:
>> [ Resending, hopefully with all pieces ]
>>
>> Detached tasks are not seen by zap_pid_ns_processes()->sys_wait4(), so
>> that release_task()->proc_flush_task() of container init can be called
>> before it is for some detached tasks in the namespace.

There are two ways we can go from here.

- Completely asynchronous children exiting.
- Waiting for all of our children to exit.

Your patch appears to be a weird middle ground, that is hard to
analyze, abusing the mount count as a thread count.

I have been weighing the options between them, and to me properly
waiting for all the processes to exit in zap_pid_ns_processes as we
currently try to do is in our advantage. It is simple and it means
one less cache line bounce for a write to global variable in the
much more common fork/exit path that we have to deal with.

The task->children isn't changed until __unhash_process() which runs
after flush_proc_task(). So we should be able to come up with
a variant of do_wait() that zap_pid_ns_processes can use that does
what we need.

Louis do you want to look at that?

>> Pin proc_mnt's in copy_process(), so that proc_flush_task() becomes safe
>> whatever the ordering of tasks.
>
> See one comment below.
>
>> Signed-off-by: Louis Rilling <louis.rilling(a)kerlabs.com>
>> ---
>> fs/proc/base.c | 18 ++++++++++++++++++
>> include/linux/proc_fs.h | 4 ++++
>> kernel/fork.c | 1 +
>> 3 files changed, 23 insertions(+), 0 deletions(-)
>>
>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>> index acb7ef8..d6cdd91 100644
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>> @@ -2663,6 +2663,23 @@ static const struct inode_operations proc_tgid_base_inode_operations = {
>> .setattr = proc_setattr,
>> };
>>
>> +/*
>> + * Pin all proc_mnt so that detached tasks can safely call proc_flush_task()
>> + * after container init calls itself proc_flush_task().
>> + */
>> +void proc_new_task(struct task_struct *task)
>> +{
>> + struct pid *pid;
>> + int i;
>> +
>> + if (!task->pid)
>> + return;
>> +
>> + pid = task_pid(task);
>> + for (i = 0; i <= pid->level; i++)
>> + mntget(pid->numbers[i].ns->proc_mnt);
>
> NAK. Pids live their own lives - task can change one, pid will
> become orphan and will be destroyed, so you'll leak.

There is that nasty case in exec isn't there. Why we ever made it
part of the ABI that calling exec on a thread changes the pid of
that thread to the pid of the thread group is beyond me.

> There's another big problem with proc mount - you can umount proc
> and the namespace will have a stale pointer.

Not true. pid_ns->proc_mnt is an internal kernel mount. See
pid_ns_prepare_proc.

> I think we should have a kern_mount-ed proc at the namespace createi

I agree, and we mostly do. In my queue for the unsharing of the pid
namespace I have a patch that makes that more explicit.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Louis Rilling on 17 Jun 2010 10:30

On Thu, Jun 17, 2010 at 06:41:49AM -0700, Eric W. Biederman wrote:
> Pavel Emelyanov <xemul(a)parallels.com> writes:
>
> > On 06/16/2010 08:34 PM, Louis Rilling wrote:
> >> [ Resending, hopefully with all pieces ]
> >>
> >> Detached tasks are not seen by zap_pid_ns_processes()->sys_wait4(), so
> >> that release_task()->proc_flush_task() of container init can be called
> >> before it is for some detached tasks in the namespace.
>
> There are two ways we can go from here.
>
> - Completely asynchronous children exiting.
> - Waiting for all of our children to exit.

Agreed.

>
> Your patch appears to be a weird middle ground, that is hard to
> analyze, abusing the mount count as a thread count.
>
> I have been weighing the options between them, and to me properly
> waiting for all the processes to exit in zap_pid_ns_processes as we
> currently try to do is in our advantage. It is simple and it means
> one less cache line bounce for a write to global variable in the
> much more common fork/exit path that we have to deal with.
>
> The task->children isn't changed until __unhash_process() which runs
> after flush_proc_task(). So we should be able to come up with
> a variant of do_wait() that zap_pid_ns_processes can use that does
> what we need.

Sounds correct.

>
> Louis do you want to look at that?

Give me a few days to look at that. IMHO my patch fixes the bug (see the
comment below), which was an emergency at work. Coding an improved fix is
lower priority :)

>
> >> Pin proc_mnt's in copy_process(), so that proc_flush_task() becomes safe
> >> whatever the ordering of tasks.
> >
> > See one comment below.
> >
> >> Signed-off-by: Louis Rilling <louis.rilling(a)kerlabs.com>
> >> ---
> >> fs/proc/base.c | 18 ++++++++++++++++++
> >> include/linux/proc_fs.h | 4 ++++
> >> kernel/fork.c | 1 +
> >> 3 files changed, 23 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/fs/proc/base.c b/fs/proc/base.c
> >> index acb7ef8..d6cdd91 100644
> >> --- a/fs/proc/base.c
> >> +++ b/fs/proc/base.c
> >> @@ -2663,6 +2663,23 @@ static const struct inode_operations proc_tgid_base_inode_operations = {
> >> .setattr = proc_setattr,
> >> };
> >>
> >> +/*
> >> + * Pin all proc_mnt so that detached tasks can safely call proc_flush_task()
> >> + * after container init calls itself proc_flush_task().
> >> + */
> >> +void proc_new_task(struct task_struct *task)
> >> +{
> >> + struct pid *pid;
> >> + int i;
> >> +
> >> + if (!task->pid)
> >> + return;
> >> +
> >> + pid = task_pid(task);
> >> + for (i = 0; i <= pid->level; i++)
> >> + mntget(pid->numbers[i].ns->proc_mnt);
> >
> > NAK. Pids live their own lives - task can change one, pid will
> > become orphan and will be destroyed, so you'll leak.
>
> There is that nasty case in exec isn't there. Why we ever made it
> part of the ABI that calling exec on a thread changes the pid of
> that thread to the pid of the thread group is beyond me.

You're right that I forgot about de_thread(). However, de_thread() does not
replace a task pid with an arbitrary other pid. The new pid lives in the same
pid namespaces, so that proc_flush_task() will call mntput() on the same
proc_mnts as the ones on which proc_new_task() called mntget().

Thanks,

Louis

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes

| Next | Last
Pages: 1 2 3 4 5
Prev: [RFC][PATCH 0/8] perf pmu interface
Next: $100,000 OFFER

procfs: Do not release pid_ns->proc_mnt too early