From: Louis Rilling on
On 08/07/10 21:39 -0700, Eric W. Biederman wrote:
>
> Currently it is possible to put proc_mnt before we have flushed the
> last process that will use the proc_mnt to flush it's proc entries.
>
> This race is fixed by not flushing proc entries for dead pid
> namespaces, and calling pid_ns_release_proc unconditionally from
> zap_pid_ns_processes after the pid namespace has been declared dead.

One comment below.

>
> To ensure we don't unnecessarily leak any dcache entries with skipped
> flushes pid_ns_release_proc flushes the entire proc_mnt when it is
> called.
>
> Signed-off-by: Eric W. Biederman <ebiederm(a)xmission.com>
> ---
> fs/proc/base.c | 9 +++++----
> fs/proc/root.c | 3 +++
> kernel/pid_namespace.c | 1 +
> 3 files changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index acb7ef8..e9d84e1 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2742,13 +2742,14 @@ void proc_flush_task(struct task_struct *task)
>
> for (i = 0; i <= pid->level; i++) {
> upid = &pid->numbers[i];
> +
> + /* Don't bother flushing dead pid namespaces */
> + if (test_bit(PIDNS_DEAD, &upid->ns->flags))
> + continue;
> +

IMHO, nothing prevents zap_pid_ns_processes() from setting PIDNS_DEAD and
calling pid_ns_release_proc() right now. zap_pid_ns_processes() does not wait
for EXIT_DEAD (self-reaping) children to be released.

Thanks,

Louis

> proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr,
> tgid->numbers[i].nr);
> }
> -
> - upid = &pid->numbers[pid->level];
> - if (upid->nr == 1)
> - pid_ns_release_proc(upid->ns);
> }
>
> static struct dentry *proc_pid_instantiate(struct inode *dir,
> diff --git a/fs/proc/root.c b/fs/proc/root.c
> index cfdf032..2298fdd 100644
> --- a/fs/proc/root.c
> +++ b/fs/proc/root.c
> @@ -209,5 +209,8 @@ int pid_ns_prepare_proc(struct pid_namespace *ns)
>
> void pid_ns_release_proc(struct pid_namespace *ns)
> {
> + /* Flush any cached proc dentries for this pid namespace */
> + shrink_dcache_parent(ns->proc_mnt->mnt_root);
> +
> mntput(ns->proc_mnt);
> }
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index 92032d1..43dec5d 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -189,6 +189,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
> rc = sys_wait4(-1, NULL, __WALL, NULL);
> } while (rc != -ECHILD);
>
> + pid_ns_release_proc(pid_ns);
> acct_exit_ns(pid_ns);
> return;
> }
> --
> 1.6.5.2.143.g8cc62
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo(a)vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes
From: Louis Rilling on
On 09/07/10 6:05 -0700, Eric W. Biederman wrote:
> Louis Rilling <Louis.Rilling(a)kerlabs.com> writes:
>
> > On 08/07/10 21:39 -0700, Eric W. Biederman wrote:
> >>
> >> Currently it is possible to put proc_mnt before we have flushed the
> >> last process that will use the proc_mnt to flush it's proc entries.
> >>
> >> This race is fixed by not flushing proc entries for dead pid
> >> namespaces, and calling pid_ns_release_proc unconditionally from
> >> zap_pid_ns_processes after the pid namespace has been declared dead.
> >
> > One comment below.
> >
> >>
> >> To ensure we don't unnecessarily leak any dcache entries with skipped
> >> flushes pid_ns_release_proc flushes the entire proc_mnt when it is
> >> called.
> >>
> >> Signed-off-by: Eric W. Biederman <ebiederm(a)xmission.com>
> >> ---
> >> fs/proc/base.c | 9 +++++----
> >> fs/proc/root.c | 3 +++
> >> kernel/pid_namespace.c | 1 +
> >> 3 files changed, 9 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/fs/proc/base.c b/fs/proc/base.c
> >> index acb7ef8..e9d84e1 100644
> >> --- a/fs/proc/base.c
> >> +++ b/fs/proc/base.c
> >> @@ -2742,13 +2742,14 @@ void proc_flush_task(struct task_struct *task)
> >>
> >> for (i = 0; i <= pid->level; i++) {
> >> upid = &pid->numbers[i];
> >> +
> >> + /* Don't bother flushing dead pid namespaces */
> >> + if (test_bit(PIDNS_DEAD, &upid->ns->flags))
> >> + continue;
> >> +
> >
> > IMHO, nothing prevents zap_pid_ns_processes() from setting PIDNS_DEAD and
> > calling pid_ns_release_proc() right now. zap_pid_ns_processes() does not wait
> > for EXIT_DEAD (self-reaping) children to be released.
>
> Good point we need something probably a lock to prevent proc_mnt from
> going away here. We might do a little better if we were starting with
> a specific dentry, those at least have some rcu properties but that isn't
> a big help.
>
> Hmm. Perhaps there is a way to completely restructure this flushing
> of dentries. It is just an optimization after all so we don't get too many
> stale dentries building up.
>
> It might just be worth it simply kill proc_flush_mnt altogether. I know
> it is measurable when we don't do the flushing but perhaps there can
> be a work struct that periodically wakes up and smacks stale proc dentries.
>
> Right now I really don't think proc_flush_task is worth the hassle it
> causes.

Indeed, proc_flush_task() seems to be the only bad guy trying to access
pid_ns->proc_mnt after the death of the init process.

But I don't know enough about the performance impact of removing it.

Louis

>
> Grumble, Grumble more thinking to do.
>
> Eric
> _______________________________________________
> Containers mailing list
> Containers(a)lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes