From: Nick Piggin on
On Mon, Feb 22, 2010 at 07:53:39PM +0800, Miao Xie wrote:
> I'm sorry for replying this late.

No problem.


> on 2010-2-19 18:06, David Rientjes wrote:
> > On Fri, 19 Feb 2010, Nick Piggin wrote:
> >
> >>> guarantee_online_cpus() truly does require callback_mutex, the
> >>> cgroup_scan_tasks() iterator locking can protect changes in the cgroup
> >>> hierarchy but it doesn't protect a store to cs->cpus_allowed or for
> >>> hotplug.
> >>
> >> Right, but the callback_mutex was being removed by this patch.
> >>
> >
> > I was making the case for it to be readded :)
>
> But cgroup_mutex is held when someone changes cs->cpus_allowed or doing hotplug,
> so I think callback_mutex is not necessary in this case.

So long as that's done consistently (and we should update the comments
too).


> >>> top_cpuset.cpus_allowed will always need to track cpu_active_map since
> >>> those are the schedulable cpus, it looks like that's initialized for SMP
> >>> and the cpu hotplug notifier does that correctly.
> >>>
> >>> I'm not sure what the logic is doing in cpuset_attach() where cs is the
> >>> cpuset to attach to:
> >>>
> >>> if (cs == &top_cpuset) {
> >>> cpumask_copy(cpus_attach, cpu_possible_mask);
> >>> to = node_possible_map;
> >>> }
> >>>
> >>> cpus_attach is properly protected by cgroup_lock, but using
> >>> node_possible_map here will set task->mems_allowed to node_possible_map
> >>> when the cpuset does not have memory_migrate enabled. This is the source
> >>> of your oops, I think.
> >>
> >> Could be, yes.
> >>
> >
> > I'd be interested to see if you still get the same oops with the patch at
> > the end of this email that fixes this logic.
>
> I think this patch can't fix this bug, because mems_allowed of tasks in the
> top group is set to node_possible_map by default, not when the task is
> attached.
>
> I made a new patch at the end of this email to fix it, but I have no machine
> to test it now. who can test it for me.

I can test it for you, but I wonder about your barriers.

>
> ---
> diff --git a/init/main.c b/init/main.c
> index 4cb47a1..512ba15 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -846,7 +846,7 @@ static int __init kernel_init(void * unused)
> /*
> * init can allocate pages on any node
> */
> - set_mems_allowed(node_possible_map);
> + set_mems_allowed(node_states[N_HIGH_MEMORY]);
> /*
> * init can run on any cpu.
> */
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index ba401fa..e29b440 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -935,10 +935,12 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
> struct task_struct *tsk = current;
>
> tsk->mems_allowed = *to;
> + wmb();
>
> do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL);
>
> guarantee_online_mems(task_cs(tsk),&tsk->mems_allowed);
> + wmb();
> }
>
> /*

You always need to comment barriers (and use smp_ variants unless you're
doing mmio).


> @@ -1391,11 +1393,10 @@ static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
>
> if (cs == &top_cpuset) {
> cpumask_copy(cpus_attach, cpu_possible_mask);
> - to = node_possible_map;
> } else {
> guarantee_online_cpus(cs, cpus_attach);
> - guarantee_online_mems(cs, &to);
> }
> + guarantee_online_mems(cs, &to);
>
> /* do per-task migration stuff possibly for each in the threadgroup */
> cpuset_attach_task(tsk, &to, cs);
> @@ -2090,15 +2091,19 @@ static int cpuset_track_online_cpus(struct notifier_block *unused_nb,
> static int cpuset_track_online_nodes(struct notifier_block *self,
> unsigned long action, void *arg)
> {
> + nodemask_t oldmems;
> +
> cgroup_lock();
> switch (action) {
> case MEM_ONLINE:
> - case MEM_OFFLINE:
> + oldmems = top_cpuset.mems_allowed;
> mutex_lock(&callback_mutex);
> top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
> mutex_unlock(&callback_mutex);
> - if (action == MEM_OFFLINE)
> - scan_for_empty_cpusets(&top_cpuset);
> + update_tasks_nodemask(&top_cpuset, &oldmems, NULL);
> + break;
> + case MEM_OFFLINE:
> + scan_for_empty_cpusets(&top_cpuset);
> break;
> default:
> break;
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index fbb6222..84c7f99 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -219,7 +219,7 @@ int kthreadd(void *unused)
> set_task_comm(tsk, "kthreadd");
> ignore_signals(tsk);
> set_cpus_allowed_ptr(tsk, cpu_all_mask);
> - set_mems_allowed(node_possible_map);
> + set_mems_allowed(node_states[N_HIGH_MEMORY]);
>
> current->flags |= PF_NOFREEZE | PF_FREEZER_NOSIG;
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Nick Piggin on
On Fri, Feb 19, 2010 at 02:06:45AM -0800, David Rientjes wrote:
> On Fri, 19 Feb 2010, Nick Piggin wrote:
> > But it doesn't matter if stores are done under lock, if the loads are
> > not. masks can be multiple words, so there isn't any ordering between
> > reading half and old mask and half a new one that results in an invalid
> > state. AFAIKS.
> >
>
> It doesn't matter for MAX_NUMNODES > BITS_PER_LONG because
> task->mems_alllowed only gets updated via cpuset_change_task_nodemask()
> where the added nodes are set and then the removed nodes are cleared. The
> side effect of this lockless access to task->mems_allowed means we may
> have a small race between
>
> nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
>
> and
>
> tsk->mems_allowed = *newmems;
>
> but the penalty is that we get an allocation on a removed node, which
> isn't a big deal, especially since it was previously allowed.

If you have a concurrent reader without any synchronisation, then what
stops it from loading a word of the mask before stores to add the new
nodes and then loading another word of the mask after the stores to
remove the old nodes? (which can give an empty mask).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Rientjes on
On Mon, 22 Feb 2010, Nick Piggin wrote:

> If you have a concurrent reader without any synchronisation, then what
> stops it from loading a word of the mask before stores to add the new
> nodes and then loading another word of the mask after the stores to
> remove the old nodes? (which can give an empty mask).
>

Currently nothing, so we'll need a variant for configurations where the
size of nodemask_t is larger than we can atomically store.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Rientjes on
On Tue, 23 Feb 2010, Miao Xie wrote:

> Sorry, Could you explain what you advised?
> I think it is hard to fix this problem by adding a variant, because it is
> hard to avoid loading a word of the mask before
>
> nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
>
> and then loading another word of the mask after
>
> tsk->mems_allowed = *newmems;
>
> unless we use lock.
>
> Maybe we need a rw-lock to protect task->mems_allowed.
>

I meant that we need to define synchronization only for configurations
that do not do atomic nodemask_t stores, it's otherwise unnecessary.
We'll need to load and store tsk->mems_allowed via a helper function that
is defined to take the rwlock for such configs and only read/write the
nodemask for others.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Rientjes on
On Tue, 23 Feb 2010, Miao Xie wrote:

> >> /*
> >> @@ -1391,11 +1393,10 @@ static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
> >>
> >> if (cs == &top_cpuset) {
> >> cpumask_copy(cpus_attach, cpu_possible_mask);
> >> - to = node_possible_map;
> >> } else {
> >> guarantee_online_cpus(cs, cpus_attach);
> >> - guarantee_online_mems(cs, &to);
> >> }
> >> + guarantee_online_mems(cs, &to);
> >>
> >> /* do per-task migration stuff possibly for each in the threadgroup */
> >> cpuset_attach_task(tsk, &to, cs);
> >
> > Do we need to set cpus_attach to cpu_possible_mask? Why won't
> > cpu_active_mask suffice?
>
> If we set cpus_attach to cpu_possible_mask, we needn't do anything for tasks in the top_cpuset when
> doing cpu hotplug. If not, we will update cpus_allowed of all tasks in the top_cpuset.
>

Cpu hotplug sets top_cpuset's cpus_allowed to cpu_active_mask by default,
regardless of what was onlined or offlined. cpus_attach in the context of
your patch (in cpuset_attach()) passes cpu_possible_mask to
set_cpus_allowed_ptr() if the task is being attached to top_cpuset, my
question was why don't we pass cpu_active_mask instead? In other words, I
think we should do

cpumask_copy(cpus_attach, cpu_active_mask);

when attached to top_cpuset like my patch did.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/