From: David Rientjes on
On Fri, 19 Feb 2010, Nick Piggin wrote:

> Hi,
>
> The patch cpuset,mm: update tasks' mems_allowed in time (58568d2) causes
> a regression uncovered by SGI. Basically it is allowing possible but not
> online nodes in the task_struct.mems_allowed nodemask (which is contrary
> to several comments still in kernel/cpuset.c), and that causes
> cpuset_mem_spread_node() to return an offline node to slab, causing an
> oops.
>
> Easy to reproduce if you have a machine with !online nodes.
>
> - mkdir /dev/cpuset
> - mount cpuset -t cpuset /dev/cpuset
> - echo 1 > /dev/cpuset/memory_spread_slab
>
> kernel BUG at
> /usr/src/packages/BUILD/kernel-default-2.6.32/linux-2.6.32/mm/slab.c:3271!
> bash[6885]: bugcheck! 0 [1]
> Pid: 6885, CPU 5, comm: bash
> psr : 00001010095a2010 ifs : 800000000000038b ip : [<a00000010020cf00>]
> Tainted: G W (2.6.32-0.6.8-default)
> ip is at ____cache_alloc_node+0x440/0x500

It seems like current->mems_allowed is not properly initialized, although
task_cs(current)->mems_allowed is to node_states[N_HIGH_MEMORY]. See
below.

> A simple bandaid is to skip !online nodes in cpuset_mem_spread_node().
> However I'm a bit worried about 58568d2.
>
> It is doing a lot of stuff. It is removing the callback_mutex from
> around several seemingly unrelated places (eg. from around
> guarnatee_online_cpus, which explicitly asks to be called with that
> lock held), and other places, so I don't know how it is not racy
> with hotplug.
>

guarantee_online_cpus() truly does require callback_mutex, the
cgroup_scan_tasks() iterator locking can protect changes in the cgroup
hierarchy but it doesn't protect a store to cs->cpus_allowed or for
hotplug.

top_cpuset.cpus_allowed will always need to track cpu_active_map since
those are the schedulable cpus, it looks like that's initialized for SMP
and the cpu hotplug notifier does that correctly.

I'm not sure what the logic is doing in cpuset_attach() where cs is the
cpuset to attach to:

if (cs == &top_cpuset) {
cpumask_copy(cpus_attach, cpu_possible_mask);
to = node_possible_map;
}

cpus_attach is properly protected by cgroup_lock, but using
node_possible_map here will set task->mems_allowed to node_possible_map
when the cpuset does not have memory_migrate enabled. This is the source
of your oops, I think.

> Then it also says that the fastpath doesn't use any locking, so the
> update-path first adds the newly allowed nodes, then removes the
> newly prohibited nodes. Unfortunately there are no barriers apparent
> (and none added), and cpumask/nodemask can be larger than one word,
> so it seems there could be races.
>

We can remove the store to tsk->mems_allowed in cpuset_migrate_mm()
because cpuset_change_task_nodemask() already does it under
task_lock(tsk).

cpuset_migrate_mm() looks to be subsequently updating the cpuset_attach()
nodemask when moving to top_cpuset so it doesn't get stuck with
node_possible_map, but that's not called unless memory_migrate is enabled.

> It also seems like the exported cpuset_mems_allowed and
> cpuset_cpus_allowed APIs are just broken wrt hotplug because the
> hotplug lock is dropped before returning.
>

The usage of cpuset_cpus_allowed_locked() looks wrong in the scheduler, as
well: it can't hold callback_mutex since it is only declared at file scope
in the cpuset code.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Nick Piggin on
On Thu, Feb 18, 2010 at 01:38:11PM -0800, David Rientjes wrote:
> On Fri, 19 Feb 2010, Nick Piggin wrote:
>
> > Hi,
> >
> > The patch cpuset,mm: update tasks' mems_allowed in time (58568d2) causes
> > a regression uncovered by SGI. Basically it is allowing possible but not
> > online nodes in the task_struct.mems_allowed nodemask (which is contrary
> > to several comments still in kernel/cpuset.c), and that causes
> > cpuset_mem_spread_node() to return an offline node to slab, causing an
> > oops.
> >
> > Easy to reproduce if you have a machine with !online nodes.
> >
> > - mkdir /dev/cpuset
> > - mount cpuset -t cpuset /dev/cpuset
> > - echo 1 > /dev/cpuset/memory_spread_slab
> >
> > kernel BUG at
> > /usr/src/packages/BUILD/kernel-default-2.6.32/linux-2.6.32/mm/slab.c:3271!
> > bash[6885]: bugcheck! 0 [1]
> > Pid: 6885, CPU 5, comm: bash
> > psr : 00001010095a2010 ifs : 800000000000038b ip : [<a00000010020cf00>]
> > Tainted: G W (2.6.32-0.6.8-default)
> > ip is at ____cache_alloc_node+0x440/0x500
>
> It seems like current->mems_allowed is not properly initialized, although
> task_cs(current)->mems_allowed is to node_states[N_HIGH_MEMORY]. See
> below.
>
> > A simple bandaid is to skip !online nodes in cpuset_mem_spread_node().
> > However I'm a bit worried about 58568d2.
> >
> > It is doing a lot of stuff. It is removing the callback_mutex from
> > around several seemingly unrelated places (eg. from around
> > guarnatee_online_cpus, which explicitly asks to be called with that
> > lock held), and other places, so I don't know how it is not racy
> > with hotplug.
> >
>
> guarantee_online_cpus() truly does require callback_mutex, the
> cgroup_scan_tasks() iterator locking can protect changes in the cgroup
> hierarchy but it doesn't protect a store to cs->cpus_allowed or for
> hotplug.

Right, but the callback_mutex was being removed by this patch.

>
> top_cpuset.cpus_allowed will always need to track cpu_active_map since
> those are the schedulable cpus, it looks like that's initialized for SMP
> and the cpu hotplug notifier does that correctly.
>
> I'm not sure what the logic is doing in cpuset_attach() where cs is the
> cpuset to attach to:
>
> if (cs == &top_cpuset) {
> cpumask_copy(cpus_attach, cpu_possible_mask);
> to = node_possible_map;
> }
>
> cpus_attach is properly protected by cgroup_lock, but using
> node_possible_map here will set task->mems_allowed to node_possible_map
> when the cpuset does not have memory_migrate enabled. This is the source
> of your oops, I think.

Could be, yes.


> > Then it also says that the fastpath doesn't use any locking, so the
> > update-path first adds the newly allowed nodes, then removes the
> > newly prohibited nodes. Unfortunately there are no barriers apparent
> > (and none added), and cpumask/nodemask can be larger than one word,
> > so it seems there could be races.
> >
>
> We can remove the store to tsk->mems_allowed in cpuset_migrate_mm()
> because cpuset_change_task_nodemask() already does it under
> task_lock(tsk).

But it doesn't matter if stores are done under lock, if the loads are
not. masks can be multiple words, so there isn't any ordering between
reading half and old mask and half a new one that results in an invalid
state. AFAIKS.

>
> cpuset_migrate_mm() looks to be subsequently updating the cpuset_attach()
> nodemask when moving to top_cpuset so it doesn't get stuck with
> node_possible_map, but that's not called unless memory_migrate is enabled.
>
> > It also seems like the exported cpuset_mems_allowed and
> > cpuset_cpus_allowed APIs are just broken wrt hotplug because the
> > hotplug lock is dropped before returning.
> >
>
> The usage of cpuset_cpus_allowed_locked() looks wrong in the scheduler, as
> well: it can't hold callback_mutex since it is only declared at file scope
> in the cpuset code.

Well it is exported as cpuset_lock(). And the scheduler has it covered
in all cases by the looks except for select_task_rq, which is called
by wakeup code. We should stick WARN_ONs through the cpuset code for
mutexes not held when they should be.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: KAMEZAWA Hiroyuki on
On Fri, 19 Feb 2010 00:49:21 +1100
Nick Piggin <npiggin(a)suse.de> wrote:

> Hi,
>
> The patch cpuset,mm: update tasks' mems_allowed in time (58568d2) causes
> a regression uncovered by SGI. Basically it is allowing possible but not
> online nodes in the task_struct.mems_allowed nodemask (which is contrary
> to several comments still in kernel/cpuset.c), and that causes
> cpuset_mem_spread_node() to return an offline node to slab, causing an
> oops.
>
> Easy to reproduce if you have a machine with !online nodes.
>
> - mkdir /dev/cpuset
> - mount cpuset -t cpuset /dev/cpuset
> - echo 1 > /dev/cpuset/memory_spread_slab
>
> kernel BUG at
> /usr/src/packages/BUILD/kernel-default-2.6.32/linux-2.6.32/mm/slab.c:3271!
> bash[6885]: bugcheck! 0 [1]
> Pid: 6885, CPU 5, comm: bash
> psr : 00001010095a2010 ifs : 800000000000038b ip : [<a00000010020cf00>]
> Tainted: G W (2.6.32-0.6.8-default)
> ip is at ____cache_alloc_node+0x440/0x500
>
> unat: 0000000000000000 pfs : 000000000000038b rsc : 0000000000000003
> rnat: 0000000000283d85 bsps: 0000000000000001 pr : 99596aaa69aa6999
> ldrs: 0000000000000000 ccv : 0000000000000018 fpsr: 0009804c0270033f
> csd : 0000000000000000 ssd : 0000000000000000
> b0 : a00000010020cf00 b6 : a0000001004962c0 b7 : a000000100493240
> f6 : 000000000000000000000 f7 : 000000000000000000000
> f8 : 000000000000000000000 f9 : 000000000000000000000
> f10 : 000000000000000000000 f11 : 000000000000000000000
> r1 : a0000001015c6fc0 r2 : 000000000000e662 r3 : 000000000000fffe
> r8 : 000000000000005c r9 : 0000000000000000 r10 : 0000000000004000
> r11 : 0000000000000000 r12 : e000003c3904fcc0 r13 : e000003c39040000
> r14 : 000000000000e662 r15 : a00000010138ed88 r16 : ffffffffffff65c8
> r17 : a00000010138ed80 r18 : a0000001013c7ad0 r19 : a0000001013d3b60
> r20 : e00001b03afdfe18 r21 : 0000000000000001 r22 : e0000130030365c8
> r23 : e000013003040000 r24 : ffffffffffff0400 r25 : 00000000000068ef
> r26 : 00000000000068ef r27 : a0000001029621d0 r28 : 00000000000068f0
> r29 : 00000000000068f0 r30 : 00000000000068f0 r31 : 000000000000000a
>
> Call Trace:
> [<a000000100017a80>] show_stack+0x80/0xa0
> [<a0000001000180e0>] show_regs+0x640/0x920
> [<a000000100029a90>] die+0x190/0x2e0
> [<a000000100029c30>] die_if_kernel+0x50/0x80
> [<a000000100904af0>] ia64_bad_break+0x470/0x760
> [<a00000010000cb60>] ia64_native_leave_kernel+0x0/0x270
> [<a00000010020cf00>] ____cache_alloc_node+0x440/0x500
> [<a00000010020ffa0>] kmem_cache_alloc+0x360/0x660
>
> A simple bandaid is to skip !online nodes in cpuset_mem_spread_node().

I think that's good.


> However I'm a bit worried about 58568d2.
>
> It is doing a lot of stuff. It is removing the callback_mutex from
> around several seemingly unrelated places (eg. from around
> guarnatee_online_cpus, which explicitly asks to be called with that
> lock held), and other places, so I don't know how it is not racy
> with hotplug.

Because removing pgdat is not archieved yet. It was verrrry difficult..
So, once node become online, it never turns to be offline.
But all pages on the node are removed. Just zonelists are rebuilded.
(zonelist rebuild uses stop_machine_run.

>
> Then it also says that the fastpath doesn't use any locking, so the
> update-path first adds the newly allowed nodes, then removes the
> newly prohibited nodes. Unfortunately there are no barriers apparent
> (and none added), and cpumask/nodemask can be larger than one word,
> so it seems there could be races.
>
Maybe. IMHO, "newly allowed node" and "newly prohobited node" come from
user's command. We don't need to guarantee correctness.

So, our concerns is only "don't access offlined node". Right ?
But as I wrote, a node once onlined will never be offlined.
So, I think it's difficult to cause panic.
My concern is zonelist rather than bitmap. But I hear no panic report
about update of it until now. (Maybe because "struct zone" is never
freed.)

> It also seems like the exported cpuset_mems_allowed and
> cpuset_cpus_allowed APIs are just broken wrt hotplug because the
> hotplug lock is dropped before returning.
>
About cpu, it can disappear...then, it should be fixed.

> I'd just like to get opinions or comments from people who know the
> code better before wading in too far myself. I'd be really keen on
> making the locking simpler, using seqlocks for fastpaths, etc.
>

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Rientjes on
On Fri, 19 Feb 2010, KOSAKI Motohiro wrote:

> Personally, I like just revert at once than bandaid. 58568d2 didn't
> introduce any new feature, then we can revet it without abi breakage.
>

Revert a commit from more than six months ago when the fix is probably a
small patch in cpuset_attach()? I think we can do better than that.

This may not have introduced a new feature, but it was a worthwhile change
to avoid the old cpuset_update_task_memory_state() hooks in mempolicy,
page allocator, etc. code that could block on callback_mutex for iterating
the hierarchy.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Rientjes on
On Fri, 19 Feb 2010, Nick Piggin wrote:

> > guarantee_online_cpus() truly does require callback_mutex, the
> > cgroup_scan_tasks() iterator locking can protect changes in the cgroup
> > hierarchy but it doesn't protect a store to cs->cpus_allowed or for
> > hotplug.
>
> Right, but the callback_mutex was being removed by this patch.
>

I was making the case for it to be readded :)

> > top_cpuset.cpus_allowed will always need to track cpu_active_map since
> > those are the schedulable cpus, it looks like that's initialized for SMP
> > and the cpu hotplug notifier does that correctly.
> >
> > I'm not sure what the logic is doing in cpuset_attach() where cs is the
> > cpuset to attach to:
> >
> > if (cs == &top_cpuset) {
> > cpumask_copy(cpus_attach, cpu_possible_mask);
> > to = node_possible_map;
> > }
> >
> > cpus_attach is properly protected by cgroup_lock, but using
> > node_possible_map here will set task->mems_allowed to node_possible_map
> > when the cpuset does not have memory_migrate enabled. This is the source
> > of your oops, I think.
>
> Could be, yes.
>

I'd be interested to see if you still get the same oops with the patch at
the end of this email that fixes this logic.

> But it doesn't matter if stores are done under lock, if the loads are
> not. masks can be multiple words, so there isn't any ordering between
> reading half and old mask and half a new one that results in an invalid
> state. AFAIKS.
>

It doesn't matter for MAX_NUMNODES > BITS_PER_LONG because
task->mems_alllowed only gets updated via cpuset_change_task_nodemask()
where the added nodes are set and then the removed nodes are cleared. The
side effect of this lockless access to task->mems_allowed means we may
have a small race between

nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);

and

tsk->mems_allowed = *newmems;

but the penalty is that we get an allocation on a removed node, which
isn't a big deal, especially since it was previously allowed.

> Well it is exported as cpuset_lock(). And the scheduler has it covered
> in all cases by the looks except for select_task_rq, which is called
> by wakeup code. We should stick WARN_ONs through the cpuset code for
> mutexes not held when they should be.
>

A lot of the reliance on callback_mutex was removed because the strict
hierarchy walking and task membership is now guarded by cgroup_mutex
instead. Some of the comments in kernel/cpuset.c weren't updated so they
still say callback_mutex when in reality they mean cgroup_mutex.
---
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1319,7 +1319,7 @@ static int fmeter_getrate(struct fmeter *fmp)
return val;
}

-/* Protected by cgroup_lock */
+/* Protected by cgroup_mutex held on cpuset_attach() */
static cpumask_var_t cpus_attach;

/* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
@@ -1390,8 +1390,12 @@ static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
struct cpuset *oldcs = cgroup_cs(oldcont);

if (cs == &top_cpuset) {
- cpumask_copy(cpus_attach, cpu_possible_mask);
- to = node_possible_map;
+ /*
+ * top_cpuset.cpus_allowed and top_cpuset.mems_allowed are
+ * protected by cgroup_lock which is already held here.
+ */
+ cpumask_copy(cpus_attach, top_cpuset.cpus_allowed);
+ to = top_cpuset.mems_allowed;
} else {
guarantee_online_cpus(cs, cpus_attach);
guarantee_online_mems(cs, &to);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/