From: Peter Zijlstra on
On Wed, 2010-05-19 at 14:35 +0530, Amit K. Arora wrote:
> Problem : In a stress test where some heavy tests were running along with
> regular CPU offlining and onlining, a hang was observed. The system seems to
> be hung at a point where migration_call() tries to kill the migration_thread
> of the dying CPU, which just got moved to the current CPU. This migration
> thread does not get a chance to run (and die) since rt_throttled is set to 1
> on current, and it doesn't get cleared as the hrtimer which is supposed to
> reset the rt bandwidth (sched_rt_period_timer) is tied to the CPU being
> offlined.
>
> Solution : This patch pushes the killing of migration thread to "CPU_POST_DEAD"
> event. By then all the timers (including sched_rt_period_timer) should have got
> migrated (along with other callbacks).
>
> Alternate Solution considered : Another option considered was to
> increase the priority of the hrtimer cpu offline notifier, such that it
> gets to run before scheduler's migration cpu offline notifier. In this
> way we are sure that the timers will get migrated before migration_call
> tries to kill migration_thread. But, this can have some non-obvious
> implications, suggested Srivatsa.
>
> Testing : Without the patch the stress tests didn't last for even 12
> hours. And yes, the problem was reproducible. With the patch applied the
> tests ran successfully for more than 48 hours.
>
> Thanks!
> --
> Regards,
> Amit Arora
>
> Signed-off-by: Amit Arora <aarora(a)in.ibm.com>
> Signed-off-by: Gautham R Shenoy <ego(a)in.ibm.com>
> --
> diff -Nuarp linux-2.6.34.org/kernel/sched.c linux-2.6.34/kernel/sched.c
> --- linux-2.6.34.org/kernel/sched.c 2010-05-18 22:56:21.000000000 -0700
> +++ linux-2.6.34/kernel/sched.c 2010-05-18 22:58:31.000000000 -0700
> @@ -5942,14 +5942,26 @@ migration_call(struct notifier_block *nf
> cpu_rq(cpu)->migration_thread = NULL;
> break;
>
> + case CPU_POST_DEAD:
> + /*
> + Bring the migration thread down in CPU_POST_DEAD event,
> + since the timers should have got migrated by now and thus
> + we should not see a deadlock between trying to kill the
> + migration thread and the sched_rt_period_timer.
> + */

Faulty comment style that, please use:

/*
* text
* goes
* here
*/

> + cpuset_lock();
> + rq = cpu_rq(cpu);
> + kthread_stop(rq->migration_thread);
> + put_task_struct(rq->migration_thread);
> + rq->migration_thread = NULL;
> + cpuset_unlock();
> + break;
> +

The other problem is more urgent though, CPU_POST_DEAD runs outside of
the hotplug lock and thus the above becomes a race where we could
possible kill off the migration thread of a newly brought up cpu:

cpu0 - down 2
cpu1 - up 2 (allocs a new migration thread, and leaks the old one)
cpu0 - post_down 2 - frees the migration thread -- oops!


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Wed, 2010-05-19 at 17:43 +0530, Amit K. Arora wrote:
> Alternate Solution considered : Another option considered was to
> increase the priority of the hrtimer cpu offline notifier, such that it
> gets to run before scheduler's migration cpu offline notifier. In this
> way we are sure that the timers will get migrated before migration_call
> tries to kill migration_thread. But, this can have some non-obvious
> implications, suggested Srivatsa.


> On Wed, May 19, 2010 at 11:31:55AM +0200, Peter Zijlstra wrote:
> > The other problem is more urgent though, CPU_POST_DEAD runs outside of
> > the hotplug lock and thus the above becomes a race where we could
> > possible kill off the migration thread of a newly brought up cpu:
> >
> > cpu0 - down 2
> > cpu1 - up 2 (allocs a new migration thread, and leaks the old one)
> > cpu0 - post_down 2 - frees the migration thread -- oops!
>
> Ok. So, how about adding a check in CPU_UP_PREPARE event handling too ?
> The cpuset_lock will synchronize, and thus avoid race between killing of
> migration_thread in up_prepare and post_dead events.
>
> Here is the updated patch. If you don't like this one too, do you mind
> suggesting an alternate approach to tackle the problem ? Thanks !

Right, so this isn't pretty at all..

Ingo, the comment near the migration_notifier says that migration_call
should happen before all else, but can you see anything that would break
if we let the timer migration happen first?

Thomas?

> Signed-off-by: Amit Arora <aarora(a)in.ibm.com>
> Signed-off-by: Gautham R Shenoy <ego(a)in.ibm.com>
> --

For everybody who reads this, _please_ use three (3) dashes '-' to
separate the changelog from the patch, and left-align the changelog
(including all tags).

I seem to get more and more people sending patches with 2 dashes and
daft changelogs with whitespace stuffing which break my scripts.

> diff -Nuarp linux-2.6.34.org/kernel/sched.c linux-2.6.34/kernel/sched.c
> --- linux-2.6.34.org/kernel/sched.c 2010-05-18 22:56:21.000000000 -0700
> +++ linux-2.6.34/kernel/sched.c 2010-05-19 04:47:49.000000000 -0700
> @@ -5900,6 +5900,19 @@ migration_call(struct notifier_block *nf
>
> case CPU_UP_PREPARE:
> case CPU_UP_PREPARE_FROZEN:
> + cpuset_lock();
> + rq = cpu_rq(cpu);
> + /*
> + * Since we now kill migration_thread in CPU_POST_DEAD event,
> + * there may be a race here. So, lets cleanup the old
> + * migration_thread on the rq, if any.
> + */
> + if (unlikely(rq->migration_thread)) {
> + kthread_stop(rq->migration_thread);
> + put_task_struct(rq->migration_thread);
> + rq->migration_thread = NULL;
> + }
> + cpuset_unlock();
> p = kthread_create(migration_thread, hcpu, "migration/%d", cpu);
> if (IS_ERR(p))
> return NOTIFY_BAD;
> @@ -5942,14 +5955,34 @@ migration_call(struct notifier_block *nf
> cpu_rq(cpu)->migration_thread = NULL;
> break;
>
> + case CPU_POST_DEAD:
> + /*
> + * Bring the migration thread down in CPU_POST_DEAD event,
> + * since the timers should have got migrated by now and thus
> + * we should not see a deadlock between trying to kill the
> + * migration thread and the sched_rt_period_timer.
> + */
> + cpuset_lock();
> + rq = cpu_rq(cpu);
> + if (likely(rq->migration_thread)) {
> + /*
> + * Its possible that this CPU was onlined (from a
> + * different CPU) before we reached here and
> + * migration_thread was cleaned-up in the
> + * CPU_UP_PREPARE event handling.
> + */
> + kthread_stop(rq->migration_thread);
> + put_task_struct(rq->migration_thread);
> + rq->migration_thread = NULL;
> + }
> + cpuset_unlock();
> + break;
> +
> case CPU_DEAD:
> case CPU_DEAD_FROZEN:
> cpuset_lock(); /* around calls to cpuset_cpus_allowed_lock() */
> migrate_live_tasks(cpu);
> rq = cpu_rq(cpu);
> - kthread_stop(rq->migration_thread);
> - put_task_struct(rq->migration_thread);
> - rq->migration_thread = NULL;
> /* Idle task back to normal (off runqueue, low prio) */
> raw_spin_lock_irq(&rq->lock);
> update_rq_clock(rq);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mike Galbraith on
On Thu, 2010-05-20 at 09:28 +0200, Peter Zijlstra wrote:
> On Wed, 2010-05-19 at 17:43 +0530, Amit K. Arora wrote:
> > Alternate Solution considered : Another option considered was to
> > increase the priority of the hrtimer cpu offline notifier, such that it
> > gets to run before scheduler's migration cpu offline notifier. In this
> > way we are sure that the timers will get migrated before migration_call
> > tries to kill migration_thread. But, this can have some non-obvious
> > implications, suggested Srivatsa.
>
>
> > On Wed, May 19, 2010 at 11:31:55AM +0200, Peter Zijlstra wrote:
> > > The other problem is more urgent though, CPU_POST_DEAD runs outside of
> > > the hotplug lock and thus the above becomes a race where we could
> > > possible kill off the migration thread of a newly brought up cpu:
> > >
> > > cpu0 - down 2
> > > cpu1 - up 2 (allocs a new migration thread, and leaks the old one)
> > > cpu0 - post_down 2 - frees the migration thread -- oops!
> >
> > Ok. So, how about adding a check in CPU_UP_PREPARE event handling too ?
> > The cpuset_lock will synchronize, and thus avoid race between killing of
> > migration_thread in up_prepare and post_dead events.
> >
> > Here is the updated patch. If you don't like this one too, do you mind
> > suggesting an alternate approach to tackle the problem ? Thanks !
>
> Right, so this isn't pretty at all..

Since the problem seems to stem from interfering with a critical thread,
how about create a SCHED_SYSTEM_CRITICAL flag ala SCHED_RESET_ON_FORK?

Not particularly beautiful, and completely untested (well, it compiles).

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6cc43e0..23da2cf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -40,6 +40,8 @@
#define SCHED_IDLE 5
/* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
#define SCHED_RESET_ON_FORK 0x40000000
+/* Can be ORed in to flag a thread as being system critical. Not inherited. */
+#define SCHED_SYSTEM_CRITICAL 0x20000000

#ifdef __KERNEL__

@@ -1240,6 +1242,9 @@ struct task_struct {
/* Revert to default priority/policy when forking */
unsigned sched_reset_on_fork:1;

+ /* System critical thread. Cleared on fork. */
+ unsigned sched_system_critical:1;
+
pid_t pid;
pid_t tgid;

diff --git a/kernel/sched.c b/kernel/sched.c
index 2d17e3b..caf8b95 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2511,6 +2511,11 @@ void sched_fork(struct task_struct *p, int clone_flags)
*/
p->prio = current->normal_prio;

+ /*
+ * System critical policy flag is not inherited.
+ */
+ p->sched_system_critical = 0;
+
if (!rt_prio(p->prio))
p->sched_class = &fair_sched_class;

@@ -4486,7 +4491,7 @@ static int __sched_setscheduler(struct task_struct *p, int policy,
unsigned long flags;
const struct sched_class *prev_class;
struct rq *rq;
- int reset_on_fork;
+ int reset_on_fork, system_critical;

/* may grab non-irq protected spin_locks */
BUG_ON(in_interrupt());
@@ -4494,10 +4499,12 @@ recheck:
/* double check policy once rq lock held */
if (policy < 0) {
reset_on_fork = p->sched_reset_on_fork;
+ system_critical = p->sched_system_critical;
policy = oldpolicy = p->policy;
} else {
reset_on_fork = !!(policy & SCHED_RESET_ON_FORK);
- policy &= ~SCHED_RESET_ON_FORK;
+ system_critical = !!(policy & SCHED_SYSTEM_CRITICAL);
+ policy &= ~(SCHED_RESET_ON_FORK|SCHED_SYSTEM_CRITICAL);

if (policy != SCHED_FIFO && policy != SCHED_RR &&
policy != SCHED_NORMAL && policy != SCHED_BATCH &&
@@ -4552,6 +4559,10 @@ recheck:
/* Normal users shall not reset the sched_reset_on_fork flag */
if (p->sched_reset_on_fork && !reset_on_fork)
return -EPERM;
+
+ /* Normal users shall not reset the sched_system_critical flag */
+ if (p->sched_system_critical && !system_critical)
+ return -EPERM;
}

if (user) {
@@ -4560,7 +4571,7 @@ recheck:
* Do not allow realtime tasks into groups that have no runtime
* assigned.
*/
- if (rt_bandwidth_enabled() && rt_policy(policy) &&
+ if (rt_bandwidth_enabled() && rt_policy(policy) && !system_critical &&
task_group(p)->rt_bandwidth.rt_runtime == 0)
return -EPERM;
#endif
@@ -4595,6 +4606,7 @@ recheck:
p->sched_class->put_prev_task(rq, p);

p->sched_reset_on_fork = reset_on_fork;
+ p->sched_system_critical = system_critical;

oldprio = p->prio;
prev_class = p->sched_class;
@@ -4712,9 +4724,13 @@ SYSCALL_DEFINE1(sched_getscheduler, pid_t, pid)
p = find_process_by_pid(pid);
if (p) {
retval = security_task_getscheduler(p);
- if (!retval)
- retval = p->policy
- | (p->sched_reset_on_fork ? SCHED_RESET_ON_FORK : 0);
+ if (!retval) {
+ retval = p->policy;
+ if (p->sched_reset_on_fork)
+ retval |= SCHED_RESET_ON_FORK;
+ if (p->sched_system_critical)
+ retval |= SCHED_SYSTEM_CRITICAL;
+ }
}
rcu_read_unlock();
return retval;
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 8afb953..4fb5ced 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -605,6 +605,7 @@ static void update_curr_rt(struct rq *rq)
struct sched_rt_entity *rt_se = &curr->rt;
struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
u64 delta_exec;
+ int system_critical = curr->sched_system_critical;

if (!task_has_rt_policy(curr))
return;
@@ -621,9 +622,13 @@ static void update_curr_rt(struct rq *rq)
curr->se.exec_start = rq->clock;
cpuacct_charge(curr, delta_exec);

- sched_rt_avg_update(rq, delta_exec);
+ /*
+ * System critical tasks do not contribute to bandwidth consumption,
+ * nor are they evicted when runtime is exceeded.
+ */
+ sched_rt_avg_update(rq, system_critical ? 0 : delta_exec);

- if (!rt_bandwidth_enabled())
+ if (!rt_bandwidth_enabled() || system_critical)
return;

for_each_sched_rt_entity(rt_se) {
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index b4e7431..6cbea9a 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -304,7 +304,7 @@ static int __cpuinit cpu_stop_cpu_callback(struct notifier_block *nfb,
cpu);
if (IS_ERR(p))
return NOTIFY_BAD;
- sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
+ sched_setscheduler_nocheck(p, SCHED_FIFO|SCHED_SYSTEM_CRITICAL, &param);
get_task_struct(p);
stopper->thread = p;
break;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Sun, 2010-05-23 at 11:07 +0200, Mike Galbraith wrote:
> On Thu, 2010-05-20 at 09:28 +0200, Peter Zijlstra wrote:
> > On Wed, 2010-05-19 at 17:43 +0530, Amit K. Arora wrote:
> > > Alternate Solution considered : Another option considered was to
> > > increase the priority of the hrtimer cpu offline notifier, such that it
> > > gets to run before scheduler's migration cpu offline notifier. In this
> > > way we are sure that the timers will get migrated before migration_call
> > > tries to kill migration_thread. But, this can have some non-obvious
> > > implications, suggested Srivatsa.
> >
> >
> > > On Wed, May 19, 2010 at 11:31:55AM +0200, Peter Zijlstra wrote:
> > > > The other problem is more urgent though, CPU_POST_DEAD runs outside of
> > > > the hotplug lock and thus the above becomes a race where we could
> > > > possible kill off the migration thread of a newly brought up cpu:
> > > >
> > > > cpu0 - down 2
> > > > cpu1 - up 2 (allocs a new migration thread, and leaks the old one)
> > > > cpu0 - post_down 2 - frees the migration thread -- oops!
> > >
> > > Ok. So, how about adding a check in CPU_UP_PREPARE event handling too ?
> > > The cpuset_lock will synchronize, and thus avoid race between killing of
> > > migration_thread in up_prepare and post_dead events.
> > >
> > > Here is the updated patch. If you don't like this one too, do you mind
> > > suggesting an alternate approach to tackle the problem ? Thanks !
> >
> > Right, so this isn't pretty at all..
>
> Since the problem seems to stem from interfering with a critical thread,
> how about create a SCHED_SYSTEM_CRITICAL flag ala SCHED_RESET_ON_FORK?
>
> Not particularly beautiful, and completely untested (well, it compiles).

Nah, I'd rather we pull the migration thread out of SCHED_FIFO and
either schedule it explicit or add a sched_class for it. We need to do
that anyway once we go play with SCHED_DEADLINE.

But it would be very nice if we could simply order the timer and task
migration bits so that the whole problem doesn't exist in the first
place.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Mon, 2010-05-24 at 15:29 +0530, Amit K. Arora wrote:
> since _cpu_up() and _cpu_down() can never run in
> parallel, because of cpu_add_remove_lock.

Ah indeed. I guess your initial patch works then.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/