oom: give the dying task a higher priority [Kernel]

Prev: oom: make oom_unkillable_task() helper function
Next: [PATCH] fs, ext4: Fix potential memory leak in ext4_fill_super

From: Minchan Kim on 30 Jun 2010 10:50

On Wed, Jun 30, 2010 at 06:35:08PM +0900, KOSAKI Motohiro wrote:
>
> Sorry, I forgot to cc Luis. resend.
>
>
> (intentional full quote)
>
> > From: Luis Claudio R. Goncalves <lclaudio(a)uudg.org>
> >
> > In a system under heavy load it was observed that even after the
> > oom-killer selects a task to die, the task may take a long time to die.
> >
> > Right after sending a SIGKILL to the task selected by the oom-killer
> > this task has it's priority increased so that it can exit() exit soon,
> > freeing memory. That is accomplished by:
> >
> > /*
> > * We give our sacrificial lamb high priority and access to
> > * all the memory it needs. That way it should be able to
> > * exit() and clear out its resources quickly...
> > */
> > p->rt.time_slice = HZ;
> > set_tsk_thread_flag(p, TIF_MEMDIE);
> >
> > It sounds plausible giving the dying task an even higher priority to be
> > sure it will be scheduled sooner and free the desired memory. It was
> > suggested on LKML using SCHED_FIFO:1, the lowest RT priority so that
> > this task won't interfere with any running RT task.
> >
> > If the dying task is already an RT task, leave it untouched.
> > Another good suggestion, implemented here, was to avoid boosting the
> > dying task priority in case of mem_cgroup OOM.
> >
> > Signed-off-by: Luis Claudio R. Goncalves <lclaudio(a)uudg.org>
> > Cc: Minchan Kim <minchan.kim(a)gmail.com>
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com>

Reviewed-by: Minchan Kim <minchan.kim(a)gmail.com>

It seems code itself doesn't have a problem.
So I give reviewed-by.
But this patch might break fairness of normal process at corner case.
If system working is more important than fairness of processes,
It does make sense. But scheduler guys might have a different opinion.

So at least, we need ACKs of scheduler guys.
Cced Ingo, Peter, Thomas.

> > ---
> > mm/oom_kill.c | 34 +++++++++++++++++++++++++++++++---
> > 1 files changed, 31 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index b5678bf..0858b18 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -82,6 +82,24 @@ static bool has_intersects_mems_allowed(struct task_struct *tsk,
> > #endif /* CONFIG_NUMA */
> >
> > /*
> > + * If this is a system OOM (not a memcg OOM) and the task selected to be
> > + * killed is not already running at high (RT) priorities, speed up the
> > + * recovery by boosting the dying task to the lowest FIFO priority.
> > + * That helps with the recovery and avoids interfering with RT tasks.
> > + */
> > +static void boost_dying_task_prio(struct task_struct *p,
> > + struct mem_cgroup *mem)
> > +{
> > + struct sched_param param = { .sched_priority = 1 };
> > +
> > + if (mem)
> > + return;
> > +
> > + if (!rt_task(p))
> > + sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
> > +}
> > +
> > +/*
> > * The process p may have detached its own ->mm while exiting or through
> > * use_mm(), but one or more of its subthreads may still have a valid
> > * pointer. Return p, or any of its subthreads with a valid ->mm, with
> > @@ -421,7 +439,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
> > }
> >
> > #define K(x) ((x) << (PAGE_SHIFT-10))
> > -static int oom_kill_task(struct task_struct *p)
> > +static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
> > {
> > p = find_lock_task_mm(p);
> > if (!p) {
> > @@ -434,9 +452,17 @@ static int oom_kill_task(struct task_struct *p)
> > K(get_mm_counter(p->mm, MM_FILEPAGES)));
> > task_unlock(p);
> >
> > - p->rt.time_slice = HZ;
> > +
> > set_tsk_thread_flag(p, TIF_MEMDIE);
> > force_sig(SIGKILL, p);
> > +
> > + /*
> > + * We give our sacrificial lamb high priority and access to
> > + * all the memory it needs. That way it should be able to
> > + * exit() and clear out its resources quickly...
> > + */
> > + boost_dying_task_prio(p, mem);
> > +
> > return 0;
> > }
> > #undef K
> > @@ -460,6 +486,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> > */
> > if (p->flags & PF_EXITING) {
> > set_tsk_thread_flag(p, TIF_MEMDIE);
> > + boost_dying_task_prio(p, mem);
> > return 0;
> > }
> >
> > @@ -489,7 +516,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> > }
> > } while_each_thread(p, t);
> >
> > - return oom_kill_task(victim);
> > + return oom_kill_task(victim, mem);
> > }
> >
> > /*
> > @@ -670,6 +697,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> > */
> > if (fatal_signal_pending(current)) {
> > set_thread_flag(TIF_MEMDIE);
> > + boost_dying_task_prio(current, NULL);
> > return;
> > }
> >
> > --
> > 1.6.5.2
> >
> >
> >
>
>
>

--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andrew Morton on 2 Jul 2010 18:00

On Wed, 30 Jun 2010 18:33:23 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com> wrote:

> +static void boost_dying_task_prio(struct task_struct *p,
> + struct mem_cgroup *mem)
> +{
> + struct sched_param param = { .sched_priority = 1 };
> +
> + if (mem)
> + return;
> +
> + if (!rt_task(p))
> + sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
> +}

We can actually make `param' static here. That saves a teeny bit of
code and a little bit of stack. The oom-killer can be called when
we're using a lot of stack.

But if we make that change we really should make the param arg to
sched_setscheduler_nocheck() be const. I did that (and was able to
convert lots of callers to use a static `param') but to complete the
job we'd need to chase through all the security goop, fixing up
security_task_setscheduler() and callees, and I got bored.

include/linux/sched.h | 2 +-
kernel/kthread.c | 2 +-
kernel/sched.c | 4 ++--
kernel/softirq.c | 4 +++-
kernel/stop_machine.c | 2 +-
kernel/workqueue.c | 2 +-
6 files changed, 9 insertions(+), 7 deletions(-)

diff -puN kernel/kthread.c~a kernel/kthread.c
--- a/kernel/kthread.c~a
+++ a/kernel/kthread.c
@@ -131,7 +131,7 @@ struct task_struct *kthread_create(int (
wait_for_completion(&create.done);

if (!IS_ERR(create.result)) {
- struct sched_param param = { .sched_priority = 0 };
+ static struct sched_param param = { .sched_priority = 0 };
va_list args;

va_start(args, namefmt);
diff -puN kernel/workqueue.c~a kernel/workqueue.c
--- a/kernel/workqueue.c~a
+++ a/kernel/workqueue.c
@@ -962,7 +962,7 @@ init_cpu_workqueue(struct workqueue_stru

static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
{
- struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+ static struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
struct workqueue_struct *wq = cwq->wq;
const char *fmt = is_wq_single_threaded(wq) ? "%s" : "%s/%d";
struct task_struct *p;
diff -puN kernel/stop_machine.c~a kernel/stop_machine.c
--- a/kernel/stop_machine.c~a
+++ a/kernel/stop_machine.c
@@ -291,7 +291,7 @@ repeat:
static int __cpuinit cpu_stop_cpu_callback(struct notifier_block *nfb,
unsigned long action, void *hcpu)
{
- struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 };
+ static struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 };
unsigned int cpu = (unsigned long)hcpu;
struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu);
struct task_struct *p;
diff -puN kernel/sched.c~a kernel/sched.c
--- a/kernel/sched.c~a
+++ a/kernel/sched.c
@@ -4570,7 +4570,7 @@ static bool check_same_owner(struct task
}

static int __sched_setscheduler(struct task_struct *p, int policy,
- struct sched_param *param, bool user)
+ const struct sched_param *param, bool user)
{
int retval, oldprio, oldpolicy = -1, on_rq, running;
unsigned long flags;
@@ -4734,7 +4734,7 @@ EXPORT_SYMBOL_GPL(sched_setscheduler);
* but our caller might not have that capability.
*/
int sched_setscheduler_nocheck(struct task_struct *p, int policy,
- struct sched_param *param)
+ const struct sched_param *param)
{
return __sched_setscheduler(p, policy, param, false);
}
diff -puN kernel/softirq.c~a kernel/softirq.c
--- a/kernel/softirq.c~a
+++ a/kernel/softirq.c
@@ -827,7 +827,9 @@ static int __cpuinit cpu_callback(struct
cpumask_any(cpu_online_mask));
case CPU_DEAD:
case CPU_DEAD_FROZEN: {
- struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+ static struct sched_param param = {
+ .sched_priority = MAX_RT_PRIO-1,
+ };

p = per_cpu(ksoftirqd, hotcpu);
per_cpu(ksoftirqd, hotcpu) = NULL;
diff -puN include/linux/sched.h~a include/linux/sched.h
--- a/include/linux/sched.h~a
+++ a/include/linux/sched.h
@@ -1924,7 +1924,7 @@ extern int task_curr(const struct task_s
extern int idle_cpu(int cpu);
extern int sched_setscheduler(struct task_struct *, int, struct sched_param *);
extern int sched_setscheduler_nocheck(struct task_struct *, int,
- struct sched_param *);
+ const struct sched_param *);
extern struct task_struct *idle_task(int cpu);
extern struct task_struct *curr_task(int cpu);
extern void set_curr_task(int cpu, struct task_struct *p);
_

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: KOSAKI Motohiro on 5 Jul 2010 20:50

(cc to James)

> On Wed, 30 Jun 2010 18:33:23 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com> wrote:
>
> > +static void boost_dying_task_prio(struct task_struct *p,
> > + struct mem_cgroup *mem)
> > +{
> > + struct sched_param param = { .sched_priority = 1 };
> > +
> > + if (mem)
> > + return;
> > +
> > + if (!rt_task(p))
> > + sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
> > +}
>
> We can actually make `param' static here. That saves a teeny bit of
> code and a little bit of stack. The oom-killer can be called when
> we're using a lot of stack.
>
> But if we make that change we really should make the param arg to
> sched_setscheduler_nocheck() be const. I did that (and was able to
> convert lots of callers to use a static `param') but to complete the
> job we'd need to chase through all the security goop, fixing up
> security_task_setscheduler() and callees, and I got bored.

ok, I've finished this works. I made two patches, for-security-tree and
for-core. diffstat is below.

I'll post the patches as reply of this mail.

KOSAKI Motohiro (2):
security: add const to security_task_setscheduler()
sched: make sched_param arugment static variables in some
sched_setscheduler() caller

include/linux/sched.h | 5 +++--
include/linux/security.h | 9 +++++----
kernel/irq/manage.c | 4 +++-
kernel/kthread.c | 2 +-
kernel/sched.c | 6 +++---
kernel/softirq.c | 4 +++-
kernel/stop_machine.c | 2 +-
kernel/trace/trace_selftest.c | 2 +-
kernel/watchdog.c | 2 +-
kernel/workqueue.c | 2 +-
security/commoncap.c | 2 +-
security/security.c | 4 ++--
security/selinux/hooks.c | 3 ++-
security/smack/smack_lsm.c | 2 +-
14 files changed, 28 insertions(+), 21 deletions(-)

- kosaki

>
>
> include/linux/sched.h | 2 +-
> kernel/kthread.c | 2 +-
> kernel/sched.c | 4 ++--
> kernel/softirq.c | 4 +++-
> kernel/stop_machine.c | 2 +-
> kernel/workqueue.c | 2 +-
> 6 files changed, 9 insertions(+), 7 deletions(-)
>
> diff -puN kernel/kthread.c~a kernel/kthread.c
> --- a/kernel/kthread.c~a
> +++ a/kernel/kthread.c
> @@ -131,7 +131,7 @@ struct task_struct *kthread_create(int (
> wait_for_completion(&create.done);
>
> if (!IS_ERR(create.result)) {
> - struct sched_param param = { .sched_priority = 0 };
> + static struct sched_param param = { .sched_priority = 0 };
> va_list args;
>
> va_start(args, namefmt);
> diff -puN kernel/workqueue.c~a kernel/workqueue.c
> --- a/kernel/workqueue.c~a
> +++ a/kernel/workqueue.c
> @@ -962,7 +962,7 @@ init_cpu_workqueue(struct workqueue_stru
>
> static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
> {
> - struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
> + static struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
> struct workqueue_struct *wq = cwq->wq;
> const char *fmt = is_wq_single_threaded(wq) ? "%s" : "%s/%d";
> struct task_struct *p;
> diff -puN kernel/stop_machine.c~a kernel/stop_machine.c
> --- a/kernel/stop_machine.c~a
> +++ a/kernel/stop_machine.c
> @@ -291,7 +291,7 @@ repeat:
> static int __cpuinit cpu_stop_cpu_callback(struct notifier_block *nfb,
> unsigned long action, void *hcpu)
> {
> - struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 };
> + static struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 };
> unsigned int cpu = (unsigned long)hcpu;
> struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu);
> struct task_struct *p;
> diff -puN kernel/sched.c~a kernel/sched.c
> --- a/kernel/sched.c~a
> +++ a/kernel/sched.c
> @@ -4570,7 +4570,7 @@ static bool check_same_owner(struct task
> }
>
> static int __sched_setscheduler(struct task_struct *p, int policy,
> - struct sched_param *param, bool user)
> + const struct sched_param *param, bool user)
> {
> int retval, oldprio, oldpolicy = -1, on_rq, running;
> unsigned long flags;
> @@ -4734,7 +4734,7 @@ EXPORT_SYMBOL_GPL(sched_setscheduler);
> * but our caller might not have that capability.
> */
> int sched_setscheduler_nocheck(struct task_struct *p, int policy,
> - struct sched_param *param)
> + const struct sched_param *param)
> {
> return __sched_setscheduler(p, policy, param, false);
> }
> diff -puN kernel/softirq.c~a kernel/softirq.c
> --- a/kernel/softirq.c~a
> +++ a/kernel/softirq.c
> @@ -827,7 +827,9 @@ static int __cpuinit cpu_callback(struct
> cpumask_any(cpu_online_mask));
> case CPU_DEAD:
> case CPU_DEAD_FROZEN: {
> - struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
> + static struct sched_param param = {
> + .sched_priority = MAX_RT_PRIO-1,
> + };
>
> p = per_cpu(ksoftirqd, hotcpu);
> per_cpu(ksoftirqd, hotcpu) = NULL;
> diff -puN include/linux/sched.h~a include/linux/sched.h
> --- a/include/linux/sched.h~a
> +++ a/include/linux/sched.h
> @@ -1924,7 +1924,7 @@ extern int task_curr(const struct task_s
> extern int idle_cpu(int cpu);
> extern int sched_setscheduler(struct task_struct *, int, struct sched_param *);
> extern int sched_setscheduler_nocheck(struct task_struct *, int,
> - struct sched_param *);
> + const struct sched_param *);
> extern struct task_struct *idle_task(int cpu);
> extern struct task_struct *curr_task(int cpu);
> extern void set_curr_task(int cpu, struct task_struct *p);
> _
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev |
Pages: 1 2
Prev: oom: make oom_unkillable_task() helper function
Next: [PATCH] fs, ext4: Fix potential memory leak in ext4_fill_super