oom: give the dying task a higher priority [Kernel]

Prev: oom: make oom_unkillable_task() helper function
Next: [PATCH] fs, ext4: Fix potential memory leak in ext4_fill_super

From: Minchan Kim on 16 Jun 2010 11:40

On Wed, Jun 16, 2010 at 08:36:29PM +0900, KOSAKI Motohiro wrote:
>
> From: Luis Claudio R. Goncalves <lclaudio(a)uudg.org>
>
> In a system under heavy load it was observed that even after the
> oom-killer selects a task to die, the task may take a long time to die.
>
> Right after sending a SIGKILL to the task selected by the oom-killer
> this task has it's priority increased so that it can exit() exit soon,
> freeing memory. That is accomplished by:
>
> /*
> * We give our sacrificial lamb high priority and access to
> * all the memory it needs. That way it should be able to
> * exit() and clear out its resources quickly...
> */
> p->rt.time_slice = HZ;
> set_tsk_thread_flag(p, TIF_MEMDIE);
>
> It sounds plausible giving the dying task an even higher priority to be
> sure it will be scheduled sooner and free the desired memory. It was
> suggested on LKML using SCHED_FIFO:1, the lowest RT priority so that
> this task won't interfere with any running RT task.
>
> If the dying task is already an RT task, leave it untouched.
> Another good suggestion, implemented here, was to avoid boosting the
> dying task priority in case of mem_cgroup OOM.
>
> Signed-off-by: Luis Claudio R. Goncalves <lclaudio(a)uudg.org>
> Cc: Minchan Kim <minchan.kim(a)gmail.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com>
> ---
> mm/oom_kill.c | 38 +++++++++++++++++++++++++++++++++++---
> 1 files changed, 35 insertions(+), 3 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 7e9942d..1ecfc7a 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -82,6 +82,28 @@ static bool has_intersects_mems_allowed(struct task_struct *tsk,
> #endif /* CONFIG_NUMA */
>
> /*
> + * If this is a system OOM (not a memcg OOM) and the task selected to be
> + * killed is not already running at high (RT) priorities, speed up the
> + * recovery by boosting the dying task to the lowest FIFO priority.
> + * That helps with the recovery and avoids interfering with RT tasks.
> + */
> +static void boost_dying_task_prio(struct task_struct *p,
> + struct mem_cgroup *mem)
> +{
> + struct sched_param param = { .sched_priority = 1 };
> +
> + if (mem)
> + return;
> +
> + if (rt_task(p)) {
> + p->rt.time_slice = HZ;
> + return;

I have a question from long time ago.
If we change rt.time_slice _without_ setscheduler, is it effective?
I mean scheduler pick up the task faster than other normal task?

> + }
> +
> + sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
> +}
> +
> +/*
--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Luis Claudio R. Goncalves on 16 Jun 2010 16:00

On Thu, Jun 17, 2010 at 12:31:20AM +0900, Minchan Kim wrote:
| > /*
| > * We give our sacrificial lamb high priority and access to
| > * all the memory it needs. That way it should be able to
| > * exit() and clear out its resources quickly...
| > */
| > p->rt.time_slice = HZ;
| > set_tsk_thread_flag(p, TIF_MEMDIE);
....
| > + if (rt_task(p)) {
| > + p->rt.time_slice = HZ;
| > + return;

I am not sure the code above will have any real effect for an RT task.
Kosaki-san, was this change motivated by test results or was it just a code
cleanup? I ask that out of curiosity.

| I have a question from long time ago.
| If we change rt.time_slice _without_ setscheduler, is it effective?
| I mean scheduler pick up the task faster than other normal task?

$ git log --pretty=oneline -Stime_slice mm/oom_kill.c
1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 Linux-2.6.12-rc2

This code ("time_slice = HZ;") is around for quite a while and
probably comes from a time where having a big time slice was enough to be
sure you would be the next on the line. I would say sched_setscheduler is
indeed necessary.

Regards,
Luis
--
[ Luis Claudio R. Goncalves Red Hat - Realtime Team ]
[ Fingerprint: 4FDD B8C4 3C59 34BD 8BE9 2696 7203 D980 A448 C8F8 ]

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: KOSAKI Motohiro on 16 Jun 2010 22:00

> On Thu, Jun 17, 2010 at 12:31:20AM +0900, Minchan Kim wrote:
> | > /*
> | > * We give our sacrificial lamb high priority and access to
> | > * all the memory it needs. That way it should be able to
> | > * exit() and clear out its resources quickly...
> | > */
> | > p->rt.time_slice = HZ;
> | > set_tsk_thread_flag(p, TIF_MEMDIE);
> ...
> | > + if (rt_task(p)) {
> | > + p->rt.time_slice = HZ;
> | > + return;
>
> I am not sure the code above will have any real effect for an RT task.
> Kosaki-san, was this change motivated by test results or was it just a code
> cleanup? I ask that out of curiosity.

just cleanup.
ok, I remove this dubious code.

>
> | I have a question from long time ago.
> | If we change rt.time_slice _without_ setscheduler, is it effective?
> | I mean scheduler pick up the task faster than other normal task?
>
> $ git log --pretty=oneline -Stime_slice mm/oom_kill.c
> 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 Linux-2.6.12-rc2
>
> This code ("time_slice = HZ;") is around for quite a while and
> probably comes from a time where having a big time slice was enough to be
> sure you would be the next on the line. I would say sched_setscheduler is
> indeed necessary.

ok

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: KOSAKI Motohiro on 16 Jun 2010 22:00

> > + struct sched_param param = { .sched_priority = 1 };
> > +
> > + if (mem)
> > + return;
> > +
> > + if (rt_task(p)) {
> > + p->rt.time_slice = HZ;
> > + return;
>
> I have a question from long time ago.
> If we change rt.time_slice _without_ setscheduler, is it effective?
> I mean scheduler pick up the task faster than other normal task?

if p is SCHED_OTHER, no effective. if my understand is correct, that's
only meaningfull if p is SCHED_RR. that's the reason why I moved this
check into "if (rt_task())".

but honestly I haven't observed this works effectively. so, I agree
this can be removed as Luis mentioned.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: KOSAKI Motohiro on 30 Jun 2010 05:40

Sorry, I forgot to cc Luis. resend.

(intentional full quote)

> From: Luis Claudio R. Goncalves <lclaudio(a)uudg.org>
>
> In a system under heavy load it was observed that even after the
> oom-killer selects a task to die, the task may take a long time to die.
>
> Right after sending a SIGKILL to the task selected by the oom-killer
> this task has it's priority increased so that it can exit() exit soon,
> freeing memory. That is accomplished by:
>
> /*
> * We give our sacrificial lamb high priority and access to
> * all the memory it needs. That way it should be able to
> * exit() and clear out its resources quickly...
> */
> p->rt.time_slice = HZ;
> set_tsk_thread_flag(p, TIF_MEMDIE);
>
> It sounds plausible giving the dying task an even higher priority to be
> sure it will be scheduled sooner and free the desired memory. It was
> suggested on LKML using SCHED_FIFO:1, the lowest RT priority so that
> this task won't interfere with any running RT task.
>
> If the dying task is already an RT task, leave it untouched.
> Another good suggestion, implemented here, was to avoid boosting the
> dying task priority in case of mem_cgroup OOM.
>
> Signed-off-by: Luis Claudio R. Goncalves <lclaudio(a)uudg.org>
> Cc: Minchan Kim <minchan.kim(a)gmail.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com>
> ---
> mm/oom_kill.c | 34 +++++++++++++++++++++++++++++++---
> 1 files changed, 31 insertions(+), 3 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index b5678bf..0858b18 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -82,6 +82,24 @@ static bool has_intersects_mems_allowed(struct task_struct *tsk,
> #endif /* CONFIG_NUMA */
>
> /*
> + * If this is a system OOM (not a memcg OOM) and the task selected to be
> + * killed is not already running at high (RT) priorities, speed up the
> + * recovery by boosting the dying task to the lowest FIFO priority.
> + * That helps with the recovery and avoids interfering with RT tasks.
> + */
> +static void boost_dying_task_prio(struct task_struct *p,
> + struct mem_cgroup *mem)
> +{
> + struct sched_param param = { .sched_priority = 1 };
> +
> + if (mem)
> + return;
> +
> + if (!rt_task(p))
> + sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
> +}
> +
> +/*
> * The process p may have detached its own ->mm while exiting or through
> * use_mm(), but one or more of its subthreads may still have a valid
> * pointer. Return p, or any of its subthreads with a valid ->mm, with
> @@ -421,7 +439,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
> }
>
> #define K(x) ((x) << (PAGE_SHIFT-10))
> -static int oom_kill_task(struct task_struct *p)
> +static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
> {
> p = find_lock_task_mm(p);
> if (!p) {
> @@ -434,9 +452,17 @@ static int oom_kill_task(struct task_struct *p)
> K(get_mm_counter(p->mm, MM_FILEPAGES)));
> task_unlock(p);
>
> - p->rt.time_slice = HZ;
> +
> set_tsk_thread_flag(p, TIF_MEMDIE);
> force_sig(SIGKILL, p);
> +
> + /*
> + * We give our sacrificial lamb high priority and access to
> + * all the memory it needs. That way it should be able to
> + * exit() and clear out its resources quickly...
> + */
> + boost_dying_task_prio(p, mem);
> +
> return 0;
> }
> #undef K
> @@ -460,6 +486,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> */
> if (p->flags & PF_EXITING) {
> set_tsk_thread_flag(p, TIF_MEMDIE);
> + boost_dying_task_prio(p, mem);
> return 0;
> }
>
> @@ -489,7 +516,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> }
> } while_each_thread(p, t);
>
> - return oom_kill_task(victim);
> + return oom_kill_task(victim, mem);
> }
>
> /*
> @@ -670,6 +697,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> */
> if (fatal_signal_pending(current)) {
> set_thread_flag(TIF_MEMDIE);
> + boost_dying_task_prio(current, NULL);
> return;
> }
>
> --
> 1.6.5.2
>
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2
Prev: oom: make oom_unkillable_task() helper function
Next: [PATCH] fs, ext4: Fix potential memory leak in ext4_fill_super