From: KAMEZAWA Hiroyuki on
On Wed, 10 Feb 2010 08:32:17 -0800 (PST)
David Rientjes <rientjes(a)google.com> wrote:

> Two VM sysctls, oom dump_tasks and oom_kill_allocating_task, were
> implemented for very large systems to avoid excessively long tasklist
> scans. The former suppresses helpful diagnostic messages that are
> emitted for each thread group leader that are candidates for oom kill
> including their pid, uid, vm size, rss, oom_adj value, and name; this
> information is very helpful to users in understanding why a particular
> task was chosen for kill over others. The latter simply kills current,
> the task triggering the oom condition, instead of iterating through the
> tasklist looking for the worst offender.
>
> Both of these sysctls are combined into one for use on the aforementioned
> large systems: oom_kill_quick. This disables the now-default
> oom_dump_tasks and kills current whenever the oom killer is called.
>
> The oom killer rewrite is the perfect opportunity to combine both sysctls
> into one instead of carrying around the others for years to come for
> nothing else than legacy purposes.
>
> Signed-off-by: David Rientjes <rientjes(a)google.com>

seems reasonable..but how old these APIs are ? Replacement is ok ?

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu(a)jp.fujitsu.com>


> ---
> Documentation/sysctl/vm.txt | 44 +++++-------------------------------------
> include/linux/oom.h | 3 +-
> kernel/sysctl.c | 13 ++---------
> mm/oom_kill.c | 9 +++----
> 4 files changed, 14 insertions(+), 55 deletions(-)
>
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -43,9 +43,8 @@ Currently, these files are in /proc/sys/vm:
> - nr_pdflush_threads
> - nr_trim_pages (only if CONFIG_MMU=n)
> - numa_zonelist_order
> -- oom_dump_tasks
> - oom_forkbomb_thres
> -- oom_kill_allocating_task
> +- oom_kill_quick
> - overcommit_memory
> - overcommit_ratio
> - page-cluster
> @@ -470,27 +469,6 @@ this is causing problems for your system/application.
>
> ==============================================================
>
> -oom_dump_tasks
> -
> -Enables a system-wide task dump (excluding kernel threads) to be
> -produced when the kernel performs an OOM-killing and includes such
> -information as pid, uid, tgid, vm size, rss, cpu, oom_adj score, and
> -name. This is helpful to determine why the OOM killer was invoked
> -and to identify the rogue task that caused it.
> -
> -If this is set to zero, this information is suppressed. On very
> -large systems with thousands of tasks it may not be feasible to dump
> -the memory state information for each one. Such systems should not
> -be forced to incur a performance penalty in OOM conditions when the
> -information may not be desired.
> -
> -If this is set to non-zero, this information is shown whenever the
> -OOM killer actually kills a memory-hogging task.
> -
> -The default value is 0.
> -
> -==============================================================
> -
> oom_forkbomb_thres
>
> This value defines how many children with a seperate address space a specific
> @@ -511,22 +489,12 @@ The default value is 1000.
>
> ==============================================================
>
> -oom_kill_allocating_task
> -
> -This enables or disables killing the OOM-triggering task in
> -out-of-memory situations.
> -
> -If this is set to zero, the OOM killer will scan through the entire
> -tasklist and select a task based on heuristics to kill. This normally
> -selects a rogue memory-hogging task that frees up a large amount of
> -memory when killed.
> -
> -If this is set to non-zero, the OOM killer simply kills the task that
> -triggered the out-of-memory condition. This avoids the expensive
> -tasklist scan.
> +oom_kill_quick
>
> -If panic_on_oom is selected, it takes precedence over whatever value
> -is used in oom_kill_allocating_task.
> +When enabled, this will always kill the task that triggered the oom killer, i.e.
> +the task that attempted to allocate memory that could not be found. It also
> +suppresses the tasklist dump to the kernel log whenever the oom killer is
> +called. Typically set on systems with an extremely large number of tasks.
>
> The default value is 0.
>
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -51,8 +51,7 @@ static inline void oom_killer_enable(void)
> }
> /* for sysctl */
> extern int sysctl_panic_on_oom;
> -extern int sysctl_oom_kill_allocating_task;
> -extern int sysctl_oom_dump_tasks;
> +extern int sysctl_oom_kill_quick;
> extern int sysctl_oom_forkbomb_thres;
>
> #endif /* __KERNEL__*/
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -949,16 +949,9 @@ static struct ctl_table vm_table[] = {
> .proc_handler = proc_dointvec,
> },
> {
> - .procname = "oom_kill_allocating_task",
> - .data = &sysctl_oom_kill_allocating_task,
> - .maxlen = sizeof(sysctl_oom_kill_allocating_task),
> - .mode = 0644,
> - .proc_handler = proc_dointvec,
> - },
> - {
> - .procname = "oom_dump_tasks",
> - .data = &sysctl_oom_dump_tasks,
> - .maxlen = sizeof(sysctl_oom_dump_tasks),
> + .procname = "oom_kill_quick",
> + .data = &sysctl_oom_kill_quick,
> + .maxlen = sizeof(sysctl_oom_kill_quick),
> .mode = 0644,
> .proc_handler = proc_dointvec,
> },
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -32,9 +32,8 @@
> #include <linux/security.h>
>
> int sysctl_panic_on_oom;
> -int sysctl_oom_kill_allocating_task;
> -int sysctl_oom_dump_tasks;
> int sysctl_oom_forkbomb_thres = DEFAULT_OOM_FORKBOMB_THRES;
> +int sysctl_oom_kill_quick;
> static DEFINE_SPINLOCK(zone_scan_lock);
>
> /*
> @@ -397,7 +396,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
> dump_stack();
> mem_cgroup_print_oom_info(mem, p);
> show_mem();
> - if (sysctl_oom_dump_tasks)
> + if (!sysctl_oom_kill_quick)
> dump_tasks(mem);
> }
>
> @@ -604,9 +603,9 @@ static void __out_of_memory(gfp_t gfp_mask, int order, unsigned long totalpages,
> struct task_struct *p;
> unsigned int points;
>
> - if (sysctl_oom_kill_allocating_task)
> + if (sysctl_oom_kill_quick)
> if (!oom_kill_process(current, gfp_mask, order, 0, totalpages,
> - NULL, "Out of memory (oom_kill_allocating_task)"))
> + NULL, "Out of memory (quick mode)"))
> return;
> retry:
> /*
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo(a)kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont(a)kvack.org"> email(a)kvack.org </a>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Rientjes on
On Fri, 12 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > Two VM sysctls, oom dump_tasks and oom_kill_allocating_task, were
> > implemented for very large systems to avoid excessively long tasklist
> > scans. The former suppresses helpful diagnostic messages that are
> > emitted for each thread group leader that are candidates for oom kill
> > including their pid, uid, vm size, rss, oom_adj value, and name; this
> > information is very helpful to users in understanding why a particular
> > task was chosen for kill over others. The latter simply kills current,
> > the task triggering the oom condition, instead of iterating through the
> > tasklist looking for the worst offender.
> >
> > Both of these sysctls are combined into one for use on the aforementioned
> > large systems: oom_kill_quick. This disables the now-default
> > oom_dump_tasks and kills current whenever the oom killer is called.
> >
> > The oom killer rewrite is the perfect opportunity to combine both sysctls
> > into one instead of carrying around the others for years to come for
> > nothing else than legacy purposes.
> >
> > Signed-off-by: David Rientjes <rientjes(a)google.com>
>
> seems reasonable..but how old these APIs are ? Replacement is ok ?
>

I'm not concerned about /proc/sys/vm/oom_dump_tasks because it was
disabled by default and is now enabled by default (unless the user sets
this new /proc/sys/vm/oom_kill_quick). So existing users of
oom_dump_tasks will just have their write fail but identical behavior as
before.

/proc/sys/vm/oom_kill_allocating_task is different since it now requires
enabling /proc/sys/vm/oom_kill_quick, but I think there are such few users
(SGI originally requested it a couple years ago when we started scanning
the tasklist for CONSTRAINT_CPUSET in 2.6.24) and the side-effect of not
enabling it is minimal, it's just a long delay at oom kill time because
they must scan the tasklist. Therefore, I don't see it as a major problem
that will cause large disruptions, instead I see it as a great opportunity
to get rid of one more sysctl without taking away functionality.

> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu(a)jp.fujitsu.com>
>

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: KOSAKI Motohiro on
> Two VM sysctls, oom dump_tasks and oom_kill_allocating_task, were
> implemented for very large systems to avoid excessively long tasklist
> scans. The former suppresses helpful diagnostic messages that are
> emitted for each thread group leader that are candidates for oom kill
> including their pid, uid, vm size, rss, oom_adj value, and name; this
> information is very helpful to users in understanding why a particular
> task was chosen for kill over others. The latter simply kills current,
> the task triggering the oom condition, instead of iterating through the
> tasklist looking for the worst offender.
>
> Both of these sysctls are combined into one for use on the aforementioned
> large systems: oom_kill_quick. This disables the now-default
> oom_dump_tasks and kills current whenever the oom killer is called.
>
> The oom killer rewrite is the perfect opportunity to combine both sysctls
> into one instead of carrying around the others for years to come for
> nothing else than legacy purposes.

"_quick" is always bad sysctl name. instead, turnning oom_dump_tasks on
by default is better.

plus, this patch makes unnecessary compatibility issue.



>
> Signed-off-by: David Rientjes <rientjes(a)google.com>
> ---
> Documentation/sysctl/vm.txt | 44 +++++-------------------------------------
> include/linux/oom.h | 3 +-
> kernel/sysctl.c | 13 ++---------
> mm/oom_kill.c | 9 +++----
> 4 files changed, 14 insertions(+), 55 deletions(-)
>
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -43,9 +43,8 @@ Currently, these files are in /proc/sys/vm:
> - nr_pdflush_threads
> - nr_trim_pages (only if CONFIG_MMU=n)
> - numa_zonelist_order
> -- oom_dump_tasks
> - oom_forkbomb_thres
> -- oom_kill_allocating_task
> +- oom_kill_quick
> - overcommit_memory
> - overcommit_ratio
> - page-cluster
> @@ -470,27 +469,6 @@ this is causing problems for your system/application.
>
> ==============================================================
>
> -oom_dump_tasks
> -
> -Enables a system-wide task dump (excluding kernel threads) to be
> -produced when the kernel performs an OOM-killing and includes such
> -information as pid, uid, tgid, vm size, rss, cpu, oom_adj score, and
> -name. This is helpful to determine why the OOM killer was invoked
> -and to identify the rogue task that caused it.
> -
> -If this is set to zero, this information is suppressed. On very
> -large systems with thousands of tasks it may not be feasible to dump
> -the memory state information for each one. Such systems should not
> -be forced to incur a performance penalty in OOM conditions when the
> -information may not be desired.
> -
> -If this is set to non-zero, this information is shown whenever the
> -OOM killer actually kills a memory-hogging task.
> -
> -The default value is 0.
> -
> -==============================================================
> -
> oom_forkbomb_thres
>
> This value defines how many children with a seperate address space a specific
> @@ -511,22 +489,12 @@ The default value is 1000.
>
> ==============================================================
>
> -oom_kill_allocating_task
> -
> -This enables or disables killing the OOM-triggering task in
> -out-of-memory situations.
> -
> -If this is set to zero, the OOM killer will scan through the entire
> -tasklist and select a task based on heuristics to kill. This normally
> -selects a rogue memory-hogging task that frees up a large amount of
> -memory when killed.
> -
> -If this is set to non-zero, the OOM killer simply kills the task that
> -triggered the out-of-memory condition. This avoids the expensive
> -tasklist scan.
> +oom_kill_quick
>
> -If panic_on_oom is selected, it takes precedence over whatever value
> -is used in oom_kill_allocating_task.
> +When enabled, this will always kill the task that triggered the oom killer, i.e.
> +the task that attempted to allocate memory that could not be found. It also
> +suppresses the tasklist dump to the kernel log whenever the oom killer is
> +called. Typically set on systems with an extremely large number of tasks.
>
> The default value is 0.
>
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -51,8 +51,7 @@ static inline void oom_killer_enable(void)
> }
> /* for sysctl */
> extern int sysctl_panic_on_oom;
> -extern int sysctl_oom_kill_allocating_task;
> -extern int sysctl_oom_dump_tasks;
> +extern int sysctl_oom_kill_quick;
> extern int sysctl_oom_forkbomb_thres;
>
> #endif /* __KERNEL__*/
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -949,16 +949,9 @@ static struct ctl_table vm_table[] = {
> .proc_handler = proc_dointvec,
> },
> {
> - .procname = "oom_kill_allocating_task",
> - .data = &sysctl_oom_kill_allocating_task,
> - .maxlen = sizeof(sysctl_oom_kill_allocating_task),
> - .mode = 0644,
> - .proc_handler = proc_dointvec,
> - },
> - {
> - .procname = "oom_dump_tasks",
> - .data = &sysctl_oom_dump_tasks,
> - .maxlen = sizeof(sysctl_oom_dump_tasks),
> + .procname = "oom_kill_quick",
> + .data = &sysctl_oom_kill_quick,
> + .maxlen = sizeof(sysctl_oom_kill_quick),
> .mode = 0644,
> .proc_handler = proc_dointvec,
> },
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -32,9 +32,8 @@
> #include <linux/security.h>
>
> int sysctl_panic_on_oom;
> -int sysctl_oom_kill_allocating_task;
> -int sysctl_oom_dump_tasks;
> int sysctl_oom_forkbomb_thres = DEFAULT_OOM_FORKBOMB_THRES;
> +int sysctl_oom_kill_quick;
> static DEFINE_SPINLOCK(zone_scan_lock);
>
> /*
> @@ -397,7 +396,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
> dump_stack();
> mem_cgroup_print_oom_info(mem, p);
> show_mem();
> - if (sysctl_oom_dump_tasks)
> + if (!sysctl_oom_kill_quick)
> dump_tasks(mem);
> }
>
> @@ -604,9 +603,9 @@ static void __out_of_memory(gfp_t gfp_mask, int order, unsigned long totalpages,
> struct task_struct *p;
> unsigned int points;
>
> - if (sysctl_oom_kill_allocating_task)
> + if (sysctl_oom_kill_quick)
> if (!oom_kill_process(current, gfp_mask, order, 0, totalpages,
> - NULL, "Out of memory (oom_kill_allocating_task)"))
> + NULL, "Out of memory (quick mode)"))
> return;
> retry:
> /*
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo(a)kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont(a)kvack.org"> email(a)kvack.org </a>



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Rientjes on
On Mon, 15 Feb 2010, KOSAKI Motohiro wrote:

> > Two VM sysctls, oom dump_tasks and oom_kill_allocating_task, were
> > implemented for very large systems to avoid excessively long tasklist
> > scans. The former suppresses helpful diagnostic messages that are
> > emitted for each thread group leader that are candidates for oom kill
> > including their pid, uid, vm size, rss, oom_adj value, and name; this
> > information is very helpful to users in understanding why a particular
> > task was chosen for kill over others. The latter simply kills current,
> > the task triggering the oom condition, instead of iterating through the
> > tasklist looking for the worst offender.
> >
> > Both of these sysctls are combined into one for use on the aforementioned
> > large systems: oom_kill_quick. This disables the now-default
> > oom_dump_tasks and kills current whenever the oom killer is called.
> >
> > The oom killer rewrite is the perfect opportunity to combine both sysctls
> > into one instead of carrying around the others for years to come for
> > nothing else than legacy purposes.
>
> "_quick" is always bad sysctl name.

Why? It does exactly what it says: it kills current without doing an
expensive tasklist scan and suppresses the possibly long tasklist dump.
That's the oom killer's "quick mode."

> instead, turnning oom_dump_tasks on
> by default is better.
>

It's now on by default and can be disabled by enabling oom_kill_quick.

> plus, this patch makes unnecessary compatibility issue.
>

It's the perfect opportunity when totally rewriting the oom killer to
combine two sysctls with the exact same users into one. Users will notice
that the tasklist is always dumped now (we're defaulting oom_dump_tasks
to be enabled), so there is no reason why we can't remove oom_dump_tasks,
we're just giving them a new way to disable it. oom_kill_allocating_task
no longer always means what it once did: with the mempolicy-constrained
oom rewrite, we now iterate the tasklist for such cases to kill a task.
So users need to reassess whether this should be set if all tasks on the
system are constrained by mempolicies, a typical configuration for
extremely large systems.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: KOSAKI Motohiro on
> On Mon, 15 Feb 2010, KOSAKI Motohiro wrote:
>
> > > Two VM sysctls, oom dump_tasks and oom_kill_allocating_task, were
> > > implemented for very large systems to avoid excessively long tasklist
> > > scans. The former suppresses helpful diagnostic messages that are
> > > emitted for each thread group leader that are candidates for oom kill
> > > including their pid, uid, vm size, rss, oom_adj value, and name; this
> > > information is very helpful to users in understanding why a particular
> > > task was chosen for kill over others. The latter simply kills current,
> > > the task triggering the oom condition, instead of iterating through the
> > > tasklist looking for the worst offender.
> > >
> > > Both of these sysctls are combined into one for use on the aforementioned
> > > large systems: oom_kill_quick. This disables the now-default
> > > oom_dump_tasks and kills current whenever the oom killer is called.
> > >
> > > The oom killer rewrite is the perfect opportunity to combine both sysctls
> > > into one instead of carrying around the others for years to come for
> > > nothing else than legacy purposes.
> >
> > "_quick" is always bad sysctl name.
>
> Why? It does exactly what it says: it kills current without doing an
> expensive tasklist scan and suppresses the possibly long tasklist dump.
> That's the oom killer's "quick mode."

Because, an administrator think "_quick" implies "please use it always".
plus, "quick" doesn't describe clealy meanings. oom_dump_tasks does.



> > instead, turnning oom_dump_tasks on
> > by default is better.
> >
>
> It's now on by default and can be disabled by enabling oom_kill_quick.
>
> > plus, this patch makes unnecessary compatibility issue.
> >
>
> It's the perfect opportunity when totally rewriting the oom killer to
> combine two sysctls with the exact same users into one. Users will notice
> that the tasklist is always dumped now (we're defaulting oom_dump_tasks
> to be enabled), so there is no reason why we can't remove oom_dump_tasks,
> we're just giving them a new way to disable it. oom_kill_allocating_task
> no longer always means what it once did: with the mempolicy-constrained
> oom rewrite, we now iterate the tasklist for such cases to kill a task.
> So users need to reassess whether this should be set if all tasks on the
> system are constrained by mempolicies, a typical configuration for
> extremely large systems.

No.
Your explanation doesn't answer why this change don't cause any comatibility
issue to _all_ user. Merely "opportunity" doesn't allow we ignore real world user.
I had made some incompatibility patch too, but all one have unavoidable reason.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/