oom-kill: add lowmem usage aware oom kill handling [Kernel]

Prev: perf top: losing events?
Next: x86: Unify reboot_type selection

From: KAMEZAWA Hiroyuki on 26 Jan 2010 19:00

On Tue, 26 Jan 2010 15:12:02 -0800
Andrew Morton <akpm(a)linux-foundation.org> wrote:

> On Mon, 25 Jan 2010 15:15:03 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu(a)jp.fujitsu.com> wrote:

> > This patch does
> > - add sysctl for new bahavior.
> > - add CONSTRAINT_LOWMEM to oom's constraint type.
> > - pass constraint to __badness()
> > - change calculation based on constraint. If CONSTRAINT_LOWMEM,
> > use low_rss instead of vmsize.
> >
> > Changelog 2010/01/25
> > - showing extension_mask value in OOM kill main log header.
> > Changelog 2010/01/22:
> > - added sysctl
> > - fixed !CONFIG_MMU
> > - fixed fs/proc/base.c breakacge.
>
> It'd be nice to see some testing results for this. Perhaps "here's a
> test case and here's the before-and-after behaviour".
>
Hm. posting test case module is O.K ?
At leaset, I'll add what test was done and /var/log/message output to the log.

> I don't like the sysctl knob much.
me, too.

> Hardly anyone will know to enable
> it so the feature won't get much testing and this binary decision
> fractures the testing effort. It would be much better if we can get
> everyone running the same code. I mean, if there are certain workloads
> on certain machines with which the oom-killer doesn't behave correctly
> then fix it!
Yes, I think you're right. But "breaking current behaviro of our servers!"
arguments kills all proposal to this area and this oom-killer or vmscan is
a feature should be tested by real users. (I'll write fork-bomb detector
and RSS based OOM again.)

Then, I'd like to use sysctl. Distro/users can select default value of this
by /etc/sysctl.conf file, at least.

>
> Why was the '#include <linux/sysctl.h>" removed from sysctl.c?
>
> The patch adds a random newline to sysctl.c.
>
Sorry. my bad.

> It was never a good idea to add extern declarations to sysctl.c. It's
> better to add them to a subsystem-specific header file (ie:
> mm-sysctl.h) and then include that file from the mm files that define
> or use sysctl_foo, and include it into sysctl.c. Oh well.
>
Hmm. Okay. I'll consider about that.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andrew Morton on 26 Jan 2010 19:30

On Wed, 27 Jan 2010 08:53:55 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu(a)jp.fujitsu.com> wrote:

> > Hardly anyone will know to enable
> > it so the feature won't get much testing and this binary decision
> > fractures the testing effort. It would be much better if we can get
> > everyone running the same code. I mean, if there are certain workloads
> > on certain machines with which the oom-killer doesn't behave correctly
> > then fix it!
> Yes, I think you're right. But "breaking current behaviro of our servers!"
> arguments kills all proposal to this area and this oom-killer or vmscan is
> a feature should be tested by real users. (I'll write fork-bomb detector
> and RSS based OOM again.)

Well don't break their servers then ;)

What I'm not understanding is: why is it not possible to improve the
behaviour on the affected machines without affecting the behaviour on
other machines?

What are these "servers" to which you refer? x86_32 servers, I assume
- the patch shouldn't affect 64-bit machines. Why don't they also want
this treatment and in what way does the patch "break" them?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: KAMEZAWA Hiroyuki on 26 Jan 2010 20:10

On Tue, 26 Jan 2010 16:19:52 -0800
Andrew Morton <akpm(a)linux-foundation.org> wrote:

> On Wed, 27 Jan 2010 08:53:55 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu(a)jp.fujitsu.com> wrote:
>
> > > Hardly anyone will know to enable
> > > it so the feature won't get much testing and this binary decision
> > > fractures the testing effort. It would be much better if we can get
> > > everyone running the same code. I mean, if there are certain workloads
> > > on certain machines with which the oom-killer doesn't behave correctly
> > > then fix it!
> > Yes, I think you're right. But "breaking current behaviro of our servers!"
> > arguments kills all proposal to this area and this oom-killer or vmscan is
> > a feature should be tested by real users. (I'll write fork-bomb detector
> > and RSS based OOM again.)
>
> Well don't break their servers then ;)
>
> What I'm not understanding is: why is it not possible to improve the
> behaviour on the affected machines without affecting the behaviour on
> other machines?
>

Now, /proc/<pid>/oom_score and /proc/<pid>/oom_adj are used by servers.
After this patch, badness() returns different value based on given context.
Changing format of them was an idea, but, as David said, using "RSS" values
will show unstable oom_score. So, I didn't modify oom_score (for this time).

To be honest, all my work are for guys who don't tweak oom_adj based on oom_score.
IOW, this is for usual novice people. And I don't wan't to break servers which
depends on oom black magic currently supported.

It may be better to show lowmem_rss via /proc/<pid>/statm or somewhere. But
I didn't do that because usual people doesn't check that in periodic and
tweak oom_adj.

For my customers, I don't like oom black magic. I'd like to recommend to
use memcg, of course ;) But lowmem oom cannot be handled by memcg, well.
So I started from this.

> What are these "servers" to which you refer?
Almost all servers/PCs/laptops which have multiple zones in memory layout.

> x86_32 servers, I assume
> - the patch shouldn't affect 64-bit machines. Why don't they also want
> this treatment and in what way does the patch "break" them?

Ah, explanation was not enough.

This patch depends on mm-add-lowmem-detection-logic.patch
The lowmem is
- ZONE_NORMAL in x86-32 which has HIGHMEM
- ZONE_DMA32 in x86-64 which has ZONE_NORMAL
- ZONE_DMA in x86-64 which doesn't have ZONE_NORMAL(memory < 4G)
- ZONE_DMA in ia64 which has ZONE_NORMAL(memory > 4G)
- no zone in ppc. all zone are DMA. (lowmem_zone=-1)

So, this affects x86-64 hosts, especially when it has 4 Gbytes of memory
and 32bit pci cards.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: David Rientjes on 27 Jan 2010 18:50

On Wed, 27 Jan 2010, KAMEZAWA Hiroyuki wrote:

> Yes, I think you're right. But "breaking current behaviro of our servers!"
> arguments kills all proposal to this area and this oom-killer or vmscan is
> a feature should be tested by real users.

Nobody has said we should discount lowmem rss when dealing with a GFP_DMA
allocation, it simply wasn't possible until the lowmem rss counters were
introduced in -mm. It would prevent the needless killing of innocent
tasks which would not allow the page allocation to succeed, so it's a good
feature to have. It doesn't need to be configurable at all, we just need
to find a way to introduce it into the heuristic without mangling oom_adj.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: David Rientjes on 27 Jan 2010 19:00

On Wed, 27 Jan 2010, KAMEZAWA Hiroyuki wrote:

> Now, /proc/<pid>/oom_score and /proc/<pid>/oom_adj are used by servers.

Nonsense, there are plenty of userspace applications such as udev that
tune their own oom_adj value on their own! oom_adj is used by anyone who
wants to define oom killer priority by polarizing the badness heuristic
for certain tasks to, for example, always prefer them or completely
disable oom killing for them.

> After this patch, badness() returns different value based on given context.
> Changing format of them was an idea, but, as David said, using "RSS" values
> will show unstable oom_score. So, I didn't modify oom_score (for this time).
>

That's a seperate issue: you cannot define the baseline of the heuristic
in terms of rss because it does not allow userspace to define when a task
has become "rogue", i.e. when it is consuming far more memory than
expected, because it is a dynamic value that depends on the state of the
VM at the time of oom. That is one of the two most popular reasons for
tuning oom_adj, the other aforementioned.

The issue with using lowmem rss for CONSTRAINT_LOWMEM is that it
misinterprets oom_adj values given to tasks; users will tune their oom_adj
based on global, system-wide ooms (or use /proc/pid/oom_score to reveal
the priority) and will never understand how it affects the value of a
resident page in lowmem for GFP_DMA allocations.

> To be honest, all my work are for guys who don't tweak oom_adj based on oom_score.
> IOW, this is for usual novice people. And I don't wan't to break servers which
> depends on oom black magic currently supported.
>

Why can't you simply create your own heuristic, seperate from badness(),
for CONSTRAINT_LOWMEM? Define the criteria that you see as important in
selecting a task in that scenario and then propose it as a seperate
function, there is no requirement that we must have a single heuristic
that works for all the various oom killer constraints. It would be
entirely appropriate to ignore oom_adj in that heuristic, as well, since
its not defined for such oom conditions (OOM_DISABLE is already taken care
of in the tasklist scan and needs no further support).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8
Prev: perf top: losing events?
Next: x86: Unify reboot_type selection