oom-kill: add lowmem usage aware oom kill handling [Kernel]

Prev: perf top: losing events?
Next: x86: Unify reboot_type selection

From: Alan Cox on 29 Jan 2010 11:40

> > Ultimately it is policy. The kernel simply can't read minds.
> >
> If so, all heuristics other than vm_size should be purged, I think.
> ...Or victim should be just determined by the class of application
> user sets. oom_adj other than OOM_DISABLE, searching victim process
> by black magic are all garbage.

oom_adj by value makes sense as do some of the basic heuristics - but a
lot of the complexity I would agree is completely nonsensical.

There are folks who use oom_adj weightings to influence things (notably
embedded and desktop). The embedded world would actually benefit on the
whole if the oom_adj was an absolute value because they usually know
precisely what they want to die and in what order.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: KAMEZAWA Hiroyuki on 29 Jan 2010 11:50

Alan Cox wrote:
>> > Ultimately it is policy. The kernel simply can't read minds.
>> >
>> If so, all heuristics other than vm_size should be purged, I think.
>> ...Or victim should be just determined by the class of application
>> user sets. oom_adj other than OOM_DISABLE, searching victim process
>> by black magic are all garbage.
>
> oom_adj by value makes sense as do some of the basic heuristics - but a
> lot of the complexity I would agree is completely nonsensical.
>
> There are folks who use oom_adj weightings to influence things (notably
> embedded and desktop). The embedded world would actually benefit on the
> whole if the oom_adj was an absolute value because they usually know
> precisely what they want to die and in what order.
>
okay...I guess the cause of the problem Vedran met came from
this calculation.
==
109 /*
110 * Processes which fork a lot of child processes are likely
111 * a good choice. We add half the vmsize of the children if they
112 * have an own mm. This prevents forking servers to flood the
113 * machine with an endless amount of children. In case a single
114 * child is eating the vast majority of memory, adding only half
115 * to the parents will make the child our kill candidate of
choice.
116 */
117 list_for_each_entry(child, &p->children, sibling) {
118 task_lock(child);
119 if (child->mm != mm && child->mm)
120 points += child->mm->total_vm/2 + 1;
121 task_unlock(child);
122 }
123
==
This makes task launcher(the fist child of some daemon.) first victim.
And...I wonder this is not good for oom_adj,
I think it's set per task with regard to personal memory usage.

But I'm not sure why this code is used now. Does anyone remember
history or the benefit of this calculation ?

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: David Rientjes on 29 Jan 2010 16:10

On Sat, 30 Jan 2010, KAMEZAWA Hiroyuki wrote:

> okay...I guess the cause of the problem Vedran met came from
> this calculation.
> ==
> 109 /*
> 110 * Processes which fork a lot of child processes are likely
> 111 * a good choice. We add half the vmsize of the children if they
> 112 * have an own mm. This prevents forking servers to flood the
> 113 * machine with an endless amount of children. In case a single
> 114 * child is eating the vast majority of memory, adding only half
> 115 * to the parents will make the child our kill candidate of
> choice.
> 116 */
> 117 list_for_each_entry(child, &p->children, sibling) {
> 118 task_lock(child);
> 119 if (child->mm != mm && child->mm)
> 120 points += child->mm->total_vm/2 + 1;
> 121 task_unlock(child);
> 122 }
> 123
> ==
> This makes task launcher(the fist child of some daemon.) first victim.

That "victim", p, is passed to oom_kill_process() which does this:

/* Try to kill a child first */
list_for_each_entry(c, &p->children, sibling) {
if (c->mm == p->mm)
continue;
if (!oom_kill_task(c))
return 0;
}
return oom_kill_task(p);

which prevents your example of the task launcher from getting killed
unless it itself is using such an egregious amount of memory that its VM
size has caused the heuristic to select the daemon in the first place.
We only look at a single level of children, and attempt to kill one of
those children not sharing memory with the selected task first, so your
example is exaggerated for dramatic value.

The oom killer has been doing this for years and I haven't noticed a huge
surge in complaints about it killing X specifically because of that code
in oom_kill_process().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: David Rientjes on 29 Jan 2010 16:20

On Sat, 30 Jan 2010, KAMEZAWA Hiroyuki wrote:

> If so, all heuristics other than vm_size should be purged, I think.

I don't recall anybody disagreeing about removing some of the current
heuristics, but there is value to those beyond simply total_vm: we want to
penalize tasks that do not share any mems_allowed with the triggering
task, for example, otherwise it can lead to needless oom killing. Many
people believe we should keep the slight penalty for superuser tasks over
regular user tasks, as well.

Auditing the badness() function is a worthwhile endeavor and I think you'd
be most successful if you tweaked the various penalties (runtime, nice,
capabilities, etc) to reflect how much each is valued in terms of VM size,
the baseline. I doubt anybody would defend simply dividing by 4 or
multiplying by 2 being scientific.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Vedran Furač on 30 Jan 2010 07:40

Alan Cox wrote:

>> off by default. Problem is that it breaks java and some other stuff that
>> allocates much more memory than it needs. Very quickly Committed_AS hits
>> CommitLimit and one cannot allocate any more while there is plenty of
>> memory still unused.
>
> So how about you go and have a complain at the people who are causing
> your problem, rather than the kernel.

That would pass completely unnoticed and ignored as long as overcommit
is enabled by default.

>>> theoretical limit, but you generally need more swap (it's one of the
>>> reasons why things like BSD historically have a '3 * memory' rule).
>> Say I have 8GB of memory and there's always some free, why would I need
>> swap?
>
> So that all the applications that allocate tons of address space and
> don't use it can swap when you hit that corner case, and as a result you
> don't need to go OOM. You should only get an OOM when you run out of
> memory + swap.

Yes, but unfortunately using swap makes machine crawl with huge disk IO
every time you access some application you haven't been using for a few
hours. So recently more and more people are disabling it completely with
positive experience.

>>> So sounds to me like a problem between the keyboard and screen (coupled
>> Unfortunately it is not. Give me ssh access to your computer (leave
>> overcommit on) and I'll kill your X with anything running on it.
>
> If you have overcommit on then you can cause stuff to get killed. Thats
> what the option enables.

s/stuff/wrong stuff/

> It's really very simple: overcommit off you must have enough RAM and swap
> to hold all allocations requested. Overcommit on - you don't need this
> but if you do use more than is available on the system something has to
> go.
>
> It's kind of like banking overcommit off is proper banking, overcommit
> on is modern western banking.

Hehe, yes and you know the consequences.

If you look at malloc(3) you would see this:

"This means that when malloc() returns non-NULL there is no guarantee
that the memory really is available. This is a really bad bug."

So, if you don't want to change the OOM algorithm why not fixing this
bug then? And after that change the proc(5) manpage entry for
/proc/sys/vm/overcommit_memory into something like:

0: heuristic overcommit (enable this if you have memory problems with
some buggy software)
1: always overcommit, never check
2: always check, never overcommit (this is the default)

Regards,
Vedran

--
http://vedranf.net | a8e7a7783ca0d460fee090cc584adc12

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8
Prev: perf top: losing events?
Next: x86: Unify reboot_type selection