From: Vedran Furač on
David Rientjes wrote:

> On Sat, 30 Jan 2010, Vedran Furac wrote:
>
>>> The oom killer has been doing this for years and I haven't noticed a huge
>>> surge in complaints about it killing X specifically because of that code
>>> in oom_kill_process().
>> Well you said it yourself, you won't see a surge because "oom killer has
>> been doing this *for years*". So you'll have a more/less constant number
>> of complains over the years. Just google for: linux, random, kill, memory;
>
> You snipped the code segment where I demonstrated that the selected task
> for oom kill is not necessarily the one chosen to die: if there is a child
> with disjoint memory that is killable, it will be selected instead. If
> Xorg or sshd is being chosen for kill, then you should investigate why
> that is, but there is nothing random about how the oom killer chooses
> tasks to kill.

I know that it isn't random, but it sure looks like that to the end user
and I use it to emphasize the problem. And about me investigating, that
simply not possible as I am not a kernel hacker who understands the code
beyond the syntax level. I can only point to the problem in hope that
someone will fix it.

> The facts that you're completely ignoring are that changing the heuristic
> baseline to rss is not going to prevent Xorg or sshd from being selected

In my tests a simple "ps -eo rss,command --sort rss" always showed the
cuprit, but OK, find another approach in fixing the problem in hope for
a positive review. Just... I feel everything will be put under the
carpet with fingers in ears while singing everything is fine. Prove me
wrong.

Regards,
Vedran


--
http://vedranf.net | a8e7a7783ca0d460fee090cc584adc12
From: KAMEZAWA Hiroyuki on
On Fri, 29 Jan 2010 13:07:01 -0800 (PST)
David Rientjes <rientjes(a)google.com> wrote:

> On Sat, 30 Jan 2010, KAMEZAWA Hiroyuki wrote:
>
> > okay...I guess the cause of the problem Vedran met came from
> > this calculation.
> > ==
> > 109 /*
> > 110 * Processes which fork a lot of child processes are likely
> > 111 * a good choice. We add half the vmsize of the children if they
> > 112 * have an own mm. This prevents forking servers to flood the
> > 113 * machine with an endless amount of children. In case a single
> > 114 * child is eating the vast majority of memory, adding only half
> > 115 * to the parents will make the child our kill candidate of
> > choice.
> > 116 */
> > 117 list_for_each_entry(child, &p->children, sibling) {
> > 118 task_lock(child);
> > 119 if (child->mm != mm && child->mm)
> > 120 points += child->mm->total_vm/2 + 1;
> > 121 task_unlock(child);
> > 122 }
> > 123
> > ==
> > This makes task launcher(the fist child of some daemon.) first victim.
>
> That "victim", p, is passed to oom_kill_process() which does this:
>
> /* Try to kill a child first */
> list_for_each_entry(c, &p->children, sibling) {
> if (c->mm == p->mm)
> continue;
> if (!oom_kill_task(c))
> return 0;
> }
> return oom_kill_task(p);
>

Then, finally, per-process oom_adj(!=OOM_DISABLE) control is ignored ?
Seems broken.

I think all this children-parent logic is bad.

Thanks,
-Kame



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Rientjes on
On Mon, 1 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > > 109 /*
> > > 110 * Processes which fork a lot of child processes are likely
> > > 111 * a good choice. We add half the vmsize of the children if they
> > > 112 * have an own mm. This prevents forking servers to flood the
> > > 113 * machine with an endless amount of children. In case a single
> > > 114 * child is eating the vast majority of memory, adding only half
> > > 115 * to the parents will make the child our kill candidate of
> > > choice.
> > > 116 */
> > > 117 list_for_each_entry(child, &p->children, sibling) {
> > > 118 task_lock(child);
> > > 119 if (child->mm != mm && child->mm)
> > > 120 points += child->mm->total_vm/2 + 1;
> > > 121 task_unlock(child);
> > > 122 }
> > > 123
> > > ==
> > > This makes task launcher(the fist child of some daemon.) first victim.
> >
> > That "victim", p, is passed to oom_kill_process() which does this:
> >
> > /* Try to kill a child first */
> > list_for_each_entry(c, &p->children, sibling) {
> > if (c->mm == p->mm)
> > continue;
> > if (!oom_kill_task(c))
> > return 0;
> > }
> > return oom_kill_task(p);
> >
>
> Then, finally, per-process oom_adj(!=OOM_DISABLE) control is ignored ?
> Seems broken.
>

No, oom_kill_task() returns 1 if the child has OOM_DISABLE set, meaning it
never gets killed and we continue iterating through the child list. If
there are no children with seperate memory to kill, the selected task gets
killed. This prevents things from like sshd or bash from getting killed
unless they are actually the memory leaker themselves.

It would naturally be better to select the child with the highest
badness() score, but it only depends on the ordering of p->children at the
moment. That's because we only want to iterate through this potentially
long list once, but improvements in this area (as well as sane tweaks to
the heuristic) would certainly be welcome.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Rientjes on
On Sun, 31 Jan 2010, Vedran Furac wrote:

> > You snipped the code segment where I demonstrated that the selected task
> > for oom kill is not necessarily the one chosen to die: if there is a child
> > with disjoint memory that is killable, it will be selected instead. If
> > Xorg or sshd is being chosen for kill, then you should investigate why
> > that is, but there is nothing random about how the oom killer chooses
> > tasks to kill.
>
> I know that it isn't random, but it sure looks like that to the end user
> and I use it to emphasize the problem. And about me investigating, that
> simply not possible as I am not a kernel hacker who understands the code
> beyond the syntax level. I can only point to the problem in hope that
> someone will fix it.
>

Disregarding the opportunity that userspace has to influence the oom
killer's selection for a moment, it really tends to favor killing tasks
that are the largest in size. Tasks that typically get the highest
badness score are those that have the highest mm->total_vm, it's that
simple. There are definitely cornercases where the first generation
children have a strong influence, but they are often killed either as a
result of themselves being a thread group leader with seperate memory from
the parent or as the result of the oom killer killing a task with seperate
memory before the selected task. It's completely natural for the oom
killer to select bash, for example, when in actuality it will kill a
memory leaker that has a high badness score as a result of the logic in
oom_kill_process().

If you have specific logs that you'd like to show, please enable
/proc/sys/vm/oom_dump_tasks and respond with them in another message with
that data inline.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/