From: David Rientjes on
On Fri, 30 Oct 2009, Vedran Furac wrote:

> > The problem you identified in http://pastebin.com/f3f9674a0, however, is a
> > forkbomb issue where the badness score should never have been so high for
> > kdeinit4 compared to "test". That's directly proportional to adding the
> > scores of all disjoint child total_vm values into the badness score for
> > the parent and then killing the children instead.
>
> Could you explain me why ntpd invoked oom killer? Its parent is init. Or
> syslog-ng?
>

Because it attempted an order-0 GFP_USER allocation and direct reclaim
could not free any pages.

The task that invoked the oom killer is simply the unlucky task that tried
an allocation that couldn't be satisified through direct reclaim. It's
usually unrelated to the task chosen for kill unless
/proc/sys/vm/oom_kill_allocating_task is enabled (which SGI requested to
avoid excessively long tasklist scans).

> > That's the problem, not using total_vm as a baseline. Replacing that with
> > rss is not going to solve the issue and reducing the user's ability to
> > specify a rough oom priority from userspace is simply not an option.
>
> OK then, if you have a solution, I would be glad to test your patch. I
> won't care much if you don't change total_vm as a baseline. Just make
> random killing history.
>

The only randomness is in selecting a task that has a different mm from
the parent in the order of its child list. Yes, that can be addressed by
doing a smarter iteration through the children before killing one of them.

Keep in mind that a heuristic as simple as this:

- kill the task that was started most recently by the same uid, or

- kill the task that was started most recently on the system if a root
task calls the oom killer,

would have yielded perfect results for your testcase but isn't necessarily
something that we'd ever want to see.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Rientjes on
On Fri, 30 Oct 2009, Vedran Furac wrote:

> Well, you are kernel hacker, not me. You know how linux mm works much
> more than I do. I just reported a, what I think is a big problem, which
> needs to be solved ASAP (2.6.33).

The oom killer heuristics have not been changed recently, why is this
suddenly a problem that needs to be immediately addressed? The heuristics
you've been referring to have been used for at least three years.

> I'm afraid that we'll just talk much
> and nothing will be done with solution/fix postponed indefinitely. Not
> sure if you are interested, but I tested this on windowsxp also, and
> nothing bad happens there, system continues to function properly.
>

I'm totally sympathetic to testcases such as your own where the oom killer
seems to react in an undesirable way. I agree that it could do a much
better job at targeting "test" and killing it without negatively impacting
other tasks.

However, I don't think we can simply change the baseline (like the rss
change which has been added to -mm (??)) and consider it a major
improvement when it severely impacts how system administrators are able to
tune the badness heuristic from userspace via /proc/pid/oom_adj. I'm sure
you'd agree that user input is important in this matter and so that we
should maximize that ability rather than make it more difficult. That's
my main criticism of the suggestions thus far (and, sorry, but I have to
look out for production server interests here: you can't take away our
ability to influence oom badness scoring just because other simple
heuristics may be more understandable).

> > Much better is to allow the user to decide at what point, regardless of
> > swap usage, their application is using much more memory than expected or
> > required. They can do that right now pretty well with /proc/pid/oom_adj
> > without this outlandish claim that they should be expected to know the rss
> > of their applications at the time of oom to effectively tune oom_adj.
>
> Believe me, barely a few developers use oom_adj for their applications,
> and probably almost none of the end users. What should they do, every
> time they start an application, go to console and set the oom_adj. You
> cannot expect them to do that.
>

oom_adj is an extremely important part of our infrastructure and although
the majority of Linux users may not use it (I know a number of opensource
programs that tune its own, however), we can't let go of our ability to
specify an oom killing priority.

There are no simple solutions to this problem: the model proposed thus
far, which has basically been to acknowledge that oom killer is a bad
thing to encounter (but within that, some rationale was found that we can
react however we want??) and should be extremely easy to understand (just
kill the memory hogger with the most resident RAM) is a non-starter.

What would be better, and what I think we'll end up with, is a root
selectable heuristic so that production servers and desktop machines can
use different heuristics to make oom kill selections. We already have
/proc/sys/vm/oom_kill_allocating_task which I added 1-2 years ago to
address concerns specifically of SGI and their enormously long tasklist
scans. This would be variation on that idea and would include different
simplistic behaviors (such as always killing the most memory hogging task,
killing the most recently started task by the same uid, etc), and leave
the default heuristic much the same as currently.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Vedran Furač on
David Rientjes wrote:

> On Fri, 30 Oct 2009, Vedran Furac wrote:
>
>>> The problem you identified in http://pastebin.com/f3f9674a0, however, is a
>>> forkbomb issue where the badness score should never have been so high for
>>> kdeinit4 compared to "test". That's directly proportional to adding the
>>> scores of all disjoint child total_vm values into the badness score for
>>> the parent and then killing the children instead.
>> Could you explain me why ntpd invoked oom killer? Its parent is init. Or
>> syslog-ng?
>>
>
> Because it attempted an order-0 GFP_USER allocation and direct reclaim
> could not free any pages.
>
> The task that invoked the oom killer is simply the unlucky task that tried
> an allocation that couldn't be satisified through direct reclaim. It's
> usually unrelated to the task chosen for kill unless
> /proc/sys/vm/oom_kill_allocating_task is enabled (which SGI requested to
> avoid excessively long tasklist scans).

Oh, well, I didn't know that. Maybe rephrasing of that part of the
output would help eliminating future misinterpretation.

>> OK then, if you have a solution, I would be glad to test your patch. I
>> won't care much if you don't change total_vm as a baseline. Just make
>> random killing history.
>
> The only randomness is in selecting a task that has a different mm from
> the parent in the order of its child list. Yes, that can be addressed by
> doing a smarter iteration through the children before killing one of them.
>
> Keep in mind that a heuristic as simple as this:
>
> - kill the task that was started most recently by the same uid, or
>
> - kill the task that was started most recently on the system if a root
> task calls the oom killer,
>
> would have yielded perfect results for your testcase but isn't necessarily
> something that we'd ever want to see.

Of course, I want algorithm that works well in all possible situations.

Regards,

Vedran

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Vedran Furač on
Andrea Arcangeli wrote:

> On Fri, Oct 30, 2009 at 03:41:12PM +0100, Vedran Furač wrote:
>> Oh... so this is because apps "reserve" (Committed_AS?) more then they
>> currently need.
>
> They don't actually reserve, they end up "reserving" if overcommit is
> set to 2 (OVERCOMMIT_NEVER)... Apps aren't reserving, more likely they
> simply avoid a flood of mmap when a single one is enough to map an
> huge MAP_PRIVATE region like shared libs that you may only execute
> partially (this is why total_vm is usually much bigger than real ram
> mapped by pagetables represented in rss). But those shared libs are
> 99% pageable and they don't need to stay in swap or ram, so
> overcommit-as greatly overstimates the actual needs even if shared lib
> loading wouldn't be 64bit optimized (i.e. large and a single one).

Thanks for info!

>> A the time of "malloc: Cannot allocate memory":
>>
>> CommitLimit: 3364440 kB
>> Committed_AS: 3240200 kB
>>
>> So probably everything is ok (and free is misleading). Overcommit is
>> unfortunately necessary if I want to be able to use all my memory.
>
> Add more swap.

I don't use swap. With current prices of RAM, swap is history, at least
for desktops. I hate when e.g. firefox gets swapped out if I don't use
it for a while. Removing swap decreased desktop latencies drastically.
And I don't care much if I'll loose 100MB of potential free memory that
could be used for disk cache...

Regards.

Vedran

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Rientjes on
On Fri, 30 Oct 2009, KAMEZAWA Hiroyuki wrote:

> > > - The kernel can't know the program is bad or not. just guess it.
> >
> > Totally irrelevant, given your fourth point about /proc/pid/oom_adj. We
> > can tell the kernel what we'd like the oom killer behavior should be if
> > the situation arises.
> >
>
> My point is that the server cannot distinguish memory leak from intentional
> memory usage. No other than that.
>

That's a different point. Today, we can influence the badness score of
any user thread to prioritize oom killing from userspace and that can be
done regardless of whether there's a memory leaker, a fork bomber, etc.
The priority based oom killing is important to production scenarios and
cannot be replaced by a heuristic that works everytime if it cannot be
influenced by userspace.

A spike in memory consumption when a process is initially forked would be
defined as a memory leaker in your quiet_time model.

> In this summer, at lunch with a daily linux user, I was said
> "you, enterprise guys, don't consider desktop or laptop problem at all."
> yes, I use only servers. My customer uses server, too. My first priority
> is always on server users.
> But, for this time, I wrote reply to Vedran and try to fix desktop problem.
> Even if current logic works well for servers, "KDE/GNOME is killed" problem
> seems to be serious. And this may be a problem for EMBEDED people, I guess.
>

You argued before that the problem wasn't specific to X (after I said you
could protect it very trivially with /proc/pid/oom_adj set to
OOM_DISABLE), but that's now your reasoning for rewriting the oom killer
heuristics?

> I can say the same thing to total_vm size. total_vm size doesn't include any
> good information for oom situation. And tweaking based on that not-useful
> parameter will make things worse.
>

Tweaking on the heuristic will probably make it more convoluted and
overall worse, I agree. But it's a more stable baseline than rss from
which we can set oom killing priorities from userspace.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/