Memory overcommit [Kernel]

Prev: [PATCH -v6 00/13] ftrace for MIPS
Next: [Bug #14372] ath5k wireless not working after suspend-resume - eeepc

From: KAMEZAWA Hiroyuki on 30 Oct 2009 05:40

On Fri, 30 Oct 2009 02:10:37 -0700 (PDT)
David Rientjes <rientjes(a)google.com> wrote:

> > - The kernel can't know the program is bad or not. just guess it.
>
> Totally irrelevant, given your fourth point about /proc/pid/oom_adj. We
> can tell the kernel what we'd like the oom killer behavior should be if
> the situation arises.
>

My point is that the server cannot distinguish memory leak from intentional
memory usage. No other than that.

> > - Then, there is no "correct" OOM-Killer other than fork-bomb killer.
>
> Well of course there is, you're seeing this is a WAY too simplistic
> manner. If we are oom, we want to be able to influence how the oom killer
> behaves and respond to that situation. You are proposing that we change
> the baseline for how the oom killer selects tasks which we use CONSTANTLY
> as part of our normal production environment. I'd appreciate it if you'd
> take it a little more seriously.
>
Yes, I'm serious.

In this summer, at lunch with a daily linux user, I was said
"you, enterprise guys, don't consider desktop or laptop problem at all."
yes, I use only servers. My customer uses server, too. My first priority
is always on server users.
But, for this time, I wrote reply to Vedran and try to fix desktop problem.
Even if current logic works well for servers, "KDE/GNOME is killed" problem
seems to be serious. And this may be a problem for EMBEDED people, I guess.

> > - User has a knob as oom_adj. This is very strong.
> >
>
> Agreed.
>
This and memcg are very useful. But everone says "bad workaround" ;(
Maybe only servers can use these functions.

> > Then, there is only "reasonable" or "easy-to-understand" OOM-Kill.
> > "Current biggest memory eater is killed" sounds reasonable, easy to
> > understand. And if total_vm works well, overcommit_guess should catch it.
> > Please improve overcommit_guess if you want to stay on total_vm.
> >
>
> I don't necessarily want to stay on total_vm, but I also don't want to
> move to rss as a baseline, as you would probably agree.
>
I'll rewrite all. I'll not rely only on rss. There are several situations
and we need some more information than we have know. I'll have to implement
ways to gather information before chaging badness.

> We disagree about a very fundamental principle: you are coming from a
> perspective of always wanting to kill the biggest resident memory eater
> even for a single order-0 allocation that fails and I'm coming from a
> perspective of wanting to ensure that our machines know how the oom killer
> will react when it is used.
yes.

> Moving to rss reduces the ability of the user to specify an expected oom
> priority other than polarizing it by either
> disabling it completely with an oom_adj value of -17 or choosing the
> definite next victim with +15. That's my objection to it: the user cannot
> possibly be expected to predict what proportion of each application's
> memory will be resident at the time of oom.
>
I can say the same thing to total_vm size. total_vm size doesn't include any
good information for oom situation. And tweaking based on that not-useful
parameter will make things worse.

For oom_adj tweak, we may need other technique other than "shift".
If I've wrote oom_adj, I'll write it as

/proc/<pid>/guarantee_nooom_size

#echo 3G > /proc/<pid>/guarantee_nooom_size

Then, 3G bytes of this process's memory usage will not be accounted to badness.

I'm not sure I can add new interface or replace oom_adj, now.
But to do this, current chilren's score problem etc...should be fixed.

> I understand you want to totally rewrite the oom killer for whatever
> reason, but I think you need to spend a lot more time understanding the
> needs that the Linux community has for its behavior instead of insisting
> on your point of view.
>
yes, use more time. I don't think all of changes can be in quick work.

To be honest, this is a part of work to implement "custom oom handler" cgroup.
Before going further, I'd like to fix current problem.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Thomas Fjellstrom on 30 Oct 2009 06:50

On Fri October 30 2009, KAMEZAWA Hiroyuki wrote:
> On Fri, 30 Oct 2009 02:10:37 -0700 (PDT)
>
> David Rientjes <rientjes(a)google.com> wrote:
> > > - The kernel can't know the program is bad or not. just guess it.
> >
> > Totally irrelevant, given your fourth point about /proc/pid/oom_adj.
> > We can tell the kernel what we'd like the oom killer behavior should be
> > if the situation arises.
>
> My point is that the server cannot distinguish memory leak from
> intentional memory usage. No other than that.
>
> > > - Then, there is no "correct" OOM-Killer other than fork-bomb
> > > killer.
> >
> > Well of course there is, you're seeing this is a WAY too simplistic
> > manner. If we are oom, we want to be able to influence how the oom
> > killer behaves and respond to that situation. You are proposing that
> > we change the baseline for how the oom killer selects tasks which we
> > use CONSTANTLY as part of our normal production environment. I'd
> > appreciate it if you'd take it a little more seriously.
>
> Yes, I'm serious.
>
> In this summer, at lunch with a daily linux user, I was said
> "you, enterprise guys, don't consider desktop or laptop problem at all."
> yes, I use only servers. My customer uses server, too. My first priority
> is always on server users.
> But, for this time, I wrote reply to Vedran and try to fix desktop
> problem. Even if current logic works well for servers, "KDE/GNOME is
> killed" problem seems to be serious. And this may be a problem for
> EMBEDED people, I guess.

Whats worse is a friend of mine gets stuck with a useless machine for a
couple hours or more when oom tries to do its thing. It swap storms for
hours. Not a good thing imo.

[snip]

--
Thomas Fjellstrom
tfjellstrom(a)shaw.ca
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Vedran Furač on 30 Oct 2009 10:00

David Rientjes wrote:

> Ok, so this is the forkbomb problem by adding half of each child's
> total_vm into the badness score of the parent. We should address this
> completely seperately by addressing that specific part of the heuristic,
> not changing what we consider to be a baseline.
> thunderbird.
>
> You're making all these claims and assertions based _solely_ on the theory
> that killing the application with the most resident RAM is always the
> optimal solution. That's just not true, especially if we're just
> allocating small numbers of order-0 memory.

Well, you are kernel hacker, not me. You know how linux mm works much
more than I do. I just reported a, what I think is a big problem, which
needs to be solved ASAP (2.6.33). I'm afraid that we'll just talk much
and nothing will be done with solution/fix postponed indefinitely. Not
sure if you are interested, but I tested this on windowsxp also, and
nothing bad happens there, system continues to function properly.

For 2-3 years I had memory overcommit turn off. I didn't get any OOM,
but sometimes Java didn't work and it seems that because of some kernel
weirdness (or misunderstanding on my part) I couldn't use all the
available memory:

# echo 2 > /proc/sys/vm/overcommit_memory

# echo 95 > /proc/sys/vm/overcommit_ratio
% ./test /* malloc in loop as before */
malloc: Cannot allocate memory /* Great, no OOM, but: */

% free -m
total used free shared buffers cached
Mem: 3458 3429 29 0 102 1119
-/+ buffers/cache: 2207 1251

There's plenty of memory available. Shouldn't cache be automatically
dropped (this question was in my original mail, hence the subject)?

All this frustrated not only me, but a great number of users on our
local Croatian linux usenet newsgroup with some of them pointing that as
the reason they use solaris. And so on...

> Much better is to allow the user to decide at what point, regardless of
> swap usage, their application is using much more memory than expected or
> required. They can do that right now pretty well with /proc/pid/oom_adj
> without this outlandish claim that they should be expected to know the rss
> of their applications at the time of oom to effectively tune oom_adj.

Believe me, barely a few developers use oom_adj for their applications,
and probably almost none of the end users. What should they do, every
time they start an application, go to console and set the oom_adj. You
cannot expect them to do that.

> What would you suggest? A script that sits in a loop checking each task's
> current rss from /proc/pid/stat or their current oom priority though
> /proc/pid/oom_score and adjusting oom_adj preemptively just in case the
> oom killer is invoked in the next second?

:)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Vedran Furač on 30 Oct 2009 10:00

David Rientjes wrote:

> On Thu, 29 Oct 2009, Vedran Furac wrote:
>
>> But then you should rename OOM killer to TRIPK:
>> Totally Random Innocent Process Killer
>>
>
> The randomness here is the order of the child list when the oom killer
> selects a task, based on the badness score, and then tries to kill a child
> with a different mm before the parent.
>
> The problem you identified in http://pastebin.com/f3f9674a0, however, is a
> forkbomb issue where the badness score should never have been so high for
> kdeinit4 compared to "test". That's directly proportional to adding the
> scores of all disjoint child total_vm values into the badness score for
> the parent and then killing the children instead.

Could you explain me why ntpd invoked oom killer? Its parent is init. Or
syslog-ng?

> That's the problem, not using total_vm as a baseline. Replacing that with
> rss is not going to solve the issue and reducing the user's ability to
> specify a rough oom priority from userspace is simply not an option.

OK then, if you have a solution, I would be glad to test your patch. I
won't care much if you don't change total_vm as a baseline. Just make
random killing history.

Regards,

Vedran

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Thomas Fjellstrom on 30 Oct 2009 10:10

On Fri October 30 2009, Vedran Furač wrote:
> David Rientjes wrote:
> > Ok, so this is the forkbomb problem by adding half of each child's
> > total_vm into the badness score of the parent. We should address this
> > completely seperately by addressing that specific part of the
> > heuristic, not changing what we consider to be a baseline.
> > thunderbird.
> >
> > You're making all these claims and assertions based _solely_ on the
> > theory that killing the application with the most resident RAM is
> > always the optimal solution. That's just not true, especially if we're
> > just allocating small numbers of order-0 memory.
>
> Well, you are kernel hacker, not me. You know how linux mm works much
> more than I do. I just reported a, what I think is a big problem, which
> needs to be solved ASAP (2.6.33). I'm afraid that we'll just talk much
> and nothing will be done with solution/fix postponed indefinitely. Not
> sure if you are interested, but I tested this on windowsxp also, and
> nothing bad happens there, system continues to function properly.
>
> For 2-3 years I had memory overcommit turn off. I didn't get any OOM,
> but sometimes Java didn't work and it seems that because of some kernel
> weirdness (or misunderstanding on my part) I couldn't use all the
> available memory:
>
> # echo 2 > /proc/sys/vm/overcommit_memory
>
> # echo 95 > /proc/sys/vm/overcommit_ratio
> % ./test /* malloc in loop as before */
> malloc: Cannot allocate memory /* Great, no OOM, but: */
>
> % free -m
> total used free shared buffers cached
> Mem: 3458 3429 29 0 102 1119
> -/+ buffers/cache: 2207 1251
>
> There's plenty of memory available. Shouldn't cache be automatically
> dropped (this question was in my original mail, hence the subject)?
>
> All this frustrated not only me, but a great number of users on our
> local Croatian linux usenet newsgroup with some of them pointing that as
> the reason they use solaris. And so on...

I think this is the MOST serious issue related to the oom killer. For some
reason it refuses to drop pages before trying to kill. When it should drop
cache, THEN kill if needed.

> > Much better is to allow the user to decide at what point, regardless of
> > swap usage, their application is using much more memory than expected
> > or required. They can do that right now pretty well with
> > /proc/pid/oom_adj without this outlandish claim that they should be
> > expected to know the rss of their applications at the time of oom to
> > effectively tune oom_adj.
>
> Believe me, barely a few developers use oom_adj for their applications,
> and probably almost none of the end users. What should they do, every
> time they start an application, go to console and set the oom_adj. You
> cannot expect them to do that.
>
> > What would you suggest? A script that sits in a loop checking each
> > task's current rss from /proc/pid/stat or their current oom priority
> > though /proc/pid/oom_score and adjusting oom_adj preemptively just in
> > case the oom killer is invoked in the next second?
> >
> :)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in the body of a message to majordomo(a)vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
Thomas Fjellstrom
tfjellstrom(a)shaw.ca
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: [PATCH -v6 00/13] ftrace for MIPS
Next: [Bug #14372] ath5k wireless not working after suspend-resume - eeepc