Memory overcommit [Kernel]

Prev: [PATCH -v6 00/13] ftrace for MIPS
Next: [Bug #14372] ath5k wireless not working after suspend-resume - eeepc

From: KAMEZAWA Hiroyuki on 3 Nov 2009 20:00

On Tue, 3 Nov 2009 12:49:52 -0800 (PST)
David Rientjes <rientjes(a)google.com> wrote:

> On Fri, 30 Oct 2009, KAMEZAWA Hiroyuki wrote:
>
> > > > - The kernel can't know the program is bad or not. just guess it.
> > >
> > > Totally irrelevant, given your fourth point about /proc/pid/oom_adj. We
> > > can tell the kernel what we'd like the oom killer behavior should be if
> > > the situation arises.
> > >
> >
> > My point is that the server cannot distinguish memory leak from intentional
> > memory usage. No other than that.
> >
>
> That's a different point. Today, we can influence the badness score of
> any user thread to prioritize oom killing from userspace and that can be
> done regardless of whether there's a memory leaker, a fork bomber, etc.
> The priority based oom killing is important to production scenarios and
> cannot be replaced by a heuristic that works everytime if it cannot be
> influenced by userspace.
>
I don't removed oom_adj...

> A spike in memory consumption when a process is initially forked would be
> defined as a memory leaker in your quiet_time model.
>
I'll rewrite or drop quiet_time.

> > In this summer, at lunch with a daily linux user, I was said
> > "you, enterprise guys, don't consider desktop or laptop problem at all."
> > yes, I use only servers. My customer uses server, too. My first priority
> > is always on server users.
> > But, for this time, I wrote reply to Vedran and try to fix desktop problem.
> > Even if current logic works well for servers, "KDE/GNOME is killed" problem
> > seems to be serious. And this may be a problem for EMBEDED people, I guess.
> >
>
> You argued before that the problem wasn't specific to X (after I said you
> could protect it very trivially with /proc/pid/oom_adj set to
> OOM_DISABLE), but that's now your reasoning for rewriting the oom killer
> heuristics?
>
One of reasons. My cusotomers always suffers from "OOM-RANDOM-KILLER".
Why I mentioned about "lunch" is for saying that "I'm not working _only_
for servers."
ok ?

> > I can say the same thing to total_vm size. total_vm size doesn't include any
> > good information for oom situation. And tweaking based on that not-useful
> > parameter will make things worse.
> >
>
> Tweaking on the heuristic will probably make it more convoluted and
> overall worse, I agree. But it's a more stable baseline than rss from
> which we can set oom killing priorities from userspace.

- "rss < total_vm_size" always.
- oom_adj culculation is quite strong.
- total_vm of processes which maps hugetlb is very big ....but killing them
is no help for usual oom.

I recommend you to add "stable baseline" knob for user space, as I wrote.
My patch 6 adds stable baseline bonus as 50% of vm size if run_time is enough
large.

If users can estimate how their process uses memory, it will be good thing.
I'll add some other than oom_adj (I don't say I'll drop oom_adj).

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: David Rientjes on 3 Nov 2009 21:00

On Wed, 4 Nov 2009, KAMEZAWA Hiroyuki wrote:

> > That's a different point. Today, we can influence the badness score of
> > any user thread to prioritize oom killing from userspace and that can be
> > done regardless of whether there's a memory leaker, a fork bomber, etc.
> > The priority based oom killing is important to production scenarios and
> > cannot be replaced by a heuristic that works everytime if it cannot be
> > influenced by userspace.
> >
> I don't removed oom_adj...
>

Right, but we must ensure that we have the same ability to influence a
priority based oom killing scheme from userspace as we currently do with a
relatively static total_vm. total_vm may not be the optimal baseline, but
it does allow users to tune oom_adj specifically to identify tasks that
are using more memory than expected and to be static enough to not depend
on rss, for example, that is really hard to predict at the time of oom.

That's actually my main goal in this discussion: to avoid losing any
ability of userspace to influence to priority of tasks being oom killed
(if you haven't noticed :).

> > Tweaking on the heuristic will probably make it more convoluted and
> > overall worse, I agree. But it's a more stable baseline than rss from
> > which we can set oom killing priorities from userspace.
>
> - "rss < total_vm_size" always.

But rss is much more dynamic than total_vm, that's my point.

> - oom_adj culculation is quite strong.
> - total_vm of processes which maps hugetlb is very big ....but killing them
> is no help for usual oom.
>
> I recommend you to add "stable baseline" knob for user space, as I wrote.
> My patch 6 adds stable baseline bonus as 50% of vm size if run_time is enough
> large.
>

There's no clear relationship between VM size and runtime. The forkbomb
heuristic itself could easily return a badness of ULONG_MAX if one is
detected using runtime and number of children, as I earlier proposed, but
that doesn't seem helpful to factor into the scoring.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: KAMEZAWA Hiroyuki on 3 Nov 2009 21:30

On Tue, 3 Nov 2009 17:58:04 -0800 (PST)
David Rientjes <rientjes(a)google.com> wrote:

> On Wed, 4 Nov 2009, KAMEZAWA Hiroyuki wrote:
>
> > > That's a different point. Today, we can influence the badness score of
> > > any user thread to prioritize oom killing from userspace and that can be
> > > done regardless of whether there's a memory leaker, a fork bomber, etc.
> > > The priority based oom killing is important to production scenarios and
> > > cannot be replaced by a heuristic that works everytime if it cannot be
> > > influenced by userspace.
> > >
> > I don't removed oom_adj...
> >
>
> Right, but we must ensure that we have the same ability to influence a
> priority based oom killing scheme from userspace as we currently do with a
> relatively static total_vm. total_vm may not be the optimal baseline, but
> it does allow users to tune oom_adj specifically to identify tasks that
> are using more memory than expected and to be static enough to not depend
> on rss, for example, that is really hard to predict at the time of oom.
>
> That's actually my main goal in this discussion: to avoid losing any
> ability of userspace to influence to priority of tasks being oom killed
> (if you haven't noticed :).
>
> > > Tweaking on the heuristic will probably make it more convoluted and
> > > overall worse, I agree. But it's a more stable baseline than rss from
> > > which we can set oom killing priorities from userspace.
> >
> > - "rss < total_vm_size" always.
>
> But rss is much more dynamic than total_vm, that's my point.
>
My point and your point are differnt.

1. All my concern is "baseline for heuristics"
2. All your concern is "baseline for knob, as oom_adj"

ok ? For selecting victim by the kernel, dynamic value is much more useful.
Current behavior of "Random kill" and "Kill multiple processes" are too bad.
Considering oom-killer is for what, I think "1" is more important.

But I know what you want, so, I offers new knob which is not affected by RSS
as I wrote in previous mail.

Off-topic:
As memcg is growing better, using OOM-Killer for resource control should be
ended, I think. Maybe Fake-NUMA+cpuset is working well for google system,
but plz consider to use memcg.

> > - oom_adj culculation is quite strong.
> > - total_vm of processes which maps hugetlb is very big ....but killing them
> > is no help for usual oom.
> >
> > I recommend you to add "stable baseline" knob for user space, as I wrote.
> > My patch 6 adds stable baseline bonus as 50% of vm size if run_time is enough
> > large.
> >
>
> There's no clear relationship between VM size and runtime. The forkbomb
> heuristic itself could easily return a badness of ULONG_MAX if one is
> detected using runtime and number of children, as I earlier proposed, but
> that doesn't seem helpful to factor into the scoring.
>

Old processes are important, younger are not. But as I wrote, I'll drop
most of patch "6". So, plz forget about this part.

I'm interested in fork-bomb killer rather than crazy badness calculation, now.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: David Rientjes on 3 Nov 2009 22:20

On Wed, 4 Nov 2009, KAMEZAWA Hiroyuki wrote:

> My point and your point are differnt.
>
> 1. All my concern is "baseline for heuristics"
> 2. All your concern is "baseline for knob, as oom_adj"
>
> ok ? For selecting victim by the kernel, dynamic value is much more useful.
> Current behavior of "Random kill" and "Kill multiple processes" are too bad.
> Considering oom-killer is for what, I think "1" is more important.
>
> But I know what you want, so, I offers new knob which is not affected by RSS
> as I wrote in previous mail.
>
> Off-topic:
> As memcg is growing better, using OOM-Killer for resource control should be
> ended, I think. Maybe Fake-NUMA+cpuset is working well for google system,
> but plz consider to use memcg.
>

I understand what you're trying to do, and I agree with it for most
desktop systems. However, I think that admins should have a very strong
influence in what tasks the oom killer kills. It doesn't really matter if
it's via oom_adj or not, and its debatable whether an adjustment on a
static heuristic score is in our best interest in the first place. But we
must have an alternative so that our control over oom killing isn't lost.

I'd also like to open another topic for discussion if you're proposing
such sweeping changes: at what point do we allow ~__GFP_NOFAIL allocations
to fail even if order < PAGE_ALLOC_COSTLY_ORDER and defer killing
anything? We both agreed that it's not always in the best interest to
kill a task so that an allocation can succeed, so we need to define some
criteria to simply fail the allocation instead.

> Old processes are important, younger are not. But as I wrote, I'll drop
> most of patch "6". So, plz forget about this part.
>
> I'm interested in fork-bomb killer rather than crazy badness calculation, now.
>

Ok, great. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: KAMEZAWA Hiroyuki on 3 Nov 2009 22:30

On Tue, 3 Nov 2009 19:10:34 -0800 (PST)
David Rientjes <rientjes(a)google.com> wrote:

> On Wed, 4 Nov 2009, KAMEZAWA Hiroyuki wrote:
>
> > My point and your point are differnt.
> >
> > 1. All my concern is "baseline for heuristics"
> > 2. All your concern is "baseline for knob, as oom_adj"
> >
> > ok ? For selecting victim by the kernel, dynamic value is much more useful.
> > Current behavior of "Random kill" and "Kill multiple processes" are too bad.
> > Considering oom-killer is for what, I think "1" is more important.
> >
> > But I know what you want, so, I offers new knob which is not affected by RSS
> > as I wrote in previous mail.
> >
> > Off-topic:
> > As memcg is growing better, using OOM-Killer for resource control should be
> > ended, I think. Maybe Fake-NUMA+cpuset is working well for google system,
> > but plz consider to use memcg.
> >
>
> I understand what you're trying to do, and I agree with it for most
> desktop systems. However, I think that admins should have a very strong
> influence in what tasks the oom killer kills. It doesn't really matter if
> it's via oom_adj or not, and its debatable whether an adjustment on a
> static heuristic score is in our best interest in the first place. But we
> must have an alternative so that our control over oom killing isn't lost.
>
I'll not go too quickly, so, let's discuss and rewrite patches more, later.
I'll parepare new version in the next week. For this week, I'll post
swap accounting and improve fork-bomb detector.

> I'd also like to open another topic for discussion if you're proposing
> such sweeping changes: at what point do we allow ~__GFP_NOFAIL allocations
> to fail even if order < PAGE_ALLOC_COSTLY_ORDER and defer killing
> anything? We both agreed that it's not always in the best interest to
> kill a task so that an allocation can succeed, so we need to define some
> criteria to simply fail the allocation instead.
>
Yes, I think allocation itself (> order=0) should fail more before we finally
invoke OOM. It tends to be soft-landing rather than oom-killer.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev |
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: [PATCH -v6 00/13] ftrace for MIPS
Next: [Bug #14372] ath5k wireless not working after suspend-resume - eeepc