Memory overcommit [Kernel]

Prev: [PATCH -v6 00/13] ftrace for MIPS
Next: [Bug #14372] ath5k wireless not working after suspend-resume - eeepc

From: KAMEZAWA Hiroyuki on 28 Oct 2009 02:10

On Tue, 27 Oct 2009 22:13:44 -0700 (PDT)
David Rientjes <rientjes(a)google.com> wrote:

> Yep:
>
> [97137.724965] 917504 pages RAM
> [97137.724967] 69721 pages reserved
>
> (917504 - 69721) * 4K = ~3.23G
>
> > Then, considering the pmap kosaki shows,
> > I guess killed ones had big total_vm but has not much real rss,
> > and no helps for oom.
> >
>
> echo 1 > /proc/sys/vm/oom_dump_tasks can confirm that.
>
yes.

> The bigger issue is making the distinction between killing a rogue task
> that is using much more memory than expected (the supposed current
> behavior, influenced from userspace by /proc/pid/oom_adj), and killing the
> task with the highest rss.

All kernel engineers know "than expected or not" can be never known to the kernel.
So, oom_adj workaround is used now. (by some special users.)
OOM Killer itself is also a workaround, too.
"No kill" is the best thing but we know there are tend to be memory-leaker on bad
systems and all systems in this world are not perfect.

In the kernel view, there is no difference between rogue one and highest rss one.
As heuristics, "time" is used now. But it's not very trustable.

> The latter is definitely desired if we are
> allocating tons of memory but reduces the ability of the user to influence
> the badness score.
>

Yes, some more trustable values other than vmsize/rss/time are appriciated.
I wonder recent memory consumption speed can be an another key value.

Anyway, current bahavior of "killing X" is a bad thing.
We need some fixes.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: David Rientjes on 28 Oct 2009 02:20

On Wed, 28 Oct 2009, KAMEZAWA Hiroyuki wrote:

> All kernel engineers know "than expected or not" can be never known to the kernel.
> So, oom_adj workaround is used now. (by some special users.)
> OOM Killer itself is also a workaround, too.
> "No kill" is the best thing but we know there are tend to be memory-leaker on bad
> systems and all systems in this world are not perfect.
>

Right, and historically that has been addressed by considering total_vm
and adjusting it with oom_adj so that we can identify memory leaking tasks
through user-defined criteria.

> Yes, some more trustable values other than vmsize/rss/time are appriciated.
> I wonder recent memory consumption speed can be an another key value.
>

Sounds very logical.

> Anyway, current bahavior of "killing X" is a bad thing.
> We need some fixes.
>

You can easily protect X with OOM_DISABLE, as you know. I don't think we
need any X-specific heuristics added to the kernel, it looks like the
special cases have already polluted badness() enough.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: KAMEZAWA Hiroyuki on 28 Oct 2009 02:30

On Tue, 27 Oct 2009 23:17:41 -0700 (PDT)
David Rientjes <rientjes(a)google.com> wrote:

> On Wed, 28 Oct 2009, KAMEZAWA Hiroyuki wrote:
>
> > All kernel engineers know "than expected or not" can be never known to the kernel.
> > So, oom_adj workaround is used now. (by some special users.)
> > OOM Killer itself is also a workaround, too.
> > "No kill" is the best thing but we know there are tend to be memory-leaker on bad
> > systems and all systems in this world are not perfect.
> >
>
> Right, and historically that has been addressed by considering total_vm
> and adjusting it with oom_adj so that we can identify memory leaking tasks
> through user-defined criteria.
>
> > Yes, some more trustable values other than vmsize/rss/time are appriciated.
> > I wonder recent memory consumption speed can be an another key value.
> >
>
> Sounds very logical.
>
> > Anyway, current bahavior of "killing X" is a bad thing.
> > We need some fixes.
> >
>
> You can easily protect X with OOM_DISABLE, as you know. I don't think we
> need any X-specific heuristics added to the kernel, it looks like the
> special cases have already polluted badness() enough.
>
It's _not_ special to X.

Almost all applications which uses many dynamica libraries can be affected by this,
total_vm. And, as I explained to Vedran, multi-threaded program like Java can easily
increase total_vm without using many anon_rss.
And it's the reason I hate overcommit_memory. size of VM doesn't tell anything.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Hugh Dickins on 28 Oct 2009 04:20

On Tue, 27 Oct 2009, David Rientjes wrote:
>
> Not sure where the -stable reference came from, I don't think this is a
> candidate.

I agree with David, this is only one little piece of a messy puzzle,
there's no good reason to rush this into -stable.

> > + if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
> > + has_capability_noaudit(p, CAP_SYS_RESOURCE) ||
> > + has_capability_noaudit(p, CAP_SYS_RAWIO))
>
> Acked-by: David Rientjes <rientjes(a)google.com>

Acked-by: Hugh Dickins <hugh.dickins(a)tiscali.co.uk>

(as far as it goes: the whole thing of quartering badness here
because "we don't want to kill" and "important" is questionable;
but definitely much more open to argument both ways than sixteenthing).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Vedran Furač on 28 Oct 2009 09:30

David Rientjes wrote:

> On Wed, 28 Oct 2009, Vedran Furac wrote:
>
>>> This is wrong; it doesn't "emulate oom" since oom_kill_process() always
>>> kills a child of the selected process instead if they do not share the
>>> same memory. The chosen task in that case is untouched.
>> OK, I stand corrected then. Thanks! But, while testing this I lost X
>> once again and "test" survived for some time (check the timestamps):
>>
>> http://pastebin.com/d5c9d026e
>>
>> - It started by killing gkrellm(!!!)
>> - Then I lost X (kdeinit4 I guess)
>> - Then 103 seconds after the killing started, it killed "test" - the
>> real culprit.
>>
>> I mean... how?!
>>
>
> Here are the five oom kills that occurred in your log, and notice that the
> first four times it kills a child and not the actual task as I explained:

Yes, but four times wrong.

> Those are practically happening simultaneously with very little memory
> being available between each oom kill. Only later is "test" killed:
>
> [97240.203228] Out of memory: kill process 5005 (test) score 256912 or a child
> [97240.206832] Killed process 5005 (test)
>
> Notice how the badness score is less than 1/4th of the others. So while
> you may find it to be hogging a lot of memory, there were others that
> consumed much more.
^^^^^^^^^^^^^^^^^^^^^

This is just wrong. I have 3.5GB of RAM, free says that 2GB are empty
(ignoring cache). Culprit then allocates all free memory (2GB). That
means it is using *more* than all other processes *together*. There
cannot be any other "that consumed much more".

> You can get a more detailed understanding of this by doing
>
> echo 1 > /proc/sys/vm/oom_dump_tasks
>
> before trying your testcase; it will show various information like the
> total_vm

Looking at total_vm (VIRT in top/vsize in ps?) is completely wrong. If I
sum up those numbers for every process running I would get:

%ps -eo pid,vsize,command|awk '{ SUM += $2} END {print SUM/1024/1024}'
14.7935

14GB. And I only have 3GB. I usually use exmap to get realistic numbers:

http://www.berthels.co.uk/exmap/doc.html

> and oom_adj value for each task at the time of oom (and the
> actual badness score is exported per-task via /proc/pid/oom_score in
> real-time). This will also include the rss and show what the end result
> would be in using that value as part of the heuristic on this particular
> workload compared to the current implementation.

Thanks, I'll try that... but I guess that using rss would yield better
results.

Regards,

Vedran
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: [PATCH -v6 00/13] ftrace for MIPS
Next: [Bug #14372] ath5k wireless not working after suspend-resume - eeepc