From: KAMEZAWA Hiroyuki on
On Wed, 28 Oct 2009 11:47:55 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com> wrote:

> > 2. I started out running my mlock test program as root (later
> > switched to use "ulimit -l unlimited" first). But badness() reckons
> > CAP_SYS_ADMIN or CAP_SYS_RESOURCE is a reason to quarter your points;
> > and CAP_SYS_RAWIO another reason to quarter your points: so running
> > as root makes you sixteen times less likely to be killed. Quartering
> > is anyway debatable, but sixteenthing seems utterly excessive to me.
> >
> > I moved the CAP_SYS_RAWIO test in with the others, so it does no
> > more than quartering; but is quartering appropriate anyway? I did
> > wonder if I was right to be "subverting" the fine-grained CAPs in
> > this way, but have since seen unrelated mail from one who knows
> > better, implying they're something of a fantasy, that su and sudo
> > are indeed what's used in the real world. Maybe this patch was okay.
>
> I agree quartering is debatable.
> At least, killing quartering is worth for any user, and it can be push into -stable.
>
>
>
>
> From 27331555366c908a93c2cdd780b77e421869c5af Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com>
> Date: Wed, 28 Oct 2009 11:28:39 +0900
> Subject: [PATCH] oom: Mitigate suer-user's bonus of oom-score
>
> Currently, badness calculation code of oom contemplate following bonus.
> - Super-user have quartering oom-score
> - CAP_SYS_RAWIO process (e.g. database) also have quartering oom-score
>
> The problem is, Super-users have CAP_SYS_RAWIO too. Then, they have
> sixteenthing bonus. it's obviously too excessive and meaningless.
>
> This patch fixes it.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com>

I'll pick this up to my series.

Thanks,
-Kame

> ---
> mm/oom_kill.c | 13 +++++--------
> 1 files changed, 5 insertions(+), 8 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ea2147d..40d323d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -152,18 +152,15 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
> /*
> * Superuser processes are usually more important, so we make it
> * less likely that we kill those.
> - */
> - if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
> - has_capability_noaudit(p, CAP_SYS_RESOURCE))
> - points /= 4;
> -
> - /*
> - * We don't want to kill a process with direct hardware access.
> + *
> + * Plus, We don't want to kill a process with direct hardware access.
> * Not only could that mess up the hardware, but usually users
> * tend to only have this flag set on applications they think
> * of as important.
> */
> - if (has_capability_noaudit(p, CAP_SYS_RAWIO))
> + if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
> + has_capability_noaudit(p, CAP_SYS_RESOURCE) ||
> + has_capability_noaudit(p, CAP_SYS_RAWIO))
> points /= 4;
>
> /*
> --
> 1.6.2.5
>
>
>
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Rientjes on
On Wed, 28 Oct 2009, Vedran Furac wrote:

> > This is wrong; it doesn't "emulate oom" since oom_kill_process() always
> > kills a child of the selected process instead if they do not share the
> > same memory. The chosen task in that case is untouched.
>
> OK, I stand corrected then. Thanks! But, while testing this I lost X
> once again and "test" survived for some time (check the timestamps):
>
> http://pastebin.com/d5c9d026e
>
> - It started by killing gkrellm(!!!)
> - Then I lost X (kdeinit4 I guess)
> - Then 103 seconds after the killing started, it killed "test" - the
> real culprit.
>
> I mean... how?!
>

Here are the five oom kills that occurred in your log, and notice that the
first four times it kills a child and not the actual task as I explained:

[97137.724971] Out of memory: kill process 21485 (VBoxSVC) score 1564940 or a child
[97137.725017] Killed process 21503 (VirtualBox)
[97137.864622] Out of memory: kill process 11141 (kdeinit4) score 1196178 or a child
[97137.864656] Killed process 11142 (klauncher)
[97137.888146] Out of memory: kill process 11141 (kdeinit4) score 1184308 or a child
[97137.888180] Killed process 11151 (ksmserver)
[97137.972875] Out of memory: kill process 11141 (kdeinit4) score 1146255 or a child
[97137.972888] Killed process 11224 (audacious2)

Those are practically happening simultaneously with very little memory
being available between each oom kill. Only later is "test" killed:

[97240.203228] Out of memory: kill process 5005 (test) score 256912 or a child
[97240.206832] Killed process 5005 (test)

Notice how the badness score is less than 1/4th of the others. So while
you may find it to be hogging a lot of memory, there were others that
consumed much more.

You can get a more detailed understanding of this by doing

echo 1 > /proc/sys/vm/oom_dump_tasks

before trying your testcase; it will show various information like the
total_vm and oom_adj value for each task at the time of oom (and the
actual badness score is exported per-task via /proc/pid/oom_score in
real-time). This will also include the rss and show what the end result
would be in using that value as part of the heuristic on this particular
workload compared to the current implementation.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Rientjes on
On Wed, 28 Oct 2009, KOSAKI Motohiro wrote:

> I agree quartering is debatable.
> At least, killing quartering is worth for any user, and it can be push into -stable.
>

Not sure where the -stable reference came from, I don't think this is a
candidate.

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ea2147d..40d323d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -152,18 +152,15 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
> /*
> * Superuser processes are usually more important, so we make it
> * less likely that we kill those.
> - */
> - if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
> - has_capability_noaudit(p, CAP_SYS_RESOURCE))
> - points /= 4;
> -
> - /*
> - * We don't want to kill a process with direct hardware access.
> + *
> + * Plus, We don't want to kill a process with direct hardware access.
> * Not only could that mess up the hardware, but usually users
> * tend to only have this flag set on applications they think
> * of as important.
> */
> - if (has_capability_noaudit(p, CAP_SYS_RAWIO))
> + if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
> + has_capability_noaudit(p, CAP_SYS_RESOURCE) ||
> + has_capability_noaudit(p, CAP_SYS_RAWIO))
> points /= 4;
>
> /*

Acked-by: David Rientjes <rientjes(a)google.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: KAMEZAWA Hiroyuki on
On Tue, 27 Oct 2009 21:08:56 -0700 (PDT)
David Rientjes <rientjes(a)google.com> wrote:

> On Wed, 28 Oct 2009, Vedran Furac wrote:
>
> > > This is wrong; it doesn't "emulate oom" since oom_kill_process() always
> > > kills a child of the selected process instead if they do not share the
> > > same memory. The chosen task in that case is untouched.
> >
> > OK, I stand corrected then. Thanks! But, while testing this I lost X
> > once again and "test" survived for some time (check the timestamps):
> >
> > http://pastebin.com/d5c9d026e
> >
> > - It started by killing gkrellm(!!!)
> > - Then I lost X (kdeinit4 I guess)
> > - Then 103 seconds after the killing started, it killed "test" - the
> > real culprit.
> >
> > I mean... how?!
> >
>
> Here are the five oom kills that occurred in your log, and notice that the
> first four times it kills a child and not the actual task as I explained:
>
> [97137.724971] Out of memory: kill process 21485 (VBoxSVC) score 1564940 or a child
> [97137.725017] Killed process 21503 (VirtualBox)
> [97137.864622] Out of memory: kill process 11141 (kdeinit4) score 1196178 or a child
> [97137.864656] Killed process 11142 (klauncher)
> [97137.888146] Out of memory: kill process 11141 (kdeinit4) score 1184308 or a child
> [97137.888180] Killed process 11151 (ksmserver)
> [97137.972875] Out of memory: kill process 11141 (kdeinit4) score 1146255 or a child
> [97137.972888] Killed process 11224 (audacious2)
>
> Those are practically happening simultaneously with very little memory
> being available between each oom kill. Only later is "test" killed:
>
> [97240.203228] Out of memory: kill process 5005 (test) score 256912 or a child
> [97240.206832] Killed process 5005 (test)
>
> Notice how the badness score is less than 1/4th of the others. So while
> you may find it to be hogging a lot of memory, there were others that
> consumed much more.

not related to child-parent problem.

Seeing this number more.
==
[97137.709272] Active_anon:671487 active_file:82 inactive_anon:132316
[97137.709273] inactive_file:82 unevictable:50 dirty:0 writeback:0 unstable:0
[97137.709273] free:6122 slab:17179 mapped:30661 pagetables:8052 bounce:0
==

acitve_file + inactive_file is very low. Almost all pages are for anon.
But "mapped(NR_FILE_MAPPED)" is a little high. This implies remaining file caches
are mapped by many processes OR some mega bytes of shmem is used.

# of pagetables is 8052, this means
8052x4096/8*4k bytes = 16Gbytes of mapped area.

Total available memory is near to be active/inactive + slab
671487+82+132316+82+50+6122+17179+8052=835370x4k= 3.2Gbytes ?
(this system is swapless)

Then, considering the pmap kosaki shows,
I guess killed ones had big total_vm but has not much real rss,
and no helps for oom.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Rientjes on
On Wed, 28 Oct 2009, KAMEZAWA Hiroyuki wrote:

> not related to child-parent problem.
>
> Seeing this number more.
> ==
> [97137.709272] Active_anon:671487 active_file:82 inactive_anon:132316
> [97137.709273] inactive_file:82 unevictable:50 dirty:0 writeback:0 unstable:0
> [97137.709273] free:6122 slab:17179 mapped:30661 pagetables:8052 bounce:0
> ==
>
> acitve_file + inactive_file is very low. Almost all pages are for anon.
> But "mapped(NR_FILE_MAPPED)" is a little high. This implies remaining file caches
> are mapped by many processes OR some mega bytes of shmem is used.
>
> # of pagetables is 8052, this means
> 8052x4096/8*4k bytes = 16Gbytes of mapped area.
>
> Total available memory is near to be active/inactive + slab
> 671487+82+132316+82+50+6122+17179+8052=835370x4k= 3.2Gbytes ?
> (this system is swapless)
>

Yep:

[97137.724965] 917504 pages RAM
[97137.724967] 69721 pages reserved

(917504 - 69721) * 4K = ~3.23G

> Then, considering the pmap kosaki shows,
> I guess killed ones had big total_vm but has not much real rss,
> and no helps for oom.
>

echo 1 > /proc/sys/vm/oom_dump_tasks can confirm that.

The bigger issue is making the distinction between killing a rogue task
that is using much more memory than expected (the supposed current
behavior, influenced from userspace by /proc/pid/oom_adj), and killing the
task with the highest rss. The latter is definitely desired if we are
allocating tons of memory but reduces the ability of the user to influence
the badness score.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/