From: Hugh Dickins on
On Tue, 27 Oct 2009, KAMEZAWA Hiroyuki wrote:
> Sigh, gnome-session has twice value of mmap(1G).
> Of course, gnome-session only uses 6M bytes of anon.
> I wonder this is because gnome-session has many children..but need to
> dig more. Does anyone has idea ?

When preparing KSM unmerge to handle OOM, I looked at how the precedent
was handled by running a little program which mmaps an anonymous region
of the same size as physical memory, then tries to mlock it. The
program was such an obvious candidate to be killed, I was shocked
by the poor decisions the OOM killer made. Usually I ran it with
mem=512M, with gnome and firefox active. Often the OOM killer killed
it right the first time, but went wrong when I tried it a second time
(I think that's because of what's already swapped out the first time).

I built up a patchset of fixes, but once I came to split them up for
submission, not one of them seemed entirely satisfactory; and Andrea's
fix to the KSM/mlock deadlock forced me to abandon even the first of
the patches (we've since then fixed the way munlocking behaves, so
in theory could revisit that; but Andrea disliked what I was trying
to do there in KSM for other reasons, so I've not touched it since).
I had to get on with KSM, so I set it all aside: none of the issues
was a recent regression.

I did briefly wonder about the reliance on total_vm which you're now
looking into, but didn't touch that at all. Let me describe those
issues which I did try but fail to fix - I've no more time to deal
with them now than then, but ought at least to mention them to you.

1. select_bad_process() tries to avoid killing another process while
there's still a TIF_MEMDIE, but its loop starts by skipping !p->mm
processes. However, p->mm is set to NULL well before p reaches
exit_mmap() to actually free the memory, and there may be significant
delays in between (I think exit_robust_list() gave me a hang at one
stage). So in practice, even when the OOM killer selects the right
process to kill, there can be lots of collateral damage from it not
waiting long enough for that process to give up its memory.

I tried to deal with that by moving the TIF_MEMDIE test up before
the p->mm test, but adding in a check on p->exit_state:
if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
!p->exit_state)
return ERR_PTR(-1UL);
But this is then liable to hang the system if there's some reason
why the selected process cannot proceed to free its memory (e.g.
the current KSM unmerge case). It needs to wait "a while", but
give up if no progress is made, instead of hanging: originally
I thought that setting PF_MEMALLOC more widely in page_alloc.c,
and giving up on the TIF_MEMDIE if it was waiting in PF_MEMALLOC,
would deal with that; but we cannot be sure that waiting of memory
is the only reason for a holdup there (in the KSM unmerge case it's
waiting for an mmap_sem, and there may well be other such cases).

2. I started out running my mlock test program as root (later
switched to use "ulimit -l unlimited" first). But badness() reckons
CAP_SYS_ADMIN or CAP_SYS_RESOURCE is a reason to quarter your points;
and CAP_SYS_RAWIO another reason to quarter your points: so running
as root makes you sixteen times less likely to be killed. Quartering
is anyway debatable, but sixteenthing seems utterly excessive to me.

I moved the CAP_SYS_RAWIO test in with the others, so it does no
more than quartering; but is quartering appropriate anyway? I did
wonder if I was right to be "subverting" the fine-grained CAPs in
this way, but have since seen unrelated mail from one who knows
better, implying they're something of a fantasy, that su and sudo
are indeed what's used in the real world. Maybe this patch was okay.

3. badness() has a comment above it which says:
* 5) we try to kill the process the user expects us to kill, this
* algorithm has been meticulously tuned to meet the principle
* of least surprise ... (be careful when you change it)
But Andrea's 2.6.11 86a4c6d9e2e43796bb362debd3f73c0e3b198efa (later
refined by Kurt's 2.6.16 9827b781f20828e5ceb911b879f268f78fe90815)
adds plenty of surprise there, by trying to factor children into the
calculation. Intended to deal with forkbombs, but any reasonable
process whose purpose is to fork children (e.g. gnome-session)
becomes very vulnerable. And whereas badness() itself goes on to
refine the total_vm points by various adjustments peculiar to the
process in question, those refinements have been ignored when
adding the child's total_vm/2. (Andrea does remark that he'd
rather have rewritten badness() from scratch.)

I tried to fix this by moving the PF_OOM_ORIGIN (was PF_SWAPOFF)
part of the calculation up to select_bad_process(), making a
solo_badness() function which makes all those adjustments to
total_vm, then badness() itself a simple function adding half
the children's solo_badness()es to the process' own solo_badness().
But probably lots more needs doing - Andrea's rewrite?

4. In some cases those children are sharing exactly the same mm,
yet its total_vm is being added again and again to the points:
I had a nasty inner loop searching back to see if we'd already
counted this mm (but then, what if the different tasks sharing
the mm deserved different adjustments to the total_vm?).


I hope these notes help someone towards a better solution
(and be prepared to discover more on the way). I agree with
Vedran that the present behaviour is pretty unimpressive, and
I'm puzzled as to how people can have been tinkering with
oom_kill.c down the years without seeing any of this.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Rientjes on
On Tue, 27 Oct 2009, Hugh Dickins wrote:

> When preparing KSM unmerge to handle OOM, I looked at how the precedent
> was handled by running a little program which mmaps an anonymous region
> of the same size as physical memory, then tries to mlock it. The
> program was such an obvious candidate to be killed, I was shocked
> by the poor decisions the OOM killer made. Usually I ran it with
> mem=512M, with gnome and firefox active. Often the OOM killer killed
> it right the first time, but went wrong when I tried it a second time
> (I think that's because of what's already swapped out the first time).
>

The heuristics that the oom killer use in selecting a task seem to get
debated quite often.

What hasn't been mentioned is that total_vm does do a good job of
identifying tasks that are using far more memory than expected. That
seems to be the initial target: killing a rogue task that is hogging much
more memory than it should, probably because of a memory leak.

The latest approach seems to be focused more on killing the task that will
free the most resident memory. That certainly is understandable to avoid
killing additional tasks later and avoiding subsequent page allocations in
the short term, but doesn't help to kill the memory leaker.

There's advantages to either approach, but it depends on the contextual
goal of the oom killer when it's called: kill a rogue task that is
allocating more memory than expected, or kill a task that will free the
most memory.

> 1. select_bad_process() tries to avoid killing another process while
> there's still a TIF_MEMDIE, but its loop starts by skipping !p->mm
> processes. However, p->mm is set to NULL well before p reaches
> exit_mmap() to actually free the memory, and there may be significant
> delays in between (I think exit_robust_list() gave me a hang at one
> stage). So in practice, even when the OOM killer selects the right
> process to kill, there can be lots of collateral damage from it not
> waiting long enough for that process to give up its memory.
>
> I tried to deal with that by moving the TIF_MEMDIE test up before
> the p->mm test, but adding in a check on p->exit_state:
> if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
> !p->exit_state)
> return ERR_PTR(-1UL);
> But this is then liable to hang the system if there's some reason
> why the selected process cannot proceed to free its memory (e.g.
> the current KSM unmerge case). It needs to wait "a while", but
> give up if no progress is made, instead of hanging: originally
> I thought that setting PF_MEMALLOC more widely in page_alloc.c,
> and giving up on the TIF_MEMDIE if it was waiting in PF_MEMALLOC,
> would deal with that; but we cannot be sure that waiting of memory
> is the only reason for a holdup there (in the KSM unmerge case it's
> waiting for an mmap_sem, and there may well be other such cases).
>

I've proposed an oom killer timeout in the past which adds a jiffies count
to struct task_struct and will defer killing other tasks until the
predefined time limit (we use 10*HZ) has been exceeded. The problem is
that even if you kill another task, it is highly unlikely that the expired
task will ever exit at that point and is still holding a substantial
amount of memory since it also had access to memory reserves and has still
failed to exit.

> 2. I started out running my mlock test program as root (later
> switched to use "ulimit -l unlimited" first). But badness() reckons
> CAP_SYS_ADMIN or CAP_SYS_RESOURCE is a reason to quarter your points;
> and CAP_SYS_RAWIO another reason to quarter your points: so running
> as root makes you sixteen times less likely to be killed. Quartering
> is anyway debatable, but sixteenthing seems utterly excessive to me.
>
> I moved the CAP_SYS_RAWIO test in with the others, so it does no
> more than quartering; but is quartering appropriate anyway? I did
> wonder if I was right to be "subverting" the fine-grained CAPs in
> this way, but have since seen unrelated mail from one who knows
> better, implying they're something of a fantasy, that su and sudo
> are indeed what's used in the real world. Maybe this patch was okay.
>

I think someone (Nick?) proposed a patch at one time that removed most of
the heuristics from select_bad_process() other than total_vm of the task
and its children, mems_allowed intersection, and oom_adj.

> 4. In some cases those children are sharing exactly the same mm,
> yet its total_vm is being added again and again to the points:
> I had a nasty inner loop searching back to see if we'd already
> counted this mm (but then, what if the different tasks sharing
> the mm deserved different adjustments to the total_vm?).
>

oom_kill_process() may not kill the task selected by select_bad_process(),
it will first attempt to kill one of these children with a different mm.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Vedran Furač on
David Rientjes wrote:

> This is wrong; it doesn't "emulate oom" since oom_kill_process() always
> kills a child of the selected process instead if they do not share the
> same memory. The chosen task in that case is untouched.

OK, I stand corrected then. Thanks! But, while testing this I lost X
once again and "test" survived for some time (check the timestamps):

http://pastebin.com/d5c9d026e

- It started by killing gkrellm(!!!)
- Then I lost X (kdeinit4 I guess)
- Then 103 seconds after the killing started, it killed "test" - the
real culprit.

I mean... how?!


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: KAMEZAWA Hiroyuki on
On Tue, 27 Oct 2009 20:44:16 +0000 (GMT)
Hugh Dickins <hugh.dickins(a)tiscali.co.uk> wrote:

> On Tue, 27 Oct 2009, KAMEZAWA Hiroyuki wrote:
> > Sigh, gnome-session has twice value of mmap(1G).
> > Of course, gnome-session only uses 6M bytes of anon.
> > I wonder this is because gnome-session has many children..but need to
> > dig more. Does anyone has idea ?
>
> When preparing KSM unmerge to handle OOM, I looked at how the precedent
> was handled by running a little program which mmaps an anonymous region
> of the same size as physical memory, then tries to mlock it. The
> program was such an obvious candidate to be killed, I was shocked
> by the poor decisions the OOM killer made. Usually I ran it with
> mem=512M, with gnome and firefox active. Often the OOM killer killed
> it right the first time, but went wrong when I tried it a second time
> (I think that's because of what's already swapped out the first time).
>
> I built up a patchset of fixes, but once I came to split them up for
> submission, not one of them seemed entirely satisfactory; and Andrea's
> fix to the KSM/mlock deadlock forced me to abandon even the first of
> the patches (we've since then fixed the way munlocking behaves, so
> in theory could revisit that; but Andrea disliked what I was trying
> to do there in KSM for other reasons, so I've not touched it since).
> I had to get on with KSM, so I set it all aside: none of the issues
> was a recent regression.
>
> I did briefly wonder about the reliance on total_vm which you're now
> looking into, but didn't touch that at all. Let me describe those
> issues which I did try but fail to fix - I've no more time to deal
> with them now than then, but ought at least to mention them to you.
>
Okay, thank you for detailed information.


> 1. select_bad_process() tries to avoid killing another process while
> there's still a TIF_MEMDIE, but its loop starts by skipping !p->mm
> processes. However, p->mm is set to NULL well before p reaches
> exit_mmap() to actually free the memory, and there may be significant
> delays in between (I think exit_robust_list() gave me a hang at one
> stage). So in practice, even when the OOM killer selects the right
> process to kill, there can be lots of collateral damage from it not
> waiting long enough for that process to give up its memory.
>
Hmm.

> I tried to deal with that by moving the TIF_MEMDIE test up before
> the p->mm test, but adding in a check on p->exit_state:
> if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
> !p->exit_state)
> return ERR_PTR(-1UL);
> But this is then liable to hang the system if there's some reason
> why the selected process cannot proceed to free its memory (e.g.
> the current KSM unmerge case). It needs to wait "a while", but
> give up if no progress is made, instead of hanging: originally
> I thought that setting PF_MEMALLOC more widely in page_alloc.c,
> and giving up on the TIF_MEMDIE if it was waiting in PF_MEMALLOC,
> would deal with that; but we cannot be sure that waiting of memory
> is the only reason for a holdup there (in the KSM unmerge case it's
> waiting for an mmap_sem, and there may well be other such cases).
>
ok, then, easy handling can't be a help.

> 2. I started out running my mlock test program as root (later
> switched to use "ulimit -l unlimited" first). But badness() reckons
> CAP_SYS_ADMIN or CAP_SYS_RESOURCE is a reason to quarter your points;
> and CAP_SYS_RAWIO another reason to quarter your points: so running
> as root makes you sixteen times less likely to be killed. Quartering
> is anyway debatable, but sixteenthing seems utterly excessive to me.
>
I can't agree that part of heuristics, either.

> I moved the CAP_SYS_RAWIO test in with the others, so it does no
> more than quartering; but is quartering appropriate anyway? I did
> wonder if I was right to be "subverting" the fine-grained CAPs in
> this way, but have since seen unrelated mail from one who knows
> better, implying they're something of a fantasy, that su and sudo
> are indeed what's used in the real world. Maybe this patch was okay.
>
ok.



> 3. badness() has a comment above it which says:
> * 5) we try to kill the process the user expects us to kill, this
> * algorithm has been meticulously tuned to meet the principle
> * of least surprise ... (be careful when you change it)
> But Andrea's 2.6.11 86a4c6d9e2e43796bb362debd3f73c0e3b198efa (later
> refined by Kurt's 2.6.16 9827b781f20828e5ceb911b879f268f78fe90815)
> adds plenty of surprise there, by trying to factor children into the
> calculation. Intended to deal with forkbombs, but any reasonable
> process whose purpose is to fork children (e.g. gnome-session)
> becomes very vulnerable. And whereas badness() itself goes on to
> refine the total_vm points by various adjustments peculiar to the
> process in question, those refinements have been ignored when
> adding the child's total_vm/2. (Andrea does remark that he'd
> rather have rewritten badness() from scratch.)
>
> I tried to fix this by moving the PF_OOM_ORIGIN (was PF_SWAPOFF)
> part of the calculation up to select_bad_process(), making a
> solo_badness() function which makes all those adjustments to
> total_vm, then badness() itself a simple function adding half
> the children's solo_badness()es to the process' own solo_badness().
> But probably lots more needs doing - Andrea's rewrite?
>
> 4. In some cases those children are sharing exactly the same mm,
> yet its total_vm is being added again and again to the points:
> I had a nasty inner loop searching back to see if we'd already
> counted this mm (but then, what if the different tasks sharing
> the mm deserved different adjustments to the total_vm?).
>
>
> I hope these notes help someone towards a better solution
> (and be prepared to discover more on the way). I agree with
> Vedran that the present behaviour is pretty unimpressive, and
> I'm puzzled as to how people can have been tinkering with
> oom_kill.c down the years without seeing any of this.
>

Sorry, I usually don't use X on servers and almost all recent my OOM test
was done under memcg ;(
Thank you for your investigation. Maybe I'll need several steps.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: KOSAKI Motohiro on
> 2. I started out running my mlock test program as root (later
> switched to use "ulimit -l unlimited" first). But badness() reckons
> CAP_SYS_ADMIN or CAP_SYS_RESOURCE is a reason to quarter your points;
> and CAP_SYS_RAWIO another reason to quarter your points: so running
> as root makes you sixteen times less likely to be killed. Quartering
> is anyway debatable, but sixteenthing seems utterly excessive to me.
>
> I moved the CAP_SYS_RAWIO test in with the others, so it does no
> more than quartering; but is quartering appropriate anyway? I did
> wonder if I was right to be "subverting" the fine-grained CAPs in
> this way, but have since seen unrelated mail from one who knows
> better, implying they're something of a fantasy, that su and sudo
> are indeed what's used in the real world. Maybe this patch was okay.

I agree quartering is debatable.
At least, killing quartering is worth for any user, and it can be push into -stable.




From 27331555366c908a93c2cdd780b77e421869c5af Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com>
Date: Wed, 28 Oct 2009 11:28:39 +0900
Subject: [PATCH] oom: Mitigate suer-user's bonus of oom-score

Currently, badness calculation code of oom contemplate following bonus.
- Super-user have quartering oom-score
- CAP_SYS_RAWIO process (e.g. database) also have quartering oom-score

The problem is, Super-users have CAP_SYS_RAWIO too. Then, they have
sixteenthing bonus. it's obviously too excessive and meaningless.

This patch fixes it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com>
---
mm/oom_kill.c | 13 +++++--------
1 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ea2147d..40d323d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -152,18 +152,15 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
/*
* Superuser processes are usually more important, so we make it
* less likely that we kill those.
- */
- if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
- has_capability_noaudit(p, CAP_SYS_RESOURCE))
- points /= 4;
-
- /*
- * We don't want to kill a process with direct hardware access.
+ *
+ * Plus, We don't want to kill a process with direct hardware access.
* Not only could that mess up the hardware, but usually users
* tend to only have this flag set on applications they think
* of as important.
*/
- if (has_capability_noaudit(p, CAP_SYS_RAWIO))
+ if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
+ has_capability_noaudit(p, CAP_SYS_RESOURCE) ||
+ has_capability_noaudit(p, CAP_SYS_RAWIO))
points /= 4;

/*
--
1.6.2.5




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/