oom: give current access to memory reserves if it has been killed [Kernel]

Prev: [PATCH] Staging: quatech_usb2: fix coding style issues
Next: proc: don't take ->siglock for /proc/pid/oom_adj

From: David Rientjes on 30 Mar 2010 16:30

On Tue, 30 Mar 2010, Oleg Nesterov wrote:

> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -681,6 +681,16 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> > }
> >
> > /*
> > + * If current has a pending SIGKILL, then automatically select it. The
> > + * goal is to allow it to allocate so that it may quickly exit and free
> > + * its memory.
> > + */
> > + if (fatal_signal_pending(current)) {
> > + __oom_kill_task(current);
>
> I am worried...
>
> Note that __oom_kill_task() does force_sig(SIGKILL) which assumes that
> ->sighand != NULL. This is not true if out_of_memory() is called after
> current has already passed exit_notify().
>

We have an even bigger problem if current is in the oom killer at
exit_notify() since it has already detached its ->mm in exit_mm() :)

> Hmm. looking at oom_kill.c... Afaics there are more problems with mt
> apllications. select_bad_process() does for_each_process() which can
> only see the group leaders. This is fine, but what if ->group_leader
> has already exited? In this case its ->mm == NULL, and we ignore the
> whole thread group.
>
> IOW, unless I missed something, it is very easy to hide the process
> from oom-kill:
>
> int main()
> {
> pthread_create(memory_hog_func);
> syscall(__NR_exit);
> }
>

The check for !p->mm was moved in the -mm tree (and the oom killer was
entirely rewritten in that tree, so I encourage you to work off of it
instead) with
oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch to
even after the check for PF_EXITING. This is set in the exit path before
the ->mm is detached so if the oom killer finds an already exiting task,
it will become a no-op since it should eventually free memory and avoids a
needless oom kill.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: David Rientjes on 31 Mar 2010 17:10

On Wed, 31 Mar 2010, Oleg Nesterov wrote:

> On 03/30, David Rientjes wrote:
> >
> > On Tue, 30 Mar 2010, Oleg Nesterov wrote:
> >
> > > Note that __oom_kill_task() does force_sig(SIGKILL) which assumes that
> > > ->sighand != NULL. This is not true if out_of_memory() is called after
> > > current has already passed exit_notify().
> >
> > We have an even bigger problem if current is in the oom killer at
> > exit_notify() since it has already detached its ->mm in exit_mm() :)
>
> Can't understand... I thought that in theory even kmalloc(1) can trigger
> oom.
>

__oom_kill_task() cannot be called on a task without an ->mm.

> > > IOW, unless I missed something, it is very easy to hide the process
> > > from oom-kill:
> > >
> > > int main()
> > > {
> > > pthread_create(memory_hog_func);
> > > syscall(__NR_exit);
> > > }
> > >
> >
> > The check for !p->mm was moved in the -mm tree (and the oom killer was
> > entirely rewritten in that tree, so I encourage you to work off of it
> > instead
>
> OK, but I guess this !p->mm check is still wrong for the same reason.
> In fact I do not understand why it is needed in select_bad_process()
> right before oom_badness() which checks ->mm too (and this check is
> equally wrong).
>

It prevents kthreads from being killed. We already identify tasks that
are in the exit path with PF_EXITING in select_bad_process() and chosen to
make the oom killer a no-op when it's not current so it can exit and free
its memory. If it is current, then we're ooming in the exit path and we
need to oom kill it so that it gets access to memory reserves so its no
longer blocking.

> > so if the oom killer finds an already exiting task,
> > it will become a no-op since it should eventually free memory and avoids a
> > needless oom kill.
>
> No, afaics, And this reminds that I already complained about this
> PF_EXITING check.
>
> Once again, p is the group leader. It can be dead (no ->mm, PF_EXITING
> is set) but it can have sub-threads. This means, unless I missed something,
> any user can trivially disable select_bad_process() forever.
>

The task is in the process of exiting and will do so if its not current,
otherwise it will get access to memory reserves since we're obviously oom
in the exit path. Thus, we'll be freeing that memory soon or recalling
the oom killer to kill additional tasks once those children have been
reparented (or one of its children was sacrificed).

>
> Well. Looks like, -mm has a lot of changes in oom_kill.c. Perhaps it
> would be better to fix these mt bugs first...
>
> Say, oom_forkbomb_penalty() does list_for_each_entry(tsk->children).
> Again, this is not right even if we forget about !child->mm check.
> This list_for_each_entry() can only see the processes forked by the
> main thread.
>

That's the intention.

> Likewise, oom_kill_process()->list_for_each_entry() is not right too.
>

Why?

> Hmm. Why oom_forkbomb_penalty() does thread_group_cputime() under
> task_lock() ? It seems, ->alloc_lock() is only needed for get_mm_rss().
>

Right, but we need to ensure that the check for !child->mm || child->mm ==
tsk->mm fails before adding in get_mm_rss(child->mm). It can race and
detach its mm prior to the dereference. It would be possible to move the
thread_group_cputime() out of this critical section, but I felt it was
better to do filter all tasks with child->mm == tsk->mm first before
unnecessarily finding the cputime for them.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: David Rientjes on 31 Mar 2010 19:50

On Thu, 1 Apr 2010, Oleg Nesterov wrote:

> > Why? You ignored this part:
> >
> > Say, right after exit_mm() we are doing acct_process(), and f_op->write()
> > needs a page. So, you are saying that in this case __page_cache_alloc()
> > can never trigger out_of_memory() ?
> >
> > why this is not possible?
> >
> > David, I am not arguing, I am asking.
>
> In case I wasn't clear...
>
> Yes, currently __oom_kill_task(p) is not possible if p->mm == NULL.
>
> But your patch adds
>
> if (fatal_signal_pending(current))
> __oom_kill_task(current);
>
> into out_of_memory().
>

Ok, and it's possible during the tasklist scan if current is PF_EXITING
and that gets passed to oom_kill_process(), so we need the following
patch. Can I have your acked-by and then I'll propose it to Andrew with a
follow-up that merges __oom_kill_task() into oom_kill_task() since it only
has one caller now anyway?

[ Both of these situations will be current since the oom killer is a
no-op whenever another task is found to be PF_EXITING and
oom_kill_process() wouldn't get called with any other thread unless
oom_kill_quick is enabled or its VM_FAULT_OOM in which cases we kill
current as well. ]

Thanks Oleg.
---
mm/oom_kill.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -459,7 +459,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
* its children or threads, just set TIF_MEMDIE so it can die quickly
*/
if (p->flags & PF_EXITING) {
- __oom_kill_task(p);
+ set_tsk_thread_flag(p, TIF_MEMDIE);
return 0;
}

@@ -686,7 +686,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
* its memory.
*/
if (fatal_signal_pending(current)) {
- __oom_kill_task(current);
+ set_tsk_thread_flag(current, TIF_MEMDIE);
return;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: David Rientjes on 1 Apr 2010 04:30

On Thu, 1 Apr 2010, Oleg Nesterov wrote:

> Why? You ignored this part:
>
> Say, right after exit_mm() we are doing acct_process(), and f_op->write()
> needs a page. So, you are saying that in this case __page_cache_alloc()
> can never trigger out_of_memory() ?
>
> why this is not possible?
>

It can, but the check for p->mm is sufficient since exit_notify() takes
write_lock_irq(&tasklist_lock) that the oom killer holds for read, so the
rule is that whenever we have a valid p->mm, we have a valid p->sighand
and can do force_sig() while under tasklist_lock. The only time we call
oom_kill_process() without holding a readlock on tasklist_lock is for
current during pagefault ooms and we know it's not exiting because it's in
the oom killer.

> > > OK, but I guess this !p->mm check is still wrong for the same reason.
> > > In fact I do not understand why it is needed in select_bad_process()
> > > right before oom_badness() which checks ->mm too (and this check is
> > > equally wrong).
> >
> > It prevents kthreads from being killed.
>
> No it doesn't, see use_mm(). See also another email I sent.
>

We cannot rely on oom_badness() to filter this task because we still
select it as our chosen task even with a badness score of 0 if !chosen, so
we must filter these threads ahead of time:

if (points > *ppoints || !chosen) {
chosen = p;
*ppoints = points;
}

Filtering on !p->mm prevents us from doing "if (points > *ppoints ||
(!chosen && p->mm))" because it's just cleaner and makes this rule
explicit.

Your point about p->mm being non-NULL for kthreads using use_mm() is
taken, we should probably just change the is_global_init() check in
select_bad_process() to p->flags & PF_KTHREAD and ensure we reject
oom_kill_process() for them.

> > The task is in the process of exiting and will do so if its not current,
> > otherwise it will get access to memory reserves since we're obviously oom
> > in the exit path. Thus, we'll be freeing that memory soon or recalling
> > the oom killer to kill additional tasks once those children have been
> > reparented (or one of its children was sacrificed).
>
> Just can't understand.
>
> OK, a bad user does
>
> int sleep_forever(void *)
> {
> pause();
> }
>
> int main(void)
> {
> pthread_create(sleep_forever);
> syscall(__NR_exit);
> }
>
> Now, every time select_bad_process() is called it will find this process
> and PF_EXITING is true, so it just returns ERR_PTR(-1UL). And note that
> this process is not going to exit.
>

Hmm, so it looks like we need to filter on !p->mm before checking for
PF_EXITING so that tasks that are EXIT_ZOMBIE won't make the oom killer
into a no-op.

> > > Say, oom_forkbomb_penalty() does list_for_each_entry(tsk->children).
> > > Again, this is not right even if we forget about !child->mm check.
> > > This list_for_each_entry() can only see the processes forked by the
> > > main thread.
> > >
> >
> > That's the intention.
>
> Why? shouldn't oom_badness() return the same result for any thread
> in thread group? We should take all childs into account.
>

oom_forkbomb_penalty() only cares about first-descendant children that
do not share the same memory, so we purposely penalize the parent so that
it is more biased to select for oom kill and then it will sacrifice these
threads in oom_kill_process().

> > > Hmm. Why oom_forkbomb_penalty() does thread_group_cputime() under
> > > task_lock() ? It seems, ->alloc_lock() is only needed for get_mm_rss().
> > >
> >
> > Right, but we need to ensure that the check for !child->mm || child->mm ==
> > tsk->mm fails before adding in get_mm_rss(child->mm). It can race and
> > detach its mm prior to the dereference.
>
> Oh, yes sure, I mentioned get_mm_rss() above.
>
> > It would be possible to move the
> > thread_group_cputime() out of this critical section,
>
> Yes, this is what I meant.
>

You could, but then you'd be calling thread_group_cputime() for all
threads even though they may not share the same ->mm as tsk.

> > but I felt it was
> > better to do filter all tasks with child->mm == tsk->mm first before
> > unnecessarily finding the cputime for them.
>
> Yes, but we can check child->mm == tsk->mm, call get_mm_counter() and drop
> task_lock().
>

We need task_lock() to ensure child->mm hasn't detached between the check
for child->mm == tsk->mm and get_mm_rss(child->mm). So I'm not sure what
you're trying to improve with this variation, it's a tradeoff between
calling thread_group_cputime() under task_lock() for a subset of a task's
threads when we already need to hold task_lock() anyway vs. calling it for
all threads unconditionally.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: David Rientjes on 1 Apr 2010 04:40

On Wed, 31 Mar 2010, Oleg Nesterov wrote:

> Probably something like the patch below makes sense. Note that
> "skip kernel threads" logic is wrong too, we should check PF_KTHREAD.
> Probably it is better to check it in select_bad_process() instead,
> near is_global_init().
>

is_global_init() will be true for p->flags & PF_KTHREAD.

> The new helper, find_lock_task_mm(), should be used by
> oom_forkbomb_penalty() too.
>
> dump_tasks() doesn't need it, it does do_each_thread(). Cough,
> __out_of_memory() and out_of_memory() call it without tasklist.
> We are going to panic() anyway, but still.
>

Indeed, good observation.

> Oleg.
>
> --- x/mm/oom_kill.c
> +++ x/mm/oom_kill.c
> @@ -129,6 +129,19 @@ static unsigned long oom_forkbomb_penalt
> (child_rss / sysctl_oom_forkbomb_thres) : 0;
> }
>
> +static find_lock_task_mm(struct task_struct *p)
> +{
> + struct task_struct *t = p;
> + do {
> + task_lock(t);
> + if (likely(t->mm && !(t->flags & PF_KTHREAD)))
> + return t;
> + task_unlock(t);
> + } while_each_thred(p, t);
> +
> + return NULL;
> +}
> +
> /**
> * oom_badness - heuristic function to determine which candidate task to kill
> * @p: task struct of which task we should calculate
> @@ -159,13 +172,9 @@ unsigned int oom_badness(struct task_str
> if (p->flags & PF_OOM_ORIGIN)
> return 1000;
>
> - task_lock(p);
> - mm = p->mm;
> - if (!mm) {
> - task_unlock(p);
> + p = find_lock_task_mm(p);
> + if (!p)
> return 0;
> - }
> -
> /*
> * The baseline for the badness score is the proportion of RAM that each
> * task's rss and swap space use.
> @@ -330,12 +339,6 @@ static struct task_struct *select_bad_pr
> *ppoints = 1000;
> }
>
> - /*
> - * skip kernel threads and tasks which have already released
> - * their mm.
> - */
> - if (!p->mm)
> - continue;
> if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> continue;

You can't do this for the reason I cited in another email, oom_badness()
returning 0 does not exclude a task from being chosen by
selcet_bad_process(), it will use that task if nothing else has been found
yet. We must explicitly filter it from consideration by checking for
!p->mm.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2
Prev: [PATCH] Staging: quatech_usb2: fix coding style issues
Next: proc: don't take ->siglock for /proc/pid/oom_adj