From: David Rientjes on
On Thu, 1 Apr 2010, Oleg Nesterov wrote:

> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -459,7 +459,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> > * its children or threads, just set TIF_MEMDIE so it can die quickly
> > */
> > if (p->flags & PF_EXITING) {
> > - __oom_kill_task(p);
> > + set_tsk_thread_flag(p, TIF_MEMDIE);
>
> So, probably this makes sense anyway but not strictly necessary, up to you.
>

It matches the already-existing comment that only says we need to set
TIF_MEMDIE so it can quickly exit rather than call __oom_kill_task(), so
it seems worthwhile.

> > if (fatal_signal_pending(current)) {
> > - __oom_kill_task(current);
> > + set_tsk_thread_flag(current, TIF_MEMDIE);
>
> Yes, I think this fix is needed.
>

Ok, I'll add your acked-by and send this to Andrew with a follow-up that
consolidates __oom_kill_task() into oom_kill_task(), thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Rientjes on
On Thu, 1 Apr 2010, Oleg Nesterov wrote:

> > > @@ -159,13 +172,9 @@ unsigned int oom_badness(struct task_str
> > > if (p->flags & PF_OOM_ORIGIN)
> > > return 1000;
> > >
> > > - task_lock(p);
> > > - mm = p->mm;
> > > - if (!mm) {
> > > - task_unlock(p);
> > > + p = find_lock_task_mm(p);
> > > + if (!p)
> > > return 0;
> > > - }
> > > -
> > > /*
> > > * The baseline for the badness score is the proportion of RAM that each
> > > * task's rss and swap space use.
> > > @@ -330,12 +339,6 @@ static struct task_struct *select_bad_pr
> > > *ppoints = 1000;
> > > }
> > >
> > > - /*
> > > - * skip kernel threads and tasks which have already released
> > > - * their mm.
> > > - */
> > > - if (!p->mm)
> > > - continue;
> > > if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> > > continue;
> >
> > You can't do this for the reason I cited in another email, oom_badness()
> > returning 0 does not exclude a task from being chosen by
> > selcet_bad_process(), it will use that task if nothing else has been found
> > yet. We must explicitly filter it from consideration by checking for
> > !p->mm.
>
> Yes, you are right. OK, oom_badness() can never return points < 0,
> we can make it int and oom_badness() can return -1 if !mm. IOW,
>
> - unsigned int points;
> + int points;
> ...
>
> points = oom_badness(...);
> if (points >= 0 && (points > *ppoints || !chosen))
> chosen = p;
>

oom_badness() and its predecessor badness() in mainline never return
negative scores, so I don't see the value in doing this; just filter the
task in select_bad_process() with !p->mm as it has always been done.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Rientjes on
On Fri, 2 Apr 2010, Oleg Nesterov wrote:

> > > Yes, you are right. OK, oom_badness() can never return points < 0,
> > > we can make it int and oom_badness() can return -1 if !mm. IOW,
> > >
> > > - unsigned int points;
> > > + int points;
> > > ...
> > >
> > > points = oom_badness(...);
> > > if (points >= 0 && (points > *ppoints || !chosen))
> > > chosen = p;
> > >
> >
> > oom_badness() and its predecessor badness() in mainline never return
> > negative scores, so I don't see the value in doing this; just filter the
> > task in select_bad_process() with !p->mm as it has always been done.
>
> David, you continue to ignore my arguments ;) select_bad_process()
> must not filter out the tasks with ->mm == NULL.
>
> Once again:
>
> void *memory_hog_thread(void *arg)
> {
> for (;;)
> malloc(A_LOT);
> }
>
> int main(void)
> {
> pthread_create(memory_hog_thread, ...);
> syscall(__NR_exit, 0);
> }
>
> Now, even if we fix PF_EXITING check, select_bad_process() will always
> ignore this process. The group leader has ->mm == NULL.
>
> See?
>
> That is why I think we need something like find_lock_task_mm() in the
> pseudo-patch I sent.
>

I'm not ignoring your arguments, I think you're ignoring what I'm
responding to. I prefer to keep oom_badness() to be a positive range as
it always has been (and /proc/pid/oom_score has always used an unsigned
qualifier), so I disagree that we need to change oom_badness() to return
anything other than 0 for such tasks. We need to filter them explicitly
in select_bad_process() instead, so please do this there.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Rientjes on
On Fri, 2 Apr 2010, Oleg Nesterov wrote:

> > > David, you continue to ignore my arguments ;) select_bad_process()
> > > must not filter out the tasks with ->mm == NULL.
> > >
> > I'm not ignoring your arguments, I think you're ignoring what I'm
> > responding to.
>
> Ah, sorry, I misunderstood your replies.
>
> > I prefer to keep oom_badness() to be a positive range as
> > it always has been (and /proc/pid/oom_score has always used an unsigned
> > qualifier),
>
> Yes, I thought about /proc/pid/oom_score, but imho this is minor issue.
> We can s/%lu/%ld/ though, or just report 0 if oom_badness() returns -1.
> Or something.
>

Just have it return 0, meaning never kill, and then ensure "chosen" is
never set for an oom_badness() of 0, even if we don't have another task to
kill. That's how Documentation/filesystems/proc.txt describes it anyway.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Rientjes on
On Thu, 1 Apr 2010, Oleg Nesterov wrote:

> > > > > Say, oom_forkbomb_penalty() does list_for_each_entry(tsk->children).
> > > > > Again, this is not right even if we forget about !child->mm check.
> > > > > This list_for_each_entry() can only see the processes forked by the
> > > > > main thread.
> > > > >
> > > >
> > > > That's the intention.
> > >
> > > Why? shouldn't oom_badness() return the same result for any thread
> > > in thread group? We should take all childs into account.
> > >
> >
> > oom_forkbomb_penalty() only cares about first-descendant children that
> > do not share the same memory,
>
> I see, but the code doesn't really do this. I mean, it doesn't really
> see the first-descendant children, only those which were forked by the
> main thread.
>
> Look. We have a main thread M and the sub-thread T. T forks a lot of
> processes which use a lot of memory. These processes _are_ the first
> descendant children of the M+T thread group, they should be accounted.
> But M->children list is empty.
>
> oom_forkbomb_penalty() and oom_kill_process() should do
>
> t = tsk;
> do {
> list_for_each_entry(child, &t->children, sibling) {
> ... take child into account ...
> }
> } while_each_thread(tsk, t);
>
>

In this case, it seems more appropriate that we would penalize T and not M
since it's not necessarily responsible for the behavior of the children it
forks. T is the buggy/malicious program, not M.

> See the patch below. Yes, this is minor, but it is always good to avoid
> the unnecessary locks, and thread_group_cputime() is O(N).
>
> Not only for performance reasons. This allows to change the locking in
> thread_group_cputime() if needed without fear to deadlock with task_lock().
>
> Oleg.
>
> --- x/mm/oom_kill.c
> +++ x/mm/oom_kill.c
> @@ -97,13 +97,16 @@ static unsigned long oom_forkbomb_penalt
> return 0;
> list_for_each_entry(child, &tsk->children, sibling) {
> struct task_cputime task_time;
> - unsigned long runtime;
> + unsigned long runtime, this_rss;
>
> task_lock(child);
> if (!child->mm || child->mm == tsk->mm) {
> task_unlock(child);
> continue;
> }
> + this_rss = get_mm_rss(child->mm);
> + task_unlock(child);
> +
> thread_group_cputime(child, &task_time);
> runtime = cputime_to_jiffies(task_time.utime) +
> cputime_to_jiffies(task_time.stime);
> @@ -113,10 +116,9 @@ static unsigned long oom_forkbomb_penalt
> * get to execute at all in such cases anyway.
> */
> if (runtime < HZ) {
> - child_rss += get_mm_rss(child->mm);
> + child_rss += this_rss;
> forkcount++;
> }
> - task_unlock(child);
> }
>
> /*

This patch looks good, will you send it to Andrew with a changelog and
sign-off line? Also feel free to add:

Acked-by: David Rientjes <rientjes(a)google.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/