From: anfei on
On Mon, Mar 29, 2010 at 01:46:30PM +0200, Oleg Nesterov wrote:
> On 03/29, anfei wrote:
> >
> > On Sun, Mar 28, 2010 at 06:28:21PM +0200, Oleg Nesterov wrote:
> > > On 03/28, anfei wrote:
> > > >
> > > > Assume thread A and B are in the same group. If A runs into the oom,
> > > > and selects B as the victim, B won't exit because at least in exit_mm(),
> > > > it can not get the mm->mmap_sem semaphore which A has already got.
> > >
> > > I see. But still I can't understand. To me, the problem is not that
> > > B can't exit, the problem is that A doesn't know it should exit. All
> >
> > If B can exit, its memory will be freed,
>
> Which memory? I thought, we are talking about the memory used by ->mm ?
>
There is also a little kernel struct related to the task can be freed,
but I think you are correct, the memory used by ->mm takes more effect,
and it won't be freed even B exits. So I agree you on:
"
the problem is not that B can't exit, the problem is that A doesn't know
it should exit. All threads should exit and free ->mm. Even if B could
exit, this is not enough. And, to some extent, it doesn't matter if it
holds mmap_sem or not.
"

Thanks,
Anfei.

> Oleg.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: anfei on
On Sun, Mar 28, 2010 at 02:21:01PM -0700, David Rientjes wrote:
> On Sun, 28 Mar 2010, Oleg Nesterov wrote:
>
> > I see. But still I can't understand. To me, the problem is not that
> > B can't exit, the problem is that A doesn't know it should exit. All
> > threads should exit and free ->mm. Even if B could exit, this is not
> > enough. And, to some extent, it doesn't matter if it holds mmap_sem
> > or not.
> >
> > Don't get me wrong. Even if I don't understand oom_kill.c the patch
> > looks obviously good to me, even from "common sense" pov. I am just
> > curious.
> >
> > So, my understanding is: we are going to kill the whole thread group
> > but TIF_MEMDIE is per-thread. Mark the whole thread group as TIF_MEMDIE
> > so that any thread can notice this flag and (say, __alloc_pages_slowpath)
> > fail asap.
> >
> > Is my understanding correct?
> >
>
> [Adding Mel Gorman <mel(a)csn.ul.ie> to the cc]
>
> The problem with this approach is that we could easily deplete all memory
> reserves if the oom killed task has an extremely large number of threads,
> there has always been only a single thread with TIF_MEMDIE set per cpuset
> or memcg; for systems that don't run with cpusets or memory controller,
> this has been limited to one thread with TIF_MEMDIE for the entire system.
>
> There's risk involved with suddenly allowing 1000 threads to have
> TIF_MEMDIE set and the chances of fully depleting all allowed zones is
> much higher if they allocate memory prior to exit, for example.
>
> An alternative is to fail allocations if they are failable and the
> allocating task has a pending SIGKILL. It's better to preempt the oom
> killer since current is going to be exiting anyway and this avoids a
> needless kill.
>
I think this method is okay, but it's easy to trigger another bug of
oom. See select_bad_process():
if (!p->mm)
continue;
!p->mm is not always an unaccepted condition. e.g. "p" is killed and
doing exit, setting tsk->mm to NULL is before releasing the memory.
And in multi threading environment, this happens much more.
In __out_of_memory(), it panics if select_bad_process returns NULL.
The simple way to fix it is as mem_cgroup_out_of_memory() does.

So I think both of these 2 patches are needed.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index afeab2a..9aae208 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -588,12 +588,8 @@ retry:
if (PTR_ERR(p) == -1UL)
return;

- /* Found nothing?!?! Either we hang forever, or we panic. */
- if (!p) {
- read_unlock(&tasklist_lock);
- dump_header(NULL, gfp_mask, order, NULL);
- panic("Out of memory and no killable processes...\n");
- }
+ if (!p)
+ p = current;

if (oom_kill_process(p, gfp_mask, order, points, NULL,
"Out of memory"))


> That's possible if it's guaranteed that __GFP_NOFAIL allocations with a
> pending SIGKILL are granted ALLOC_NO_WATERMARKS to prevent them from
> endlessly looping while making no progress.
>
> Comments?
> ---
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1610,13 +1610,21 @@ try_next_zone:
> }
>
> static inline int
> -should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> +should_alloc_retry(struct task_struct *p, gfp_t gfp_mask, unsigned int order,
> unsigned long pages_reclaimed)
> {
> /* Do not loop if specifically requested */
> if (gfp_mask & __GFP_NORETRY)
> return 0;
>
> + /* Loop if specifically requested */
> + if (gfp_mask & __GFP_NOFAIL)
> + return 1;
> +
> + /* Task is killed, fail the allocation if possible */
> + if (fatal_signal_pending(p))
> + return 0;
> +
> /*
> * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> * means __GFP_NOFAIL, but that may not be true in other
> @@ -1635,13 +1643,6 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
> return 1;
>
> - /*
> - * Don't let big-order allocations loop unless the caller
> - * explicitly requests that.
> - */
> - if (gfp_mask & __GFP_NOFAIL)
> - return 1;
> -
> return 0;
> }
>
> @@ -1798,6 +1799,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> if (!in_interrupt() &&
> ((p->flags & PF_MEMALLOC) ||
> + (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL)) ||
> unlikely(test_thread_flag(TIF_MEMDIE))))
> alloc_flags |= ALLOC_NO_WATERMARKS;
> }
> @@ -1812,6 +1814,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> int migratetype)
> {
> const gfp_t wait = gfp_mask & __GFP_WAIT;
> + const gfp_t nofail = gfp_mask & __GFP_NOFAIL;
> struct page *page = NULL;
> int alloc_flags;
> unsigned long pages_reclaimed = 0;
> @@ -1876,7 +1879,7 @@ rebalance:
> goto nopage;
>
> /* Avoid allocations with no watermarks from looping endlessly */
> - if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> + if (test_thread_flag(TIF_MEMDIE) && !nofail)
> goto nopage;
>
> /* Try direct reclaim and then allocating */
> @@ -1888,6 +1891,10 @@ rebalance:
> if (page)
> goto got_pg;
>
> + /* Task is killed, fail the allocation if possible */
> + if (fatal_signal_pending(p) && !nofail)
> + goto nopage;
> +
> /*
> * If we failed to make any progress reclaiming, then we are
> * running out of options and have to consider going OOM
> @@ -1909,8 +1916,7 @@ rebalance:
> * made, there are no other options and retrying is
> * unlikely to help.
> */
> - if (order > PAGE_ALLOC_COSTLY_ORDER &&
> - !(gfp_mask & __GFP_NOFAIL))
> + if (order > PAGE_ALLOC_COSTLY_ORDER && !nofail)
> goto nopage;
>
> goto restart;
> @@ -1919,7 +1925,7 @@ rebalance:
>
> /* Check if we should retry the allocation */
> pages_reclaimed += did_some_progress;
> - if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
> + if (should_alloc_retry(p, gfp_mask, order, pages_reclaimed)) {
> /* Wait for some write requests to complete then retry */
> congestion_wait(BLK_RW_ASYNC, HZ/50);
> goto rebalance;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Rientjes on
On Mon, 29 Mar 2010, anfei wrote:

> I think this method is okay, but it's easy to trigger another bug of
> oom. See select_bad_process():
> if (!p->mm)
> continue;
> !p->mm is not always an unaccepted condition. e.g. "p" is killed and
> doing exit, setting tsk->mm to NULL is before releasing the memory.
> And in multi threading environment, this happens much more.
> In __out_of_memory(), it panics if select_bad_process returns NULL.
> The simple way to fix it is as mem_cgroup_out_of_memory() does.
>

This is fixed by
oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch in
the -mm tree.

See
http://userweb.kernel.org/~akpm/mmotm/broken-out/oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index afeab2a..9aae208 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -588,12 +588,8 @@ retry:
> if (PTR_ERR(p) == -1UL)
> return;
>
> - /* Found nothing?!?! Either we hang forever, or we panic. */
> - if (!p) {
> - read_unlock(&tasklist_lock);
> - dump_header(NULL, gfp_mask, order, NULL);
> - panic("Out of memory and no killable processes...\n");
> - }
> + if (!p)
> + p = current;
>
> if (oom_kill_process(p, gfp_mask, order, points, NULL,
> "Out of memory"))

The reason p wasn't selected is because it fails to meet the criteria for
candidacy in select_bad_process(), not necessarily because of a race with
the !p->mm check that the -mm patch cited above fixes. It's quite
possible that current has an oom_adj value of OOM_DISABLE, for example,
where this would be wrong.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: anfei on
On Mon, Mar 29, 2010 at 01:01:58PM -0700, David Rientjes wrote:
> On Mon, 29 Mar 2010, anfei wrote:
>
> > I think this method is okay, but it's easy to trigger another bug of
> > oom. See select_bad_process():
> > if (!p->mm)
> > continue;
> > !p->mm is not always an unaccepted condition. e.g. "p" is killed and
> > doing exit, setting tsk->mm to NULL is before releasing the memory.
> > And in multi threading environment, this happens much more.
> > In __out_of_memory(), it panics if select_bad_process returns NULL.
> > The simple way to fix it is as mem_cgroup_out_of_memory() does.
> >
>
> This is fixed by
> oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch in
> the -mm tree.
>
> See
> http://userweb.kernel.org/~akpm/mmotm/broken-out/oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch
>
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index afeab2a..9aae208 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -588,12 +588,8 @@ retry:
> > if (PTR_ERR(p) == -1UL)
> > return;
> >
> > - /* Found nothing?!?! Either we hang forever, or we panic. */
> > - if (!p) {
> > - read_unlock(&tasklist_lock);
> > - dump_header(NULL, gfp_mask, order, NULL);
> > - panic("Out of memory and no killable processes...\n");
> > - }
> > + if (!p)
> > + p = current;
> >
> > if (oom_kill_process(p, gfp_mask, order, points, NULL,
> > "Out of memory"))
>
> The reason p wasn't selected is because it fails to meet the criteria for
> candidacy in select_bad_process(), not necessarily because of a race with
> the !p->mm check that the -mm patch cited above fixes. It's quite
> possible that current has an oom_adj value of OOM_DISABLE, for example,
> where this would be wrong.

I see. And what about changing mem_cgroup_out_of_memory() too?


diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0cb1ca4..9e89a29 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -510,8 +510,10 @@ retry:
if (PTR_ERR(p) == -1UL)
goto out;

- if (!p)
- p = current;
+ if (!p) {
+ read_unlock(&tasklist_lock);
+ panic("Out of memory and no killable processes...\n");
+ }

if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
"Memory cgroup out of memory"))
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Rientjes on
On Tue, 30 Mar 2010, anfei wrote:

> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index afeab2a..9aae208 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -588,12 +588,8 @@ retry:
> > > if (PTR_ERR(p) == -1UL)
> > > return;
> > >
> > > - /* Found nothing?!?! Either we hang forever, or we panic. */
> > > - if (!p) {
> > > - read_unlock(&tasklist_lock);
> > > - dump_header(NULL, gfp_mask, order, NULL);
> > > - panic("Out of memory and no killable processes...\n");
> > > - }
> > > + if (!p)
> > > + p = current;
> > >
> > > if (oom_kill_process(p, gfp_mask, order, points, NULL,
> > > "Out of memory"))
> >
> > The reason p wasn't selected is because it fails to meet the criteria for
> > candidacy in select_bad_process(), not necessarily because of a race with
> > the !p->mm check that the -mm patch cited above fixes. It's quite
> > possible that current has an oom_adj value of OOM_DISABLE, for example,
> > where this would be wrong.
>
> I see. And what about changing mem_cgroup_out_of_memory() too?
>

The memory controller is different because it must kill a task even if
another task is exiting since the imposed limit has been reached.

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 0cb1ca4..9e89a29 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -510,8 +510,10 @@ retry:
> if (PTR_ERR(p) == -1UL)
> goto out;
>
> - if (!p)
> - p = current;
> + if (!p) {
> + read_unlock(&tasklist_lock);
> + panic("Out of memory and no killable processes...\n");
> + }
>
> if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
> "Memory cgroup out of memory"))
>

This actually does appear to be necessary but for a different reason: if
current is unkillable because it has OOM_DISABLE, for example, then
oom_kill_process() will repeatedly fail and mem_cgroup_out_of_memory()
will infinitely loop.

Kame-san?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/