From: Balbir Singh on
* KAMEZAWA Hiroyuki <kamezawa.hiroyu(a)jp.fujitsu.com> [2010-03-31 15:13:56]:

> On Tue, 30 Mar 2010 23:07:08 -0700 (PDT)
> David Rientjes <rientjes(a)google.com> wrote:
>
> > On Wed, 31 Mar 2010, KAMEZAWA Hiroyuki wrote:
> >
> > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > > index 0cb1ca4..9e89a29 100644
> > > > > --- a/mm/oom_kill.c
> > > > > +++ b/mm/oom_kill.c
> > > > > @@ -510,8 +510,10 @@ retry:
> > > > > if (PTR_ERR(p) == -1UL)
> > > > > goto out;
> > > > >
> > > > > - if (!p)
> > > > > - p = current;
> > > > > + if (!p) {
> > > > > + read_unlock(&tasklist_lock);
> > > > > + panic("Out of memory and no killable processes...\n");
> > > > > + }
> > > > >
> > > > > if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
> > > > > "Memory cgroup out of memory"))
> > > > >
> > > >
> > > > This actually does appear to be necessary but for a different reason: if
> > > > current is unkillable because it has OOM_DISABLE, for example, then
> > > > oom_kill_process() will repeatedly fail and mem_cgroup_out_of_memory()
> > > > will infinitely loop.
> > > >
> > > > Kame-san?
> > > >
> > >
> > > When a memcg goes into OOM and it only has unkillable processes (OOM_DISABLE),
> > > we can do nothing. (we can't panic because container's death != system death.)
> > >
> > > Because memcg itself has mutex+waitqueue for mutual execusion of OOM killer,
> > > I think infinite-loop will not be critical probelm for the whole system.
> > >
> > > And, now, memcg has oom-kill-disable + oom-kill-notifier features.
> > > So, If a memcg goes into OOM and there is no killable process, but oom-kill is
> > > not disabled by memcg.....it means system admin's mis-configuraton.
> > >
> > > He can stop inifite loop by hand, anyway.
> > > # echo 1 > ..../group_A/memory.oom_control
> > >
> >
> > Then we should be able to do this since current is by definition
> > unkillable since it was not found in select_bad_process(), right?
>
> To me, this patch is acceptable and seems reasnoable.
>
> But I didn't joined to memcg development when this check was added
> and don't know why kill current..
>

The reason for adding current was that we did not want to loop
forever, since it stops forward progress - no error/no forward
progress. It made sense to oom kill the current process, so that the
cgroup admin could look at what went wrong.

> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c7ba5c9e8176704bfac0729875fa62798037584d
>
> Addinc Balbir to CC. Maybe situation is changed now.
> Because we can stop inifinite loop (by hand) and there is no rushing oom-kill
> callers, this change is acceptable.
>

By hand is not always possible if we have a large number of cgroups
(I've seen a setup with 2000 cgroups on libcgroup ML). 2000 cgroups *
number of processes make the situation complex. I think using OOM
notifier is now another way of handling such a situation.

--
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Rientjes on
On Wed, 31 Mar 2010, KAMEZAWA Hiroyuki wrote:

> "By hand" includes "automatically with daemon program", of course.
>
> Hmm, in short, your opinion is "killing current is good for now" ?
>
> I have no strong opinion, here. (Because I'll recommend all customers to
> disable oom kill if they don't want any task to be killed automatically.)
>

I think there're a couple of options: either define threshold notifiers
with memory.usage_in_bytes so userspace can proactively address low memory
situations prior to oom, or use the oom notifier after setting
echo 1 > /dev/cgroup/blah/memory.oom_control to address those issues
in userspace as they happen. If userspace wants to defer back to the
kernel oom killer because it can't raise max_usage_in_bytes, then
echo 0 > /dev/cgroup/blah/memory.oom_control should take care of it
instantly and I'd rather see a misconfigured memcg with tasks that are
OOM_DISABLE but not memcg->oom_kill_disable to be starved of memory than
panicking the entire system.

Those are good options for users having to deal with low memory
situations, thanks for continuing to work on it!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on
On Sun, Mar 28, 2010 at 02:21:01PM -0700, David Rientjes wrote:
> On Sun, 28 Mar 2010, Oleg Nesterov wrote:
>
> > I see. But still I can't understand. To me, the problem is not that
> > B can't exit, the problem is that A doesn't know it should exit. All
> > threads should exit and free ->mm. Even if B could exit, this is not
> > enough. And, to some extent, it doesn't matter if it holds mmap_sem
> > or not.
> >
> > Don't get me wrong. Even if I don't understand oom_kill.c the patch
> > looks obviously good to me, even from "common sense" pov. I am just
> > curious.
> >
> > So, my understanding is: we are going to kill the whole thread group
> > but TIF_MEMDIE is per-thread. Mark the whole thread group as TIF_MEMDIE
> > so that any thread can notice this flag and (say, __alloc_pages_slowpath)
> > fail asap.
> >
> > Is my understanding correct?
> >
>
> [Adding Mel Gorman <mel(a)csn.ul.ie> to the cc]
>

Sorry for the delay.

> The problem with this approach is that we could easily deplete all memory
> reserves if the oom killed task has an extremely large number of threads,
> there has always been only a single thread with TIF_MEMDIE set per cpuset
> or memcg; for systems that don't run with cpusets or memory controller,
> this has been limited to one thread with TIF_MEMDIE for the entire system.
>
> There's risk involved with suddenly allowing 1000 threads to have
> TIF_MEMDIE set and the chances of fully depleting all allowed zones is
> much higher if they allocate memory prior to exit, for example.
>
> An alternative is to fail allocations if they are failable and the
> allocating task has a pending SIGKILL. It's better to preempt the oom
> killer since current is going to be exiting anyway and this avoids a
> needless kill.
>
> That's possible if it's guaranteed that __GFP_NOFAIL allocations with a
> pending SIGKILL are granted ALLOC_NO_WATERMARKS to prevent them from
> endlessly looping while making no progress.
>
> Comments?
> ---
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1610,13 +1610,21 @@ try_next_zone:
> }
>
> static inline int
> -should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> +should_alloc_retry(struct task_struct *p, gfp_t gfp_mask, unsigned int order,
> unsigned long pages_reclaimed)
> {
> /* Do not loop if specifically requested */
> if (gfp_mask & __GFP_NORETRY)
> return 0;
>
> + /* Loop if specifically requested */
> + if (gfp_mask & __GFP_NOFAIL)
> + return 1;
> +

Meh, you could have preserved the comment but no biggie.

> + /* Task is killed, fail the allocation if possible */
> + if (fatal_signal_pending(p))
> + return 0;
> +

Seems reasonable. This will be checked on every major loop in the
allocator slow patch.

> /*
> * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> * means __GFP_NOFAIL, but that may not be true in other
> @@ -1635,13 +1643,6 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
> return 1;
>
> - /*
> - * Don't let big-order allocations loop unless the caller
> - * explicitly requests that.
> - */
> - if (gfp_mask & __GFP_NOFAIL)
> - return 1;
> -
> return 0;
> }
>
> @@ -1798,6 +1799,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> if (!in_interrupt() &&
> ((p->flags & PF_MEMALLOC) ||
> + (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL)) ||

This is a lot less clear. GFP_NOFAIL is rare so this is basically saying
that all threads with a fatal signal pending can ignore watermarks. This
is dangerous because if 1000 threads get killed, there is a possibility
of deadlocking the system.

Why not obey the watermarks and just not retry the loop later and fail
the allocation?

> unlikely(test_thread_flag(TIF_MEMDIE))))
> alloc_flags |= ALLOC_NO_WATERMARKS;
> }
> @@ -1812,6 +1814,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> int migratetype)
> {
> const gfp_t wait = gfp_mask & __GFP_WAIT;
> + const gfp_t nofail = gfp_mask & __GFP_NOFAIL;
> struct page *page = NULL;
> int alloc_flags;
> unsigned long pages_reclaimed = 0;
> @@ -1876,7 +1879,7 @@ rebalance:
> goto nopage;
>
> /* Avoid allocations with no watermarks from looping endlessly */
> - if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> + if (test_thread_flag(TIF_MEMDIE) && !nofail)
> goto nopage;
>
> /* Try direct reclaim and then allocating */
> @@ -1888,6 +1891,10 @@ rebalance:
> if (page)
> goto got_pg;
>
> + /* Task is killed, fail the allocation if possible */
> + if (fatal_signal_pending(p) && !nofail)
> + goto nopage;
> +

Again, I would expect this to be caught by should_alloc_retry().

> /*
> * If we failed to make any progress reclaiming, then we are
> * running out of options and have to consider going OOM
> @@ -1909,8 +1916,7 @@ rebalance:
> * made, there are no other options and retrying is
> * unlikely to help.
> */
> - if (order > PAGE_ALLOC_COSTLY_ORDER &&
> - !(gfp_mask & __GFP_NOFAIL))
> + if (order > PAGE_ALLOC_COSTLY_ORDER && !nofail)
> goto nopage;
>
> goto restart;
> @@ -1919,7 +1925,7 @@ rebalance:
>
> /* Check if we should retry the allocation */
> pages_reclaimed += did_some_progress;
> - if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
> + if (should_alloc_retry(p, gfp_mask, order, pages_reclaimed)) {
> /* Wait for some write requests to complete then retry */
> congestion_wait(BLK_RW_ASYNC, HZ/50);
> goto rebalance;
>

I'm ok with the should_alloc_retry() change but am a lot less ok with ignoring
watermarks just because a fatal signal is pending and I think the nofail
changes to __alloc_pages_slowpath() are unnecessary as should_alloc_retry()
should end up failing the allocations.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Rientjes on
On Fri, 2 Apr 2010, Mel Gorman wrote:

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1610,13 +1610,21 @@ try_next_zone:
> > }
> >
> > static inline int
> > -should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> > +should_alloc_retry(struct task_struct *p, gfp_t gfp_mask, unsigned int order,
> > unsigned long pages_reclaimed)
> > {
> > /* Do not loop if specifically requested */
> > if (gfp_mask & __GFP_NORETRY)
> > return 0;
> >
> > + /* Loop if specifically requested */
> > + if (gfp_mask & __GFP_NOFAIL)
> > + return 1;
> > +
>
> Meh, you could have preserved the comment but no biggie.
>

I'll remember to preserve it when it's proposed.

> > + /* Task is killed, fail the allocation if possible */
> > + if (fatal_signal_pending(p))
> > + return 0;
> > +
>
> Seems reasonable. This will be checked on every major loop in the
> allocator slow patch.
>
> > /*
> > * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> > * means __GFP_NOFAIL, but that may not be true in other
> > @@ -1635,13 +1643,6 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> > if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
> > return 1;
> >
> > - /*
> > - * Don't let big-order allocations loop unless the caller
> > - * explicitly requests that.
> > - */
> > - if (gfp_mask & __GFP_NOFAIL)
> > - return 1;
> > -
> > return 0;
> > }
> >
> > @@ -1798,6 +1799,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> > if (!in_interrupt() &&
> > ((p->flags & PF_MEMALLOC) ||
> > + (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL)) ||
>
> This is a lot less clear. GFP_NOFAIL is rare so this is basically saying
> that all threads with a fatal signal pending can ignore watermarks. This
> is dangerous because if 1000 threads get killed, there is a possibility
> of deadlocking the system.
>

I don't quite understand the comment, this is only for __GFP_NOFAIL
allocations, which you say are rare, so a large number of threads won't be
doing this simultaneously.

> Why not obey the watermarks and just not retry the loop later and fail
> the allocation?
>

The above check for (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL))
essentially oom kills p without invoking the oom killer before direct
reclaim is invoked. We know it has a pending SIGKILL and wants to exit,
so we allow it to allocate beyond the min watermark to avoid costly
reclaim or needlessly killing another task.

> > unlikely(test_thread_flag(TIF_MEMDIE))))
> > alloc_flags |= ALLOC_NO_WATERMARKS;
> > }
> > @@ -1812,6 +1814,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > int migratetype)
> > {
> > const gfp_t wait = gfp_mask & __GFP_WAIT;
> > + const gfp_t nofail = gfp_mask & __GFP_NOFAIL;
> > struct page *page = NULL;
> > int alloc_flags;
> > unsigned long pages_reclaimed = 0;
> > @@ -1876,7 +1879,7 @@ rebalance:
> > goto nopage;
> >
> > /* Avoid allocations with no watermarks from looping endlessly */
> > - if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> > + if (test_thread_flag(TIF_MEMDIE) && !nofail)
> > goto nopage;
> >
> > /* Try direct reclaim and then allocating */
> > @@ -1888,6 +1891,10 @@ rebalance:
> > if (page)
> > goto got_pg;
> >
> > + /* Task is killed, fail the allocation if possible */
> > + if (fatal_signal_pending(p) && !nofail)
> > + goto nopage;
> > +
>
> Again, I would expect this to be caught by should_alloc_retry().
>

It is, but only after the oom killer is called. We don't want to
needlessly kill another task here when p has already been killed but may
not be PF_EXITING yet.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on
On Sun, Apr 04, 2010 at 04:26:38PM -0700, David Rientjes wrote:
> On Fri, 2 Apr 2010, Mel Gorman wrote:
>
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1610,13 +1610,21 @@ try_next_zone:
> > > }
> > >
> > > static inline int
> > > -should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> > > +should_alloc_retry(struct task_struct *p, gfp_t gfp_mask, unsigned int order,
> > > unsigned long pages_reclaimed)
> > > {
> > > /* Do not loop if specifically requested */
> > > if (gfp_mask & __GFP_NORETRY)
> > > return 0;
> > >
> > > + /* Loop if specifically requested */
> > > + if (gfp_mask & __GFP_NOFAIL)
> > > + return 1;
> > > +
> >
> > Meh, you could have preserved the comment but no biggie.
> >
>
> I'll remember to preserve it when it's proposed.
>
> > > + /* Task is killed, fail the allocation if possible */
> > > + if (fatal_signal_pending(p))
> > > + return 0;
> > > +
> >
> > Seems reasonable. This will be checked on every major loop in the
> > allocator slow patch.
> >
> > > /*
> > > * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> > > * means __GFP_NOFAIL, but that may not be true in other
> > > @@ -1635,13 +1643,6 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> > > if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
> > > return 1;
> > >
> > > - /*
> > > - * Don't let big-order allocations loop unless the caller
> > > - * explicitly requests that.
> > > - */
> > > - if (gfp_mask & __GFP_NOFAIL)
> > > - return 1;
> > > -
> > > return 0;
> > > }
> > >
> > > @@ -1798,6 +1799,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > > if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> > > if (!in_interrupt() &&
> > > ((p->flags & PF_MEMALLOC) ||
> > > + (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL)) ||
> >
> > This is a lot less clear. GFP_NOFAIL is rare so this is basically saying
> > that all threads with a fatal signal pending can ignore watermarks. This
> > is dangerous because if 1000 threads get killed, there is a possibility
> > of deadlocking the system.
> >
>
> I don't quite understand the comment, this is only for __GFP_NOFAIL
> allocations, which you say are rare, so a large number of threads won't be
> doing this simultaneously.
>
> > Why not obey the watermarks and just not retry the loop later and fail
> > the allocation?
> >
>
> The above check for (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL))
> essentially oom kills p without invoking the oom killer before direct
> reclaim is invoked. We know it has a pending SIGKILL and wants to exit,
> so we allow it to allocate beyond the min watermark to avoid costly
> reclaim or needlessly killing another task.
>

Sorry, I typod.

GFP_NOFAIL is rare but this is basically saying that all threads with a
fatal signal and using NOFAIL can ignore watermarks.

I don't think there is any caller in an exit path will be using GFP_NOFAIL
as it's most common user is file-system related but it still feels unnecssary
to check this case on every call to the slow path.

> > > unlikely(test_thread_flag(TIF_MEMDIE))))
> > > alloc_flags |= ALLOC_NO_WATERMARKS;
> > > }
> > > @@ -1812,6 +1814,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > > int migratetype)
> > > {
> > > const gfp_t wait = gfp_mask & __GFP_WAIT;
> > > + const gfp_t nofail = gfp_mask & __GFP_NOFAIL;
> > > struct page *page = NULL;
> > > int alloc_flags;
> > > unsigned long pages_reclaimed = 0;
> > > @@ -1876,7 +1879,7 @@ rebalance:
> > > goto nopage;
> > >
> > > /* Avoid allocations with no watermarks from looping endlessly */
> > > - if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> > > + if (test_thread_flag(TIF_MEMDIE) && !nofail)
> > > goto nopage;
> > >
> > > /* Try direct reclaim and then allocating */
> > > @@ -1888,6 +1891,10 @@ rebalance:
> > > if (page)
> > > goto got_pg;
> > >
> > > + /* Task is killed, fail the allocation if possible */
> > > + if (fatal_signal_pending(p) && !nofail)
> > > + goto nopage;
> > > +
> >
> > Again, I would expect this to be caught by should_alloc_retry().
> >
>
> It is, but only after the oom killer is called. We don't want to
> needlessly kill another task here when p has already been killed but may
> not be PF_EXITING yet.
>

Fair point. How about just checking before __alloc_pages_may_oom() is
called then? This check will be then in a slower path.
I recognise this means that it is also only checked when direct reclaim
is failing but there is at least one good reason for it.

With this change, processes that have been sigkilled may now fail allocations
that they might not have failed before. It would be difficult to trigger
but here is one possible problem with this change;

1. System was borderline with some trashing
2. User starts program that gobbles up lots of memory on page faults,
trashing the system further and annoying the user
3. User sends SIGKILL
4. Process was faulting and returns NULL because fatal signal was pending
5. Fault path returns VM_FAULT_OOM
6. Arch-specific path (on x86 anyway) calls out_of_memory again because
VM_FAULT_OOM was returned.

Ho hum, I haven't thought about this before but it's also possible that
a process that is fauling that gets oom-killed will trigger a cascading
OOM kill. If the system was heavily trashing, it might mean a large
number of processes get killed.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/