From: David Rientjes on
On Mon, 29 Mar 2010, Oleg Nesterov wrote:

> Can't comment, I do not understand these subtleties.
>
> But I'd like to note that fatal_signal_pending() can be true when the
> process wasn't killed, but another thread does exit_group/exec.
>

I'm not sure there's a difference between whether a process was oom killed
and received a SIGKILL that way or whether exit_group(2) was used, so I
don't think we need to test for (p->signal->flags & SIGNAL_GROUP_EXIT)
here.

We do need to guarantee that exiting tasks always can get memory, which is
the responsibility of setting TIF_MEMDIE. The only thing this patch does
is defer calling the oom killer when a task has a pending SIGKILL and then
fail the allocation when it would otherwise repeat. Instead of the
considerable risk involved with no failing GFP_KERNEL allocations that are
under PAGE_ALLOC_COSTLY_ORDER that is typically never done, it may make
more sense to retry the allocation with TIF_MEMDIE on the second
iteration: in essence, automatically selecting current for oom kill
regardless of other oom killed tasks if it already has a pending SIGKILL.



oom: give current access to memory reserves if it has been killed

It's possible to livelock the page allocator if a thread has mm->mmap_sem and
fails to make forward progress because the oom killer selects another thread
sharing the same ->mm to kill that cannot exit until the semaphore is dropped.

The oom killer will not kill multiple tasks at the same time; each oom killed
task must exit before another task may be killed. Thus, if one thread is
holding mm->mmap_sem and cannot allocate memory, all threads sharing the same
->mm are blocked from exiting as well. In the oom kill case, that means the
thread holding mm->mmap_sem will never free additional memory since it cannot
get access to memory reserves and the thread that depends on it with access to
memory reserves cannot exit because it cannot acquire the semaphore. Thus,
the page allocators livelocks.

When the oom killer is called and current happens to have a pending SIGKILL,
this patch automatically selects it for kill so that it has access to memory
reserves and the better timeslice. Upon returning to the page allocator, its
allocation will hopefully succeed so it can quickly exit and free its memory.

Cc: Mel Gorman <mel(a)csn.ul.ie>
Signed-off-by: David Rientjes <rientjes(a)google.com>
---
mm/oom_kill.c | 10 ++++++++++
1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -681,6 +681,16 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
}

/*
+ * If current has a pending SIGKILL, then automatically select it. The
+ * goal is to allow it to allocate so that it may quickly exit and free
+ * its memory.
+ */
+ if (fatal_signal_pending(current)) {
+ __oom_kill_task(current);
+ return;
+ }
+
+ /*
* Check if there were limitations on the allocation (only relevant for
* NUMA) that may require different handling.
*/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/