From: Oleg Nesterov on
On 06/02, Oleg Nesterov wrote:
>
> On 06/02, Roland McGrath wrote:
> >
> > > when select_bad_process() finds the task P to kill it can participate
> > > in the core dump (sleep in exit_mm), but we should somehow inform the
> > > thread which actually dumps the core: P->mm->core_state->dumper.
> >
> > Perhaps it should simply do that: if you would choose P to oom-kill, and
> > P->mm->core_state!=NULL, then choose P->mm->core_state->dumper instead.
>
> ... to set TIF_MEMDIE which should be checked in elf_core_dump().
>
> Probably yes.

Well, nothing can protect mm->core_state, the dumper owns it. Of course
we can add the locking, but this is not nice.

And again, perhaps MMF_OOMKILLED can be useful anyway.

So, I think this would be the most quick/simple fix for now.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: KOSAKI Motohiro on
> On 06/02, Roland McGrath wrote:
> >
> > > when select_bad_process() finds the task P to kill it can participate
> > > in the core dump (sleep in exit_mm), but we should somehow inform the
> > > thread which actually dumps the core: P->mm->core_state->dumper.
> >
> > Perhaps it should simply do that: if you would choose P to oom-kill, and
> > P->mm->core_state!=NULL, then choose P->mm->core_state->dumper instead.
>
> ... to set TIF_MEMDIE which should be checked in elf_core_dump().
>
> Probably yes.

Yep, probably. but can you please allow me additonal explanation?

In multi threaded OOM case, we have two problematic routine, coredump
and vmscan. Roland's idea can only solve the former.

But I also interest vmscan quickly exit if OOM received. if other threads
get stuck in vmscan for freeing addional pages (this is very typical because
usually every thread call any syscall and eventually call kmalloc etc),
recovering oom become very slow even if this doesn't makes deadlock.

Unfortunatelly, vmscan need much refactoring before appling this idea.
then, I didn't include such fixes.

I mean I hope to implement per-process OOM flag even if coredump don't
really need it.

So, I created MMF_OOM patch today. It is just for discussion, still.

From f099e1ba6e7b5654b35b468c13e1ae9e4f182ea4 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com>
Date: Fri, 4 Jun 2010 18:56:56 +0900
Subject: [RFC][PATCH v2] oom: make coredump interruptible

If oom victim process is under core dumping, sending SIGKILL cause
no-op. Unfortunately, coredump need relatively much memory. It mean
OOM vs coredump can makes deadlock.

Then, coredump logic should check the task has received SIGKILL
from OOM.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com>
---
fs/binfmt_elf.c | 4 ++++
include/linux/sched.h | 1 +
mm/oom_kill.c | 3 ++-
3 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 535e763..2aca748 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -2038,6 +2038,10 @@ static int elf_core_dump(struct coredump_params *cprm)
page_cache_release(page);
} else
stop = !dump_seek(cprm->file, PAGE_SIZE);
+
+ /* The task need to exit ASAP if received OOM. */
+ if (test_bit(MMF_OOM_KILLED, &current->mm->flags))
+ stop = 1;
if (stop)
goto end_coredump;
}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8485aa2..53b7caa 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -436,6 +436,7 @@ extern int get_dumpable(struct mm_struct *mm);
#endif
/* leave room for more dump flags */
#define MMF_VM_MERGEABLE 16 /* KSM may merge identical pages */
+#define MMF_OOM_KILLED 17 /* Killed by OOM */

#define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 2678a04..29850c4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -401,7 +401,6 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem,
K(p->mm->total_vm),
K(get_mm_counter(p->mm, MM_ANONPAGES)),
K(get_mm_counter(p->mm, MM_FILEPAGES)));
- task_unlock(p);

/*
* We give our sacrificial lamb high priority and access to
@@ -410,6 +409,8 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem,
*/
p->rt.time_slice = HZ;
set_tsk_thread_flag(p, TIF_MEMDIE);
+ set_bit(MMF_OOM_KILLED, &p->mm->flags);
+ task_unlock(p);

force_sig(SIGKILL, p);

--
1.6.5.2



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Oleg Nesterov on
On 06/04, KOSAKI Motohiro wrote:
>
> > ... to set TIF_MEMDIE which should be checked in elf_core_dump().
> >
> > Probably yes.
>
> Yep, probably. but can you please allow me additonal explanation?
>
> In multi threaded OOM case, we have two problematic routine, coredump
> and vmscan. Roland's idea can only solve the former.
>
> But I also interest vmscan quickly exit if OOM received.

Yes, agreed. See another email from me, MMF_ flags looks "obviously
useful" to me.

(I'd suggest you to add a note into the changelog, to explain
that the new flag makes sense even without coredump problems).

> @@ -410,6 +409,8 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem,
> */
> p->rt.time_slice = HZ;
> set_tsk_thread_flag(p, TIF_MEMDIE);
> + set_bit(MMF_OOM_KILLED, &p->mm->flags);
> + task_unlock(p);

IIUC, it has find_lock_task() mm above and thus we can trust p->mm ?
(I am asking just in case, I lost the plot a bit).

Ack or Reviewed, whatever your prefer.

Very minor nit.

> @@ -2038,6 +2038,10 @@ static int elf_core_dump(struct coredump_params *cprm)
> page_cache_release(page);
> } else
> stop = !dump_seek(cprm->file, PAGE_SIZE);
> +
> + /* The task need to exit ASAP if received OOM. */
> + if (test_bit(MMF_OOM_KILLED, &current->mm->flags))
> + stop = 1;

Perhaps this check makes more sense at the start of the loop,
and there is no need to set "stop = 1" (this var is not visible
outside of "for (;;) {}" anyway). Cosmetic, up to you.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Oleg Nesterov on
On 06/04, Oleg Nesterov wrote:
>
> (I'd suggest you to add a note into the changelog, to explain
> that the new flag makes sense even without coredump problems).

And. May I ask you to add another note into the changelog?

> > @@ -410,6 +409,8 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem,
> > */
> > p->rt.time_slice = HZ;
> > set_tsk_thread_flag(p, TIF_MEMDIE);
> > + set_bit(MMF_OOM_KILLED, &p->mm->flags);

I think the changelog should explain that, if we race with fork(),
this flag can't leak into the child's new mm. mm_init() filters
the bits outside of MMF_INIT_MASK.

If we race with exec, it can't leak because mm_alloc() does
memset(0).

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Oleg Nesterov on
On 06/04, Oleg Nesterov wrote:
>
> On 06/04, KOSAKI Motohiro wrote:
> >
> > In multi threaded OOM case, we have two problematic routine, coredump
> > and vmscan. Roland's idea can only solve the former.
> >
> > But I also interest vmscan quickly exit if OOM received.
>
> Yes, agreed. See another email from me, MMF_ flags looks "obviously
> useful" to me.

Well. But somehow we forgot about the !coredumping case... Suppose
that select_bad_process() chooses the process P to kill and we have
other processes (not sub-threads) which share the same ->mm.

In that case I am not sure we should blindly set MMF_OOMKILL. Suppose
that we kill P and after that the "out-of-memory" condition goes away.
But its ->mm still has MMF_OOMKILL set, and it is used. Who/when will
clear this flag?

Perhaps something like below makes sense for now.

Oleg.

--- x/fs/exec.c
+++ x/fs/exec.c
@@ -1594,6 +1594,7 @@ static inline int zap_threads(struct tas
spin_lock_irq(&tsk->sighand->siglock);
if (!signal_group_exit(tsk->signal)) {
mm->core_state = core_state;
+ set_bit(MMF_COREDUMP, &mm->flags);
nr = zap_process(tsk, exit_code);
}
spin_unlock_irq(&tsk->sighand->siglock);
--- x/fs/binfmt_elf.c
+++ x/fs/binfmt_elf.c
@@ -2028,6 +2028,9 @@ static int elf_core_dump(struct coredump
struct page *page;
int stop;

+ if (!test_bit(MMF_COREDUMP, &current->mm->flags))
+ goto end_coredump;
+
page = get_dump_page(addr);
if (page) {
void *kaddr = kmap(page);
--- x/mm/oom_kill.c
+++ x/mm/oom_kill.c
@@ -414,6 +414,7 @@ static void __oom_kill_task(struct task_
p->rt.time_slice = HZ;
set_tsk_thread_flag(p, TIF_MEMDIE);

+ clear_bit(MMF_COREDUMP, &p->mm->flags);
force_sig(SIGKILL, p);
}


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/