From: Peter Zijlstra on
On Thu, 2010-07-08 at 18:24 +0900, KOSAKI Motohiro wrote:
> > On Wed, 2010-07-07 at 16:11 -0700, Michel Lespinasse wrote:
> >
> > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > > index f627779..4b3a1c7 100644
> > > --- a/arch/x86/mm/fault.c
> > > +++ b/arch/x86/mm/fault.c
> > > @@ -1062,7 +1062,10 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
> > > bad_area_nosemaphore(regs, error_code, address);
> > > return;
> > > }
> > > - down_read(&mm->mmap_sem);
> > > + if (test_thread_flag(TIF_MEMDIE))
> > > + down_read_unfair(&mm->mmap_sem);
> > > + else
> > > + down_read(&mm->mmap_sem);
> > > } else {
> > > /*
> > > * The above down_read_trylock() might have succeeded in
> >
> > I still think adding that _unfair interface is asking for trouble.
>
> Can you please explain trouble that you worry? Why do we need to keep
> thread fairness when OOM case?

Just the whole concept of the unfair thing offends me ;-) I didn't
really look at the particular application in this case.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: KOSAKI Motohiro on
> On Thu, 2010-07-08 at 18:24 +0900, KOSAKI Motohiro wrote:
> > > On Wed, 2010-07-07 at 16:11 -0700, Michel Lespinasse wrote:
> > >
> > > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > > > index f627779..4b3a1c7 100644
> > > > --- a/arch/x86/mm/fault.c
> > > > +++ b/arch/x86/mm/fault.c
> > > > @@ -1062,7 +1062,10 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
> > > > bad_area_nosemaphore(regs, error_code, address);
> > > > return;
> > > > }
> > > > - down_read(&mm->mmap_sem);
> > > > + if (test_thread_flag(TIF_MEMDIE))
> > > > + down_read_unfair(&mm->mmap_sem);
> > > > + else
> > > > + down_read(&mm->mmap_sem);
> > > > } else {
> > > > /*
> > > > * The above down_read_trylock() might have succeeded in
> > >
> > > I still think adding that _unfair interface is asking for trouble.
> >
> > Can you please explain trouble that you worry? Why do we need to keep
> > thread fairness when OOM case?
>
> Just the whole concept of the unfair thing offends me ;-) I didn't
> really look at the particular application in this case.

I see.
Yup, I agree unfair thing concept is a bit ugly. If anyone have
alternative idea, I agree to choose that thing.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Wed, 2010-07-07 at 16:11 -0700, Michel Lespinasse wrote:
> What happens is we end up with a single thread in the oom loop (T1)
> that ends up killing a sibling thread (T2). That sibling thread will
> need to acquire the read side of the mmap_sem in the exit path. It's
> possible however that yet a different thread (T3) is in the middle of
> a virtual address space operation (mmap, munmap) and is enqueue to
> grab the write side of the mmap_sem behind yet another thread (T4)
> that is stuck in the OOM loop (behind T1) with mmap_sem held for read
> (like allocating a page for pagecache as part of a fault.
>
> T1 T2 T3 T4
> . . . .
> oom: . . .
> oomkill . . .
> ^ \ . . .
> /|\ ----> do_exit: . .
> | sleep in . .
> | read(mmap_sem) . .
> | \ . .
> | ----> mmap .
> | sleep in .
> | write(mmap_sem) .
> | \ .
> | ----> fault
> | holding read(mmap_sem)
> | oom
> | |
> | /
> \----------------------------------------------/

So what you do is use recursive locking to side-step a deadlock.
Recursive locking is poor taste and leads to very ill defined locking
rules.

One way to fix this is to have T4 wake from the oom queue and return an
allocation failure instead of insisting on going oom itself when T1
decides to take down the task.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: KOSAKI Motohiro on
> On Thu, 2010-07-08 at 03:39 -0700, Michel Lespinasse wrote:
> >
> >
> > One way to fix this is to have T4 wake from the oom queue and return an
> > allocation failure instead of insisting on going oom itself when T1
> > decides to take down the task.
> >
> > How would you have T4 figure out the deadlock situation ? T1 is taking down T2, not T4...
>
> If T2 and T4 share a mmap_sem they belong to the same process. OOM takes
> down the whole process by sending around signals of sorts (SIGKILL?), so
> if T4 gets a fatal signal while it is waiting to enter the oom thingy,
> have it abort and return an allocation failure.
>
> That alloc failure (along with a pending fatal signal) will very likely
> lead to the release of its mmap_sem (if not, there's more things to
> cure).
>
> At which point the cycle is broken an stuff continues as it was
> intended.

Now, I've reread current code. I think mmotm already have this.


T4 call out_of_memory and get TIF_MEMDIE
=========================================================
void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
int order, nodemask_t *nodemask)
{
(snip)
/*
* If current has a pending SIGKILL, then automatically select it. The
* goal is to allow it to allocate so that it may quickly exit and free
* its memory.
*/
if (fatal_signal_pending(current)) {
set_thread_flag(TIF_MEMDIE);
boost_dying_task_prio(current, NULL);
return;
}
==================================================================


alloc_pages immediately return if the task have TIF_MEMDIE
==================================================================
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
nodemask_t *nodemask, struct zone *preferred_zone,
int migratetype)
{
(snip)
/* Avoid allocations with no watermarks from looping endlessly */
if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
goto nopage;
==========================================================================


Thought?



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Thu, 2010-07-08 at 03:39 -0700, Michel Lespinasse wrote:
>
>
> One way to fix this is to have T4 wake from the oom queue and return an
> allocation failure instead of insisting on going oom itself when T1
> decides to take down the task.
>
> How would you have T4 figure out the deadlock situation ? T1 is taking down T2, not T4...

If T2 and T4 share a mmap_sem they belong to the same process. OOM takes
down the whole process by sending around signals of sorts (SIGKILL?), so
if T4 gets a fatal signal while it is waiting to enter the oom thingy,
have it abort and return an allocation failure.

That alloc failure (along with a pending fatal signal) will very likely
lead to the release of its mmap_sem (if not, there's more things to
cure).

At which point the cycle is broken an stuff continues as it was
intended.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/