From: Minchan Kim on
On Mon, May 31, 2010 at 3:35 PM, KOSAKI Motohiro
<kosaki.motohiro(a)jp.fujitsu.com> wrote:
> Hi
>
>> Hi, Kosaki.
>>
>> On Sat, May 29, 2010 at 12:59 PM, KOSAKI Motohiro
>> <kosaki.motohiro(a)jp.fujitsu.com> wrote:
>> > Hi
>> >
>> >> oom-killer: give the dying task rt priority (v3)
>> >>
>> >> Give the dying task RT priority so that it can be scheduled quickly and die,
>> >> freeing needed memory.
>> >>
>> >> Signed-off-by: Luis Claudio R. Gonçalves <lgoncalv(a)redhat.com>
>> >
>> > Almostly acceptable to me. but I have two requests,
>> >
>> > - need 1) force_sig() 2)sched_setscheduler() order as Oleg mentioned
>> > - don't boost priority if it's in mem_cgroup_out_of_memory()
>>
>> Why do you want to not boost priority if it's path of memcontrol?
>>
>> If it's path of memcontrol and CONFIG_CGROUP_MEM_RES_CTLR is enabled,
>> mem_cgroup_out_of_memory will select victim task in memcg.
>> So __oom_kill_task's target task would be in memcg, I think.
>
> Yep.
> But priority boost naturally makes CPU starvation for out of the group
> processes.
> It seems to break cgroup's isolation concept.
>
>> As you and memcg guys don't complain this, I would be missing something.
>> Could you explain it? :)
>
> So, My points are,
>
> 1) Usually priority boost is wrong idea. It have various side effect, but
>   system wide OOM is one of exception. In such case, all tasks aren't
>   runnable, then, the downside is acceptable.
> 2) memcg have OOM notification mechanism. If the admin need priority boost,
>   they can do it by their OOM-daemon.

Is it possible kill the hogging task immediately when the daemon send
kill signal?
I mean we can make OOM daemon higher priority than others and it can
send signal to normal process. but when is normal process exited after
receiving kill signal from OOM daemon? Maybe it's when killed task is
executed by scheduler. It's same problem again, I think.

Kame, Do you have an idea?

> Thanks.
>
>
>



--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: KAMEZAWA Hiroyuki on
On Mon, 31 May 2010 16:05:48 +0900
Minchan Kim <minchan.kim(a)gmail.com> wrote:

> On Mon, May 31, 2010 at 3:35 PM, KOSAKI Motohiro
> <kosaki.motohiro(a)jp.fujitsu.com> wrote:
> > Hi
> >
> >> Hi, Kosaki.
> >>
> >> On Sat, May 29, 2010 at 12:59 PM, KOSAKI Motohiro
> >> <kosaki.motohiro(a)jp.fujitsu.com> wrote:
> >> > Hi
> >> >
> >> >> oom-killer: give the dying task rt priority (v3)
> >> >>
> >> >> Give the dying task RT priority so that it can be scheduled quickly and die,
> >> >> freeing needed memory.
> >> >>
> >> >> Signed-off-by: Luis Claudio R. Gonçalves <lgoncalv(a)redhat.com>
> >> >
> >> > Almostly acceptable to me. but I have two requests,
> >> >
> >> > - need 1) force_sig() 2)sched_setscheduler() order as Oleg mentioned
> >> > - don't boost priority if it's in mem_cgroup_out_of_memory()
> >>
> >> Why do you want to not boost priority if it's path of memcontrol?
> >>
> >> If it's path of memcontrol and CONFIG_CGROUP_MEM_RES_CTLR is enabled,
> >> mem_cgroup_out_of_memory will select victim task in memcg.
> >> So __oom_kill_task's target task would be in memcg, I think.
> >
> > Yep.
> > But priority boost naturally makes CPU starvation for out of the group
> > processes.
> > It seems to break cgroup's isolation concept.
> >
> >> As you and memcg guys don't complain this, I would be missing something.
> >> Could you explain it? :)
> >
> > So, My points are,
> >
> > 1) Usually priority boost is wrong idea. It have various side effect, but
> >   system wide OOM is one of exception. In such case, all tasks aren't
> >   runnable, then, the downside is acceptable.
> > 2) memcg have OOM notification mechanism. If the admin need priority boost,
> >   they can do it by their OOM-daemon.
>
> Is it possible kill the hogging task immediately when the daemon send
> kill signal?
> I mean we can make OOM daemon higher priority than others and it can
> send signal to normal process. but when is normal process exited after
> receiving kill signal from OOM daemon? Maybe it's when killed task is
> executed by scheduler. It's same problem again, I think.
>
> Kame, Do you have an idea?
>
This is just an idea and I have no implementaion, yet.

With memcg, oom situation can be recovered by "enlarging limit temporary".
Then, what the daemon has to do is

1. send signal (kill or other signal to abort for coredump.)
2. move a problematic task to a jail if necessary.
3. enlarge limit for indicating "Go"
4. After stabilization, reduce the limit.

This is the fastest. Admin has to think of extra-room or jails and
the daemon should be enough clever. But in most case, I think this works well.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Minchan Kim on
On Mon, May 31, 2010 at 4:25 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu(a)jp.fujitsu.com> wrote:
> On Mon, 31 May 2010 16:05:48 +0900
> Minchan Kim <minchan.kim(a)gmail.com> wrote:
>
>> On Mon, May 31, 2010 at 3:35 PM, KOSAKI Motohiro
>> <kosaki.motohiro(a)jp.fujitsu.com> wrote:
>> > Hi
>> >
>> >> Hi, Kosaki.
>> >>
>> >> On Sat, May 29, 2010 at 12:59 PM, KOSAKI Motohiro
>> >> <kosaki.motohiro(a)jp.fujitsu.com> wrote:
>> >> > Hi
>> >> >
>> >> >> oom-killer: give the dying task rt priority (v3)
>> >> >>
>> >> >> Give the dying task RT priority so that it can be scheduled quickly and die,
>> >> >> freeing needed memory.
>> >> >>
>> >> >> Signed-off-by: Luis Claudio R. Gonçalves <lgoncalv(a)redhat.com>
>> >> >
>> >> > Almostly acceptable to me. but I have two requests,
>> >> >
>> >> > - need 1) force_sig() 2)sched_setscheduler() order as Oleg mentioned
>> >> > - don't boost priority if it's in mem_cgroup_out_of_memory()
>> >>
>> >> Why do you want to not boost priority if it's path of memcontrol?
>> >>
>> >> If it's path of memcontrol and CONFIG_CGROUP_MEM_RES_CTLR is enabled,
>> >> mem_cgroup_out_of_memory will select victim task in memcg.
>> >> So __oom_kill_task's target task would be in memcg, I think.
>> >
>> > Yep.
>> > But priority boost naturally makes CPU starvation for out of the group
>> > processes.
>> > It seems to break cgroup's isolation concept.
>> >
>> >> As you and memcg guys don't complain this, I would be missing something.
>> >> Could you explain it? :)
>> >
>> > So, My points are,
>> >
>> > 1) Usually priority boost is wrong idea. It have various side effect, but
>> >   system wide OOM is one of exception. In such case, all tasks aren't
>> >   runnable, then, the downside is acceptable.
>> > 2) memcg have OOM notification mechanism. If the admin need priority boost,
>> >   they can do it by their OOM-daemon.
>>
>> Is it possible kill the hogging task immediately when the daemon send
>> kill signal?
>> I mean we can make OOM daemon higher priority than others and it can
>> send signal to normal process. but when is normal process exited after
>> receiving kill signal from OOM daemon? Maybe it's when killed task is
>> executed by scheduler. It's same problem again, I think.
>>
>> Kame, Do you have an idea?
>>
> This is just an idea and I have no implementaion, yet.
>
> With memcg, oom situation can be recovered by "enlarging limit temporary".
> Then, what the daemon has to do is
>
>  1. send signal (kill or other signal to abort for coredump.)
>  2. move a problematic task to a jail if necessary.
>  3. enlarge limit for indicating "Go"
>  4. After stabilization, reduce the limit.
>
> This is the fastest. Admin has to think of extra-room or jails and
> the daemon should be enough clever. But in most case, I think this works well.

I think it is very hard that how much we have to make extra-room since
we can't expect how many tasks are stuck to allocate memory.
But tend to agree that system-wide OOM problem is more important than
memcg's one.
And memcg's guy doesn't seem to have any problem. So I am not against
this patch any more.

Thanks, Kosaki and Kame.

> Thanks,
> -Kame
>
>



--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Minchan Kim on
On Mon, May 31, 2010 at 3:51 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu(a)jp.fujitsu.com> wrote:
> On Mon, 31 May 2010 15:09:41 +0900
> Minchan Kim <minchan.kim(a)gmail.com> wrote:
>
>> On Mon, May 31, 2010 at 2:54 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu(a)jp.fujitsu.com> wrote:
>> > On Mon, 31 May 2010 14:46:05 +0900
>> > Minchan Kim <minchan.kim(a)gmail.com> wrote:
>> >
>> >> On Mon, May 31, 2010 at 2:04 PM, KAMEZAWA Hiroyuki
>> >> <kamezawa.hiroyu(a)jp.fujitsu.com> wrote:
>> >> > On Mon, 31 May 2010 14:01:03 +0900
>> >> > Minchan Kim <minchan.kim(a)gmail.com> wrote:
>> >> >
>> >> >> Hi, Kame.
>> >> >>
>> >> >> On Mon, May 31, 2010 at 9:21 AM, KAMEZAWA Hiroyuki
>> >> >> <kamezawa.hiroyu(a)jp.fujitsu.com> wrote:
>> >> >> > On Fri, 28 May 2010 13:48:26 -0300
>> >> >> > "Luis Claudio R. Goncalves" <lclaudio(a)uudg.org> wrote:
>> >> >> >>
>> >> >> >> oom-killer: give the dying task rt priority (v3)
>> >> >> >>
>> >> >> >> Give the dying task RT priority so that it can be scheduled quickly and die,
>> >> >> >> freeing needed memory.
>> >> >> >>
>> >> >> >> Signed-off-by: Luis Claudio R. Gonçalves <lgoncalv(a)redhat.com>
>> >> >> >>
>> >> >> >> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> >> >> >> index 84bbba2..2b0204f 100644
>> >> >> >> --- a/mm/oom_kill.c
>> >> >> >> +++ b/mm/oom_kill.c
>> >> >> >> @@ -266,6 +266,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints)
>> >> >> >>   */
>> >> >> >>  static void __oom_kill_task(struct task_struct *p, int verbose)
>> >> >> >>  {
>> >> >> >> +     struct sched_param param;
>> >> >> >> +
>> >> >> >>       if (is_global_init(p)) {
>> >> >> >>               WARN_ON(1);
>> >> >> >>               printk(KERN_WARNING "tried to kill init!\n");
>> >> >> >> @@ -288,6 +290,8 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
>> >> >> >>        * exit() and clear out its resources quickly...
>> >> >> >>        */
>> >> >> >>       p->time_slice = HZ;
>> >> >> >> +     param.sched_priority = MAX_RT_PRIO-10;
>> >> >> >> +     sched_setscheduler(p, SCHED_FIFO, &param);
>> >> >> >>       set_tsk_thread_flag(p, TIF_MEMDIE);
>> >> >> >>
>> >> >> >
>> >> >> > BTW, how about the other threads which share mm_struct ?
>> >> >>
>> >> >> Could you elaborate your intention? :)
>> >> >>
>> >> >
>> >> > IIUC, the purpose of rising priority is to accerate dying thread to exit()
>> >> > for freeing memory AFAP. But to free memory, exit, all threads which share
>> >> > mm_struct should exit, too. I'm sorry if I miss something.
>> >>
>> >> How do we kill only some thread and what's the benefit of it?
>> >> I think when if some thread receives  KILL signal, the process include
>> >> the thread will be killed.
>> >>
>> > yes, so, if you want a _process_ die quickly, you have to acceralte the whole
>> > threads on a process. Acceralating a thread in a process is not big help.
>>
>> Yes.
>>
>> I see the code.
>> oom_kill_process is called by
>>
>> 1. mem_cgroup_out_of_memory
>> 2. __out_of_memory
>> 3. out_of_memory
>>
>>
>> (1,2) calls select_bad_process which select victim task in processes
>> by do_each_process.
>> But 3 isn't In case of  CONSTRAINT_MEMORY_POLICY, it kills current.
>> In only the case, couldn't we pass task of process, not one of thread?
>>
>
> Hmm, my point is that priority-acceralation is against a thread, not against a process.
> So, most of threads in memory-eater will not gain high priority even with this patch
> and works slowly.
> I have no objections to this patch. I just want to confirm the purpose. If this patch
> is for accelating exiting process by SIGKILL, it seems not enough.
> If an explanation as "acceralating all thread's priority in a process seems overkill"
> is given in changelog or comment, it's ok to me.

Okay. I got your point.
Kame's concern is proper.

Couldn't we raise priorities of whole threads of the task killed?

>
> Thanks,
> -Kame
>
>



--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Luis Claudio R. Goncalves on
On Mon, May 31, 2010 at 03:51:02PM +0900, KAMEZAWA Hiroyuki wrote:
| On Mon, 31 May 2010 15:09:41 +0900
| Minchan Kim <minchan.kim(a)gmail.com> wrote:
| > On Mon, May 31, 2010 at 2:54 PM, KAMEZAWA Hiroyuki
| > <kamezawa.hiroyu(a)jp.fujitsu.com> wrote:
....
| > >> > IIUC, the purpose of rising priority is to accerate dying thread to exit()
| > >> > for freeing memory AFAP. But to free memory, exit, all threads which share
| > >> > mm_struct should exit, too. I'm sorry if I miss something.
| > >>
| > >> How do we kill only some thread and what's the benefit of it?
| > >> I think when if some thread receives �KILL signal, the process include
| > >> the thread will be killed.
| > >>
| > > yes, so, if you want a _process_ die quickly, you have to acceralte the whole
| > > threads on a process. Acceralating a thread in a process is not big help.
| >
| > Yes.
| >
| > I see the code.
| > oom_kill_process is called by
| >
| > 1. mem_cgroup_out_of_memory
| > 2. __out_of_memory
| > 3. out_of_memory
| >
| >
| > (1,2) calls select_bad_process which select victim task in processes
| > by do_each_process.
| > But 3 isn't In case of CONSTRAINT_MEMORY_POLICY, it kills current.
| > In only the case, couldn't we pass task of process, not one of thread?
| >
|
| Hmm, my point is that priority-acceralation is against a thread, not against a process.
| So, most of threads in memory-eater will not gain high priority even with this patch
| and works slowly.

This is a good point...

| I have no objections to this patch. I just want to confirm the purpose. If this patch
| is for accelating exiting process by SIGKILL, it seems not enough.

I understand (from the comments in the code) the badness calculation gives more
points to the siblings in a thread that have their own mm. I wonder if what you
are describing is not a corner case.

Again, your idea sounds like an interesting refinement to the patch. I am
just not sure this change should implemented now or in a second round of
changes.

| If an explanation as "acceralating all thread's priority in a process seems overkill"
| is given in changelog or comment, it's ok to me.

If my understanding of badness() is right, I wouldn't be ashamed of saying
that it seems to be _a bit_ overkill. But I may be wrong in my
interpretation.

While re-reading the code I noticed that in select_bad_process() we can
eventually bump on an already dying task, case in which we just wait for
the task to die and avoid killing other tasks. Maybe we could boost the
priority of the dying task here too.

Luis
--
[ Luis Claudio R. Goncalves Bass - Gospel - RT ]
[ Fingerprint: 4FDD B8C4 3C59 34BD 8BE9 2696 7203 D980 A448 C8F8 ]

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/