oom-kill: give the dying task a higher priority [Kernel]

Prev: [RFC] oom-kill: give the dying task a higher priority
Next: [PATCH 1/2 v2] FLAT: split the stack & data alignments

From: KOSAKI Motohiro on 28 May 2010 02:40

> * KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com> [2010-05-28 13:46:53]:
>
> > > * Luis Claudio R. Goncalves <lclaudio(a)uudg.org> [2010-05-28 00:51:47]:
> > >
> > > > @@ -382,6 +382,8 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
> > > > */
> > > > static void __oom_kill_task(struct task_struct *p, int verbose)
> > > > {
> > > > + struct sched_param param;
> > > > +
> > > > if (is_global_init(p)) {
> > > > WARN_ON(1);
> > > > printk(KERN_WARNING "tried to kill init!\n");
> > > > @@ -413,8 +415,9 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
> > > > */
> > > > p->rt.time_slice = HZ;
> > > > set_tsk_thread_flag(p, TIF_MEMDIE);
> > > > -
> > > > force_sig(SIGKILL, p);
> > > > + param.sched_priority = MAX_RT_PRIO-1;
> > > > + sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
> > > > }
> > > >
> > >
> > > I would like to understand the visible benefits of this patch. Have
> > > you seen an OOM kill tasked really get bogged down. Should this task
> > > really be competing with other important tasks for run time?
> >
> > What you mean important? Until OOM victim task exit completely, the system have no memory.
> > all of important task can't do anything.
> >
> > In almost kernel subsystems, automatically priority boost is really bad idea because
> > it may break RT task's deterministic behavior. but OOM is one of exception. The deterministic
> > was alread broken by memory starvation.
> >
>
> I am still not convinced, specially if we are running under mem
> cgroup. Even setting SCHED_FIFO does not help, you could have other
> things like cpusets that might restrict the CPUs you can run on, or
> any other policy and we could end up contending anyway with other
> SCHED_FIFO tasks.

Ah, right you are. I had missed mem-cgroup.
But I think memcgroup also don't need following two boost. Can we get rid of it?

p->rt.time_slice = HZ;
set_tsk_thread_flag(p, TIF_MEMDIE);

I mean we need distinguish global oom and memcg oom, perhapls.

> > That's the reason I acked it.
>
> If we could show faster recovery from OOM or anything else, I would be
> more convinced.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: KAMEZAWA Hiroyuki on 28 May 2010 02:40

On Fri, 28 May 2010 11:57:01 +0530
Balbir Singh <balbir(a)linux.vnet.ibm.com> wrote:

> I am still not convinced, specially if we are running under mem
> cgroup. Even setting SCHED_FIFO does not help, you could have other
> things like cpusets that might restrict the CPUs you can run on, or
> any other policy and we could end up contending anyway with other
> SCHED_FIFO tasks.
>
> > That's the reason I acked it.
>
> If we could show faster recovery from OOM or anything else, I would be
> more convinced.
>
Off topic.

1. Run a daemon in the highest RT priority.
2. disable OOM for a mem cgroup.
3. The daemon register oom-event-notifier of the mem cgroup.

When OOM happens.
4. The daemon receive a event, and then,
a) enlarge limit
or
b) kill a task
or
c) enlarge limit temporary and kill a task, later, reduce limit again.

This is the fastest and promissing operation for memcg users.

memcg's oom slowdown happens just because it's limited by a user configuration
not by the system. That's a point to be considered.
The oom situation can be _immediaterly_ fixed up by enlarge limit as emergency mode.

If you has to wait for the end of a task, there will be delay, it's unavoidable.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Minchan Kim on 28 May 2010 04:00

On Fri, May 28, 2010 at 2:59 PM, KOSAKI Motohiro
<kosaki.motohiro(a)jp.fujitsu.com> wrote:
>> RT Task
>>
>> void non-RT-function()
>> {
>> system call();
>> buffer = malloc();
>> memset(buffer);
>> }
>> /*
>> * We make sure this function must be executed in some millisecond
>> */
>> void RT-function()
>> {
>> some calculation(); <- This doesn't have no dynamic characteristic
>> }
>> int main()
>> {
>> non-RT-function();
>> /* This function make sure RT-function cannot preempt by others */
>> set_RT_max_high_priority();
>> RT-function A();
>> set_normal_priority();
>> non-RT-function();
>> }
>>
>> We don't want realtime in whole function of the task. What we want is
>> just RT-function A.
>> Of course, current Linux cannot make perfectly sure RT-functionA can
>> not preempt by others.
>> That's because some interrupt or exception happen. But RT-function A
>> doesn't related to any dynamic characteristic. What can justify to
>> preempt RT-function A by other processes?
>
> As far as my observation, RT-function always have some syscall. because pure
> calculation doesn't need deterministic guarantee. But _if_ you are really
> using such priority design. I'm ok maximum NonRT priority instead maximum
> RT priority too.

Hmm. It's just example. but it would be not good exmaple.
Let's change it with this.

void RT-function()
{
int result = some calculation(); <- This doesn't have no dynamic
characteristic
*mmap_base = result; <-- mmap_base is mapped by GPIO device.
}

Could we allow preemption of this RT function due to other task's
memory pressure?
Of course, Linux is not Hard RT featured OS, I think. So I thinks it
is a policy problem.
If we think system memory pressure is more important than RT task and
we _all_ agree such policy, we can allow it.

But I don't hope it.

--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Luis Claudio R. Goncalves on 28 May 2010 09:00

On Fri, May 28, 2010 at 02:59:02PM +0900, KOSAKI Motohiro wrote:
| > RT Task
| >
| > void non-RT-function()
| > {
| > system call();
| > buffer = malloc();
| > memset(buffer);
| > }
| > /*
| > * We make sure this function must be executed in some millisecond
| > */
| > void RT-function()
| > {
| > some calculation(); <- This doesn't have no dynamic characteristic
| > }
| > int main()
| > {
| > non-RT-function();
| > /* This function make sure RT-function cannot preempt by others */
| > set_RT_max_high_priority();
| > RT-function A();
| > set_normal_priority();
| > non-RT-function();
| > }
| >
| > We don't want realtime in whole function of the task. What we want is
| > just RT-function A.
| > Of course, current Linux cannot make perfectly sure RT-functionA can
| > not preempt by others.
| > That's because some interrupt or exception happen. But RT-function A
| > doesn't related to any dynamic characteristic. What can justify to
| > preempt RT-function A by other processes?
|
| As far as my observation, RT-function always have some syscall. because pure
| calculation doesn't need deterministic guarantee. But _if_ you are really
| using such priority design. I'm ok maximum NonRT priority instead maximum
| RT priority too.

I confess I failed to distinguish memcg OOM and system OOM and used "in
case of OOM kill the selected task the faster you can" as the guideline.
If the exit code path is short that shouldn't be a problem.

Maybe the right way to go would be giving the dying task the biggest
priority inside that memcg to be sure that it will be the next process from
that memcg to be scheduled. Would that be reasonable?

| Luis, NonRT high priority break your use case? and if yes, can you please
| explain the reason?

Most of my tests are in the realtime land, usually with premmpt_rt kernels.
In this case, an RT priority will be usually necessary. But that is not the
general case and I agree that a smoother (but not slower) solution could be
devised by just increasing the dying process'priority to be high enough for
the process set (memcg) it belongs to.

What is the impact of a memcg OOM for the system as a whole? That should be
factored in the equation too.

Luis
--
[ Luis Claudio R. Goncalves Bass - Gospel - RT ]
[ Fingerprint: 4FDD B8C4 3C59 34BD 8BE9 2696 7203 D980 A448 C8F8 ]

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Minchan Kim on 28 May 2010 10:10

On Fri, May 28, 2010 at 09:53:05AM -0300, Luis Claudio R. Goncalves wrote:
> On Fri, May 28, 2010 at 02:59:02PM +0900, KOSAKI Motohiro wrote:
> | > RT Task
> | >
> | > void non-RT-function()
> | > {
> | > system call();
> | > buffer = malloc();
> | > memset(buffer);
> | > }
> | > /*
> | > * We make sure this function must be executed in some millisecond
> | > */
> | > void RT-function()
> | > {
> | > some calculation(); <- This doesn't have no dynamic characteristic
> | > }
> | > int main()
> | > {
> | > non-RT-function();
> | > /* This function make sure RT-function cannot preempt by others */
> | > set_RT_max_high_priority();
> | > RT-function A();
> | > set_normal_priority();
> | > non-RT-function();
> | > }
> | >
> | > We don't want realtime in whole function of the task. What we want is
> | > just RT-function A.
> | > Of course, current Linux cannot make perfectly sure RT-functionA can
> | > not preempt by others.
> | > That's because some interrupt or exception happen. But RT-function A
> | > doesn't related to any dynamic characteristic. What can justify to
> | > preempt RT-function A by other processes?
> |
> | As far as my observation, RT-function always have some syscall. because pure
> | calculation doesn't need deterministic guarantee. But _if_ you are really
> | using such priority design. I'm ok maximum NonRT priority instead maximum
> | RT priority too.
>
> I confess I failed to distinguish memcg OOM and system OOM and used "in
> case of OOM kill the selected task the faster you can" as the guideline.
> If the exit code path is short that shouldn't be a problem.
>
> Maybe the right way to go would be giving the dying task the biggest
> priority inside that memcg to be sure that it will be the next process from
> that memcg to be scheduled. Would that be reasonable?

Hmm. I can't understand your point.
What do you mean failing distinguish memcg and system OOM?

We already have been distinguish it by mem_cgroup_out_of_memory.
(but we have to enable CONFIG_CGROUP_MEM_RES_CTLR).
So task selected in select_bad_process is one out of memcg's tasks when
memcg have a memory pressure.

Isn't it enough?
--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: [RFC] oom-kill: give the dying task a higher priority
Next: [PATCH 1/2 v2] FLAT: split the stack & data alignments