Prev: [PATCH] trace-cmd: prevent print_graph_duration buffer overflow
Next: 2.6.34-rc1: Badness at fs/proc/generic.c:316
From: Oleg Nesterov on 13 Jun 2010 13:20 On 06/13, Oleg Nesterov wrote: > > On 06/13, KOSAKI Motohiro wrote: > > > > But, again, I have no objection to your patch. because I really hope to > > fix coredump vs oom issue. > > Yes, I think this is important. Oh. And another problem, vfork() is not interruptible too. This means that the user can hide the memory hog from oom-killer. But let's forget about oom. Roland, any reason it should be uninterruptible? This doesn't look good in any case. Perhaps the pseudo-patch below makes sense? Oleg. --- x/kernel/fork.c +++ x/kernel/fork.c @@ -1359,6 +1359,26 @@ struct task_struct * __cpuinit fork_idle return task; } +// --------------------------------------------------- +// THIS SHOULD BE USED BY mm_release/coredump_wait/etc +// --------------------------------------------------- +void complete_vfork_done(struct task_struct *tsk) +{ + struct completion *vfork = xchg(tsk->vfork_done, NULL); + if (vfork) + complete(vfork); +} + +static wait_for_vfork_done(struct task_struct *child, struct completion *vfork) +{ + if (!wait_for_completion_killable(vfork)) + return; + if (xchg(child->vfork_done, NULL) != NULL) + return; + // the child has already read ->vfork_done and it should wake us up + wait_for_completion(vfork); +} + /* * Ok, this is the main fork-routine. * @@ -1433,6 +1453,7 @@ long do_fork(unsigned long clone_flags, if (clone_flags & CLONE_VFORK) { p->vfork_done = &vfork; init_completion(&vfork); + get_task_struct(p); } audit_finish_fork(p); @@ -1462,7 +1483,8 @@ long do_fork(unsigned long clone_flags, if (clone_flags & CLONE_VFORK) { freezer_do_not_count(); - wait_for_completion(&vfork); + wait_for_vfork_done(p, &vfork); + put_task_struct(p), freezer_count(); tracehook_report_vfork_done(p, nr); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Roland McGrath on 13 Jun 2010 21:00
> Oh. And another problem, vfork() is not interruptible too. This means > that the user can hide the memory hog from oom-killer. I'm not sure there is really any danger like that, because of the oom_kill_process "Try to kill a child first" logic. Eventually the vfork child will be chosen and killed, and when it finally exits that will release the vfork wait. So if the vfork parent is really the culprit, it will then be subject to oom_kill_process sooner or later. > But let's forget about oom. Sure, but it reminds me to mention that vfork mm sharing is another reason that having oom_kill set some persistent state in the mm seems wrong. If a vfork child is chosen for oom_kill and killed, then it's possible that will relieve the need (e.g. much memory was held indirectly via its fd table or whatnot else that is not shared with the parent via mm). So once the child is dead, there should not be any lingering bits in the parent's mm. > Roland, any reason it should be uninterruptible? This doesn't look good > in any case. Perhaps the pseudo-patch below makes sense? I've long thought that we should make a vfork parent SIGKILL-able. (Of course the vfork wait can't be made interruptible by other signals, since it must never do anything userish like signal handler setup until the child has died or exec'd.) I don't know off hand of any problem with your straightforward change. But I don't have much confidence that there isn't any strange gotcha waiting there due to some other kind of implicit assumption about vfork parent blocks that we are overlooking at the moment. So I wouldn't change this without more thorough auditing and thinking about everything related to vfork. Personally, what I've really been interested in is changing the vfork wait to use some different kind of blocking entirely. My real motivation for that is to let a vfork wait be morphed into and out of TASK_TRACED, so a debugger can examine its registers and so forth. That would entail letting the vfork/clone syscall return fully back to the asm level so it could stop in a proper state some place like the syscall-exit or notify-resume points. However, that has other wrinkles on machines like sparc and ia64, where user_regset access can involve user memory access. Since we can't allow those while the user memory is still shared with the child, it might not really be practical at all. Thanks, Roland -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |