A possible sys_wait* bug [Kernel]

Prev: invitation / inquiry
Next: [PATCH] x86, Calgary: Increase max PHB number

From: KOSAKI Motohiro on 30 Jun 2010 20:50

Hello, (cc to some core developers)

Are anyone tracking this issue? This seems security issue.

> One of our internal workloads ran into a problem with waitpid. A
> simple repro case is as follows:
>
>
> #include <sys/types.h>
> #include <sys/wait.h>
> #include <sys/time.h>
> #include <signal.h>
> #include <stdlib.h>
> #include <stdio.h>
> #include <errno.h>
> #include <assert.h>
> #include <sched.h>
>
> #define NUM_CPUS 4
>
> void *thread_code(void *args)
> {
> int j;
> int pid2;
> for (j = 0; j < 1000; j++) {
> pid2 = fork();
> if (pid2 == 0)
> while(1) { sleep(1000); }
> }
>
> while (1) {
> int status;
> if (waitpid(-1, &status, WNOHANG)) {
> printf("! %d\n", errno);
> }
>
> }
> exit(0);
>
> }
>
> /*
> * non-blocking waitpids in tight loop, with many children to go through,
> * done on multiple thread, so that they can "pass the torch" to eachother
> * and eliminate the window that a writer has to get in.
> *
> * This maximizes the holding of the tasklist_lock in read mode, starving
> * any attempts to take the lock in the write mode.
> */
> int main(int argc, char **argv)
> {
> int i;
> pthread_attr_t attr;
> pthread_t threads[NUM_CPUS];
> for (i = 0; i < NUM_CPUS; i++) {
> assert(!pthread_attr_init(&attr));
> assert(!pthread_create(&threads[i], &attr, thread_code));
> }
> while(1) { sleep(1000);}
> return 0;
> }
>
>
> Basically, it is possibly for readers to continuously hold
> tasklist_lock (theoretically forever, as they pass from one to other),
> preventing the writer from taking that lock. This typically causes a
> lockup on a CPU where a task is attempting to do a fork() or exit(),
> resulting in the NMI watchdog firing.
>
> Yes, WNOHANG is being used. And I agree that this is an inefficient
> use of wait(). However, I think it should be possible to produce the
> same effect without WNOHANG on sufficiently large number of threads:
> by having it so that at least one thread always has the reader lock.
>
> I think the most direct approach to the problem is to have the
> readers-writer locks be writer biased (i.e. as soon as a writer
> contends, we do not permit any new readers). However all suggestions
> are welcome.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo(a)vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Salman Qazi on 1 Jul 2010 01:10

On Wed, Jun 30, 2010 at 5:47 PM, KOSAKI Motohiro
<kosaki.motohiro(a)jp.fujitsu.com> wrote:
> Hello, �(cc to some core developers)
>
> Are anyone tracking this issue? This seems security issue.

Please explain why this is a security issue. This is not readily
apparent to me. As far as Google is concerned it is a low/medium
priority bug, as there are user space workarounds, at least for the
time being.

>
>
>> One of our internal workloads ran into a problem with waitpid. �A
>> simple repro case is as follows:
>>
>>
>> #include <sys/types.h>
>> #include <sys/wait.h>
>> #include <sys/time.h>
>> #include <signal.h>
>> #include <stdlib.h>
>> #include <stdio.h>
>> #include <errno.h>
>> #include <assert.h>
>> #include <sched.h>
>>
>> #define NUM_CPUS 4
>>
>> void *thread_code(void *args)
>> {
>> � � � � int j;
>> � � � � int pid2;
>> � � � � for (j = 0; j < 1000; j++) {
>> � � � � � � � � pid2 = fork();
>> � � � � � � � � if (pid2 == 0)
>> � � � � � � � � � � � � while(1) { sleep(1000); }
>> � � � � }
>>
>> � � � � while (1) {
>> � � � � � � � � int status;
>> � � � � � � � � if (waitpid(-1, &status, WNOHANG)) {
>> � � � � � � � � � � � � printf("! %d\n", errno);
>> � � � � � � � � }
>>
>> � � � � }
>> � � � � exit(0);
>>
>> }
>>
>> /*
>> �* non-blocking waitpids in tight loop, with many children to go through,
>> �* done on multiple thread, so that they can "pass the torch" to eachother
>> �* and eliminate the window that a writer has to get in.
>> �*
>> �* This maximizes the holding of the tasklist_lock in read mode, starving
>> �* any attempts to take the lock in the write mode.
>> �*/
>> int main(int argc, char **argv)
>> {
>> � � � � int i;
>> � � � � pthread_attr_t attr;
>> � � � � pthread_t threads[NUM_CPUS];
>> � � � � for (i = 0; i < NUM_CPUS; i++) {
>> � � � � � � � � assert(!pthread_attr_init(&attr));
>> � � � � � � � � assert(!pthread_create(&threads[i], &attr, thread_code));
>> � � � � }
>> � � � � while(1) { sleep(1000);}
>> � � � � return 0;
>> }
>>
>>
>> Basically, it is possibly for readers to continuously hold
>> tasklist_lock (theoretically forever, as they pass from one to other),
>> preventing the writer from taking that lock. �This typically causes a
>> lockup on a CPU where a task is attempting to do a fork() or exit(),
>> resulting in the NMI watchdog firing.
>>
>> Yes, WNOHANG is being used. �And I agree that this is an inefficient
>> use of wait(). �However, I think it should be possible to produce the
>> same effect without WNOHANG on sufficiently large number of threads:
>> by having it so that at least one thread always has the reader lock.
>>
>> I think the most direct approach to the problem is to have the
>> readers-writer locks be writer biased (i.e. as soon as a writer
>> contends, we do not permit any new readers). �However all suggestions
>> are welcome.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo(a)vger.kernel.org
>> More majordomo info at �http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at �http://www.tux.org/lkml/
>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Roland McGrath on 1 Jul 2010 01:40

There are several other ways than wait that pathological user processes can
arrange to do tight loops taking tasklist_lock. So if you call that a DoS
risk then it's not solved with anything specific to the wait syscalls.

Using different rwlock behavior (at least for that lock) would be one way
to address it. Of course, changing the contention dynamics might have
other effects that you don't want (i.e. fork/exec/exit/reap drowning out
the reader uses, cascading to some bad pattern). I don't have an opinion
on that, and wish you good luck figuring it out.

In the long run, things have been moving to finer-grained locking and/or
less direct locking. We can imagine tasklist_lock might go this way one
day too, removing the general bottleneck (i.e. slicing it up into different
smaller, bottlenecks). As things stand today, it's a bit hard to see
exactly how we might get there e.g. for wait/exit with the synchronization
requirements for reparenting and all that. But it's clear that eventually
the big system-wide bottlenecks like tasklist_lock all get reduced. I
don't know of anything like this to expect to see any time soon, however.

Thanks,
Roland
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Oleg Nesterov on 1 Jul 2010 10:20

On 07/01, KOSAKI Motohiro wrote:
>
> > Basically, it is possibly for readers to continuously hold
> > tasklist_lock

Yes, this is the known problem.

Perhaps do_wait() is not the worst example. sys_kill(-1),
sys_ioprio_set() scan the global list.

> > I think the most direct approach to the problem is to have the
> > readers-writer locks be writer biased (i.e. as soon as a writer
> > contends, we do not permit any new readers).

I thought about this too, but this is deadlockable. At least,
read_lock(tasklist) should nest, and it should work in irq context.

We need the more fine-grained locking, but it is not clear to me what
should be done in the long term. Afaics, this is very nontrivial.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: KOSAKI Motohiro on 2 Jul 2010 02:20

> On 07/01, KOSAKI Motohiro wrote:
> >
> > > Basically, it is possibly for readers to continuously hold
> > > tasklist_lock
>
> Yes, this is the known problem.
>
> Perhaps do_wait() is not the worst example. sys_kill(-1),
> sys_ioprio_set() scan the global list.

Ah, I see.
Yup, Roland also pointed out this is NOT biggest risk, there are much
other way. My thinking coverage was too narrow. sorry.

> > > I think the most direct approach to the problem is to have the
> > > readers-writer locks be writer biased (i.e. as soon as a writer
> > > contends, we do not permit any new readers).
>
> I thought about this too, but this is deadlockable. At least,
> read_lock(tasklist) should nest, and it should work in irq context.
>
> We need the more fine-grained locking, but it is not clear to me what
> should be done in the long term. Afaics, this is very nontrivial.

Thank you for kindful explanation.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

|
Pages: 1
Prev: invitation / inquiry
Next: [PATCH] x86, Calgary: Increase max PHB number