2.6.35-rc3 deadlocks on semaphore operations [Kernel]

Prev: DRM / radeon / KMS: Fix hibernation regression related to radeon PM (was: Re: [Regression, post-2.6.34] Hibernation broken on machines with radeon/KMS and r300)
Next: [PATCH] x86, UV: make kdump avoid stack dumps

From: Luca Tettamanti on 21 Jun 2010 16:10

On Fri, Jun 18, 2010 at 09:49:44AM -0500, Christoph Lameter wrote:
> Can produce it with make-kpkg building a kernel.
[...]
> linux-2.6$ strace -p21561
> Process 21561 attached - interrupt to quit
> semop(32768, {{0, -1, SEM_UNDO}}, 1
>
> linux-2.6$ strace -p21751
> Process 21751 attached - interrupt to quit
> semop(32768, {{0, -1, SEM_UNDO}}, 1
>
> linux-2.6$ strace -p21792
> Process 21792 attached - interrupt to quit
> semop(32768, {{0, -1, SEM_UNDO}}, 1
>
> linux-2.6$ strace -p21793
> Process 21793 attached - interrupt to quit
> semop(32768, {{0, -1, SEM_UNDO}}, 1

Ah! I was trying to understand what was going on with apache... I see
the same symptoms with apache and prefork module: each child serves one
request and then just hangs until it's recycled.
Strace shows the same semop syscall.

# ipcs

------ Shared Memory Segments --------
key shmid owner perms bytes nattch status

------ Semaphore Arrays --------
key semid owner perms nsems
0x00000000 65536 www-data 600 1

------ Message Queues --------
key msqid owner perms used-bytes messages

# cat /proc/sysvipc/sem
key semid perms nsems uid gid cuid cgid otime ctime
0 65536 600 1 33 33 0 0 1277149940 1277149903

# /tmp/getall 65536 -v
getall <id> [-v]
found 1 semaphores.
0: 0 (cnt 4 zcnt 0)

Luca
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Manfred Spraul on 23 Jun 2010 12:30

Hi,

I think I found it:
Previously, queue.status was never IN_WAKEUP when the semaphore spinlock
was held.

The last patch changes that:
Now the change from IN_WAKEUP to the final result code happens after the
the semaphore spinlock is dropped.
Thus a task can observe IN_WAKEUP even when it acquired the semaphore
spinlock.

As a result, semop() sometimes returned 1 (IN_WAKEUP) for a successful
operation.

Attached is a patch that should fix the bug.

--
Manfred

From: Luca Tettamanti on 23 Jun 2010 15:20

On Wed, Jun 23, 2010 at 6:29 PM, Manfred Spraul
<manfred(a)colorfullife.com> wrote:
> Hi,
>
> I think I found it:
> Previously, queue.status was never IN_WAKEUP when the semaphore spinlock was
> held.
>
> The last patch changes that:
> Now the change from IN_WAKEUP to the final result code happens after the the
> semaphore spinlock is dropped.
> Thus a task can observe IN_WAKEUP even when it acquired the semaphore
> spinlock.
>
> As a result, semop() sometimes returned 1 (IN_WAKEUP) for a successful
> operation.
>
> Attached is a patch that should fix the bug.

Apache seems fine.

Tested-by: Luca Tettamanti <kronos.it(a)gmail.com>

thanks,
Luca
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Christoph Lameter on 23 Jun 2010 16:30

On Wed, 23 Jun 2010, Manfred Spraul wrote:

> Attached is a patch that should fix the bug.

I have not seen the bug since I applied the fix.

From: Luca Tettamanti on 24 Jun 2010 15:30

On Wed, Jun 23, 2010 at 9:14 PM, Luca Tettamanti <kronos.it(a)gmail.com> wrote:
> On Wed, Jun 23, 2010 at 6:29 PM, Manfred Spraul
> <manfred(a)colorfullife.com> wrote:
>> Hi,
>>
>> I think I found it:
>> Previously, queue.status was never IN_WAKEUP when the semaphore spinlock was
>> held.
>>
>> The last patch changes that:
>> Now the change from IN_WAKEUP to the final result code happens after the the
>> semaphore spinlock is dropped.
>> Thus a task can observe IN_WAKEUP even when it acquired the semaphore
>> spinlock.
>>
>> As a result, semop() sometimes returned 1 (IN_WAKEUP) for a successful
>> operation.
>>
>> Attached is a patch that should fix the bug.
>
> Apache seems fine.

Argh, "seems" was indeed appropriate. Manfred your patch does
alleviate the problem but something is still wrong. I noticed (I'm
developing an ajax heavy web app) that sometimes an apache worker
hangs; I can reproduce the problem with ab (apache benchmark) and a
high concurrency level (I'm testing with 100 and 10k requests, and I
get only 2-5 dropped requests). This does not happen with 2.4.34.
Any idea on how I can debug this further?

Luca
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: DRM / radeon / KMS: Fix hibernation regression related to radeon PM (was: Re: [Regression, post-2.6.34] Hibernation broken on machines with radeon/KMS and r300)
Next: [PATCH] x86, UV: make kdump avoid stack dumps