reiserfs deadlock [Kernel]

Prev: [PATCH] ssb: Fix CONFIG_SSB_SDIOHOST typo
Next: Input: serio - re-add thaw (and add freeze)

From: Alexander Beregalov on 3 Feb 2010 14:10

Hi Frederic

I do not have previous messages and do not know how to reproduce it.
Kernel was 2.6.33-rc5-00237-g9a3cbe3

INFO: task svscan:1395 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
svscan D f73094f0 5248 1395 1 0x00000000
f6565da0 00000046 00000007 f73094f0 c134e5fd f7309770 f73094f0 c104e4b4
f6411c84 f6411c84 00000246 f73094f0 f6565de4 c134e58a 00000000 00000002
00000000 c1110e38 f6411cac c1110e38 f6411cc0 f6335d18 f706bf14 f73094f0
Call Trace:
[<c134e5fd>] ? mutex_lock_nested+0x19d/0x2b0
[<c104e4b4>] ? trace_hardirqs_on_caller+0x124/0x170
[<c134e58a>] mutex_lock_nested+0x12a/0x2b0
[<c1110e38>] ? reiserfs_write_lock+0x28/0x40
[<c1110e38>] ? reiserfs_write_lock+0x28/0x40
[<c1110e38>] reiserfs_write_lock+0x28/0x40
[<c10f059a>] reiserfs_readdir_dentry+0x5a/0x720
[<c1042c75>] ? sched_clock_cpu+0x95/0x110
[<c104ca2b>] ? trace_hardirqs_off+0xb/0x10
[<c109fe10>] ? filldir+0x0/0xd0
[<c1042c75>] ? sched_clock_cpu+0x95/0x110
[<c10502d9>] ? __lock_acquire+0x509/0xa70
[<c104e1f2>] ? mark_held_locks+0x62/0x90
[<c134e361>] ? mutex_lock_killable_nested+0x201/0x300
[<c104e4b4>] ? trace_hardirqs_on_caller+0x124/0x170
[<c134e36b>] ? mutex_lock_killable_nested+0x20b/0x300
[<c10a0022>] ? vfs_readdir+0x52/0xa0
[<c10a0022>] ? vfs_readdir+0x52/0xa0
[<c109fe10>] ? filldir+0x0/0xd0
[<c10f0c77>] reiserfs_readdir+0x17/0x20
[<c10a0055>] vfs_readdir+0x85/0xa0
[<c10a0188>] sys_getdents+0x68/0xc0
[<c1002b10>] sysenter_do_call+0x12/0x36
2 locks held by svscan/1395:
#0: (&sb->s_type->i_mutex_key#5){+.+.+.}, at: [<c10a0022>]
vfs_readdir+0x52/0xa0
#1: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1110e38>]
reiserfs_write_lock+0x28/0x40
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Frederic Weisbecker on 3 Feb 2010 15:30

On Wed, Feb 03, 2010 at 10:08:57PM +0300, Alexander Beregalov wrote:
> On 3 February 2010 22:03, Alexander Beregalov <a.beregalov(a)gmail.com> wrote:
> > Hi Frederic
> >
> > I do not have previous messages and do not know how to reproduce it.
> > Kernel was 2.6.33-rc5-00237-g9a3cbe3
> >
>
> Hm, I have the same after reboot.
>
> Do you need me to do anything before I try to fsck ?

Yeah. Rebooting again makes your kernel soft lockup?

Usually such softlockup happens because we have a lock
inversion, in which case you should have a lockdep report
before the softlockup.

Otherwise this can also happen when we wait for an event
that needs the lock to complete but
that can not happen because we already have the lock.

Task A hold reiserfs lock and wait for event 1
Task B wants to complete event 1 but it need the reisers lock
for that => deadlock.

This can usually be found in a softlockup report: lots of
tasks are blocked on reiserfs_write_lock/mutex_lock
except one, and this one is important as it is probably
the waiter: the task that holds the lock and that is waiting
for another event (that in turn needs the lock to complete).

Having more reports could probably help us:

echo 100 > /proc/sys/kernel/hung_task_warnings

Hopefully you can still reproduce it :-s

Thanks a lot!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Alexander Beregalov on 3 Feb 2010 17:50

On 3 February 2010 23:29, Frederic Weisbecker <fweisbec(a)gmail.com> wrote:
> On Wed, Feb 03, 2010 at 10:08:57PM +0300, Alexander Beregalov wrote:
>> On 3 February 2010 22:03, Alexander Beregalov <a.beregalov(a)gmail.com> wrote:
>> > Hi Frederic
>> >
>> > I do not have previous messages and do not know how to reproduce it.
>> > Kernel was 2.6.33-rc5-00237-g9a3cbe3
>> >
>>
>> Hm, I have the same after reboot.
>>
>> Do you need me to do anything before I try to fsck ?
>
>
> Yeah. Rebooting again makes your kernel soft lockup?
Yes, reboot does not help. I even can't login, agetty and sshd are frozen.

INFO: task sshd:1863 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
sshd D 6f60ec44 6576 1863 1810 0x00000000
f633dd78 00000046 ffffffff 6f60ec44 0000000f f7306b30 f73068b0 00000000
f7306d84 7fffffff 00000000 f633de70 f633dde8 c134da45 00000000 f633dd8c
c104ca3b 00000000 7fffffff 0000000f 6f618f50 f73068b0 00000000 00000000
Call Trace:
[<c134da45>] schedule_timeout+0x125/0x1b0
[<c104ca3b>] ? trace_hardirqs_off+0xb/0x10
[<c1350152>] ? _raw_spin_unlock_irq+0x22/0x30
[<c104e4c4>] ? trace_hardirqs_on_caller+0x124/0x170
[<c104e51b>] ? trace_hardirqs_on+0xb/0x10
[<c134d7d0>] wait_for_common+0xd0/0x130
[<c1024850>] ? default_wake_function+0x0/0x10
[<c134d8c2>] wait_for_completion+0x12/0x20
[<c1039709>] call_usermodehelper_exec+0x89/0xb0
[<c1039471>] ? call_usermodehelper_setup+0x71/0xb0
[<c134d730>] ? wait_for_common+0x30/0x130
[<c10398e2>] __request_module+0xa2/0xf0
[<c10a6136>] ? new_inode+0x76/0x80
[<c13501cd>] ? _raw_spin_unlock+0x1d/0x20
[<c12cc89f>] __sock_create+0x18f/0x1f0
[<c107b22a>] ? might_fault+0x4a/0xa0
[<c12cc967>] sock_create+0x37/0x40
[<c12ccb1e>] sys_socket+0x3e/0x70
[<c12ccbb0>] sys_socketcall+0x60/0x270
[<c1002b43>] ? sysenter_exit+0xf/0x18
[<c11d5eb4>] ? trace_hardirqs_on_thunk+0xc/0x10
[<c1002b10>] sysenter_do_call+0x12/0x36
no locks held by sshd/1863.

No locks - what does it mean?

>
> Usually such softlockup happens because we have a lock
> inversion, in which case you should have a lockdep report
> before the softlockup.

No, I do not have it. 120 seconds after boot I see these messages on
the console,
no lockdep reports (lockdep is enabled).

>
> Otherwise this can also happen when we wait for an event
> that needs the lock to complete but
> that can not happen because we already have the lock.
>
> Task A hold reiserfs lock and wait for event 1
> Task B wants to complete event 1 but it need the reisers lock
> for that => deadlock.
>
> This can usually be found in a softlockup report: lots of
> tasks are blocked on reiserfs_write_lock/mutex_lock
> except one, and this one is important as it is probably
> the waiter: the task that holds the lock and that is waiting
> for another event (that in turn needs the lock to complete).
>
> Having more reports could probably help us:
>
> echo 100 > /proc/sys/kernel/hung_task_warnings

Ok, I will modify rc scripts to do it, as I can't login.

>
> Hopefully you can still reproduce it :-s
>
> Thanks a lot!
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Frederic Weisbecker on 3 Feb 2010 18:00

On Thu, Feb 04, 2010 at 01:43:53AM +0300, Alexander Beregalov wrote:
> On 3 February 2010 23:29, Frederic Weisbecker <fweisbec(a)gmail.com> wrote:
> > On Wed, Feb 03, 2010 at 10:08:57PM +0300, Alexander Beregalov wrote:
> >> On 3 February 2010 22:03, Alexander Beregalov <a.beregalov(a)gmail.com> wrote:
> >> > Hi Frederic
> >> >
> >> > I do not have previous messages and do not know how to reproduce it.
> >> > Kernel was 2.6.33-rc5-00237-g9a3cbe3
> >> >
> >>
> >> Hm, I have the same after reboot.
> >>
> >> Do you need me to do anything before I try to fsck ?
> >
> >
> > Yeah. Rebooting again makes your kernel soft lockup?
> Yes, reboot does not help. I even can't login, agetty and sshd are frozen.
>
> INFO: task sshd:1863 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> sshd D 6f60ec44 6576 1863 1810 0x00000000
> f633dd78 00000046 ffffffff 6f60ec44 0000000f f7306b30 f73068b0 00000000
> f7306d84 7fffffff 00000000 f633de70 f633dde8 c134da45 00000000 f633dd8c
> c104ca3b 00000000 7fffffff 0000000f 6f618f50 f73068b0 00000000 00000000
> Call Trace:
> [<c134da45>] schedule_timeout+0x125/0x1b0
> [<c104ca3b>] ? trace_hardirqs_off+0xb/0x10
> [<c1350152>] ? _raw_spin_unlock_irq+0x22/0x30
> [<c104e4c4>] ? trace_hardirqs_on_caller+0x124/0x170
> [<c104e51b>] ? trace_hardirqs_on+0xb/0x10
> [<c134d7d0>] wait_for_common+0xd0/0x130
> [<c1024850>] ? default_wake_function+0x0/0x10
> [<c134d8c2>] wait_for_completion+0x12/0x20
> [<c1039709>] call_usermodehelper_exec+0x89/0xb0
> [<c1039471>] ? call_usermodehelper_setup+0x71/0xb0
> [<c134d730>] ? wait_for_common+0x30/0x130
> [<c10398e2>] __request_module+0xa2/0xf0
> [<c10a6136>] ? new_inode+0x76/0x80
> [<c13501cd>] ? _raw_spin_unlock+0x1d/0x20
> [<c12cc89f>] __sock_create+0x18f/0x1f0
> [<c107b22a>] ? might_fault+0x4a/0xa0
> [<c12cc967>] sock_create+0x37/0x40
> [<c12ccb1e>] sys_socket+0x3e/0x70
> [<c12ccbb0>] sys_socketcall+0x60/0x270
> [<c1002b43>] ? sysenter_exit+0xf/0x18
> [<c11d5eb4>] ? trace_hardirqs_on_thunk+0xc/0x10
> [<c1002b10>] sysenter_do_call+0x12/0x36
> no locks held by sshd/1863.
>
> No locks - what does it mean?

This is the call_usermodehelper_exec path, so probably
the kernel tries to ask userspace to load a module, but
since the filesystem is locked up, this can't happen.

> >
> > Usually such softlockup happens because we have a lock
> > inversion, in which case you should have a lockdep report
> > before the softlockup.
>
> No, I do not have it. 120 seconds after boot I see these messages on
> the console,
> no lockdep reports (lockdep is enabled).

So this is probably this event waited thing.

> >
> > Otherwise this can also happen when we wait for an event
> > that needs the lock to complete but
> > that can not happen because we already have the lock.
> >
> > Task A hold reiserfs lock and wait for event 1
> > Task B wants to complete event 1 but it need the reisers lock
> > for that => deadlock.
> >
> > This can usually be found in a softlockup report: lots of
> > tasks are blocked on reiserfs_write_lock/mutex_lock
> > except one, and this one is important as it is probably
> > the waiter: the task that holds the lock and that is waiting
> > for another event (that in turn needs the lock to complete).
> >
> > Having more reports could probably help us:
> >
> > echo 100 > /proc/sys/kernel/hung_task_warnings
>
> Ok, I will modify rc scripts to do it, as I can't login.

Thanks!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Alexander Beregalov on 4 Feb 2010 20:50

Hi Frederic

Does it help?

2 locks held by metalog/1558:
#0: (&sb->s_type->i_mutex_key#5){+.+.+.}, at: [<c106a501>]
generic_file_aio_write+0x41/0xb0
#1: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1110ed8>]
reiserfs_write_lock_once+0x28/0x50

2 locks held by svscan/1600:
#0: (&sb->s_type->i_mutex_key#5){+.+.+.}, at: [<c10a0112>]
vfs_readdir+0x52/0xa0
#1: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1110f28>]
reiserfs_write_lock+0x28/0x40

3 locks held by agetty/1868:
#0: (&p->cred_guard_mutex){+.+.+.}, at: [<c10972f4>]
prepare_bprm_creds+0x24/0x60
#1: (&sb->s_type->i_mutex_key#5){+.+.+.}, at: [<c109a4bb>]
do_lookup+0xbb/0x1c0
#2: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1110ed8>]
reiserfs_write_lock_once+0x28/0x50

2 locks held by modprobe/1871:
#0: (&sb->s_type->i_mutex_key#5){+.+.+.}, at: [<c10a0112>]
vfs_readdir+0x52/0xa0
#1: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1110f28>]
reiserfs_write_lock+0x28/0x40

2 locks held by fcron/1872:
#0: (&sb->s_type->i_mutex_key#5){+.+.+.}, at: [<c109d903>]
do_filp_open+0x173/0x9f0
#1: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1110ed8>]
reiserfs_write_lock_once+0x28/0x50

3 locks held by ddclient/1588:
#0: (&sb->s_type->i_mutex_key#5){+.+.+.}, at: [<c1091417>]
do_truncate+0x57/0x90
#1: (&sb->s_type->i_alloc_sem_key#3){+.+...}, at: [<c10a65e6>]
notify_change+0x1d6/0x330
#2: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1110ed8>]
reiserfs_write_lock_once+0x28/0x50

2 locks held by multilog/1615:
#0: (&sb->s_type->i_mutex_key#5){+.+.+.}, at: [<c106a501>]
generic_file_aio_write+0x41/0xb0
#1: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1110ed8>]
reiserfs_write_lock_once+0x28/0x50

2 locks held by nfsd/1734:
#0: (hash_sem){.+.+.+}, at: [<c1131a2d>] exp_readlock+0xd/0x10
#1: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1110f28>]
reiserfs_write_lock+0x28/0x40
2 locks held by nfsd/1735:
#0: (hash_sem){.+.+.+}, at: [<c1131a2d>] exp_readlock+0xd/0x10
#1: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1110f28>]
reiserfs_write_lock+0x28/0x40
2 locks held by nfsd/1736:
#0: (hash_sem){.+.+.+}, at: [<c1131a2d>] exp_readlock+0xd/0x10
#1: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1110f28>]
reiserfs_write_lock+0x28/0x40
2 locks held by nfsd/1737:
#0: (hash_sem){.+.+.+}, at: [<c1131a2d>] exp_readlock+0xd/0x10
#1: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1110f28>]
reiserfs_write_lock+0x28/0x40
2 locks held by nfsd/1738:
#0: (hash_sem){.+.+.+}, at: [<c1131a2d>] exp_readlock+0xd/0x10
#1: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1110f28>]
reiserfs_write_lock+0x28/0x40
2 locks held by nfsd/1739:
#0: (hash_sem){.+.+.+}, at: [<c1131a2d>] exp_readlock+0xd/0x10
#1: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1110f28>]
reiserfs_write_lock+0x28/0x40
2 locks held by nfsd/1740:
#0: (hash_sem){.+.+.+}, at: [<c1131a2d>] exp_readlock+0xd/0x10
#1: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1110f28>]
reiserfs_write_lock+0x28/0x40
2 locks held by nfsd/1741:
#0: (hash_sem){.+.+.+}, at: [<c1131a2d>] exp_readlock+0xd/0x10
#1: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1110f28>]
reiserfs_write_lock+0x28/0x40
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2
Prev: [PATCH] ssb: Fix CONFIG_SSB_SDIOHOST typo
Next: Input: serio - re-add thaw (and add freeze)