From: Thomas Gleixner on
On Mon, 8 Feb 2010, Andreas Mohr wrote:
>
> And then a cat current_clocksource managed to hang again.

Well, that's not surprising at all. If one task is stuck on clocksource_mutex,
then the next one will be stuck as well.

> (NOTE that the - now complete! - SysRq-T list does NOT show any backtraces
> of kwatchdog any more, only many other processes)
> Could it be that the (rather disruptive) NMI watchdog confuses the current state at
> change_clocksource and causes that stuff to get left with
> clocksource_mutex remaining taken?

Nope, the NMI watchdog is not involved. It merily tells us that the
task is stuck.

tglx

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Thomas Gleixner on
On Mon, 8 Feb 2010, Andreas Mohr wrote:
> > Nope, the NMI watchdog is not involved. It merily tells us that the
> > task is stuck.
>
> OK.
> And after that message debug_locks is zeroed and kwatchdog is gone
> from the process list (probably during debug_locks change).

Oh, no. kwatchdog is a run once thread. It always exits after work is
done, but I'm pretty confused about the NMI watchdog output.

EIP: 0060:[<c1045170>] EFLAGS: 00000082 CPU: 0
EIP is at timekeeping_forward_now+0x116/0x139

I don't see what might make the machine stuck here. Can you decode the
source line with "addr2line -e vmlinux c1045170" please ?

> I'll explain what I think might be happening:
> bootup switches to acpi_pm, timekeeping gets borked, NMI watchdog complains
> due to timekeeping issues, brutally yanks the waiting acpi_pm switchover
> (thereby NOT releasing clocksource_mutex),

No, the NMI watchdog does not yank anything. It just reports.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Thomas Gleixner on
On Mon, 8 Feb 2010, Andreas Mohr wrote:
> Hi,
>
> On Mon, Feb 08, 2010 at 11:06:58AM +0100, Thomas Gleixner wrote:
> > EIP: 0060:[<c1045170>] EFLAGS: 00000082 CPU: 0
> > EIP is at timekeeping_forward_now+0x116/0x139
> >
> > I don't see what might make the machine stuck here. Can you decode the
> > source line with "addr2line -e vmlinux c1045170" please ?
>
> And the winner is:
> /usr/src/linux-2.6.33-rc7/include/linux/math64.h:91
>
> static __always_inline u32
> __iter_div_u64_rem(u64 dividend, u32 divisor, u64 *remainder)
> {
> u32 ret = 0;
>
> while (dividend >= divisor) {
> /* The following asm() prevents the compiler from
> optimising this loop into a modulo operation. */
> asm("" : "+rm"(dividend));
>
> dividend -= divisor;
> ret++;
> }
>
> *remainder = dividend;
>
> return ret;
> }
>
>
> while ......
>
> Do I see a divisor == 0 here?? ;)

The only function which is calling __iter_div_u64_rem() from
timekeeping_forward_now() is timespec_add_ns() which calls it with a
constant divisor:

static __always_inline void timespec_add_ns(struct timespec *a, u64 ns)
{
a->tv_sec += __iter_div_u64_rem(a->tv_nsec + ns, NSEC_PER_SEC, &ns);
a->tv_nsec = ns;
}

There goes the theory :)

Which compiler version are you using ?

Can you please provide the disassembly of kernel/time/timekeeping.o ?

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Thomas Gleixner on
On Mon, 8 Feb 2010, Thomas Gleixner wrote:
> On Mon, 8 Feb 2010, Andreas Mohr wrote:
> Which compiler version are you using ?
>
> Can you please provide the disassembly of kernel/time/timekeeping.o ?

Is that NMI watchdog hit fully reproducible ?

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Thomas Gleixner on


On Mon, 8 Feb 2010, Andreas Mohr wrote:

> On Mon, Feb 08, 2010 at 09:51:05PM +0100, Andreas Mohr wrote:
> > Looks like it:
> > - another bootup also had lockup message
> > - all /var/log/dmesg* have lockup message, oldest is:
> > 2010-02-07 20:00 dmesg.4.gz
> >
> > Linux version 2.6.33-rc6 (root(a)note) (gcc version 4.3.4 (Debian 4.3.4-6)) #3 Sun Jan 31 23:47:51 CET 2010
>
> -rc4 and 2.6.32.3 don't show lockup message, instant bootup without any
> visible delay.
>
> Don't tell me I'm now supposed to try -rc5 and also rebuild -rc6 ;)
> If so, do tell early so that I have lots of time to get it built...

Well, we better know, after which point that problem manifested
itself. A bisect would be optimal.

Thanks,

tglx


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/