From: Yinghai Lu on
On 07/13/2010 04:27 PM, Yinghai Lu wrote:
> On 07/13/2010 03:00 PM, H. Peter Anvin wrote:
>> On 07/12/2010 07:59 PM, Yinghai Lu wrote:
>>> tip/master:
>>> system1: BIOS enabled x2apic, first kernel boot well, and when kexec second kernel will cause system instant reboot.
>>>
>>> system2: BIOS not enable x2apic, first kernel boot well and enable x2apic, and kexec second kernel well. but when kexec third kernel will case system instant reboot.
>>>
>>> linus' tree is ok.
>>>
>>> but for system2 if boot with nox2apic ,intr-remaping off, iommu off, the kexec loop test will pass.
>>>
>>> the problem looks start in recent two or three weeks.
>>>
>>> Any idea?
>>>
>>> bisecting will take a while, because the system post take a while everytime.
>>>
>>> Thanks
>>>
>>> Yinghai Lu
>>
>> OK, I found the bug... if you could test out the patch which will be
>> sent out shortly I would very much appreciate it.
>
> not sure if your patch is the offending one now.
>
> kL: kernel from linus tree
> kT1: kernel from tip
> kT2: kernel from tip with reverting your patch
>
> BIOS-->kL ---> kL ---> kL....always working
> BIOS-->kT1 ---> kT1 ---> kT1 : between second one and third one system reset instant...
> BIOS-->kT2 ---> kT2 ---> kT2 : between second one and third one system reset instant...
>
> BIOS-->kL ---> kL ---> kL ---> then kT1 ---> kT1 .... always working
> BIOS-->kL ---> kL ---> kL ---> then kT2 ---> kT2 .... always working
>

bisecting said:

> git bisect good
58687acba59266735adb8ccd9b5b9aa2c7cd205b is the first bad commit
commit 58687acba59266735adb8ccd9b5b9aa2c7cd205b
Author: Don Zickus <dzickus(a)redhat.com>
Date: Fri May 7 17:11:44 2010 -0400

lockup_detector: Combine nmi_watchdog and softlockup detector

The new nmi_watchdog (which uses the perf event subsystem) is very
similar in structure to the softlockup detector. Using Ingo's
suggestion, I combined the two functionalities into one file:
kernel/watchdog.c.

Now both the nmi_watchdog (or hardlockup detector) and softlockup
detector sit on top of the perf event subsystem, which is run every
60 seconds or so to see if there are any lockups.

To detect hardlockups, cpus not responding to interrupts, I
implemented an hrtimer that runs 5 times for every perf event
overflow event. If that stops counting on a cpu, then the cpu is
most likely in trouble.

To detect softlockups, tasks not yielding to the scheduler, I used the
previous kthread idea that now gets kicked every time the hrtimer fires.
If the kthread isn't being scheduled neither is anyone else and the
warning is printed to the console.

I tested this on x86_64 and both the softlockup and hardlockup paths
work.

V2:
- cleaned up the Kconfig and softlockup combination
- surrounded hardlockup cases with #ifdef CONFIG_PERF_EVENTS_NMI
- seperated out the softlockup case from perf event subsystem
- re-arranged the enabling/disabling nmi watchdog from proc space
- added cpumasks for hardlockup failure cases
- removed fallback to soft events if no PMU exists for hard events

V3:
- comment cleanups
- drop support for older softlockup code
- per_cpu cleanups
- completely remove software clock base hardlockup detector
- use per_cpu masking on hard/soft lockup detection
- #ifdef cleanups
- rename config option NMI_WATCHDOG to LOCKUP_DETECTOR
- documentation additions

V4:
- documentation fixes
- convert per_cpu to __get_cpu_var
- powerpc compile fixes

V5:
- split apart warn flags for hard and soft lockups

TODO:
- figure out how to make an arch-agnostic clock2cycles call
(if possible) to feed into perf events as a sample period

[fweisbec: merged conflict patch]

Signed-off-by: Don Zickus <dzickus(a)redhat.com>
Cc: Ingo Molnar <mingo(a)elte.hu>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Cyrill Gorcunov <gorcunov(a)gmail.com>
Cc: Eric Paris <eparis(a)redhat.com>
Cc: Randy Dunlap <randy.dunlap(a)oracle.com>
LKML-Reference: <1273266711-18706-2-git-send-email-dzickus(a)redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec(a)gmail.com>

:040000 040000 c99baa531fdcc45b1cc4d2d3257c9a848067961b 637cfd2034d694e3fdcb0eb0b52b705d71b5078a M Documentation
:040000 040000 0844d6f54293ec10af53a1d5ff64053dc9585a02 acb13a89b3f58130ef9677160e73b7121095da84 M arch
:040000 040000 9b7508dba6d0a76cbec9d6c7ed82820e8c4f2a97 8016330e23998f9dfdce2512556e8a795d66aa55 M include
:040000 040000 e6ec48f3f0314aff9a6a46706772ccd26d901830 ad70b3b8d21c8114096c8a5675393f1ab11457f5 M init
:040000 040000 a4456db9fbda918e06e68e573f18b51f388182db ace18da3199572a1fbc2c0800a2d65f22050ff8c M kernel
:040000 040000 120bb994855546e2e0003e54e3a382663994c00d 0e7721b41acd86ecae6ddf3c2aa6b836543aacb3 M lib


> git bisect log
git bisect start
# bad: [6058b92b74c529f7234b92492bf634f52707a8c0] Merge branch 'x86/setup'
git bisect bad 6058b92b74c529f7234b92492bf634f52707a8c0
# good: [1c5474a65bf15a4cb162dfff86d6d0b5a08a740c] Linux 2.6.35-rc5
git bisect good 1c5474a65bf15a4cb162dfff86d6d0b5a08a740c
# good: [f12813390bebee04bbd0a070592ce57648805493] Merge branch 'tracing/urgent'
git bisect good f12813390bebee04bbd0a070592ce57648805493
# bad: [e8eb3808c6bd8d78895f6b61d4a36d8346818aad] Merge branch 'x86/urgent'
git bisect bad e8eb3808c6bd8d78895f6b61d4a36d8346818aad
# good: [bb8beea5d4df37ccfb0359329dc0053a82f38501] Merge branch 'linus'
git bisect good bb8beea5d4df37ccfb0359329dc0053a82f38501
# bad: [24e5c8ccb4d187c7a05cb77c3ac004581ad16f26] Merge branch 'linus'
git bisect bad 24e5c8ccb4d187c7a05cb77c3ac004581ad16f26
# bad: [fbde9fccc1a9da261f9f786338af10edbbfb7eb8] Merge branch 'irq/core'
git bisect bad fbde9fccc1a9da261f9f786338af10edbbfb7eb8
# good: [a9a58f907d8650db1c650688cddbecfe481f91d7] Merge branch 'perf/core'
git bisect good a9a58f907d8650db1c650688cddbecfe481f91d7
# bad: [89d7ce2a2178e7f562f608b466a18c8c2ece87af] lockup_detector: Make BOOTPARAM_SOFTLOCKUP_PANIC depend on LOCKUP_DETECTOR
git bisect bad 89d7ce2a2178e7f562f608b466a18c8c2ece87af
# bad: [2508ce1845a3b256798532b2c6b7997c2dc6533b] lockup_detector: Remove old softlockup code
git bisect bad 2508ce1845a3b256798532b2c6b7997c2dc6533b
# bad: [58687acba59266735adb8ccd9b5b9aa2c7cd205b] lockup_detector: Combine nmi_watchdog and softlockup detector
git bisect bad 58687acba59266735adb8ccd9b5b9aa2c7cd205b
# good: [a9aa1d02de36b450990b0e25a88fc2ff1c3e6b94] Merge commit 'v2.6.34-rc7' into perf/nmi
git bisect good a9aa1d02de36b450990b0e25a88fc2ff1c3e6b94





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Don Zickus on
On Wed, Jul 14, 2010 at 01:35:44PM -0700, Yinghai Lu wrote:
> On 07/13/2010 04:27 PM, Yinghai Lu wrote:
> > On 07/13/2010 03:00 PM, H. Peter Anvin wrote:
> >> On 07/12/2010 07:59 PM, Yinghai Lu wrote:
> >>> tip/master:
> >>> system1: BIOS enabled x2apic, first kernel boot well, and when kexec second kernel will cause system instant reboot.
> >>>
> >>> system2: BIOS not enable x2apic, first kernel boot well and enable x2apic, and kexec second kernel well. but when kexec third kernel will case system instant reboot.
> >>>
> >>> linus' tree is ok.
> >>>
> >>> but for system2 if boot with nox2apic ,intr-remaping off, iommu off, the kexec loop test will pass.
> >>>
> >>> the problem looks start in recent two or three weeks.
> >>>
> >>> Any idea?
> >>>
> >>> bisecting will take a while, because the system post take a while everytime.
> >>>
> >>> Thanks
> >>>
> >>> Yinghai Lu
> >>
> >> OK, I found the bug... if you could test out the patch which will be
> >> sent out shortly I would very much appreciate it.
> >
> > not sure if your patch is the offending one now.
> >
> > kL: kernel from linus tree
> > kT1: kernel from tip
> > kT2: kernel from tip with reverting your patch
> >
> > BIOS-->kL ---> kL ---> kL....always working
> > BIOS-->kT1 ---> kT1 ---> kT1 : between second one and third one system reset instant...
> > BIOS-->kT2 ---> kT2 ---> kT2 : between second one and third one system reset instant...
> >
> > BIOS-->kL ---> kL ---> kL ---> then kT1 ---> kT1 .... always working
> > BIOS-->kL ---> kL ---> kL ---> then kT2 ---> kT2 .... always working
> >
>
> bisecting said:
>
> > git bisect good
> 58687acba59266735adb8ccd9b5b9aa2c7cd205b is the first bad commit
> commit 58687acba59266735adb8ccd9b5b9aa2c7cd205b
> Author: Don Zickus <dzickus(a)redhat.com>
> Date: Fri May 7 17:11:44 2010 -0400

What do you mean by instant reboot? This code isn't really exercised
until the cpus come online.

I'll dig through the history of this thread to see if there is a boot log
or something to look at.

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Yinghai Lu on
On 07/14/2010 01:35 PM, Yinghai Lu wrote:
> On 07/13/2010 04:27 PM, Yinghai Lu wrote:
>> On 07/13/2010 03:00 PM, H. Peter Anvin wrote:
>>> On 07/12/2010 07:59 PM, Yinghai Lu wrote:
>>>> tip/master:
>>>> system1: BIOS enabled x2apic, first kernel boot well, and when kexec second kernel will cause system instant reboot.
>>>>
>>>> system2: BIOS not enable x2apic, first kernel boot well and enable x2apic, and kexec second kernel well. but when kexec third kernel will case system instant reboot.
>>>>
>>>> linus' tree is ok.
>>>>
>>>> but for system2 if boot with nox2apic ,intr-remaping off, iommu off, the kexec loop test will pass.
>>>>
>>>> the problem looks start in recent two or three weeks.
>>>>
>>>> Any idea?
>>>>
>>>> bisecting will take a while, because the system post take a while everytime.
>>>>
>>>> Thanks
>>>>
>>>> Yinghai Lu
>>>
>>> OK, I found the bug... if you could test out the patch which will be
>>> sent out shortly I would very much appreciate it.
>>
>> not sure if your patch is the offending one now.
>>
>> kL: kernel from linus tree
>> kT1: kernel from tip
>> kT2: kernel from tip with reverting your patch
>>
>> BIOS-->kL ---> kL ---> kL....always working
>> BIOS-->kT1 ---> kT1 ---> kT1 : between second one and third one system reset instant...
>> BIOS-->kT2 ---> kT2 ---> kT2 : between second one and third one system reset instant...
>>
>> BIOS-->kL ---> kL ---> kL ---> then kT1 ---> kT1 .... always working
>> BIOS-->kL ---> kL ---> kL ---> then kT2 ---> kT2 .... always working
>>
>
> bisecting said:
>
>> git bisect good
> 58687acba59266735adb8ccd9b5b9aa2c7cd205b is the first bad commit
> commit 58687acba59266735adb8ccd9b5b9aa2c7cd205b
> Author: Don Zickus <dzickus(a)redhat.com>
> Date: Fri May 7 17:11:44 2010 -0400
>
> lockup_detector: Combine nmi_watchdog and softlockup detector
>
> The new nmi_watchdog (which uses the perf event subsystem) is very
> similar in structure to the softlockup detector. Using Ingo's
> suggestion, I combined the two functionalities into one file:
> kernel/watchdog.c.
>
> Now both the nmi_watchdog (or hardlockup detector) and softlockup
> detector sit on top of the perf event subsystem, which is run every
> 60 seconds or so to see if there are any lockups.
>
> To detect hardlockups, cpus not responding to interrupts, I
> implemented an hrtimer that runs 5 times for every perf event
> overflow event. If that stops counting on a cpu, then the cpu is
> most likely in trouble.
>
> To detect softlockups, tasks not yielding to the scheduler, I used the
> previous kthread idea that now gets kicked every time the hrtimer fires.
> If the kthread isn't being scheduled neither is anyone else and the
> warning is printed to the console.
>
> I tested this on x86_64 and both the softlockup and hardlockup paths
> work.
>

with
# CONFIG_LOCKUP_DETECTOR is not set
# CONFIG_HARDLOCKUP_DETECTOR is not set

kexec loop test could passed.

also that patch will break x2apic preenabled system 's kexec/kdump.

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Yinghai Lu on
On 07/14/2010 02:05 PM, Don Zickus wrote:
> On Wed, Jul 14, 2010 at 01:35:44PM -0700, Yinghai Lu wrote:
>> On 07/13/2010 04:27 PM, Yinghai Lu wrote:
>>> On 07/13/2010 03:00 PM, H. Peter Anvin wrote:
>>>> On 07/12/2010 07:59 PM, Yinghai Lu wrote:
>>>>> tip/master:
>>>>> system1: BIOS enabled x2apic, first kernel boot well, and when kexec second kernel will cause system instant reboot.
>>>>>
>>>>> system2: BIOS not enable x2apic, first kernel boot well and enable x2apic, and kexec second kernel well. but when kexec third kernel will case system instant reboot.
>>>>>
>>>>> linus' tree is ok.
>>>>>
>>>>> but for system2 if boot with nox2apic ,intr-remaping off, iommu off, the kexec loop test will pass.
>>>>>
>>>>> the problem looks start in recent two or three weeks.
>>>>>
>>>>> Any idea?
>>>>>
>>>>> bisecting will take a while, because the system post take a while everytime.
>>>>>
>>>>> Thanks
>>>>>
>>>>> Yinghai Lu
>>>>
>>>> OK, I found the bug... if you could test out the patch which will be
>>>> sent out shortly I would very much appreciate it.
>>>
>>> not sure if your patch is the offending one now.
>>>
>>> kL: kernel from linus tree
>>> kT1: kernel from tip
>>> kT2: kernel from tip with reverting your patch
>>>
>>> BIOS-->kL ---> kL ---> kL....always working
>>> BIOS-->kT1 ---> kT1 ---> kT1 : between second one and third one system reset instant...
>>> BIOS-->kT2 ---> kT2 ---> kT2 : between second one and third one system reset instant...
>>>
>>> BIOS-->kL ---> kL ---> kL ---> then kT1 ---> kT1 .... always working
>>> BIOS-->kL ---> kL ---> kL ---> then kT2 ---> kT2 .... always working
>>>
>>
>> bisecting said:
>>
>>> git bisect good
>> 58687acba59266735adb8ccd9b5b9aa2c7cd205b is the first bad commit
>> commit 58687acba59266735adb8ccd9b5b9aa2c7cd205b
>> Author: Don Zickus <dzickus(a)redhat.com>
>> Date: Fri May 7 17:11:44 2010 -0400
>
> What do you mean by instant reboot? This code isn't really exercised
> until the cpus come online.

when call kexec -e
get
Starting kernel ...

then should have second kernel booting

instead, I get BIOS post.

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Yinghai Lu on
On 07/14/2010 02:23 PM, Yinghai Lu wrote:
> On 07/14/2010 01:35 PM, Yinghai Lu wrote:
>> On 07/13/2010 04:27 PM, Yinghai Lu wrote:
>>> On 07/13/2010 03:00 PM, H. Peter Anvin wrote:
>>>> On 07/12/2010 07:59 PM, Yinghai Lu wrote:
>>>>> tip/master:
>>>>> system1: BIOS enabled x2apic, first kernel boot well, and when kexec second kernel will cause system instant reboot.
>>>>>
>>>>> system2: BIOS not enable x2apic, first kernel boot well and enable x2apic, and kexec second kernel well. but when kexec third kernel will case system instant reboot.
>>>>>
>>>>> linus' tree is ok.
>>>>>
>>>>> but for system2 if boot with nox2apic ,intr-remaping off, iommu off, the kexec loop test will pass.
>>>>>
>>>>> the problem looks start in recent two or three weeks.
>>>>>
>>>>> Any idea?
>>>>>
>>>>> bisecting will take a while, because the system post take a while everytime.
>>>>>
>>>>> Thanks
>>>>>
>>>>> Yinghai Lu
>>>>
>>>> OK, I found the bug... if you could test out the patch which will be
>>>> sent out shortly I would very much appreciate it.
>>>
>>> not sure if your patch is the offending one now.
>>>
>>> kL: kernel from linus tree
>>> kT1: kernel from tip
>>> kT2: kernel from tip with reverting your patch
>>>
>>> BIOS-->kL ---> kL ---> kL....always working
>>> BIOS-->kT1 ---> kT1 ---> kT1 : between second one and third one system reset instant...
>>> BIOS-->kT2 ---> kT2 ---> kT2 : between second one and third one system reset instant...
>>>
>>> BIOS-->kL ---> kL ---> kL ---> then kT1 ---> kT1 .... always working
>>> BIOS-->kL ---> kL ---> kL ---> then kT2 ---> kT2 .... always working
>>>
>>
>> bisecting said:
>>
>>> git bisect good
>> 58687acba59266735adb8ccd9b5b9aa2c7cd205b is the first bad commit
>> commit 58687acba59266735adb8ccd9b5b9aa2c7cd205b
>> Author: Don Zickus <dzickus(a)redhat.com>
>> Date: Fri May 7 17:11:44 2010 -0400
>>
>> lockup_detector: Combine nmi_watchdog and softlockup detector
>>
>> The new nmi_watchdog (which uses the perf event subsystem) is very
>> similar in structure to the softlockup detector. Using Ingo's
>> suggestion, I combined the two functionalities into one file:
>> kernel/watchdog.c.
>>
>> Now both the nmi_watchdog (or hardlockup detector) and softlockup
>> detector sit on top of the perf event subsystem, which is run every
>> 60 seconds or so to see if there are any lockups.
>>
>> To detect hardlockups, cpus not responding to interrupts, I
>> implemented an hrtimer that runs 5 times for every perf event
>> overflow event. If that stops counting on a cpu, then the cpu is
>> most likely in trouble.
>>
>> To detect softlockups, tasks not yielding to the scheduler, I used the
>> previous kthread idea that now gets kicked every time the hrtimer fires.
>> If the kthread isn't being scheduled neither is anyone else and the
>> warning is printed to the console.
>>
>> I tested this on x86_64 and both the softlockup and hardlockup paths
>> work.
>>
>
> with
> # CONFIG_LOCKUP_DETECTOR is not set
> # CONFIG_HARDLOCKUP_DETECTOR is not set
>
> kexec loop test could passed.
>
> also that patch will break x2apic preenabled system 's kexec/kdump.

before the combining patch

CONFIG_DETECT_SOFTLOCKUP=y
CONFIG_NMI_WATCHDOG=y

will have the same problem.

so the problem should come from NMI_WATCHDOG.

Yinghai


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/