kdump: extract log buffer and registers from vmcore on NMI button pressing [Kernel]

Prev: ERROR: Unable to locate IOAPIC for GSI xx
Next: RCU: don't turn off lockdep when find suspicious rcu_dereference_check() usage

From: Vitaly Mayatskikh on 3 Jun 2010 05:10

At Wed, 2 Jun 2010 11:16:11 -0400, Vivek Goyal wrote:

> I am not sure what is the problem we are trying to solve here. If we are
> unable to capture the dump because second kernel did not boot due to
> some dirver issue etc, above patch is not going to help either.
>
> If kernel has booted, then one should be able to capture the dump, filter
> it and look at the log buffers and cpu registers.
>
> Most of the failures I have seen in capture kernel is that it was unable
> to boot due to either deivce issues or failure in early boot. Once it has
> crossed those hurdles, after that capturing the dump is easy part.
>
> How many times does it happen in second kernel that kernel is spinning in
> a loop and NMI can still get you information out.
>
> So can you please give some more information about what kind of failures
> while capturing the dump you are addressing by this patchset.

Obviously, this change doesn't help if 2nd kernel is not able to
boot. But there are other problems, which may prevent vmcore to be
captured. For example, machine has RAM > HDD and it may save vmcore
only over network. If network fails (e.g., due to bugs in NIC drivers
or NFS, what is not so rare), and dump capture environment is
non-interactive, or it doesn't have development tools like `crash',
there's no chance even to guess what has happened.

Other possibilities of failure may include broken RAID controller,
HDD, RAM. NMI button in such situations is a last chance to see old
log.

--
wbr, Vitaly
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andi Kleen on 3 Jun 2010 05:40

Vitaly Mayatskikh <v.mayatskih(a)gmail.com> writes:
>
> Obviously, this change doesn't help if 2nd kernel is not able to
> boot. But there are other problems, which may prevent vmcore to be
> captured. For example, machine has RAM > HDD and it may save vmcore
> only over network. If network fails (e.g., due to bugs in NIC drivers
> or NFS, what is not so rare), and dump capture environment is
> non-interactive, or it doesn't have development tools like `crash',
> there's no chance even to guess what has happened.

In this case you don't need NMI, sysrq or some /sys trigger
is good enough.

NMI would be only needed if the crash kernel is completely
hosed too.

> Other possibilities of failure may include broken RAID controller,
> HDD, RAM. NMI button in such situations is a last chance to see old
> log.

The big problem is that the NMI is used by more and more subsystems,
and several of them tend to eat all NMIs, so the leftovers are less and
less. Overall I would not consider it reliable.

Also NMI buttons are not actually all that common.

I'm also not sure you really need the analysis in kernel space.

Why not have a user space program that does a quick analysis
of the previous vmcore and dumps a summary only? In fact
I suspect crash can already do that.

-Andi

--
ak(a)linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Vitaly Mayatskikh on 3 Jun 2010 08:40

At Thu, 03 Jun 2010 11:30:01 +0200, Andi Kleen wrote:

> > Obviously, this change doesn't help if 2nd kernel is not able to
> > boot. But there are other problems, which may prevent vmcore to be
> > captured. For example, machine has RAM > HDD and it may save vmcore
> > only over network. If network fails (e.g., due to bugs in NIC drivers
> > or NFS, what is not so rare), and dump capture environment is
> > non-interactive, or it doesn't have development tools like `crash',
> > there's no chance even to guess what has happened.
>
> In this case you don't need NMI, sysrq or some /sys trigger
> is good enough.

Yes, it can be enough if you still can login. Also NMI-part is small
and can be easily changed/removed.

> NMI would be only needed if the crash kernel is completely
> hosed too.

That's the case.

> > Other possibilities of failure may include broken RAID controller,
> > HDD, RAM. NMI button in such situations is a last chance to see old
> > log.
>
> The big problem is that the NMI is used by more and more subsystems,
> and several of them tend to eat all NMIs, so the leftovers are less and
> less. Overall I would not consider it reliable.

True. But as a last hope, when nothing else helps, it still may be
worth trying :)

> Also NMI buttons are not actually all that common.

True as well. This feature is generally not for desktop systems, but
for large servers running critical apps. Usually such servers have NMI
button facility (directly at front of chassis or as a function in
remote console software).

> I'm also not sure you really need the analysis in kernel space.
>
> Why not have a user space program that does a quick analysis
> of the previous vmcore and dumps a summary only? In fact
> I suspect crash can already do that.

I agree, that's fine and usually is enough, if it's still possible to
login into system and run this utility. What about scenario when
console session is available only for 1 unit in the rack at the same
time, main kernel crashed, and dump capture environment stuck? User
attaches to that machine, but cannot even login, so the kdump kernel
is probably also semi-dead. Also he don't see analysis dump, produced
by the utility, because he attached too late to see it's output.
--
wbr, Vitaly
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Vitaly Mayatskikh on 4 Jun 2010 10:00

At Fri, 4 Jun 2010 12:15:19 +0200, Andi Kleen wrote:

> > As usual: for engineers, who have to deal with it - yes, it is common.
>
> Well it would be better then to find out why that happens and fix it.
>
> Is this related to kexec driver problems?

This patchset is not fixing some particular bug.
--
wbr, Vitaly
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

|
Pages: 1
Prev: ERROR: Unable to locate IOAPIC for GSI xx
Next: RCU: don't turn off lockdep when find suspicious rcu_dereference_check() usage