From: Theodore Tso on
On Mon, Sep 28, 2009 at 12:16:44PM -0700, Andy Isaacson wrote:
> After a hard lockup and reboot, my test box (running recent Linus git
> 851b147) came up with:
>
> [ 5.016854] EXT4-fs (sda1): mounted filesystem with ordered data mode
> [ 8.809125] EXT4-fs (sda1): internal journal on sda1:8
> [ 10.165239] EXT4-fs error (device sda1): ext4_lookup: deleted inode referenced: 524788
> [ 10.165286] Aborting journal on device sda1:8.
> [ 10.168111] EXT4-fs error (device sda1): ext4_journal_start_sb: Detected aborted journal
> [ 10.168169] EXT4-fs (sda1): Remounting filesystem read-only
> [ 10.171614] EXT4-fs (sda1): Remounting filesystem read-only

It would be useful to see what pathname is associated with inode 524788.

You can use debugfs to find this out. For example to find a pathname
which points to inode 14666, you can do this:

# debugfs /dev/sda1
debugfs 1.41.9 (22-Aug-2009)
debugfs: ncheck 14666
Inode Pathname
14666 /grub/menu.lst

Also try using the debugfs stat command, send me the output, please:

debugfs: stat <14666>

> 2. after a lockup the journal recovery should not fail.

I'm not sure it was a matter of the journal recovery failing. All we
know for certain is that filesystem was corrupted after the lockup and
remounting the filesystem. What caused the file system corruption is
open to question at the moment; it could have been caused by the
lockup; or it could have been a file that was deleted right about the
time of the lockup; or it could have been some completely random
filesystem corruption that.

It would be useful to know whether the inode was in question was
supposed to have been deleted. If it was, it would be useful to know
if the dtime reported by debugfs's stat was around the time of the
original lockup.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Theodore Tso on
On Mon, Sep 28, 2009 at 02:28:38PM -0700, Andy Isaacson wrote:
>
> I've attached the complete output from "fsck -n /dev/sda1" and "stat
> <%d>" on each inode reported to be deleted.
>

So the large numbers of multiply-claimed blocks message is definitely
a clue:

> Multiply-claimed block(s) in inode 919422: 3704637
> Multiply-claimed block(s) in inode 928410: 3704637

> Multiply-claimed block(s) in inode 928622: 3703283
> Multiply-claimed block(s) in inode 943927: 3703283

> Multiply-claimed block(s) in inode 933307: 3702930
> Multiply-claimed block(s) in inode 943902: 3702930

What this indicates to me is that an inode table block was written to
the wrong location on disk. In fact, given large numbers of inode
numbers involved, it looks like large numbers of inode table blocks
were written to the wrong location on disk.

So what happend with the file "/etc/rcS.d/S90mountdebugfs" is probably
_not_ that it was deleted on September 22nd, but rather sometime
recently the inode table block containing to inode #524788 was
overwritten by another inode table block, containing a deleted inode
at that relative position in the inode table block.

This must have happened since the last successful boot, since with
/etc/rcS.d/S90mountdebugfs pointing at a deleted inode, any attempt to
boot the system after the corruption had taken place would have
resulted in catastrophe.

I'm surprised by how many inode tables blocks apparently had gotten
mis-directed. Almost certainly there must have been some kind of
hardware failure that must have triggered this. I'm not sure what
caused it, but it does seem like your filesystem has been toasted
fairly badly.

At this point my advice to you would be to try to recover as much data
from the disk as you can, and to *not* try to run fsck or mount the
filesystem read/write until you are confident you have recovered all
of the critical files you care about, or have made a image copy of the
disk using dd to a backup hard drive first. If you're really curious
we could try to look at the dumpe2fs output and see if we can find the
pattern of what might have caused so many misdirected writes, but
there's no guarantee that we would be able to find the definitive root
cause, and from a recovery perspective, it's probably faster and less
risk to reinstall your system disk from scratch.

Good luck, and I'm sorry your file system had gotten so badly
disrupted.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/