From: Tim Woodall on
Friday night (actually very early Saturday morning) I started getting
errors from my backups:

DUMP: dumping (Pass III) [directories]
DUMP: dumping (Pass IV) [regular files]
DUMP: read error from /dev/vg0/var-backup: Input/output error: [block 1385034, ext2blk 0]: count=173129
DUMP: read error from /dev/vg0/var-backup: Input/output error: [sector 1385034, ext2blk 0]: count=173129
....
DUMP: read error from /dev/vg0/var-backup: Input/output error: [sector 1385147, ext2blk 0]: count=173143
DUMP: read error from /dev/vg0/var-backup: Input/output error: [sector 1385148, ext2blk 0]: count=173143
DUMP: DUMP: DUMP: DUMP: mount: you must specify the filesystem type

and similar this morning:

/sbin/lvcreate -A n -L500M -s -nvar-backup /dev/vg0/var
Logical volume "var-backup" created
/sbin/e2fsck -p /dev/vg0/var-backup
/dev/vg0/var-backup: recovering journal
/dev/vg0/var-backup: clean, 1933/256000 files, 400751/512000 blocks
ssh -e none -i /root/.ssh/id_rsa_backup backup(a)dhcpdns 'mkdir -p /mnt/backup/dumps/mailserver.20080622.1'
mount | ssh -e none -i /root/.ssh/id_rsa_backup backup(a)dhcpdns 'cat >/mnt/backup/dumps/mailserver.20080622.1/mount.log'
/sbin/dump -z9 -1u -f - /dev/vg0/var-backup | ssh -e none -i /root/.ssh/id_rsa_backup backup(a)dhcpdns 'cat >/mnt/backup/dumps/mai DUMP: Date of this level 1 dump: Sun Jun 22 02:32:12 2008
DUMP: Date of last level 0 dump: Sun Jun 1 02:38:52 2008
DUMP: Dumping /dev/vg0/var-backup (an unlisted file system) to standard output
DUMP: Label: none
DUMP: Writing 10 Kilobyte records
DUMP: Compressing output at compression level 9 (zlib)
DUMP: mapping (Pass I) [regular files]
DUMP: mapping (Pass II) [directories]
DUMP: estimated 1383159 blocks.
DUMP: Volume 1 started with block 1 at: Sun Jun 22 02:32:13 2008
DUMP: dumping (Pass III) [directories]
DUMP: dumping (Pass IV) [regular files]
DUMP: read error from /dev/vg0/var-backup: Input/output error: [block 1425368, ext2blk 0]: count=178171
DUMP: read error from /dev/vg0/var-backup: Input/output error: [sector 1425368, ext2blk 0]: count=178171
DUMP: read error from /dev/vg0/var-backup: Input/output error: [sector 1425369, ext2blk 0]: count=178171
...
DUMP: read error from /dev/vg0/var-backup: Input/output error: [sector 1425542, ext2blk 0]: count=178192
DUMP: DUMP: DUMP: DUMP: DUMP: DUMP: fopen on /dev/tty fails: No such device or address
DUMP: The ENTIRE dump is aborted.
mount: you must specify the filesystem type


But I can't find what's wrong.

Manually creating the snapshot and running
dump -0 -f /dev/null /dev/vg0/var-backup
works fine.

dd if=/dev/vg0/var of=/dev/null will read the entire partition ok. ditto
dd if=/dev/vg0/var-backup of=/dev/null (although I think in this case
I'm really still mostly reading from /dev/vg0/var). If I create a
separate non snapshot partition then that also reads OK.

The VG is on a RAID on /dev/hda2 and /dev/hdc2

I've done smartctl -t long /dev/hd[ac] and there are no errors. I do
notice that hdc is running hotter than hda - now the machine is mostly
idle again hda is 25C while hdc is 43C. While runing the tests they were
about 40C and 55C respectively. hdc is newer than hda - the original hdc
(bought at the same time as hda) failed fairly quickly.

smartctl says poweron hours are 8375 and 64324. (I don't believe that
64324 - that's more than 7 years - the tests say they were run at 14465
and 20459 lifetime hours which is more believable - the maxtor site says
the warranty expires on 25th November 2008 for /dev/hda and 9th July
2009 for /dev/hdc)

e2fsck -n -f /dev/vg0/var reports no errors.

I want to identify which disk is having problems before I shutdown so I
can then pull that disk. What I really don't want is a problem shutting
down and then the raid getting rebuilt from the faulty disk to the good
disk. I know I've got backups from Friday but I'd rather not have to go
though the effort of restoring.

I'm about to try
dd if=/dev/hda of=/dev/null and likewise for /dev/hdc to see if that
flags anything. But is there anywhere else I should be looking? The
entire dump took two minutes so I don't think it's the snapshot volume
getting full.

(I've also noticed that dump exits with 0 even when it says "The ENTIRE
dump is aborted")

Tim.

--
God said, "div D = rho, div B = 0, curl E = - @B/@t, curl H = J + @D/@t,"
and there was light.

http://tjw.hn.org/ http://www.locofungus.btinternet.co.uk/
From: Andy Burns on
On 22/06/2008 12:29, Tim Woodall wrote:

> I'm about to try
> dd if=/dev/hda of=/dev/null and likewise for /dev/hdc to see if that
> flags anything.

That was my first thought, what next would depend on the results ...
From: Tim Woodall on
On Sun, 22 Jun 2008 12:41:36 +0100,
Andy Burns <usenet.april2008(a)adslpipe.co.uk> wrote:
> On 22/06/2008 12:29, Tim Woodall wrote:
>
>> I'm about to try
>> dd if=/dev/hda of=/dev/null and likewise for /dev/hdc to see if that
>> flags anything.
>
> That was my first thought, what next would depend on the results ...
Nothing :-(

Both disks have read from start to end without a murmur:

hda: Peaked at 47C
80293248+0 records in
80293248+0 records out
41110142976 bytes transferred in 842.769365 seconds (48779826 bytes/sec)

hdc: Peaked at 60C
80293248+0 records in
80293248+0 records out
41110142976 bytes transferred in 820.883806 seconds (50080343 bytes/sec)

Maybe whatever the problem was, my fiddling has sorted it out. But I
don't know what I might have done. We'll see tonight when the next
backup runs.

Tim.

--
God said, "div D = rho, div B = 0, curl E = - @B/@t, curl H = J + @D/@t,"
and there was light.

http://tjw.hn.org/ http://www.locofungus.btinternet.co.uk/
From: Andrew Halliwell on
Andy Burns <usenet.april2008(a)adslpipe.co.uk> wrote:
> On 22/06/2008 12:29, Tim Woodall wrote:
>
>> I'm about to try
>> dd if=/dev/hda of=/dev/null and likewise for /dev/hdc to see if that
>> flags anything.
>
> That was my first thought, what next would depend on the results ...

badblocks perhaps? It is slightly more... thorough..

--
| spike1(a)freenet.co.uk | Windows95 (noun): 32 bit extensions and a |
| | graphical shell for a 16 bit patch to an 8 bit |
| Andrew Halliwell BSc | operating system originally coded for a 4 bit |
| in |microprocessor, written by a 2 bit company, that|
| Computer Science | can't stand 1 bit of competition. |
From: google on
On Jun 22, 1:15 pm, Tim Woodall <devn...(a)woodall.me.uk> wrote:
>
> Maybe whatever the problem was, my fiddling has sorted it out. But I
> don't know what I might have done. We'll see tonight when the next
> backup runs.
>
No problems today.

Oh well. Looks like I'll just have to keep an eye on it.

Tim.