Raid 1 "disk" going bad-- how determine which one. [Linux Help]

Prev: newbie question about using interprocess communication
Next: ffmpeg AV sync batch job?

From: unruh on 5 Apr 2010 21:09

On 2010-04-05, Giampiero Gabbiani <Giampiero(a)Gabbiani.org> wrote:
> If you have I/O errors on file on a raid1 partition it's likely that you
> have problems on the file system and NOT on the raid array.
>
> Are you using SOFTWARE raid (i.e. managed through mdadm) ?
> If so, doing a 'cat /proc/mdstat' you should see what disk is in failure.

Yes, I am using software raid, and I do see that all the errors are
coming from /dev/sdb3. so it is time to replace that disk. I did backup,
but I now have to try to remember how to set up the raid system.

>
> If not and if you are using a HW (or fake / ROM) raid you should see the
> array situation from the BIOS.
>
> Regards
> Giampiero
>
>> On 2010-04-02, unruh <unruh(a)wormhole.physics.ubc.ca> wrote:
>>> I have a system in which I have two disks united into a raid 1 , and one
>>> of the disks seems to be starting to go bad ( a bunch of files have I/O
>>> errors on them.
>>> How do I deterimine which of the two disks is going bad so I can replace
>>> it? Fortunately I have a backup of the stuff (well most of it) so I
>>> could reconstruct if necessary, but I do not want to chuck two disks if
>>> only one is bad.
>>
>> dmesg
>>
>> --- news://freenews.netfront.net/ - complaints: news(a)netfront.net ---
>

From: Giampiero Gabbiani on 10 Apr 2010 04:53

> Yes, I am using software raid, and I do see that all the errors are
> coming from /dev/sdb3. so it is time to replace that disk. I did backup,
> but I now have to try to remember how to set up the raid system.

Ok, assumed that the raid1 array is named md0, and that it is constituted by
two partition sdaX and sdb3

1) check the current situation of your array:

# cat /proc/mdstat

2) mark sdb3 as faulty and remove it from the array:

mdadm /dev/md0 --fail /dev/sdb3 --remove /dev/sdb3

3) chack again:

# cat /proc/mdstat

you should see your array as degraded.

4) check that the faulty disk doesn't contain any system partition, then
halt the system and change the faulty disk with another one.

5) start the system and prepare the new disk creating a new partition on it
of AT LEAST the same size. It is VERY important that the size of the new
partition it's >= to the old one.

6) Assuming that the new partition is sdbY add this one to the array:

mdadm /dev/md0 --add /dev/sdbY

7) at this point it will immediately start recovering data. You can monitor
it with:

watch -n1 'cat /proc/mdstat'

you should see something like:

Personalities : [raid1]
md0 : active raid1 sdaX[0] sdbY[1]
184859840 blocks [2/2] [UU]
[======>..............] resync = 33.1% (61296896/184859840)
finish=34.3min speed=59895K/sec

unused devices: <none>

DO NOT HALT the system till it finish resync.

Regards
Giampiero

First | Prev |
Pages: 1 2
Prev: newbie question about using interprocess communication
Next: ffmpeg AV sync batch job?