WDC HDD RAID Failure / Intel SATA controller (random, every 2-3weeks) [Linux Hardware]

Prev: self assemble or off-the-shelf?
Next: Core i7: x86-64 PC and PC server (dual socket) with CentOS 4.8(RHEL 4.8)

From: philo on 9 Dec 2009 08:35

news.tpi.pl wrote:
> I'm getting WDC HDDs dropping out of the RAID. Every 2-3 weeks, seem random,
> only one drive at random partition (MD1, MD2).
>
> SMART is CLEAN for both drives. There are no errors for both short and long
> smart tests, both drives.
> BADBLOCKS returns no errors for read / write safe and write desructible
> modes, both drives.

I'd go further than that and run the manufacturer's diagnostic on the
drive in question.

If the diagnostic finds any errors, obviously you will have to replace
the drive.

OTOH: Even if the manufacturer's diagnostic does not find any errors...
I'd err on the side of caution and replace the drive.

Obviously I assume all data are backed up!
>
> The error is always same:
> ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
>
> Full error logs below, it happened yesterday SDB on MD2 failed. I made
> rebuild that took > 20 hours, during the rebuild SDA on MD1 failed. So after
> first rebuild finished i was forced to rebuid MD1.
>
> Kernel: 2.6.27.10-grsec-xxxx-grs-ipv4-64
>
> Kernel / libata bug? Any comments?
>
> FI

<snip>

From: Hactar on 9 Dec 2009 14:07

In article <hfo93c$3gf$2(a)news.eternal-september.org>,
philo <philo(a)privacy.invalid> wrote:
> news.tpi.pl wrote:
> > I'm getting WDC HDDs dropping out of the RAID. Every 2-3 weeks, seem random,
> > only one drive at random partition (MD1, MD2).
> >
> > SMART is CLEAN for both drives. There are no errors for both short and long
> > smart tests, both drives.
> > BADBLOCKS returns no errors for read / write safe and write desructible
> > modes, both drives.
>
> I'd go further than that and run the manufacturer's diagnostic on the
> drive in question.
>
> If the diagnostic finds any errors, obviously you will have to replace
> the drive.
>
> OTOH: Even if the manufacturer's diagnostic does not find any errors...
> I'd err on the side of caution and replace the drive.

So, no matter what the manufacturer's diagnostic says, you'd replace the
drive. Why bother running it at all? I replace my drives too when they
start to act up, because I figure that's the beginning of the end.

> Obviously I assume all data are backed up!

As it should be.

--
-eben QebWenE01R(a)vTerYizUonI.nOetP royalty.mine.nu:81

This message was created using recycled electrons.

From: philo on 9 Dec 2009 17:24

Hactar wrote:
> In article <hfo93c$3gf$2(a)news.eternal-september.org>,
> philo <philo(a)privacy.invalid> wrote:
>> news.tpi.pl wrote:
>>> I'm getting WDC HDDs dropping out of the RAID. Every 2-3 weeks, seem random,
>>> only one drive at random partition (MD1, MD2).
>>>
>>> SMART is CLEAN for both drives. There are no errors for both short and long
>>> smart tests, both drives.
>>> BADBLOCKS returns no errors for read / write safe and write desructible
>>> modes, both drives.
>> I'd go further than that and run the manufacturer's diagnostic on the
>> drive in question.
>>
>> If the diagnostic finds any errors, obviously you will have to replace
>> the drive.
>>
>> OTOH: Even if the manufacturer's diagnostic does not find any errors...
>> I'd err on the side of caution and replace the drive.
>
> So, no matter what the manufacturer's diagnostic says, you'd replace the
> drive. Why bother running it at all? I replace my drives too when they
> start to act up, because I figure that's the beginning of the end.
>

If the drive is going to be replaced under warranty,
the mfg will want the diagnostic error code.

But, no matter what I'd replace it.
I have seen drives that passed the mfg's diagnostic
but were definitely bad. (rare though)

>> Obviously I assume all data are backed up!
>
> As it should be.
>

From: Jon Solberg on 11 Dec 2009 07:48

On 2009-12-11, news.tpi.pl <pslawek> wrote:
>
> Uzytkownik "philo" <philo(a)privacy.invalid> napisal w wiadomosci
> news:hfp82u$ucd$1(a)news.eternal-september.org...
>> Hactar wrote:
>>> In article <hfo93c$3gf$2(a)news.eternal-september.org>,
>>> philo <philo(a)privacy.invalid> wrote:
>>>> news.tpi.pl wrote:
>>>>
>>>> [snipped]
>>>>
>>>> Obviously I assume all data are backed up!
>>>
>>> As it should be.
>
> Yes, data is backed up.
>
> But i can' t replace the drive (no manufacturer will 2 HDDs back
> because of some bugs reported by kernel, when the drive is looking
> 100% healthy and there are no errors).
>
> Any other ideas?

I can't help you with your original problem but, pretty please, don't
top post. It makes it unnecessarily hard to follow the thread.

Refer to
http://www.google.se/#hl=sv&source=hp&q=why+top+posting+is+bad&btnG=Google-s%C3%B6kning&meta=&aq=f&oq=why+top+posting+is+bad&fp=af2e0ae02f7c4ab7
for example for more information on postings styles.

Thanks.

--
Jon Solberg (remove "nospam." from email address).

From: AZ Nomad on 11 Dec 2009 09:27

On Fri, 11 Dec 2009 13:35:35 +0100, news.tpi.pl <pslawek> wrote:
>Yes, data is backed up.

>But i can' t replace the drive (no manufacturer will 2 HDDs back because of
>some bugs reported by kernel, when the drive is looking 100% healthy and
>there are no errors).

>Any other ideas?

Replace them one at a time. Tell WD that the drive is dead.

| Next | Last
Pages: 1 2
Prev: self assemble or off-the-shelf?
Next: Core i7: x86-64 PC and PC server (dual socket) with CentOS 4.8(RHEL 4.8)