From: Arno on
In comp.sys.ibm.pc.hardware.storage Mike Tomlinson <mike(a)none.invalid> wrote:
> In article <82cni4F42iU1(a)mid.individual.net>, Arno <me(a)privacy.net>
> writes

>>That sounds like BS to me. A soft pencil eraser cannot remove silver
>>sulfide, it is quite resilient.

> It's a technique that has been used on edge connectors for many years.

It works with a harder eraser and it works for tin contacts with
a soft one. But it does not work for silver contacts, you need
to have at least some sand in th eraser for that.

Arno
--
Arno Wagner, Dr. sc. techn., Dipl. Inform., CISSP -- Email: arno(a)wagner.name
GnuPG: ID: 1E25338F FP: 0C30 5782 9D93 F785 E79C 0296 797F 6B50 1E25 338F
----
Cuddly UI's are the manifestation of wishful thinking. -- Dylan Evans
From: Sergey Kubushyn on
In sci.electronics.repair Arno <me(a)privacy.net> wrote:
> In comp.sys.ibm.pc.hardware.storage Sergey Kubushyn <ksi(a)koi8.net> wrote:
>> In sci.electronics.repair Arno <me(a)privacy.net> wrote:
>>> In comp.sys.ibm.pc.hardware.storage Sergey Kubushyn <ksi(a)koi8.net> wrote:
>>>> In sci.electronics.repair Franc Zabkar <fzabkar(a)iinternode.on.net> wrote:
>>>>> On Thu, 8 Apr 2010 14:03:39 -0700 (PDT), whit3rd <whit3rd(a)gmail.com>
>>>>> put finger to keyboard and composed:
>>>>>
>>>>>>On Apr 8, 12:11?am, Franc Zabkar <fzab...(a)iinternode.on.net> wrote:
>>>>>
>>>>>>> Is this the fallout from RoHS?
>>>>>>
>>>>>>Maybe not. There are other known culprits, like the drywall (gypsum
>>>>>>board,
>>>>>>sheetrock... whatever it's called in your region) that outgasses
>>>>>>hydrogen
>>>>>>sulphide. Some US construction of a few years ago is so bad with
>>>>>>this
>>>>>>toxic and corrosive gas emission that demolition of nearly-new
>>>>>>construction
>>>>>>is called for.
>>>>>>
>>>>>>Corrosion of nearby copper is one of the symptoms of the nasty
>>>>>>product.
>>>>>
>>>>> It's not just Russia that has this problem. The same issue comes up
>>>>> frequently at the HDD Guru forums.
>>>
>>>> I'm right here in the US and I had 3 of 3 WD 1TB drives failed at the same
>>>> time in RAID1 thus making the entire array dead. It is not that you can
>>>> simply buff that dark stuff off and you're good to go. Drive itself tries to
>>>> recover from failures by rewriting service info (remapping etc.) but
>>>> connection is unreliable and it trashes the entire disk beyound repair. Then
>>>> you have that infamous "click of death"... BTW, it is not just WD; others
>>>> are also that bad.
>>>
>>> It is extremly unlikely for a slow chemical process to achive this
>>> level of syncronicity. About as unlikely that it would be fair to call
>>> it impossible
>>>
>>> Your array died from a different cause that would affect all drives
>>> simultaneously, such as a power spike.
>
>> Yes, they did not die from contacts oxidation at that very same moment. I
>> can not even tell they all died the same month--that array might've been
>> running in degraded mode with one drive dead, then after some time second
>> drive died but it was still running on one remaining drive. And only when
>> the last one crossed the Styx the entire array went dead.
>
> Ah, I see. I did misunderstand that. May still be something
> else but the contacts are a possible explanation with that.

I don't think it is something else but everything is possible...

>> I don't use Windows so my machines are never turned off unless there
>> is a real need for this. And they are rarely updated once they are
>> up and running so there is no reboots. Typical uptime is more than a
>> year.
>
> So your disks worked and then refused to restart? Or you are running
> a RAID1 without monitoring?

They failed during weekly full backup. One of the files read failed and they
entered that infinite loop of restarting themself and retrying. Root
filesystem was also on that RAID1 array so there was no other choice than
to reboot. And on that reboot all 3 drives failed to start with the same
"click of death" syndrome.

>> I don't know though how I could miss a degradation alert if there was any.
>
> Well, if it is Linux with mdadm, it only sends one email per
> degradation event in the default settings.

Yep, I probably missed it when shoveling through mountains of spam.

>> All 3 drives in the array simply failed to start after reboot. There were
>> some media errors reported before reboot but all drives somehow worked. Then
>> the system got rebooted and all 3 drives failed with the same "click of
>> death."
>
>> The mechanism here is not that oxidation itself killed the drives. It never
>> happens that way. It was a main cause of a failure, but drives actually
>> performed suicide like body immune system kills that body when overreacting
>> to some kind of hemorrargic fever or so.
>
>> The probable sequence is something like this:
>
>> - Drives run for a long time with majority of the files never
>> accessed so it doesn't matter if that part of the disk where they
>> are stored is bad or not
>
> I run long smart selftest on all my drives (RAID or no) every
> 14 days to prevent that. Works well.
>
>> - When the system is rebooted RAID array assembly is performed
>
>> - While this assembly is being performed a number of sectors on a
>> drive found to be defective and drive tries to remap them
>
>> - Such action involves rewriting service information
>
>> - Read/write operations are unreliable because of failing head
>> contacts so the service areas become filled with garbage
>
>> - Once the vital service information is damaged the drive is
>> essentially dead because its controller can not read vital data to
>> even start the disk
>
>> - The only hope for the controller to recover is to repeat the read
>> in hope that it might somehow get read. This is that infamous
>> "click of death" sound when drive tries to read the info again and
>> again. There is no way it can recover because that data are
>> trashed.
>
>> - Drives do NOT fail while they run, the failure happens on the next
>> reboot. The damage that would kill the drives on that reboot
>> happened way before that reboot though.
>
>> That suicide also can happen when some old file that was not accessed for
>> ages is read. That attempt triggers the suicide chain.
>
> Yes, that makes sense. However you should do surface scans on
> RAIDed disks regularly, e.g. by long SMART selftests. This will
> catch weak sectors early and other degradation as well.

I know but I simply didn't think all 3 drives can fail... I thought I have
enough redundancy because I put not 2 but 3 drives in that RAID1... And I
did have something like a test with regular weekly full backup that reads
all the files (not the entire disk media but at least all the files on it)
and that was that backup that triggered disk suicide.

Anyway lesson learned and I'm taking additional measures now. It was not a
very good experience loosing some of my work...

BTW, I took a look at brand new WDC WD5000YS-01MPB1 drives, right out of the
sealed bags with silica gel and all 4 of those had their contacts already
oxidized with a lot of black stuff. That makes me very suspicious that
conspiracy theory might be not all that crazy--that oxidation seems to be
pre-applied by the manufacturer.

---
******************************************************************
* KSI(a)home KOI8 Net < > The impossible we do immediately. *
* Las Vegas NV, USA < > Miracles require 24-hour notice. *
******************************************************************
From: Rod Speed on
Sergey Kubushyn wrote:
> In sci.electronics.repair Arno <me(a)privacy.net> wrote:
>> In comp.sys.ibm.pc.hardware.storage Sergey Kubushyn <ksi(a)koi8.net>
>> wrote:
>>> In sci.electronics.repair Arno <me(a)privacy.net> wrote:
>>>> In comp.sys.ibm.pc.hardware.storage Sergey Kubushyn <ksi(a)koi8.net>
>>>> wrote:
>>>>> In sci.electronics.repair Franc Zabkar
>>>>> <fzabkar(a)iinternode.on.net> wrote:
>>>>>> On Thu, 8 Apr 2010 14:03:39 -0700 (PDT), whit3rd
>>>>>> <whit3rd(a)gmail.com> put finger to keyboard and composed:
>>>>>>
>>>>>>> On Apr 8, 12:11?am, Franc Zabkar <fzab...(a)iinternode.on.net>
>>>>>>> wrote:
>>>>>>
>>>>>>>> Is this the fallout from RoHS?
>>>>>>>
>>>>>>> Maybe not. There are other known culprits, like the drywall
>>>>>>> (gypsum board,
>>>>>>> sheetrock... whatever it's called in your region) that outgasses
>>>>>>> hydrogen
>>>>>>> sulphide. Some US construction of a few years ago is so bad
>>>>>>> with this
>>>>>>> toxic and corrosive gas emission that demolition of nearly-new
>>>>>>> construction
>>>>>>> is called for.
>>>>>>>
>>>>>>> Corrosion of nearby copper is one of the symptoms of the nasty
>>>>>>> product.
>>>>>>
>>>>>> It's not just Russia that has this problem. The same issue comes
>>>>>> up frequently at the HDD Guru forums.
>>>>
>>>>> I'm right here in the US and I had 3 of 3 WD 1TB drives failed at
>>>>> the same time in RAID1 thus making the entire array dead. It is
>>>>> not that you can simply buff that dark stuff off and you're good
>>>>> to go. Drive itself tries to recover from failures by rewriting
>>>>> service info (remapping etc.) but connection is unreliable and it
>>>>> trashes the entire disk beyound repair. Then you have that
>>>>> infamous "click of death"... BTW, it is not just WD; others are
>>>>> also that bad.
>>>>
>>>> It is extremly unlikely for a slow chemical process to achive this
>>>> level of syncronicity. About as unlikely that it would be fair to
>>>> call it impossible
>>>>
>>>> Your array died from a different cause that would affect all drives
>>>> simultaneously, such as a power spike.
>>
>>> Yes, they did not die from contacts oxidation at that very same
>>> moment. I can not even tell they all died the same month--that
>>> array might've been running in degraded mode with one drive dead,
>>> then after some time second drive died but it was still running on
>>> one remaining drive. And only when the last one crossed the Styx
>>> the entire array went dead.
>>
>> Ah, I see. I did misunderstand that. May still be something
>> else but the contacts are a possible explanation with that.
>
> I don't think it is something else but everything is possible...
>
>>> I don't use Windows so my machines are never turned off unless there
>>> is a real need for this. And they are rarely updated once they are
>>> up and running so there is no reboots. Typical uptime is more than a
>>> year.
>>
>> So your disks worked and then refused to restart? Or you are running
>> a RAID1 without monitoring?
>
> They failed during weekly full backup. One of the files read failed
> and they entered that infinite loop of restarting themself and
> retrying. Root filesystem was also on that RAID1 array so there was
> no other choice than to reboot. And on that reboot all 3 drives
> failed to start with the same "click of death" syndrome.
>
>>> I don't know though how I could miss a degradation alert if there
>>> was any.
>>
>> Well, if it is Linux with mdadm, it only sends one email per
>> degradation event in the default settings.
>
> Yep, I probably missed it when shoveling through mountains of spam.
>
>>> All 3 drives in the array simply failed to start after reboot.
>>> There were some media errors reported before reboot but all drives
>>> somehow worked. Then the system got rebooted and all 3 drives
>>> failed with the same "click of death."
>>
>>> The mechanism here is not that oxidation itself killed the drives.
>>> It never happens that way. It was a main cause of a failure, but
>>> drives actually performed suicide like body immune system kills
>>> that body when overreacting to some kind of hemorrargic fever or so.
>>
>>> The probable sequence is something like this:
>>
>>> - Drives run for a long time with majority of the files never
>>> accessed so it doesn't matter if that part of the disk
>>> where they are stored is bad or not
>>
>> I run long smart selftest on all my drives (RAID or no) every
>> 14 days to prevent that. Works well.
>>
>>> - When the system is rebooted RAID array assembly is
>>> performed
>>
>>> - While this assembly is being performed a number of sectors
>>> on a drive found to be defective and drive tries to remap
>>> them
>>
>>> - Such action involves rewriting service information
>>
>>> - Read/write operations are unreliable because of failing
>>> head contacts so the service areas become filled with
>>> garbage
>>
>>> - Once the vital service information is damaged the drive is
>>> essentially dead because its controller can not read vital
>>> data to even start the disk
>>
>>> - The only hope for the controller to recover is to repeat
>>> the read in hope that it might somehow get read. This is
>>> that infamous "click of death" sound when drive tries to
>>> read the info again and again. There is no way it can
>>> recover because that data are trashed.
>>
>>> - Drives do NOT fail while they run, the failure happens on
>>> the next reboot. The damage that would kill the drives on
>>> that reboot happened way before that reboot though.
>>
>>> That suicide also can happen when some old file that was not
>>> accessed for ages is read. That attempt triggers the suicide chain.
>>
>> Yes, that makes sense. However you should do surface scans on
>> RAIDed disks regularly, e.g. by long SMART selftests. This will
>> catch weak sectors early and other degradation as well.
>
> I know but I simply didn't think all 3 drives can fail... I thought I
> have enough redundancy because I put not 2 but 3 drives in that
> RAID1... And I did have something like a test with regular weekly
> full backup that reads all the files (not the entire disk media but
> at least all the files on it) and that was that backup that triggered
> disk suicide.
>
> Anyway lesson learned and I'm taking additional measures now. It was
> not a very good experience loosing some of my work...
>
> BTW, I took a look at brand new WDC WD5000YS-01MPB1 drives, right out
> of the sealed bags with silica gel and all 4 of those had their
> contacts already oxidized with a lot of black stuff. That makes me
> very suspicious that conspiracy theory might be not all that
> crazy--that oxidation seems to be pre-applied by the manufacturer.

MUCH more likely that someone fucked up in the factory.


From: Arno on
In comp.sys.ibm.pc.hardware.storage Sergey Kubushyn <ksi(a)koi8.net> wrote:
[...]
>>> That suicide also can happen when some old file that was not accessed for
>>> ages is read. That attempt triggers the suicide chain.
>>
>> Yes, that makes sense. However you should do surface scans on
>> RAIDed disks regularly, e.g. by long SMART selftests. This will
>> catch weak sectors early and other degradation as well.

> I know but I simply didn't think all 3 drives can fail... I thought I have
> enough redundancy because I put not 2 but 3 drives in that RAID1... And I
> did have something like a test with regular weekly full backup that reads
> all the files (not the entire disk media but at least all the files on it)
> and that was that backup that triggered disk suicide.

> Anyway lesson learned and I'm taking additional measures now. It was not a
> very good experience loosing some of my work...

Yes, I can imagine. I have my critical stuff also on a 3 way RAID1,
but with long SMART selftests every 2 weeks and 3 different drives,
two from WD and one from Samsung. One additional advantage of the
long SMART selftest is that with smartd you will get a warning
email on every failing test, i.e. one every two weeks. For additional
warning you can also run a daily short test, e.g..

> BTW, I took a look at brand new WDC WD5000YS-01MPB1 drives, right out of the
> sealed bags with silica gel and all 4 of those had their contacts already
> oxidized with a lot of black stuff. That makes me very suspicious that
> conspiracy theory might be not all that crazy--that oxidation seems to be
> pre-applied by the manufacturer.

Urgh. These bags are airtight. No way the problem happened on your
side then. My two weeks old WD5000AADS-00S9B0 looks fine on the top
of the PCB. I think I will have a look underneath later.

Arno
--
Arno Wagner, Dr. sc. techn., Dipl. Inform., CISSP -- Email: arno(a)wagner.name
GnuPG: ID: 1E25338F FP: 0C30 5782 9D93 F785 E79C 0296 797F 6B50 1E25 338F
----
Cuddly UI's are the manifestation of wishful thinking. -- Dylan Evans
From: Sergey Kubushyn on
In sci.electronics.repair Arno <me(a)privacy.net> wrote:
> In comp.sys.ibm.pc.hardware.storage Sergey Kubushyn <ksi(a)koi8.net> wrote:
> [...]
>>>> That suicide also can happen when some old file that was not accessed for
>>>> ages is read. That attempt triggers the suicide chain.
>>>
>>> Yes, that makes sense. However you should do surface scans on
>>> RAIDed disks regularly, e.g. by long SMART selftests. This will
>>> catch weak sectors early and other degradation as well.
>
>> I know but I simply didn't think all 3 drives can fail... I thought I have
>> enough redundancy because I put not 2 but 3 drives in that RAID1... And I
>> did have something like a test with regular weekly full backup that reads
>> all the files (not the entire disk media but at least all the files on it)
>> and that was that backup that triggered disk suicide.
>
>> Anyway lesson learned and I'm taking additional measures now. It was not a
>> very good experience loosing some of my work...
>
> Yes, I can imagine. I have my critical stuff also on a 3 way RAID1,
> but with long SMART selftests every 2 weeks and 3 different drives,
> two from WD and one from Samsung. One additional advantage of the
> long SMART selftest is that with smartd you will get a warning
> email on every failing test, i.e. one every two weeks. For additional
> warning you can also run a daily short test, e.g..

No matter what you do you can not prevent an occasional disaster :( One
MUST remember that "backup" in not a noun but a verb in imperative.

>> BTW, I took a look at brand new WDC WD5000YS-01MPB1 drives, right out of the
>> sealed bags with silica gel and all 4 of those had their contacts already
>> oxidized with a lot of black stuff. That makes me very suspicious that
>> conspiracy theory might be not all that crazy--that oxidation seems to be
>> pre-applied by the manufacturer.
>
> Urgh. These bags are airtight. No way the problem happened on your
> side then. My two weeks old WD5000AADS-00S9B0 looks fine on the top
> of the PCB. I think I will have a look underneath later.

Those 4 were fine on the top of PCB. Black stuff was underneath, on those
pads contacting with springy heads pins.

---
******************************************************************
* KSI(a)home KOI8 Net < > The impossible we do immediately. *
* Las Vegas NV, USA < > Miracles require 24-hour notice. *
******************************************************************