HDD problems causing Kernel panics in Linux/Debian? [Storage]

Prev: [fw] Oracle Drops Hitachi Data Storage Arrays
Next: Any way to revive a dropped 1.5TB Seagate drive? MUST GET IT TO SPIN UP

From: Ant on 6 Mar 2010 10:33

On 3/6/2010 2:22 AM PT, Pascal Hambourg typed:

> Hello,
>
> Ant a �crit :
>> I was poking around to see why my old Linux/Debian box was rarely and
>> randomly crashing with kernel panics. I read that its errors can be
>> found in /var/log/syslog (dmesg didn't show me anything related to
>> Kernel panics that I could find):
>>
>> # cat /var/log/syslog
>> ...
>> Mar 4 23:12:07 foobar smartd[2647]: Device: /dev/hda, SMART Usage
>> Attribute: 194 Temperature_Celsius changed from 30 to 31
>> ...
>> Mar 5 15:11:31 foobar smartd[2610]: Device: /dev/hda, SMART Prefailure
>> Attribute: 1 Raw_Read_Error_Rate changed from 58 to 59
>> Mar 5 15:11:31 foobar smartd[2610]: Device: /dev/hda, SMART Usage
>> Attribute: 195 Hardware_ECC_Recovered changed from 58 to 59
>
> These are not errors but just (useless IMHO) notifications on SMART
> attribute changes.

Ah OK. Thanks. :)

>> foobar:/home/ant/download# smartctl -a /dev/hda
> [...]
>
> From this, hda seems to be perfectly healthy.

Thanks for the confirmation. :)
--
"We ants are runnin' the show! We're the lords of the earth!" --ANTZ
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT
( ) or ANTant(a)zimage.com
Ant is currently not listening to any songs on his home computer.

From: Ant on 6 Mar 2010 10:34

> It doesn't seem like this is the cause of your kernel panics. These are
> just informational messages, at worst warnings, but nothing more. The
> drive in question seems to be experiencing some communications error
> with your computer. If it's an IDE drive, then suggest looking into
> changing cables. If it's SATA, it's more rare for there to be
> communications errors, but not unthinkable. However, as the attribute
> says, it recovered from that error, so it's not a failure.

OK. That's good then.

> You might want to turn on core dump saves on the machine, if you haven't
> already done so.

How do I enable core dumps saves for kernel panics?
--
"We ants are runnin' the show! We're the lords of the earth!" --ANTZ
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT
( ) or ANTant(a)zimage.com
Ant is currently not listening to any songs on his home computer.

From: Rod Speed on 6 Mar 2010 13:15

"Ant" <ant(a)zimage.comANT> wrote in message news:gfSdncBxmoq-ZgzWnZ2dnUVZ_uadnZ2d(a)earthlink.com...
>I was poking around to see why my old Linux/Debian box was rarely and randomly crashing with kernel panics. I read that
>its errors can be found in /var/log/syslog (dmesg didn't show me anything related to Kernel panics that I could find):
>
> # cat /var/log/syslog
> ...
> Mar 4 23:12:07 foobar smartd[2647]: Device: /dev/hda, SMART Usage Attribute: 194 Temperature_Celsius changed from 30
> to 31
> ...
> Mar 5 15:11:31 foobar smartd[2610]: Device: /dev/hda, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from
> 58 to 59
> Mar 5 15:11:31 foobar smartd[2610]: Device: /dev/hda, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from
> 58 to 59
> Mar 5 15:15:01 foobar /USR/SBIN/CRON[8815]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
> Mar 5 15:17:01 foobar /USR/SBIN/CRON[11199]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
> Mar 5 15:25:01 foobar /USR/SBIN/CRON[20721]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
> Mar 5 15:35:01 foobar /USR/SBIN/CRON[32588]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
> Mar 5 15:45:01 foobar /USR/SBIN/CRON[12129]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
> Mar 5 15:55:01 foobar /USR/SBIN/CRON[23947]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
> < rebooted my crashed PC from its kernel panic >
> Mar 5 21:05:19 foobar syslogd 1.5.0#5: restart.
> ...
>
> I couldn't find any similiar from an earlier one like:
> ...
> Mar 5 05:17:01 foobar /USR/SBIN/CRON[26833]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
> Mar 5 05:25:01 foobar /USR/SBIN/CRON[29514]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
> Mar 5 05:35:01 foobar /USR/SBIN/CRON[372]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
> Mar 5 05:45:01 foobar /USR/SBIN/CRON[3772]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
> Mar 5 05:55:01 foobar /USR/SBIN/CRON[7160]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
> Mar 5 06:41:19 foobar syslogd 1.5.0#5: restart.
> ...
>
> # hdparm /dev/hda
>
> /dev/hda:
> multcount = 16 (on)
> IO_support = 1 (32-bit)
> unmaskirq = 1 (on)
> using_dma = 1 (on)
> keepsettings = 0 (off)
> readonly = 0 (off)
> readahead = 256 (on)
> geometry = 16383/255/63, sectors = 156301488, start = 0
> foobar:/home/ant/download# hdparm /dev/hda^C
> foobar:/home/ant/download# smartctl -a /dev/hda
> smartctl 5.40 2010-02-03 r3060 [i686-pc-linux-gnu] (local build)
> Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
>
> === START OF INFORMATION SECTION ===
> Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family
> Device Model: ST380011A
> Serial Number: 4JV5P7LN
> Firmware Version: 8.01
> User Capacity: 80,026,361,856 bytes
> Device is: In smartctl database [for details use: -P show]
> ATA Version is: 6
> ATA Standard is: ATA/ATAPI-6 T13 1410D revision 2
> Local Time is: Fri Mar 5 22:32:16 2010 PST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status: (0x82) Offline data collection activity
> was completed without error.
> Auto Offline Data Collection: Enabled.
> Self-test execution status: ( 0) The previous self-test routine completed
> without error or no self-test has ever
> been run.
> Total time to complete Offline
> data collection: ( 430) seconds.
> Offline data collection
> capabilities: (0x5b) SMART execute Offline immediate.
> Auto Offline data collection on/off support.
> Suspend Offline collection upon new
> command.
> Offline surface scan supported.
> Self-test supported.
> No Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities: (0x0003) Saves SMART data before entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability: (0x01) Error logging supported.
> General Purpose Logging supported.
> Short self-test routine
> recommended polling time: ( 1) minutes.
> Extended self-test routine
> recommended polling time: ( 58) minutes.
>
> SMART Attributes Data Structure revision number: 10
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
> 1 Raw_Read_Error_Rate 0x000f 060 056 006 Pre-fail Always - 40077017
> 3 Spin_Up_Time 0x0003 098 098 000 Pre-fail Always - 0
> 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 0
> 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
> 7 Seek_Error_Rate 0x000f 085 060 030 Pre-fail Always - 339834978
> 9 Power_On_Hours 0x0032 060 060 000 Old_age Always - 35554
> 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
> 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 289
> 194 Temperature_Celsius 0x0022 030 048 000 Old_age Always - 30
> 195 Hardware_ECC_Recovered 0x001a 060 055 000 Old_age Always - 40077017
> 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
> 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
> 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
> 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0
> 202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
> # 1 Extended offline Completed without error 00% 31886 -
> # 2 Extended offline Completed without error 00% 22233 -
> # 3 Extended offline Completed without error 00% 18951 -
> # 4 Extended offline Completed without error 00% 18674 -
> # 5 Extended offline Completed without error 00% 15957 -
> # 6 Extended offline Completed without error 00% 14448 -
>
> SMART Selective self-test log data structure revision number 1
> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
> 1 0 0 Not_testing
> 2 0 0 Not_testing
> 3 0 0 Not_testing
> 4 0 0 Not_testing
> 5 0 0 Not_testing
> Selective self-test flags (0x0):
> After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
>
>
> Are those bad? Thank you in advance. :)

Yes, the reallocated sectors are much higher than I would continue to use with new hard drives so cheap.

From: Pascal Hambourg on 6 Mar 2010 14:41

Rod Speed a �crit :
>> 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
>> 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
[...]
> Yes, the reallocated sectors are much higher than I would continue to
> use with new hard drives so cheap.

Huh ? From the above, the drive has no reallocated sectors yet, and no
pending (unreadable) sectors either.

PS : was it useful to quote all the post just to comment on one line ?

From: Yousuf Khan on 6 Mar 2010 14:58

Ant wrote:
>> You might want to turn on core dump saves on the machine, if you haven't
>> already done so.
>
> How do I enable core dumps saves for kernel panics?

This might be a little old, or not entirely relevant to your distro.

HOWTO enable core-dumps - LinuxReviews - Mozilla Firefox
chrome://browser/content/browser.xul

Look up for your own distro, they may have an easier way to do this,
depending what tools are included with your distro.

Yousuf Khan

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: [fw] Oracle Drops Hitachi Data Storage Arrays
Next: Any way to revive a dropped 1.5TB Seagate drive? MUST GET IT TO SPIN UP