HDD problems causing Kernel panics in Linux/Debian? [Storage]

Prev: [fw] Oracle Drops Hitachi Data Storage Arrays
Next: Any way to revive a dropped 1.5TB Seagate drive? MUST GET IT TO SPIN UP

From: Rod Speed on 8 Mar 2010 20:11

Robert Nichols wrote
> Rod Speed wrote
>> Bob wrote
>>> Rod Speed wrote
>>>> Yousuf Khan wrote
>>>>> Pascal Hambourg wrote
>>>>>> Rod Speed wrote
>>>>>>> Pascal Hambourg wrote

>>>>>>>> Aren't pending sectors sectors waiting to be remapped because of read errors ?

>>>>>>> Thats not the same thing as unreadable sectors.

>>>>>> Aren't pending sectors unreadable, because of read errors ?

>>>>> Pending sectors are still readable,

>>>> Some of them are unreadable, just not all of them are.

>>>>> but they are weak when it comes time to write to them. So they are pending remapping during the next write to
>>>>> them.

>>>> Even unreadable sectors are marked as pending until the write, so you can attempt to get what data is in them
>>>> before the write/reallocate.

>>> More importantly, it's so that you will continue to get an I/O
>>> error when you try to read the file that contains that sector.

>> That doesnt happen. If you get a good read, you wont get an I/O error.

>>> It would be a Very Bad Thing (tm) if you got whatever junk was in the reallocated sector

>> The normal read check stops that from happening, it
>> doesnt have to be flagged as a pending sector to get that.

>>> (remember, the drive could not recover the original data) with no error indication.

>> You will always get an error indication if it couldnt be read, and
>> dont need one if it could be read because the read success varys.

>>> Even if the bad sector is part of your file system's free space or is otherwise irrelevant, the drive has no way to
>>> know that.

>> Yes, but it doesnt need to know that because if its part of the
>> free space it will normally only be written to, not read from.

>>> The drive has to keep the bad sector visible to the OS until you direct a write there,

>> No it does not. It could reallocate the sector on enough bad reads of that sector, and not bother about the data in
>> that sector.

>>> at which point the original content is no longer relevant.

> You misunderstand.

Nope, you do.

> If the bad sectors were actually reallocated, not just marked as "pending", prior to being rewritten, then there would
> be _no_ error upon reading

Wrong. The error indication is always returned on any read error.

Thats an entirely separate matter to whether the sector is reallocated or not.

> (the new sector is, after all, good),

Not on the initial read it aint.

> but the original data from the bad sector could not be there.

You dont know that either with a sector that doesnt always read bad.

From: ANTant on 8 Mar 2010 20:13

>>> You don't need to, no disk access is possible after a kernel
>>> panic, hence no loging. The only thing you can do, is to
>>> look at the screen or to enable the serial console output and
>>> log that on another machine.
>>>
>>> The reason no disk access is possible is simple: A kernel
>>> panic only hapens when the kernel internal state is regarded
>>> as seriously corrupted. A disk access could then cause serious
>>> filesystem corruption (at least writing) and is therefore
>>> not done.
>
>> So Windows' blue screens with memory dumps are different?
>
> You can get a memory dump under Linux as well by using
> the magic sys-req ley (if compiled in), but you cannto write
> to disk after a panic. It is a safety measure.

How about writing to an USB drive or something else?
--
"We are anthill men upon an anthill world." --Ray Bradbury
/\___/\
/ /\ /\ \ Phillip (Ant) @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links (AQFL): http://aqfl.net
\ _ / Please remove ANT if replying by e-mail.
( )

From: Arno on 8 Mar 2010 23:13

In comp.sys.ibm.pc.hardware.storage ANTant(a)zimage.com wrote:
>>>> You don't need to, no disk access is possible after a kernel
>>>> panic, hence no loging. The only thing you can do, is to
>>>> look at the screen or to enable the serial console output and
>>>> log that on another machine.
>>>>
>>>> The reason no disk access is possible is simple: A kernel
>>>> panic only hapens when the kernel internal state is regarded
>>>> as seriously corrupted. A disk access could then cause serious
>>>> filesystem corruption (at least writing) and is therefore
>>>> not done.
>>
>>> So Windows' blue screens with memory dumps are different?
>>
>> You can get a memory dump under Linux as well by using
>> the magic sys-req ley (if compiled in), but you cannto write
>> to disk after a panic. It is a safety measure.

> How about writing to an USB drive or something else?

No. The problem is that a kernel panic is severe and it is
unknown which kernel internal structures are corrupted.
If any of the filesystem structures are corrupt, you could
do arbitrary damage to any accessible filesystem. Unix
(and Linux) has far higher data integrity requirements than
Windows ever had. You also have to take into account that
Unix systems are designed to be long-running, so the chance
of the next reboot being a power-failure is actually high.
As a consequence Unix filesystems are very robust to just
being disconnected at an arbitraty point in time. For that
reason not writing anything when a kernel panic happens
has a pretty good chance of not causing any filesystem
corruption and data corruption only to open files.

On the other hand, the serial interface is simple, so console
output, including error messages, will still be written to it.
If you need that output, connect a different computer to
the serial port, activate the serial console and capture
its output. I have done this a number of times, mostly to
try out experimental kernels on a cluster, but also to debug
kernel panics.

The kernel options (boot time kernel commandline) for enabling
the serial console are (copy&paste from kernel-paramaters.txt):

tty<n> Use the virtual console device <n>.

ttyS<n>[,options]
ttyUSB0[,options]
Use the specified serial port. The options are of
the form "bbbbpnf", where "bbbb" is the baud rate,
"p" is parity ("n", "o", or "e"), "n" is number of
bits, and "f" is flow control ("r" for RTS or
omit it). Default is "9600n8".

See Documentation/serial-console.txt for more
information. See
Documentation/networking/netconsole.txt for an
alternative.

uart[8250],io,<addr>[,options]
uart[8250],mmio,<addr>[,options]
Start an early, polled-mode console on the 8250/16550
UART at the specified I/O port or MMIO address,
switching to the matching ttyS device later. The
options are the same as for ttyS, above.

Arno

--
Arno Wagner, Dr. sc. techn., Dipl. Inform., CISSP -- Email: arno(a)wagner.name
GnuPG: ID: 1E25338F FP: 0C30 5782 9D93 F785 E79C 0296 797F 6B50 1E25 338F
----
Cuddly UI's are the manifestation of wishful thinking. -- Dylan Evans

From: Ant on 9 Mar 2010 01:32

On 3/8/2010 8:13 PM PT, Arno typed:

> On the other hand, the serial interface is simple, so console
> output, including error messages, will still be written to it.
> If you need that output, connect a different computer to
> the serial port, activate the serial console and capture
> its output. I have done this a number of times, mostly to
> try out experimental kernels on a cluster, but also to debug
> kernel panics.
[snipped]

Can I use my old serial external dial-up modem for this?
--
"... Hey. Could we do that again? I know we haven't met, but I don't
want to be an ant. You know? I mean, it's like we go through life with
our antennae bouncing off one another, continously on ant autopilot,
with nothing really human required of us. 'Stop.' 'Go.' 'Walk here.'
'Drive there.' All action basically for survival. All communication
simply to keep this ant colony buzzing along in an efficient, polite
manner. 'Here's your change.' 'Paper or plastic?' 'Credit or debit?"'
'You want ketchup with that' I don't want a straw. I want real human
moments. I want to see you. I want you to see me. I don't want to give
that up. I don't want to be ant, you know?" "Yeah... yeah I know. I
don't want to be an ant either. Thanks for kinda, like, josteling me
there... I've been kinda on zombie autopilot lately. I don't feel like
an ant in my head, but I guess I probably look like one..." --Waking
Life movie.
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT
( ) or ANTant(a)zimage.com
Ant is currently not listening to any songs on his home computer.

From: Vlad_Inhaler on 9 Mar 2010 11:12

On Mar 9, 12:52 am, Arno <m...(a)privacy.net> wrote:
> In comp.sys.ibm.pc.hardware.storage Ant <a...(a)zimage.comant> wrote:
>
> > On 3/7/2010 9:20 AM PT, Arno typed:
>
> >> The reason no disk access is possible is simple: A kernel
> >> panic only hapens when the kernel internal state is regarded
> >> as seriously corrupted. A disk access could then cause serious
> >> filesystem corruption (at least writing) and is therefore
> >> not done.
> > So Windows' blue screens with memory dumps are different?
>
> You can get a memory dump under Linux as well by using
> the magic sys-req ley (if compiled in), but you cannto write
> to disk after a panic. It is a safety measure.
>
> Arno
> --
> Arno Wagner, Dr. sc. techn., Dipl. Inform., CISSP -- Email: a...(a)wagner.name
> GnuPG: ID: 1E25338F FP: 0C30 5782 9D93 F785 E79C 0296 797F 6B50 1E25 338F
> ----
> Cuddly UI's are the manifestation of wishful thinking. -- Dylan Evans

I would have no hesitation in creating a special partition for panic
dumps, hell - if standard Linux filesystems are that sensitive I'd
even make it VFAT or whatever else is necessary.
I have reproducible kernel hangs under a certain kind of load, they
are *not* temperature related and I have no way of working out what
the hell is going on. Oh, the machine is dual-boot and I don't have
these problems under XP.

Going further into that here would be hijacking this thread, and I
have tried that before now anyway without success.

Having some sensible way of taking dumps for further analysis would be
a really *good thing* - hell, I'd even put an additional old IDE drive
in there as a destination device if that was what it took. Sorry, but
that is a 'safety feature' I am not that happy with. Windows can do
it, mainframe OSs can do it . . .

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: [fw] Oracle Drops Hitachi Data Storage Arrays
Next: Any way to revive a dropped 1.5TB Seagate drive? MUST GET IT TO SPIN UP