HDD not suspending properly / dead on resume [Kernel]

Prev: [PATCH 2/2] Add trace point to mremap
Next: Add trace events to mmap and brk

From: Stephan Diestelhorst on 9 Jul 2010 12:00

Hi,
I have n issue with suepnd to RAM and I/O load on a disk. Symptoms
are that the disk does not respond to requests when woken up, producing
only I/O errors on all tested kernels (newest 2.6.35-rc4 (Ubuntu
mainline PPA build)):

[ 1719.580169] sd 0:0:0:0: [sda] Unhandled error code
[ 1719.580174] sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 1719.580178] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 0f 51 e7 88 00 00 b0 00
[ 1719.580186] end_request: I/O error, dev sda, sector 257025928
[ 1719.580798] Aborting journal on device dm-1-8.
[ 1719.580912] EXT4-fs error (device dm-1) in ext4_reserve_inode_write: Journal has aborted
[ 1719.580959] EXT4-fs (dm-1): Remounting filesystem read-only
[ 1719.581004] sd 0:0:0:0: [sda] Unhandled error code
[ 1719.581007] sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 1719.581010] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 0f 51 a1 88 00 00 08 00
[ 1719.581016] end_request: I/O error, dev sda, sector 257008008
[ 1719.581026] Buffer I/O error on device dm-1, logical block 2129920
[ 1719.581027] lost page write due to I/O error on dm-1
[ 1719.581149]
[ 1719.581214] sd 0:0:0:0: [sda] Unhandled error code
[ 1719.581217] sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 1719.581220] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 0e 4d a1 88 00 00 08 00
[ 1719.581227] end_request: I/O error, dev sda, sector 239968648
[ 1719.581254] JBD2: I/O error detected when updating journal superblock for dm-1-8.
[ 1719.581268] journal commit I/O error

This can be triggered most reliably with multiple "direct" writes to
disk, I create the load with the attached script. If the issue is
triggered, suspend (through pm-suspend) takes very long.

IMHO the interesting log output during suspend is:
[ 1668.150125] Suspending console(s) (use no_console_suspend to debug)
[ 1668.150460] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 1668.174958] sd 0:0:0:0: [sda] Stopping disk
[ 1668.198045] ACPI handle has no context!
[ 1668.199302] ohci_hcd 0000:00:14.5: PCI INT C disabled
[ 1668.199468] ohci_hcd 0000:00:13.1: PCI INT A disabled
[ 1668.199477] ohci_hcd 0000:00:13.0: PCI INT A disabled
[ 1668.199520] ehci_hcd 0000:00:12.2: PCI INT B disabled
[ 1668.199525] ohci_hcd 0000:00:12.1: PCI INT A disabled
[ 1668.199562] ohci_hcd 0000:00:12.0: PCI INT A disabled
[ 1668.210138] ehci_hcd 0000:00:13.2: PCI INT B disabled
[ 1668.300295] HDA Intel 0000:00:14.2: PCI INT A disabled
[ 1668.300301] HDA Intel 0000:01:00.1: PCI INT B disabled
[ 1668.300349] ACPI handle has no context!
[ 1669.700139] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 1674.700125] ata1.00: qc timeout (cmd 0xec)
[ 1674.700136] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[ 1674.700139] ata1.00: revalidation failed (errno=-5)
[ 1675.230136] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 1685.230125] ata1.00: qc timeout (cmd 0xec)
[ 1685.230137] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[ 1685.230140] ata1.00: revalidation failed (errno=-5)
[ 1685.230144] ata1: limiting SATA link speed to 1.5 Gbps
[ 1685.760137] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[ 1715.760126] ata1.00: qc timeout (cmd 0xec)
[ 1715.760137] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[ 1715.760139] ata1.00: revalidation failed (errno=-5)
[ 1715.760142] ata1.00: disabled
[ 1715.810216] ahci 0000:00:11.0: PCI INT A disabled
[ 1715.830154] PM: suspend of devices complete after 47679.847 msecs

I've also attached the full dmesg, lspci -vv and smartctl -a
information.

Do you guys have any ideas here?

Many thanks,
Stephan
--
Stephan Diestelhorst, AMD Operating System Research Center
stephan.diestelhorst(a)amd.com, Tel. +49 (0)351 448 356 719

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

From: Stephan Diestelhorst on 9 Jul 2010 17:50

I wrote:
> I have an issue with suspend to RAM and I/O load on a disk. Symptoms
> are that the disk does not respond to requests when woken up, producing
> only I/O errors on all tested kernels (newest 2.6.35-rc4 (Ubuntu
> mainline PPA build)):
>
<snip>

> This can be triggered most reliably with multiple "direct" writes to
> disk, I create the load with the attached script. If the issue is
> triggered, suspend (through pm-suspend) takes very long.

Attached now...

> IMHO the interesting log output during suspend is:
> [ 1674.700125] ata1.00: qc timeout (cmd 0xec)

Almighty google suggested to try "pci=nomsi", which seems to have
cured the issue for me for now. Is that plausible? I'll keep this
under observation.

Thanks,
Stephan

From: Rafael J. Wysocki on 9 Jul 2010 18:00

On Friday, July 09, 2010, Stephan Diestelhorst wrote:
> I wrote:
> > I have an issue with suspend to RAM and I/O load on a disk. Symptoms
> > are that the disk does not respond to requests when woken up, producing
> > only I/O errors on all tested kernels (newest 2.6.35-rc4 (Ubuntu
> > mainline PPA build)):
> >
> <snip>
>
> > This can be triggered most reliably with multiple "direct" writes to
> > disk, I create the load with the attached script. If the issue is
> > triggered, suspend (through pm-suspend) takes very long.
>
> Attached now...
>
> > IMHO the interesting log output during suspend is:
> > [ 1674.700125] ata1.00: qc timeout (cmd 0xec)
>
> Almighty google suggested to try "pci=nomsi", which seems to have
> cured the issue for me for now. Is that plausible? I'll keep this
> under observation.

Hmm. How does your /proc/interrupts look like?

Also, do you have a link to this "Google suggestion"?

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Stephan Diestelhorst on 9 Jul 2010 19:10

Rafael J. Wysocki wrote:
> On Friday, July 09, 2010, Stephan Diestelhorst wrote:
> > I wrote:
> > > I have an issue with suspend to RAM and I/O load on a disk. Symptoms
> > > are that the disk does not respond to requests when woken up, producing
> > > only I/O errors on all tested kernels (newest 2.6.35-rc4 (Ubuntu
> > > mainline PPA build)):
> > >
> > <snip>
> >
> > > This can be triggered most reliably with multiple "direct" writes to
> > > disk, I create the load with the attached script. If the issue is
> > > triggered, suspend (through pm-suspend) takes very long.
> >
> > > IMHO the interesting log output during suspend is:
> > > [ 1674.700125] ata1.00: qc timeout (cmd 0xec)
> >
> > Almighty google suggested to try "pci=nomsi", which seems to have
> > cured the issue for me for now. Is that plausible? I'll keep this
> > under observation.
>
> Hmm. How does your /proc/interrupts look like?

This has been yet another red herring. After trying out the kernel
option three times with two different kernels, it failed yet again
with the same symptoms.

I have attached /proc/interrupts for 2.6.35-rc4, once with pci=nomsi
and once without, but again, I do not think this makes a difference :-/

> Also, do you have a link to this "Google suggestion"?

It was some german forum, a guy with completely different HW, but the
same symptom. I thought trying out the option wouldn't hurt.

Maybe it came for example from http://lkml.org/lkml/2008/12/20/3
originally.

Stephan

From: Rafael J. Wysocki on 9 Jul 2010 20:10

On Saturday, July 10, 2010, Stephan Diestelhorst wrote:
> Rafael J. Wysocki wrote:
> > On Friday, July 09, 2010, Stephan Diestelhorst wrote:
> > > I wrote:
> > > > I have an issue with suspend to RAM and I/O load on a disk. Symptoms
> > > > are that the disk does not respond to requests when woken up, producing
> > > > only I/O errors on all tested kernels (newest 2.6.35-rc4 (Ubuntu
> > > > mainline PPA build)):
> > > >
> > > <snip>
> > >
> > > > This can be triggered most reliably with multiple "direct" writes to
> > > > disk, I create the load with the attached script. If the issue is
> > > > triggered, suspend (through pm-suspend) takes very long.
> > >
> > > > IMHO the interesting log output during suspend is:
> > > > [ 1674.700125] ata1.00: qc timeout (cmd 0xec)
> > >
> > > Almighty google suggested to try "pci=nomsi", which seems to have
> > > cured the issue for me for now. Is that plausible? I'll keep this
> > > under observation.
> >
> > Hmm. How does your /proc/interrupts look like?
>
> This has been yet another red herring. After trying out the kernel
> option three times with two different kernels, it failed yet again
> with the same symptoms.

I thought it would be like that.

> I have attached /proc/interrupts for 2.6.35-rc4, once with pci=nomsi
> and once without, but again, I do not think this makes a difference :-/
>
> > Also, do you have a link to this "Google suggestion"?
>
> It was some german forum, a guy with completely different HW, but the
> same symptom. I thought trying out the option wouldn't hurt.
>
> Maybe it came for example from http://lkml.org/lkml/2008/12/20/3
> originally.

I have a box where this problem is kind of reproducible, but it happens _very_
rarely. Also I can't reproduce it on demand running suspend-resume in a tight
loop. Are you able to reproduce it more regurarly?

Also, what kind of disk do you use?

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2
Prev: [PATCH 2/2] Add trace point to mremap
Next: Add trace events to mmap and brk