From: Marc Dionne on
On Thu, Jun 3, 2010 at 11:42 PM, Robert Hancock <hancockrwd(a)gmail.com> wrote:
> On 06/03/2010 07:21 PM, Marc Dionne wrote:
>>
>> On Wed, Jun 2, 2010 at 10:00 PM, Matthew Garrett<mjg59(a)srcf.ucam.org>
>>  wrote:
>>>
>>> On Wed, Jun 02, 2010 at 09:49:27PM -0400, Marc Dionne wrote:
>>>>
>>>>    Since somewhere early in the 2.6.35 merge window I've been seeing a
>>>> hang
>>>>    during boot.  The hang last roughly 30 seconds and is followed by
>>>> this:
>>>>
>>>>    ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
>>>>    ata4.00: failed command: READ FPDMA QUEUED
>>>>    ata4.00: cmd 60/08:00:00:00:00/00:00:00:00:00/40 tag 0 ncq 4096 in
>>>>             res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>>>>    ata4.00: status: { DRDY }
>>>>    ata4: hard resetting link
>>>>    ahci_is_device_present: status is: 80
>>>>    ahci_is_device_present: status(2) is: 3
>>>>    ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>>>>    ata4.00: configured for UDMA/133
>>>>    ata4.00: device reported invalid CHS sector 0
>>>>    ata4: EH complete
>>>>
>>>>    The system works normally once booted.  I had some time to do a
>>>> bisect
>>>>    today and it cleanly ended up pointing to this commit:
>>>>        96d60303fd: ahci: Turn off DMA engines when there's no device
>>>> attached
>>>>
>>>>    Reverting the commit makes the system boot cleanly as before.  System
>>>> has
>>>>    an Intel ICH10 based board, in AHCI mode, disk is an Intel X25M SSD.
>>>
>>> Hm, interesting. Jeff, any immediate ideas? I don't see this behaviour
>>> on my ICH10. Is there actually a device on ata4?
>>>
>>> --
>>> Matthew Garrett | mjg59(a)srcf.ucam.org
>>
>> If it's helpful, attached are the dmesg output following a normal boot
>> and one with the hang.
>>
>> The only difference in the kernel used for those two boots is that the
>> return statement in ahci_start_engine following the call to
>> ahci_is_device_present is commented out for the normal boot case.
>> When the return statement is active, the 30 second hang occurs.
>
> Can you try and put a printk in where it's returning out of the
> ahci_start_engine method? I wonder if that's getting triggered somewhere
> unexpected..

Attached is a dmesg output with some printks added to show the status
when ahci_is_device_present causes ahci_start_engine to return early..

It looks to me like there's two passes through ahci_start_engine for
all ports, and during the first pass the port with the disk is showing
a status of 127, causing the function to return early and skipping the
start DMA part. This seems to cause the hang later on.

I noticed that during this first pass all the print_id fields for the
ports are still set to -1, so as a test I made ahci_start_engine only
do the check for ahci_is_device_present if print_id != -1, and this
makes the system boot normally.

Marc
From: Robert Hancock on
Fri, Jun 4, 2010 at 7:23 PM, Marc Dionne <marc.c.dionne(a)gmail.com> wrote:
> On Thu, Jun 3, 2010 at 11:42 PM, Robert Hancock <hancockrwd(a)gmail.com> wrote:
>> On 06/03/2010 07:21 PM, Marc Dionne wrote:
>>>
>>> On Wed, Jun 2, 2010 at 10:00 PM, Matthew Garrett<mjg59(a)srcf.ucam.org>
>>> �wrote:
>>>>
>>>> On Wed, Jun 02, 2010 at 09:49:27PM -0400, Marc Dionne wrote:
>>>>>
>>>>> � �Since somewhere early in the 2.6.35 merge window I've been seeing a
>>>>> hang
>>>>> � �during boot. �The hang last roughly 30 seconds and is followed by
>>>>> this:
>>>>>
>>>>> � �ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
>>>>> � �ata4.00: failed command: READ FPDMA QUEUED
>>>>> � �ata4.00: cmd 60/08:00:00:00:00/00:00:00:00:00/40 tag 0 ncq 4096 in
>>>>> � � � � � � res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>>>>> � �ata4.00: status: { DRDY }
>>>>> � �ata4: hard resetting link
>>>>> � �ahci_is_device_present: status is: 80
>>>>> � �ahci_is_device_present: status(2) is: 3
>>>>> � �ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>>>>> � �ata4.00: configured for UDMA/133
>>>>> � �ata4.00: device reported invalid CHS sector 0
>>>>> � �ata4: EH complete
>>>>>
>>>>> � �The system works normally once booted. �I had some time to do a
>>>>> bisect
>>>>> � �today and it cleanly ended up pointing to this commit:
>>>>> � � � �96d60303fd: ahci: Turn off DMA engines when there's no device
>>>>> attached
>>>>>
>>>>> � �Reverting the commit makes the system boot cleanly as before. �System
>>>>> has
>>>>> � �an Intel ICH10 based board, in AHCI mode, disk is an Intel X25M SSD.
>>>>
>>>> Hm, interesting. Jeff, any immediate ideas? I don't see this behaviour
>>>> on my ICH10. Is there actually a device on ata4?
>>>>
>>>> --
>>>> Matthew Garrett | mjg59(a)srcf.ucam.org
>>>
>>> If it's helpful, attached are the dmesg output following a normal boot
>>> and one with the hang.
>>>
>>> The only difference in the kernel used for those two boots is that the
>>> return statement in ahci_start_engine following the call to
>>> ahci_is_device_present is commented out for the normal boot case.
>>> When the return statement is active, the 30 second hang occurs.
>>
>> Can you try and put a printk in where it's returning out of the
>> ahci_start_engine method? I wonder if that's getting triggered somewhere
>> unexpected..
>
> Attached is a dmesg output with some printks added to show the status
> when ahci_is_device_present causes ahci_start_engine to return early..
>
> It looks to me like there's two passes through ahci_start_engine for
> all ports, and during the first pass the port with the disk is showing
> a status of 127, causing the function to return early and skipping the
> start DMA part. �This seems to cause the hang later on.
>
> I noticed that during this first pass all the print_id fields for the
> ports are still set to -1, so as a test I made ahci_start_engine only
> do the check for ahci_is_device_present if print_id != -1, and this
> makes the system boot normally.

Is that the right dmesg? I didn't see where the extra output was. It
might be useful to put a WARN_ON(1); inside ahci_start_engine to see
the two code paths where it's being called..
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Marc Dionne on
On Fri, Jun 4, 2010 at 9:37 PM, Robert Hancock <hancockrwd(a)gmail.com> wrote:
> �Fri, Jun 4, 2010 at 7:23 PM, Marc Dionne <marc.c.dionne(a)gmail.com> wrote:
>> On Thu, Jun 3, 2010 at 11:42 PM, Robert Hancock <hancockrwd(a)gmail.com> wrote:
>>> On 06/03/2010 07:21 PM, Marc Dionne wrote:
>>>>
>>>> On Wed, Jun 2, 2010 at 10:00 PM, Matthew Garrett<mjg59(a)srcf.ucam.org>
>>>> �wrote:
>>>>>
>>>>> On Wed, Jun 02, 2010 at 09:49:27PM -0400, Marc Dionne wrote:
>>>>>>
>>>>>> � �Since somewhere early in the 2.6.35 merge window I've been seeing a
>>>>>> hang
>>>>>> � �during boot. �The hang last roughly 30 seconds and is followed by
>>>>>> this:
>>>>>>
>>>>>> � �ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
>>>>>> � �ata4.00: failed command: READ FPDMA QUEUED
>>>>>> � �ata4.00: cmd 60/08:00:00:00:00/00:00:00:00:00/40 tag 0 ncq 4096 in
>>>>>> � � � � � � res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>>>>>> � �ata4.00: status: { DRDY }
>>>>>> � �ata4: hard resetting link
>>>>>> � �ahci_is_device_present: status is: 80
>>>>>> � �ahci_is_device_present: status(2) is: 3
>>>>>> � �ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>>>>>> � �ata4.00: configured for UDMA/133
>>>>>> � �ata4.00: device reported invalid CHS sector 0
>>>>>> � �ata4: EH complete
>>>>>>
>>>>>> � �The system works normally once booted. �I had some time to do a
>>>>>> bisect
>>>>>> � �today and it cleanly ended up pointing to this commit:
>>>>>> � � � �96d60303fd: ahci: Turn off DMA engines when there's no device
>>>>>> attached
>>>>>>
>>>>>> � �Reverting the commit makes the system boot cleanly as before. �System
>>>>>> has
>>>>>> � �an Intel ICH10 based board, in AHCI mode, disk is an Intel X25M SSD.
>>>>>
>>>>> Hm, interesting. Jeff, any immediate ideas? I don't see this behaviour
>>>>> on my ICH10. Is there actually a device on ata4?
>>>>>
>>>>> --
>>>>> Matthew Garrett | mjg59(a)srcf.ucam.org
>>>>
>>>> If it's helpful, attached are the dmesg output following a normal boot
>>>> and one with the hang.
>>>>
>>>> The only difference in the kernel used for those two boots is that the
>>>> return statement in ahci_start_engine following the call to
>>>> ahci_is_device_present is commented out for the normal boot case.
>>>> When the return statement is active, the 30 second hang occurs.
>>>
>>> Can you try and put a printk in where it's returning out of the
>>> ahci_start_engine method? I wonder if that's getting triggered somewhere
>>> unexpected..
>>
>> Attached is a dmesg output with some printks added to show the status
>> when ahci_is_device_present causes ahci_start_engine to return early..
>>
>> It looks to me like there's two passes through ahci_start_engine for
>> all ports, and during the first pass the port with the disk is showing
>> a status of 127, causing the function to return early and skipping the
>> start DMA part. �This seems to cause the hang later on.
>>
>> I noticed that during this first pass all the print_id fields for the
>> ports are still set to -1, so as a test I made ahci_start_engine only
>> do the check for ahci_is_device_present if print_id != -1, and this
>> makes the system boot normally.
>
> Is that the right dmesg? I didn't see where the extra output was. It
> might be useful to put a WARN_ON(1); inside ahci_start_engine to see
> the two code paths where it's being called..

It's the right dmesg, look for lines containing
"ahci_is_device_present" or "ahci_start_engine" - sorry, I should have
pointed it out.

I'll try adding a WARN_ON to get the code paths.

Marc
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Marc Dionne on
On Fri, Jun 4, 2010 at 9:37 PM, Robert Hancock <hancockrwd(a)gmail.com> wrote:
> Is that the right dmesg? I didn't see where the extra output was. It
> might be useful to put a WARN_ON(1); inside ahci_start_engine to see
> the two code paths where it's being called..

The first series of calls to ahci_start_engine have a stack trace that
looks like this:

Call Trace:
[<ffffffff81077002>] warn_slowpath_common+0x85/0x9d
[<ffffffff81077034>] warn_slowpath_null+0x1a/0x1c
[<ffffffff813572dd>] ahci_start_engine+0x92/0xa4
[<ffffffff8135736a>] ahci_start_port+0x7b/0x16c
[<ffffffff81357792>] ahci_port_resume+0x52/0x73
[<ffffffff81357901>] ahci_port_start+0x14e/0x164
[<ffffffff81341dc5>] ata_host_start+0xdf/0x184
[<ffffffff81357b35>] ? ahci_interrupt+0x0/0x5f3
[<ffffffff8134266f>] ata_host_activate+0x2a/0xdc
[<ffffffff81355aab>] ahci_init_one+0x9dd/0x9fe
[<ffffffff81170a9c>] ? __sysfs_add_one+0x38/0x9a
[<ffffffff812519bd>] ? kobject_get+0x1a/0x22
[<ffffffff81269891>] local_pci_probe+0x17/0x1b
[<ffffffff8126a63f>] pci_device_probe+0xca/0xfa
[<ffffffff81308fbe>] ? driver_sysfs_add+0x4c/0x71
[<ffffffff81309110>] driver_probe_device+0xac/0x13b
[<ffffffff813091fc>] __driver_attach+0x5d/0x81
[<ffffffff8130919f>] ? __driver_attach+0x0/0x81
[<ffffffff81308714>] bus_for_each_dev+0x53/0x88
[<ffffffff81308f70>] driver_attach+0x1e/0x20
[<ffffffff81308bc3>] bus_add_driver+0xb6/0x203
[<ffffffff813094ed>] driver_register+0x9e/0x10f
[<ffffffff8126a869>] __pci_register_driver+0x58/0xc8
[<ffffffff81b6582c>] ? ahci_init+0x0/0x1b
[<ffffffff81b65845>] ahci_init+0x19/0x1b
[<ffffffff81002069>] do_one_initcall+0x5e/0x159
[<ffffffff81b33713>] kernel_init+0x183/0x211
[<ffffffff810319a4>] kernel_thread_helper+0x4/0x10
[<ffffffff81b33590>] ? kernel_init+0x0/0x211
[<ffffffff810319a0>] ? kernel_thread_helper+0x0/0x10

The later calls:

Call Trace:
[<ffffffff81077002>] warn_slowpath_common+0x85/0x9d
[<ffffffff81077034>] warn_slowpath_null+0x1a/0x1c
[<ffffffff813572dd>] ahci_start_engine+0x92/0xa4
[<ffffffff81357a0d>] ahci_hardreset+0xc6/0x124
[<ffffffff81357947>] ? ahci_hardreset+0x0/0x124
[<ffffffff81342e08>] ? ata_std_prereset+0x0/0xb7
[<ffffffff8134aeca>] ata_do_reset+0x59/0x62
[<ffffffff8134c3fa>] ata_eh_reset+0x512/0xc99
[<ffffffff81356baa>] ? ahci_postreset+0x0/0x5d
[<ffffffff81357947>] ? ahci_hardreset+0x0/0x124
[<ffffffff81358498>] ? ahci_softreset+0x0/0x44
[<ffffffff81357947>] ? ahci_hardreset+0x0/0x124
[<ffffffff8134d9a1>] ? ata_eh_schedule_probe+0xc5/0xf3
[<ffffffff8134dca7>] ata_eh_recover+0x2d8/0xd6d
[<ffffffff81356baa>] ? ahci_postreset+0x0/0x5d
[<ffffffff81357947>] ? ahci_hardreset+0x0/0x124
[<ffffffff81358498>] ? ahci_softreset+0x0/0x44
[<ffffffff81342e08>] ? ata_std_prereset+0x0/0xb7
[<ffffffff81352ec9>] sata_pmp_error_handler+0xd3/0xa0a
[<ffffffff81067f73>] ? dequeue_entity+0x4a/0x183
[<ffffffff8108c1ae>] ? __cancel_work_timer+0xad/0x193
[<ffffffff8108395e>] ? lock_timer_base+0x2c/0x52
[<ffffffff810839ff>] ? try_to_del_timer_sync+0x7b/0x89
[<ffffffff81357943>] ahci_error_handler+0x2c/0x30
[<ffffffff8134f2c7>] ata_scsi_error+0x33d/0x697
[<ffffffff8131bc10>] scsi_error_handler+0xc2/0x3c5
[<ffffffff8131bb4e>] ? scsi_error_handler+0x0/0x3c5
[<ffffffff8108f451>] kthread+0x7f/0x87
[<ffffffff810319a4>] kernel_thread_helper+0x4/0x10
[<ffffffff8108f3d2>] ? kthread+0x0/0x87
[<ffffffff810319a0>] ? kernel_thread_helper+0x0/0x10

Marc
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Robert Hancock on
On Fri, Jun 4, 2010 at 8:01 PM, Marc Dionne <marc.c.dionne(a)gmail.com> wrote:
> On Fri, Jun 4, 2010 at 9:37 PM, Robert Hancock <hancockrwd(a)gmail.com> wrote:
>> Is that the right dmesg? I didn't see where the extra output was. It
>> might be useful to put a WARN_ON(1); inside ahci_start_engine to see
>> the two code paths where it's being called..
>
> The first series of calls to ahci_start_engine have a stack trace that
> looks like this:
>
> Call Trace:
> �[<ffffffff81077002>] warn_slowpath_common+0x85/0x9d
> �[<ffffffff81077034>] warn_slowpath_null+0x1a/0x1c
> �[<ffffffff813572dd>] ahci_start_engine+0x92/0xa4
> �[<ffffffff8135736a>] ahci_start_port+0x7b/0x16c
> �[<ffffffff81357792>] ahci_port_resume+0x52/0x73
> �[<ffffffff81357901>] ahci_port_start+0x14e/0x164
> �[<ffffffff81341dc5>] ata_host_start+0xdf/0x184
> �[<ffffffff81357b35>] ? ahci_interrupt+0x0/0x5f3
> �[<ffffffff8134266f>] ata_host_activate+0x2a/0xdc
> �[<ffffffff81355aab>] ahci_init_one+0x9dd/0x9fe
> �[<ffffffff81170a9c>] ? __sysfs_add_one+0x38/0x9a
> �[<ffffffff812519bd>] ? kobject_get+0x1a/0x22
> �[<ffffffff81269891>] local_pci_probe+0x17/0x1b
> �[<ffffffff8126a63f>] pci_device_probe+0xca/0xfa
> �[<ffffffff81308fbe>] ? driver_sysfs_add+0x4c/0x71
> �[<ffffffff81309110>] driver_probe_device+0xac/0x13b
> �[<ffffffff813091fc>] __driver_attach+0x5d/0x81
> �[<ffffffff8130919f>] ? __driver_attach+0x0/0x81
> �[<ffffffff81308714>] bus_for_each_dev+0x53/0x88
> �[<ffffffff81308f70>] driver_attach+0x1e/0x20
> �[<ffffffff81308bc3>] bus_add_driver+0xb6/0x203
> �[<ffffffff813094ed>] driver_register+0x9e/0x10f
> �[<ffffffff8126a869>] __pci_register_driver+0x58/0xc8
> �[<ffffffff81b6582c>] ? ahci_init+0x0/0x1b
> �[<ffffffff81b65845>] ahci_init+0x19/0x1b
> �[<ffffffff81002069>] do_one_initcall+0x5e/0x159
> �[<ffffffff81b33713>] kernel_init+0x183/0x211
> �[<ffffffff810319a4>] kernel_thread_helper+0x4/0x10
> �[<ffffffff81b33590>] ? kernel_init+0x0/0x211
> �[<ffffffff810319a0>] ? kernel_thread_helper+0x0/0x10
>
> The later calls:
>
> Call Trace:
> �[<ffffffff81077002>] warn_slowpath_common+0x85/0x9d
> �[<ffffffff81077034>] warn_slowpath_null+0x1a/0x1c
> �[<ffffffff813572dd>] ahci_start_engine+0x92/0xa4
> �[<ffffffff81357a0d>] ahci_hardreset+0xc6/0x124
> �[<ffffffff81357947>] ? ahci_hardreset+0x0/0x124
> �[<ffffffff81342e08>] ? ata_std_prereset+0x0/0xb7
> �[<ffffffff8134aeca>] ata_do_reset+0x59/0x62
> �[<ffffffff8134c3fa>] ata_eh_reset+0x512/0xc99
> �[<ffffffff81356baa>] ? ahci_postreset+0x0/0x5d
> �[<ffffffff81357947>] ? ahci_hardreset+0x0/0x124
> �[<ffffffff81358498>] ? ahci_softreset+0x0/0x44
> �[<ffffffff81357947>] ? ahci_hardreset+0x0/0x124
> �[<ffffffff8134d9a1>] ? ata_eh_schedule_probe+0xc5/0xf3
> �[<ffffffff8134dca7>] ata_eh_recover+0x2d8/0xd6d
> �[<ffffffff81356baa>] ? ahci_postreset+0x0/0x5d
> �[<ffffffff81357947>] ? ahci_hardreset+0x0/0x124
> �[<ffffffff81358498>] ? ahci_softreset+0x0/0x44
> �[<ffffffff81342e08>] ? ata_std_prereset+0x0/0xb7
> �[<ffffffff81352ec9>] sata_pmp_error_handler+0xd3/0xa0a
> �[<ffffffff81067f73>] ? dequeue_entity+0x4a/0x183
> �[<ffffffff8108c1ae>] ? __cancel_work_timer+0xad/0x193
> �[<ffffffff8108395e>] ? lock_timer_base+0x2c/0x52
> �[<ffffffff810839ff>] ? try_to_del_timer_sync+0x7b/0x89
> �[<ffffffff81357943>] ahci_error_handler+0x2c/0x30
> �[<ffffffff8134f2c7>] ata_scsi_error+0x33d/0x697
> �[<ffffffff8131bc10>] scsi_error_handler+0xc2/0x3c5
> �[<ffffffff8131bb4e>] ? scsi_error_handler+0x0/0x3c5
> �[<ffffffff8108f451>] kthread+0x7f/0x87
> �[<ffffffff810319a4>] kernel_thread_helper+0x4/0x10
> �[<ffffffff8108f3d2>] ? kthread+0x0/0x87
> �[<ffffffff810319a0>] ? kernel_thread_helper+0x0/0x10

I'm thinking that the hardreset and device classification makes more
sense as a place to do this anyways, since that's the point where we
really know if there's a device connected or not.. Tejun, any
thoughts?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/