From: Dave Airlie on
Hi guys,

I've been running an Intel SSD (the KS one) on my Dell XPS710 desktop
machine, with btrfs on it.

I'm not sure the btrfs oops isn't due to the disk/controller doing
something bad (almost guaranteed).

Attached the dmesg + config, using 2.6.34 + only drm patches.

Jeff I'd be interested in knowing what is happening to the disk before
btrfs oops.

Dave.
From: Robert Hancock on
On 05/31/2010 09:04 PM, Dave Airlie wrote:
> Hi guys,
>
> I've been running an Intel SSD (the KS one) on my Dell XPS710 desktop
> machine, with btrfs on it.
>
> I'm not sure the btrfs oops isn't due to the disk/controller doing
> something bad (almost guaranteed).
>
> Attached the dmesg + config, using 2.6.34 + only drm patches.
>
> Jeff I'd be interested in knowing what is happening to the disk before
> btrfs oops.

ata2: EH in SWNCQ mode,QC:qc_active 0x7FFFFE03 sactive 0x7FFFFE03
ata2: SWNCQ:qc_active 0xFE00 defer_bits 0x7FFF0003 last_issue_tag 0xf
dhfis 0x7E00 dmafis 0x200 sdbfis 0x0
ata2: ATA_REG 0x40 ERR_REG 0x0
ata2: tag : dhfis dmafis sdbfis sacitve
ata2: tag 0x9: 1 1 0 1
ata2: tag 0xa: 1 0 0 1
ata2: tag 0xb: 1 0 0 1
ata2: tag 0xc: 1 0 0 1
ata2: tag 0xd: 1 0 0 1
ata2: tag 0xe: 1 0 0 1
ata2: tag 0xf: 0 0 0 1
ata2.00: exception Emask 0x0 SAct 0x7ffffe03 SErr 0x1800000 action 0x6
frozen
ata2: SError: { LinkSeq TrStaTrns }

Last line is probably the most informative, SATA link sequence error and
transport state transition error. That's probably something bad
happening at the low level between the controller and drive. Is this
happening repeatedly?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Robert Hancock on
On Mon, May 31, 2010 at 11:02 PM, Dave Airlie <airlied(a)gmail.com> wrote:
> On Tue, Jun 1, 2010 at 2:59 PM, Robert Hancock <hancockrwd(a)gmail.com> wrote:
>> On 05/31/2010 09:04 PM, Dave Airlie wrote:
>>>
>>> Hi guys,
>>>
>>> I've been running an Intel SSD (the KS one) on my Dell XPS710 desktop
>>> machine, with btrfs on it.
>>>
>>> I'm not sure the btrfs oops isn't due to the disk/controller doing
>>> something bad (almost guaranteed).
>>>
>>> Attached the dmesg + config, using 2.6.34 + only drm patches.
>>>
>>> Jeff I'd be interested in knowing what is happening to the disk before
>>> btrfs oops.
>>
>> ata2: EH in SWNCQ mode,QC:qc_active 0x7FFFFE03 sactive 0x7FFFFE03
>> ata2: SWNCQ:qc_active 0xFE00 defer_bits 0x7FFF0003 last_issue_tag 0xf
>> �dhfis 0x7E00 dmafis 0x200 sdbfis 0x0
>> ata2: ATA_REG 0x40 ERR_REG 0x0
>> ata2: tag : dhfis dmafis sdbfis sacitve
>> ata2: tag 0x9: 1 1 0 1
>> ata2: tag 0xa: 1 0 0 1
>> ata2: tag 0xb: 1 0 0 1
>> ata2: tag 0xc: 1 0 0 1
>> ata2: tag 0xd: 1 0 0 1
>> ata2: tag 0xe: 1 0 0 1
>> ata2: tag 0xf: 0 0 0 1
>> ata2.00: exception Emask 0x0 SAct 0x7ffffe03 SErr 0x1800000 action 0x6
>> frozen
>> ata2: SError: { LinkSeq TrStaTrns }
>>
>> Last line is probably the most informative, SATA link sequence error and
>> transport state transition error. That's probably something bad happening at
>> the low level between the controller and drive. Is this happening
>> repeatedly?
>>
>
> from another boot I do see another one.
>
> ata2: EH in SWNCQ mode,QC:qc_active 0x1FF sactive 0x1FF
> ata2: SWNCQ:qc_active 0x7F defer_bits 0x180 last_issue_tag 0x6
> �dhfis 0x3F dmafis 0x8 sdbfis 0x0
> ata2: ATA_REG 0x41 ERR_REG 0x84
> ata2: tag : dhfis dmafis sdbfis sacitve
> ata2: tag 0x0: 1 0 0 1
> ata2: tag 0x1: 1 0 0 1
> ata2: tag 0x2: 1 0 0 1
> ata2: tag 0x3: 1 1 0 1
> ata2: tag 0x4: 1 0 0 1
> ata2: tag 0x5: 1 0 0 1
> ata2: tag 0x6: 0 0 0 1
> ata2.00: exception Emask 0x1 SAct 0x1ff SErr 0x3800000 action 0x6 frozen
>
> So yes it seems to happen quite a bit, I'm wondering is SWNCQ is
> something I should be disabling for this controller.

Wouldn't hurt to try (swncq=0 module parameter). However, from some of
the later output in the log you posted, it seems like not only was
there some kind of hiccup resulting in a timeout, but later there were
more Serror flags raised like PHY ready change, CommWake, etc. and the
drive seemed to stop responding entirely. That does tend to smell like
some kind of hardware problem to me..

btrfs exploding is presumably a kernel problem, of course..
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Dave Airlie on
On Tue, Jun 1, 2010 at 2:59 PM, Robert Hancock <hancockrwd(a)gmail.com> wrote:
> On 05/31/2010 09:04 PM, Dave Airlie wrote:
>>
>> Hi guys,
>>
>> I've been running an Intel SSD (the KS one) on my Dell XPS710 desktop
>> machine, with btrfs on it.
>>
>> I'm not sure the btrfs oops isn't due to the disk/controller doing
>> something bad (almost guaranteed).
>>
>> Attached the dmesg + config, using 2.6.34 + only drm patches.
>>
>> Jeff I'd be interested in knowing what is happening to the disk before
>> btrfs oops.
>
> ata2: EH in SWNCQ mode,QC:qc_active 0x7FFFFE03 sactive 0x7FFFFE03
> ata2: SWNCQ:qc_active 0xFE00 defer_bits 0x7FFF0003 last_issue_tag 0xf
> �dhfis 0x7E00 dmafis 0x200 sdbfis 0x0
> ata2: ATA_REG 0x40 ERR_REG 0x0
> ata2: tag : dhfis dmafis sdbfis sacitve
> ata2: tag 0x9: 1 1 0 1
> ata2: tag 0xa: 1 0 0 1
> ata2: tag 0xb: 1 0 0 1
> ata2: tag 0xc: 1 0 0 1
> ata2: tag 0xd: 1 0 0 1
> ata2: tag 0xe: 1 0 0 1
> ata2: tag 0xf: 0 0 0 1
> ata2.00: exception Emask 0x0 SAct 0x7ffffe03 SErr 0x1800000 action 0x6
> frozen
> ata2: SError: { LinkSeq TrStaTrns }
>
> Last line is probably the most informative, SATA link sequence error and
> transport state transition error. That's probably something bad happening at
> the low level between the controller and drive. Is this happening
> repeatedly?
>

from another boot I do see another one.

ata2: EH in SWNCQ mode,QC:qc_active 0x1FF sactive 0x1FF
ata2: SWNCQ:qc_active 0x7F defer_bits 0x180 last_issue_tag 0x6
dhfis 0x3F dmafis 0x8 sdbfis 0x0
ata2: ATA_REG 0x41 ERR_REG 0x84
ata2: tag : dhfis dmafis sdbfis sacitve
ata2: tag 0x0: 1 0 0 1
ata2: tag 0x1: 1 0 0 1
ata2: tag 0x2: 1 0 0 1
ata2: tag 0x3: 1 1 0 1
ata2: tag 0x4: 1 0 0 1
ata2: tag 0x5: 1 0 0 1
ata2: tag 0x6: 0 0 0 1
ata2.00: exception Emask 0x1 SAct 0x1ff SErr 0x3800000 action 0x6 frozen

So yes it seems to happen quite a bit, I'm wondering is SWNCQ is
something I should be disabling for this controller.

Dave.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jeff Garzik on
On 05/31/2010 11:04 PM, Dave Airlie wrote:
> Hi guys,
>
> I've been running an Intel SSD (the KS one) on my Dell XPS710 desktop
> machine, with btrfs on it.
>
> I'm not sure the btrfs oops isn't due to the disk/controller doing
> something bad (almost guaranteed).

The btrfs oops may be poor handling of an I/O error thrown by the block
error.

Root cause is definitely your SATA PHY throwing some hardware errors
from the transport layer (low level SATA packet transmission failures).
Everything else sorta falls apart after that.

First guesses are the usual suspects: cabling, temperature, power or
SATA ports on the [SATA controller | SATA device] going bad.

Disabling swncq will only improve things from the perspective of slowing
things down and giving the hardware less to do. swncq makes things
parallel, so forcing only one transaction at a time certainly increases
the chances of success by reducing complexity and serializing transactions.

Jeff



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/