BUG in drivers/dma/ioat/dma

Prev: avoid return NULL on root rb_node in rb_next/rb_prev in lib/rbtree.c
Next: arch/x86/kernel/cpu/mtrr/cleanup.c: Use ";" not "," to terminate statements

From: Chris Li on 28 Jun 2010 20:00

Hi Dan,

My Mac Pro hit this BUG every time it try to load module ioatdma.

This was first discover in FC 12 & 13 kernel. See redhat bug 605845.
https://bugzilla.redhat.com/show_bug.cgi?id=605845. I attach a picture
of the kernel panic on the bug.

The current git tree has it as well. The bug line number change a
little bit though.

/* when halted due to errors check for channel
* programming errors before advancing the completion state
*/
if (is_ioat_halted(status)) {
u32 chanerr;

chanerr = readl(chan->reg_base + IOAT_CHANERR_OFFSET);
dev_err(to_dev(chan), "%s: Channel halted (%x)\n",
__func__, chanerr);
BUG_ON(is_ioat_bug(chanerr)); <---------------------------------
}

The machine is a Mac Pro. The bug is reproducible 100%. Black list the
ioatdma module and the kernel boot just fine.

Any suggestion? I am not afraid to try out patches.

Thanks

Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Dan Williams on 28 Jun 2010 20:50

On 6/28/2010 4:50 PM, Chris Li wrote:
> Hi Dan,
>
> My Mac Pro hit this BUG every time it try to load module ioatdma.
>
> This was first discover in FC 12& 13 kernel. See redhat bug 605845.
> https://bugzilla.redhat.com/show_bug.cgi?id=605845. I attach a picture
> of the kernel panic on the bug.
>
> The current git tree has it as well. The bug line number change a
> little bit though.
>
>
> /* when halted due to errors check for channel
> * programming errors before advancing the completion state
> */
> if (is_ioat_halted(status)) {
> u32 chanerr;
>
> chanerr = readl(chan->reg_base + IOAT_CHANERR_OFFSET);
> dev_err(to_dev(chan), "%s: Channel halted (%x)\n",
> __func__, chanerr);
> BUG_ON(is_ioat_bug(chanerr));<---------------------------------
> }
>
> The machine is a Mac Pro. The bug is reproducible 100%. Black list the
> ioatdma module and the kernel boot just fine.
>
> Any suggestion? I am not afraid to try out patches.
>

Looks like that dev_err() did not make it to the console. The attached
patch should get us some more debug information. This will stop the
driver from making forward progress (applies to current -git). I
suspect this may be triggering from the driver self test, but to be safe
you should set CONFIG_NET_DMA=n and CONFIG_ASYNC_TX_DMA=n.

--
Dan

From: Chris Li on 29 Jun 2010 03:20

On Mon, Jun 28, 2010 at 5:45 PM, Dan Williams <dan.j.williams(a)intel.com> wrote:
> Looks like that dev_err() did not make it to the console. �The attached
> patch should get us some more debug information. �This will stop the driver
> from making forward progress (applies to current -git). �I suspect this may
> be triggering from the driver self test, but to be safe you should set
> CONFIG_NET_DMA=n and CONFIG_ASYNC_TX_DMA=n.

I will try that tomorrow and get back to you.

Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Chris Li on 29 Jun 2010 19:30

On Mon, Jun 28, 2010 at 5:45 PM, Dan Williams <dan.j.williams(a)intel.com> wrote:
> Looks like that dev_err() did not make it to the console. The attached
> patch should get us some more debug information. This will stop the driver
> from making forward progress (applies to current -git). I suspect this may
> be triggering from the driver self test, but to be safe you should set
> CONFIG_NET_DMA=n and CONFIG_ASYNC_TX_DMA=n.

OK, with the patch it does not kernel panic any more.

Here is the prink from ioatdma.

ioatdma: Intel(R) QuickData Technology Driver 4.00
ioatdma 0000:00:0f.0: can't derive routing for PCI INT A
ioatdma 0000:00:0f.0: PCI INT A: no GSI
ioatdma 0000:00:0f.0: setting latency timer to 64
alloc irq_desc for 57 on node -1
alloc kstat_irqs on node -1
ioatdma 0000:00:0f.0: irq 57 for MSI/MSI-X
alloc irq_desc for 58 on node -1
alloc kstat_irqs on node -1
ioatdma 0000:00:0f.0: irq 58 for MSI/MSI-X
alloc irq_desc for 59 on node -1
alloc kstat_irqs on node -1
ioatdma 0000:00:0f.0: irq 59 for MSI/MSI-X
alloc irq_desc for 60 on node -1
alloc kstat_irqs on node -1
ioatdma 0000:00:0f.0: irq 60 for MSI/MSI-X
------------[ cut here ]------------
WARNING: at drivers/dma/ioat/dma_v2.c:289 ioat2_timer_event+0xbc/0x225
[ioatdma]()
Hardware name: MacPro3,1
0000:00:0f.0: ioat2_timer_event: Channel halted (10)
Modules linked in: ioatdma(+) dca fuse rfcomm sco bridge stp llc bnep
l2cap autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf
ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 uinput
e1000e snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep
snd_seq i5400_edac snd_seq_device snd_pcm snd_timer snd edac_core
btusb bluetooth rfkill soundcore i5k_amb i2c_i801 shpchp applesmc
snd_page_alloc iTCO_wdt iTCO_vendor_support input_polldev
firewire_ohci firewire_core crc_itu_t radeon ttm drm_kms_helper drm
i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]
Pid: 0, comm: swapper Not tainted 2.6.35-rc3+ #41
Call Trace:
<IRQ> [<ffffffff8104bdac>] warn_slowpath_common+0x85/0x9d
[<ffffffff8104be67>] warn_slowpath_fmt+0x46/0x48
[<ffffffff810100a5>] ? sched_clock+0x9/0xd
[<ffffffffa03ee4ed>] ioat2_timer_event+0xbc/0x225 [ioatdma]
[<ffffffff81069d76>] ? sched_clock_cpu+0xc3/0xce
[<ffffffff81058a6a>] run_timer_softirq+0x1d6/0x2a5
[<ffffffffa03ee431>] ? ioat2_timer_event+0x0/0x225 [ioatdma]
[<ffffffff8106cc08>] ? ktime_get+0x65/0xbe
[<ffffffff81051ddb>] __do_softirq+0xe9/0x1ae
[<ffffffff81070f70>] ? tick_program_event+0x2a/0x2c
[<ffffffff8100ab1c>] call_softirq+0x1c/0x30
[<ffffffff8100c18a>] do_softirq+0x46/0x83
[<ffffffff81051c48>] irq_exit+0x3b/0x7d
[<ffffffff81433638>] smp_apic_timer_interrupt+0x8d/0x9b
[<ffffffff8100a5d3>] apic_timer_interrupt+0x13/0x20
<EOI> [<ffffffff810115fd>] ? mwait_idle+0x7a/0x87
[<ffffffff810115af>] ? mwait_idle+0x2c/0x87
[<ffffffff81008c1f>] cpu_idle+0xaa/0xe4
[<ffffffff81427eb0>] start_secondary+0x253/0x294---[ end trace
69aa12150c49792c ]---
ioatdma 0000:00:0f.0: Self-test copy timed out, disabling
ioatdma 0000:00:0f.0: Freeing 2 in use descriptors!
ioatdma 0000:00:0f.0: Intel(R) I/OAT DMA Engine init failed
ioatdma 0000:00:0f.0: can't derive routing for PCI INT A

I attach the full dmesg in case you need it. Is it possible that
the Mac Pro is MSI only and ioatdma is not happy about that?

Chris

From: Dan Williams on 29 Jun 2010 20:00

On 6/29/2010 4:20 PM, Chris Li wrote:
> On Mon, Jun 28, 2010 at 5:45 PM, Dan Williams<dan.j.williams(a)intel.com> wrote:
>> Looks like that dev_err() did not make it to the console. The attached
>> patch should get us some more debug information. This will stop the driver
>> from making forward progress (applies to current -git). I suspect this may
>> be triggering from the driver self test, but to be safe you should set
>> CONFIG_NET_DMA=n and CONFIG_ASYNC_TX_DMA=n.
>
> OK, with the patch it does not kernel panic any more.
>
> Here is the prink from ioatdma.
>

Thanks.

[..]
> 0000:00:0f.0: ioat2_timer_event: Channel halted (10)

This says that we got an invalid chain address error when trying to
start the engine. If there was a driver problem with init I would have
expected to see reports from other systems. The attached patch will
print out what chain address we are setting. The hardware expects a
64-byte aligned address which should be guaranteed by the use of
pci_pool_alloc().

However, if you are up for another experiment, I'd like to see what
happens if you disable VT-d. Maybe it is a misconfigured iommu table
that is blocking the engine's access to memory?

> I attach the full dmesg in case you need it. Is it possible that
> the Mac Pro is MSI only and ioatdma is not happy about that?

Not really, MSI is the preferred mode of operation, and as I said
earlier if something like this were broken I would expect reports from
other platforms??

--
Dan

| Next | Last
Pages: 1 2 3 4 5 6 7 8 9
Prev: avoid return NULL on root rb_node in rb_next/rb_prev in lib/rbtree.c
Next: arch/x86/kernel/cpu/mtrr/cleanup.c: Use ";" not "," to terminate statements

BUG in drivers/dma/ioat/dma_v2.c:314