From: Chris Li on
On Tue, Jun 29, 2010 at 4:57 PM, Dan Williams <dan.j.williams(a)intel.com> wrote:
>> 0000:00:0f.0: ioat2_timer_event: Channel halted (10)
>
> This says that we got an invalid chain address error when trying to start
> the engine. �If there was a driver problem with init I would have expected
> to see reports from other systems. �The attached patch will print out what
> chain address we are setting. �The hardware expects a 64-byte aligned
> address which should be guaranteed by the use of pci_pool_alloc().

OK. I can't do this test remotely so I will get back to you tomorrow.

>
> However, if you are up for another experiment, I'd like to see what happens
> if you disable VT-d. �Maybe it is a misconfigured iommu table that is
> blocking the engine's access to memory?

You mean disable VT-d in kernel config or the BIOS?

BTW, I don't know how to disable VT-d in Mac BIOS. It use EFI, then simulate
a normal BIOS in the boot camp mode to boot Linux.

Another stab in the dark is that, it is Mac. It has some strange SMI interaction
like TSC drifting even after boot. I notice that in the past.

Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Dan Williams on
[ copying David to see if I am barking up the wrong VT-d tree. This is
on a MacPro 3,1 according to dmesg so a 5400 series MCH ]

On 6/29/2010 6:07 PM, Chris Li wrote:
> On Tue, Jun 29, 2010 at 4:57 PM, Dan Williams<dan.j.williams(a)intel.com> wrote:
>>> 0000:00:0f.0: ioat2_timer_event: Channel halted (10)
>>
>> This says that we got an invalid chain address error when trying to start
>> the engine. If there was a driver problem with init I would have expected
>> to see reports from other systems. The attached patch will print out what
>> chain address we are setting. The hardware expects a 64-byte aligned
>> address which should be guaranteed by the use of pci_pool_alloc().
>
> OK. I can't do this test remotely so I will get back to you tomorrow.

I appreciate it!

>
>>
>> However, if you are up for another experiment, I'd like to see what happens
>> if you disable VT-d. Maybe it is a misconfigured iommu table that is
>> blocking the engine's access to memory?
>
> You mean disable VT-d in kernel config or the BIOS?

I was thinking in the BIOS, but appending iommu=off to the kernel
command-line should also do the trick.

> BTW, I don't know how to disable VT-d in Mac BIOS. It use EFI, then simulate
> a normal BIOS in the boot camp mode to boot Linux.
>
> Another stab in the dark is that, it is Mac. It has some strange SMI interaction
> like TSC drifting even after boot. I notice that in the past.

....but the failure is not intermittent, right?

Where it fell over is a pretty straightforward usage of the dma engine
and it is failing on the first transaction that the first channel issues
to memory. You should be able to 'modprobe ioatdma' after you boot and
watch it fail again if my suspicion is correct... if the signature
changes that would also be good to know.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Chris Li on
On Tue, Jun 29, 2010 at 9:17 PM, Dan Williams <dan.j.williams(a)intel.com> wrote:
> [ copying David to see if I am barking up the wrong VT-d tree.  This is on a
> MacPro 3,1 according to dmesg so a 5400 series MCH ]
>
> On 6/29/2010 6:07 PM, Chris Li wrote:
>>
>> On Tue, Jun 29, 2010 at 4:57 PM, Dan Williams<dan.j.williams(a)intel.com>
>> OK. I can't do this test remotely so I will get back to you tomorrow.

ioatdma: Intel(R) QuickData Technology Driver 4.00
ioatdma 0000:00:0f.0: can't derive routing for PCI INT A
ioatdma 0000:00:0f.0: PCI INT A: no GSI
ioatdma 0000:00:0f.0: setting latency timer to 64
alloc irq_desc for 57 on node -1
alloc kstat_irqs on node -1
ioatdma 0000:00:0f.0: irq 57 for MSI/MSI-X
alloc irq_desc for 58 on node -1
alloc kstat_irqs on node -1
ioatdma 0000:00:0f.0: irq 58 for MSI/MSI-X
alloc irq_desc for 59 on node -1
alloc kstat_irqs on node -1
ioatdma 0000:00:0f.0: irq 59 for MSI/MSI-X
alloc irq_desc for 60 on node -1
alloc kstat_irqs on node -1
ioatdma 0000:00:0f.0: irq 60 for MSI/MSI-X
ioatdma 0000:00:0f.0: ioat2_set_chainaddr: chainaddr: ffffe000
------------[ cut here ]------------
WARNING: at drivers/dma/ioat/dma_v2.c:289 ioat2_timer_event+0xbc/0x225
[ioatdma]()
Hardware name: MacPro3,1
0000:00:0f.0: ioat2_timer_event: Channel halted (10)
Modules linked in: ioatdma(+) dca fuse rfcomm sco bridge stp llc bnep
l2cap autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf
ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 uinput
snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq
snd_seq_device btusb i5400_edac snd_pcm bluetooth shpchp snd_timer snd
e1000e soundcore rfkill i2c_i801 edac_core iTCO_wdt snd_page_alloc
applesmc i5k_amb iTCO_vendor_support input_polldev firewire_ohci
firewire_core crc_itu_t radeon ttm drm_kms_helper drm i2c_algo_bit
i2c_core [last unloaded: scsi_wait_scan]
Pid: 0, comm: swapper Not tainted 2.6.35-rc3+ #41
Call Trace:
<IRQ> [<ffffffff8104bdac>] warn_slowpath_common+0x85/0x9d
[<ffffffff8104be67>] warn_slowpath_fmt+0x46/0x48
[<ffffffff810100a5>] ? sched_clock+0x9/0xd
[<ffffffffa03ef55b>] ioat2_timer_event+0xbc/0x225 [ioatdma]
[<ffffffff81069d76>] ? sched_clock_cpu+0xc3/0xce
[<ffffffff81058a6a>] run_timer_softirq+0x1d6/0x2a5
[<ffffffffa03ef49f>] ? ioat2_timer_event+0x0/0x225 [ioatdma]
[<ffffffff8106cc08>] ? ktime_get+0x65/0xbe
[<ffffffff81051ddb>] __do_softirq+0xe9/0x1ae
[<ffffffff81070f70>] ? tick_program_event+0x2a/0x2c
[<ffffffff8100ab1c>] call_softirq+0x1c/0x30
[<ffffffff8100c18a>] do_softirq+0x46/0x83
[<ffffffff81051c48>] irq_exit+0x3b/0x7d
[<ffffffff81433638>] smp_apic_timer_interrupt+0x8d/0x9b
[<ffffffff8100a5d3>] apic_timer_interrupt+0x13/0x20
<EOI> [<ffffffff810115fd>] ? mwait_idle+0x7a/0x87
[<ffffffff810115af>] ? mwait_idle+0x2c/0x87
[<ffffffff81008c1f>] cpu_idle+0xaa/0xe4
[<ffffffff81427eb0>] start_secondary+0x253/0x294
---[ end trace 19d8162e5c74f492 ]---
ioatdma 0000:00:0f.0: Self-test copy timed out, disabling
ioatdma 0000:00:0f.0: Freeing 2 in use descriptors!
ioatdma 0000:00:0f.0: Intel(R) I/OAT DMA Engine init failed
ioatdma 0000:00:0f.0: can't derive routing for PCI INT A

> I was thinking in the BIOS, but appending iommu=off to the kernel
> command-line should also do the trick.

iommu=off cause the kernel not boot properly. BTW, that is why I lost
my machine remotely last night. There is some sata error keep printing on
the console. Let me try to collect that once I reboot the machine again.

> ...but the failure is not intermittent, right?

Happen every time.

>
> Where it fell over is a pretty straightforward usage of the dma engine and
> it is failing on the first transaction that the first channel issues to
> memory.  You should be able to 'modprobe ioatdma' after you boot and watch
> it fail again if my suspicion is correct... if the signature changes that
> would also be good to know.

The delta seems to be this line:
ioatdma 0000:00:0f.0: ioat2_set_chainaddr: chainaddr: ffffe000


Chris
From: David Woodhouse on
On Wed, 2010-06-30 at 19:26 +0100, Chris Li wrote:
>
> The delta seems to be this line:
> ioatdma 0000:00:0f.0: ioat2_set_chainaddr: chainaddr: ffffe000

That's a reasonable address if the IOMMU is enabled. We start at 4GiB
and work down, so that's the second page given out (or the first 8KiB
chunk).

It looks like the DMA is going AWOL causing the initialisation to
fail... but it's interesting that there are no DMA faults reported by
the IOMMU.

--
David Woodhouse Open Source Technology Centre
David.Woodhouse(a)intel.com Intel Corporation

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Chris Li on
On Wed, Jun 30, 2010 at 11:26 AM, Chris Li <lkml(a)chrisli.org> wrote:
> iommu=off cause the kernel not boot properly. BTW, that is why I lost
> my machine remotely last night. There is some sata error keep printing on
> the console. Let me try to collect that once I reboot the machine again.

The error is flushing the screen very fast. Now it stops. I type the
last one in:

nommu_map_sg: overflow 270e3800+256 of device mask fffffff
ata1.00: execption Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
sr 0:0:0:0: [sr0] CDB: Inquiry: 12 00 00 00 fe 00
ata1.00: cmd a0/01:00:00:fe:00/00/00:00:00:00:00/a0 tag 0 dma 16640 in
res 50/00:03:16:00:00/00:00:00:00:00:00/a0 Emask 0x40
(internal error)
ata1.00: status: {DRDY}
ata1.00: configured for UDMA/166
ata1: EH complete

I get lost of those.

Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/