x86/pci Oops with CONFIG_SND_HDA

Prev: More than 1M open file descriptors
Next: [PATCH] drivers/gpio/it8761e_gpio: check return value of gpiochip_remove()

From: Yinghai on 19 May 2010 20:10

On 05/19/2010 03:47 PM, Graham Ramsey wrote:
> On 19/05/10 19:01, Yinghai wrote:
>> On 05/19/2010 10:16 AM, Graham Ramsey wrote:
>>
>>> On 19/05/10 17:44, Bjorn Helgaas wrote:
>>>
>>>> On Wednesday, May 19, 2010 09:13:24 am Graham Ramsey wrote:
>>>>
>>>>
>>>>> I am on x86_64 with latest (v2.6.34) kernel. When i set
>>>>> CONFIG_SND_HDA_INTEL=Y It hangs at an early stage in boot with kernel
>>>>> oops.
>>>>> When i use CONFIG_SND_HDA_INTEL=M the machine will boot, and i get the
>>>>> dmesg (below).
>>>>>
>>>>> I have bisected down to one commit that causes the problem:
>>>>>
>>>>> commit 3e3da00c01d050307e753fb7b3e84aefc16da0d0
>>>>> x86/pci: AMD one chain system to use pci read out res
>>>>> ...
>>>>>
>>>>>
>>>> I CC'd Yinghai, the author of that patch. That commit went in after
>>>> 2.6.33, so this is probably a regression between .33 and .34. Can
>>>> you open a report at https://bugzilla.kernel.org and respond to this
>>>> thread with the URL?
>>>>
>>>> Please attach the complete dmesg (with SND_HDA_INTEL=m) to the
>>>> bugzilla.
>>>>
>>>> Thanks a lot for your report!
>>>>
>>>>
>> please send out bootlog with pci=earlydump.
>>
>> looks like your system have a very sick BIOS,
>>
>> system have two HT chains.
>>
>> PCI: Probing PCI hardware (bus 00)
>> ...
>> PCI: Discovered primary peer bus 80 [IRQ]
>>
>>
>> rt to non-coherent only set one link:
>> node 0 link 0: io port [1000, ffffff]
>> TOM: 0000000080000000 aka 2048M
>> node 0 link 0: mmio [e0000000, efffffff]
>> node 0 link 0: mmio [a0000, bffff]
>> node 0 link 0: mmio [80000000, ffffffff]
>> bus: [00, ff] on node 0 link 0
>>
>> YH
>>
>>
> I have uploaded full boot log (of a working kernel) to bug if that is ok
>
> https://bugzilla.kernel.org/attachment.cgi?id=26444
>

ah, that 80:01.0 is standalone device, the system still only have one HT chain.

that is CRAZY that they can sell those poor designed chips.

actually 3e3da00c is fixing another bug with one HT chain.

Jesse,
We have two options:
1. revert that 3e3da00c
2. or use quirks to black out system with VIA chipset.

please let me know which one you prefer.

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Jesse Barnes on 19 May 2010 20:30

On Wed, 19 May 2010 17:03:04 -0700
Yinghai <yinghai.lu(a)oracle.com> wrote:

> On 05/19/2010 03:47 PM, Graham Ramsey wrote:
> > On 19/05/10 19:01, Yinghai wrote:
> >> On 05/19/2010 10:16 AM, Graham Ramsey wrote:
> >>
> >>> On 19/05/10 17:44, Bjorn Helgaas wrote:
> >>>
> >>>> On Wednesday, May 19, 2010 09:13:24 am Graham Ramsey wrote:
> >>>>
> >>>>
> >>>>> I am on x86_64 with latest (v2.6.34) kernel. When i set
> >>>>> CONFIG_SND_HDA_INTEL=Y It hangs at an early stage in boot with kernel
> >>>>> oops.
> >>>>> When i use CONFIG_SND_HDA_INTEL=M the machine will boot, and i get the
> >>>>> dmesg (below).
> >>>>>
> >>>>> I have bisected down to one commit that causes the problem:
> >>>>>
> >>>>> commit 3e3da00c01d050307e753fb7b3e84aefc16da0d0
> >>>>> x86/pci: AMD one chain system to use pci read out res
> >>>>> ...
> >>>>>
> >>>>>
> >>>> I CC'd Yinghai, the author of that patch. That commit went in after
> >>>> 2.6.33, so this is probably a regression between .33 and .34. Can
> >>>> you open a report at https://bugzilla.kernel.org and respond to this
> >>>> thread with the URL?
> >>>>
> >>>> Please attach the complete dmesg (with SND_HDA_INTEL=m) to the
> >>>> bugzilla.
> >>>>
> >>>> Thanks a lot for your report!
> >>>>
> >>>>
> >> please send out bootlog with pci=earlydump.
> >>
> >> looks like your system have a very sick BIOS,
> >>
> >> system have two HT chains.
> >>
> >> PCI: Probing PCI hardware (bus 00)
> >> ...
> >> PCI: Discovered primary peer bus 80 [IRQ]
> >>
> >>
> >> rt to non-coherent only set one link:
> >> node 0 link 0: io port [1000, ffffff]
> >> TOM: 0000000080000000 aka 2048M
> >> node 0 link 0: mmio [e0000000, efffffff]
> >> node 0 link 0: mmio [a0000, bffff]
> >> node 0 link 0: mmio [80000000, ffffffff]
> >> bus: [00, ff] on node 0 link 0
> >>
> >> YH
> >>
> >>
> > I have uploaded full boot log (of a working kernel) to bug if that is ok
> >
> > https://bugzilla.kernel.org/attachment.cgi?id=26444
> >
>
> ah, that 80:01.0 is standalone device, the system still only have one HT chain.
>
> that is CRAZY that they can sell those poor designed chips.
>
> actually 3e3da00c is fixing another bug with one HT chain.
>
> Jesse,
> We have two options:
> 1. revert that 3e3da00c
> 2. or use quirks to black out system with VIA chipset.
>
> please let me know which one you prefer.

I'm guessing these VIA chipsets are pretty common; how common is the
platform bug you fixed with 3e3da00c?

I'd rather quirk one platform than a whole bunch...

--
Jesse Barnes, Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Yinghai on 19 May 2010 20:40

On 05/19/2010 05:22 PM, Jesse Barnes wrote:
> On Wed, 19 May 2010 17:03:04 -0700
> Yinghai <yinghai.lu(a)oracle.com> wrote:
>
>> On 05/19/2010 03:47 PM, Graham Ramsey wrote:
>>> On 19/05/10 19:01, Yinghai wrote:
>>>> On 05/19/2010 10:16 AM, Graham Ramsey wrote:
>>>>
>>>>> On 19/05/10 17:44, Bjorn Helgaas wrote:
>>>>>
>>>>>> On Wednesday, May 19, 2010 09:13:24 am Graham Ramsey wrote:
>>>>>>
>>>>>>
>>>>>>> I am on x86_64 with latest (v2.6.34) kernel. When i set
>>>>>>> CONFIG_SND_HDA_INTEL=Y It hangs at an early stage in boot with kernel
>>>>>>> oops.
>>>>>>> When i use CONFIG_SND_HDA_INTEL=M the machine will boot, and i get the
>>>>>>> dmesg (below).
>>>>>>>
>>>>>>> I have bisected down to one commit that causes the problem:
>>>>>>>
>>>>>>> commit 3e3da00c01d050307e753fb7b3e84aefc16da0d0
>>>>>>> x86/pci: AMD one chain system to use pci read out res
>>>>>>> ...
>>>>>>>
>>>>>>>
>>>>>> I CC'd Yinghai, the author of that patch. That commit went in after
>>>>>> 2.6.33, so this is probably a regression between .33 and .34. Can
>>>>>> you open a report at https://bugzilla.kernel.org and respond to this
>>>>>> thread with the URL?
>>>>>>
>>>>>> Please attach the complete dmesg (with SND_HDA_INTEL=m) to the
>>>>>> bugzilla.
>>>>>>
>>>>>> Thanks a lot for your report!
>>>>>>
>>>>>>
>>>> please send out bootlog with pci=earlydump.
>>>>
>>>> looks like your system have a very sick BIOS,
>>>>
>>>> system have two HT chains.
>>>>
>>>> PCI: Probing PCI hardware (bus 00)
>>>> ...
>>>> PCI: Discovered primary peer bus 80 [IRQ]
>>>>
>>>>
>>>> rt to non-coherent only set one link:
>>>> node 0 link 0: io port [1000, ffffff]
>>>> TOM: 0000000080000000 aka 2048M
>>>> node 0 link 0: mmio [e0000000, efffffff]
>>>> node 0 link 0: mmio [a0000, bffff]
>>>> node 0 link 0: mmio [80000000, ffffffff]
>>>> bus: [00, ff] on node 0 link 0
>>>>
>>>> YH
>>>>
>>>>
>>> I have uploaded full boot log (of a working kernel) to bug if that is ok
>>>
>>> https://bugzilla.kernel.org/attachment.cgi?id=26444
>>>
>>
>> ah, that 80:01.0 is standalone device, the system still only have one HT chain.
>>
>> that is CRAZY that they can sell those poor designed chips.
>>
>> actually 3e3da00c is fixing another bug with one HT chain.
>>
>> Jesse,
>> We have two options:
>> 1. revert that 3e3da00c
>> 2. or use quirks to black out system with VIA chipset.
>>
>> please let me know which one you prefer.
>
> I'm guessing these VIA chipsets are pretty common; how common is the
> platform bug you fixed with 3e3da00c?

one laptop with firewire on AMD 64 bit laptop. can not find the mail any more.

>
> I'd rather quirk one platform than a whole bunch...

maybe you you can revert that patch at first.

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Bjorn Helgaas on 20 May 2010 13:10

> >>>> looks like your system have a very sick BIOS,
> >>>>
> >>>> system have two HT chains.
> >>>>
> >>>> PCI: Probing PCI hardware (bus 00)
> >>>> PCI: Discovered primary peer bus 80 [IRQ]
> >>>>
> >>>> rt to non-coherent only set one link:
> >>>> node 0 link 0: io port [1000, ffffff]
> >>>> TOM: 0000000080000000 aka 2048M
> >>>> node 0 link 0: mmio [e0000000, efffffff]
> >>>> node 0 link 0: mmio [a0000, bffff]
> >>>> node 0 link 0: mmio [80000000, ffffffff]
> >>>> bus: [00, ff] on node 0 link 0

> >> ah, that 80:01.0 is standalone device, the system still only have one HT chain.
> >> that is CRAZY that they can sell those poor designed chips.
> >>
> >> actually 3e3da00c is fixing another bug with one HT chain.
> >>
> >> We have two options:
> >> 1. revert that 3e3da00c
> >> 2. or use quirks to black out system with VIA chipset.

This is voodoo kernel development, and I don't think we should do it.

Can you explain the cause of Graham's oops? All I can see is that we
discovered a host bridge window of [mem 0x80000000-0xfcffffffff] to
bus 00, we did *not* find a bridge leading to bus 80, we found a device
on bus 80 that is inside the window forwarded to bus 00, so we moved
that device outside the window:

bus: 00 index 1 [mem 0x80000000-0xfcffffffff]
pci 0000:80:01.0: reg 10: [mem 0xfebfc000-0xfebfffff 64bit]
pci 0000:80:01.0: address space collision: [mem 0xfebfc000-0xfebfffff 64bit] conflicts with PCI Bus #00 [mem 0x80000000-0xfcffffffff]
pci 0000:80:01.0: BAR 0: set to [mem 0xfd00000000-0xfd00003fff 64bit]

I have no idea why this led to a page fault at ffffc90000078000:

BUG: unable to handle kernel paging request at ffffc90000078000
IP: [<ffffffffa0018d11>] azx_probe+0x3a2/0xa6a [snd_hda_intel]

It looks to me like amd_bus.c just failed to discover the host bridge
to bus 80. If the BIOS can program the chipset to work that way, we
should be able to figure that out, too.

Graham, I think your "pci=earlydump" log is missing the KERN_DEBUG
output. It would be interesting to see that for the patched kernel
so we can compare it with 2.6.34.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Bjorn Helgaas on 2 Jun 2010 13:00

I think the basic problem is that Yinghai's patch broke your system,
and this is a regression between 2.6.33 and 2.6.34.

We could use a quirk like yours (which looks fine, BTW) to cover up
this regression, but I don't like that approach because other machines
are probably affected by the same issue, and we'd have to find and
fix them one-by-one.

I think it'd be better to figure out the problem with 3e3da00c01d
and fix or revert it. I said earlier that I wasn't in favor of just
reverting it, and I still don't like that option because it will
likely break something. But Yinghai didn't supply any details about
the system that 3e3da00c01d fixed, so I don't know how to fix things
so both that system and yours work.

I assume that 2.6.34 with 3e3da00c01d reverted will work fine even
without "pci=use_crs". Can you try that and attach the dmesg log?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3
Prev: More than 1M open file descriptors
Next: [PATCH] drivers/gpio/it8761e_gpio: check return value of gpiochip_remove()

x86/pci Oops with CONFIG_SND_HDA_INTEL