From: Bjorn Helgaas on
Oops, I added Yinghai to the CC: list, but I forgot to add
linux-pci(a)vger.kernel.org. Please add that on any future replies.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Bjorn Helgaas on
On Wednesday, May 12, 2010 12:14:32 pm Mike Travis wrote:
> Subject: [Patch 1/1] x86 pci: Add option to not assign BAR's if not already assigned
> From: Mike Habeck <habeck(a)sgi.com>
>
> The Linux kernel assigns BARs that a BIOS did not assign, most likely
> to handle broken BIOSes that didn't enumerate the devices correctly.
> On UV the BIOS purposely doesn't assign I/O BARs for certain devices/
> drivers we know don't use them (examples, LSI SAS, Qlogic FC, ...).
> We purposely don't assign these I/O BARs because I/O Space is a very
> limited resource. There is only 64k of I/O Space, and in a PCIe
> topology that space gets divided up into 4k chucks (this is due to
> the fact that a pci-to-pci bridge's I/O decoder is aligned at 4k)...
> Thus a system can have at most 16 cards with I/O BARs: (64k / 4k = 16)
>
> SGI needs to scale to >16 devices with I/O BARs. So by not assigning
> I/O BARs on devices we know don't use them, we can do that (iff the
> kernel doesn't go and assign these BARs that the BIOS purposely didn't
> assign).

I don't quite understand this part. If you boot with "pci=nobar",
the BIOS doesn't assign BARs, Linux doesn't either, the drivers
don't need them -- everything works, and that makes sense so far.

Now, if you boot normally (without "pci=nobar"), what changes?
The BIOS situation is the same, but Linux tries to assign the
unassigned BARs. It may assign a few before running out of space,
but the drivers still don't need those BARs. What breaks?

> This patch will not assign a resource to a device BAR if that BAR was
> not assigned by the BIOS, and the kernel cmdline option 'pci=nobar'
> was specified. This patch is closely modeled after the 'pci=norom'
> option that currently exists in the tree.

Can't we figure out whether we need this ourselves? Using a command-
line option just guarantees that we'll forever be writing customer
advisories about this issue.

This issue is not specific to x86, so I don't really like having
the implementation be x86-specific.

Do we know anything about how other OSes handle this case of I/O
space exhaustion?

I'm a little bit nervous about Linux's current strategy of assigning
resources to things before we even know whether we're going to use
them. We don't support dynamic PCI resource reassignment, so maybe
we don't have any choice in this case, but generally I prefer the
lazy approach.

Bjorn

> Signed-off-by: Mike Habeck <habeck(a)sgi.com>
> Signed-off-by: Mike Travis <travis(a)sgi.com>
> ---
> Documentation/kernel-parameters.txt | 2 ++
> arch/x86/include/asm/pci_x86.h | 1 +
> arch/x86/pci/common.c | 20 ++++++++++++++++++++
> 3 files changed, 23 insertions(+)
>
> --- linux.orig/Documentation/kernel-parameters.txt
> +++ linux/Documentation/kernel-parameters.txt
> @@ -1935,6 +1935,8 @@ and is between 256 and 4096 characters.
> norom [X86] Do not assign address space to
> expansion ROMs that do not already have
> BIOS assigned address ranges.
> + nobar [X86] Do not assign address space to the
> + BARs that weren't assigned by the BIOS.
> irqmask=0xMMMM [X86] Set a bit mask of IRQs allowed to be
> assigned automatically to PCI devices. You can
> make the kernel exclude IRQs of your ISA cards
> --- linux.orig/arch/x86/include/asm/pci_x86.h
> +++ linux/arch/x86/include/asm/pci_x86.h
> @@ -30,6 +30,7 @@
> #define PCI_HAS_IO_ECS 0x40000
> #define PCI_NOASSIGN_ROMS 0x80000
> #define PCI_ROOT_NO_CRS 0x100000
> +#define PCI_NOASSIGN_BARS 0x200000
>
> extern unsigned int pci_probe;
> extern unsigned long pirq_table_addr;
> --- linux.orig/arch/x86/pci/common.c
> +++ linux/arch/x86/pci/common.c
> @@ -125,6 +125,23 @@ void __init dmi_check_skip_isa_align(voi
> static void __devinit pcibios_fixup_device_resources(struct pci_dev *dev)
> {
> struct resource *rom_r = &dev->resource[PCI_ROM_RESOURCE];
> + struct resource *bar_r;
> + int bar;
> +
> + if (pci_probe & PCI_NOASSIGN_BARS) {
> + /*
> + * If the BIOS did not assign the BAR, zero out the
> + * resource so the kernel doesn't attmept to assign
> + * it later on in pci_assign_unassigned_resources
> + */
> + for (bar = 0; bar <= PCI_STD_RESOURCE_END; bar++) {
> + bar_r = &dev->resource[bar];
> + if (bar_r->start == 0 && bar_r->end != 0) {
> + bar_r->flags = 0;
> + bar_r->end = 0;
> + }
> + }
> + }
>
> if (pci_probe & PCI_NOASSIGN_ROMS) {
> if (rom_r->parent)
> @@ -509,6 +526,9 @@ char * __devinit pcibios_setup(char *st
> } else if (!strcmp(str, "norom")) {
> pci_probe |= PCI_NOASSIGN_ROMS;
> return NULL;
> + } else if (!strcmp(str, "nobar")) {
> + pci_probe |= PCI_NOASSIGN_BARS;
> + return NULL;
> } else if (!strcmp(str, "assign-busses")) {
> pci_probe |= PCI_ASSIGN_ALL_BUSSES;
> return NULL;
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Bjorn Helgaas on
On Thursday, May 13, 2010 01:12:21 pm Mike Travis wrote:
>
> Bjorn Helgaas wrote:
> > On Wednesday, May 12, 2010 12:14:32 pm Mike Travis wrote:
> >> Subject: [Patch 1/1] x86 pci: Add option to not assign BAR's if not already assigned
> >> From: Mike Habeck <habeck(a)sgi.com>
> >>
> >> The Linux kernel assigns BARs that a BIOS did not assign, most likely
> >> to handle broken BIOSes that didn't enumerate the devices correctly.
> >> On UV the BIOS purposely doesn't assign I/O BARs for certain devices/
> >> drivers we know don't use them (examples, LSI SAS, Qlogic FC, ...).
> >> We purposely don't assign these I/O BARs because I/O Space is a very
> >> limited resource. There is only 64k of I/O Space, and in a PCIe
> >> topology that space gets divided up into 4k chucks (this is due to
> >> the fact that a pci-to-pci bridge's I/O decoder is aligned at 4k)...
> >> Thus a system can have at most 16 cards with I/O BARs: (64k / 4k = 16)
> >>
> >> SGI needs to scale to >16 devices with I/O BARs. So by not assigning
> >> I/O BARs on devices we know don't use them, we can do that (iff the
> >> kernel doesn't go and assign these BARs that the BIOS purposely didn't
> >> assign).
> >
> > I don't quite understand this part. If you boot with "pci=nobar",
> > the BIOS doesn't assign BARs, Linux doesn't either, the drivers
> > don't need them -- everything works, and that makes sense so far.
> >
> > Now, if you boot normally (without "pci=nobar"), what changes?
> > The BIOS situation is the same, but Linux tries to assign the
> > unassigned BARs. It may assign a few before running out of space,
> > but the drivers still don't need those BARs. What breaks?
>
> The problem arises because we run out of address spaces to assign.
>
> Say you have 24 cards, and the 1st 16 do not use I/O BARs. If
> you assign the available 16 address spaces to cards that may not
> need them, then the final 8 cards will not be available.

It sounds like your BIOS treats some devices specially, so I assumed
it would leave the first sixteen devices unassigned, but would assign
the last eight, including the bridge windows leading to them. In that
case, I would expect Linux to preserve the resources of the last
eight devices, since they're already assigned, and assign anything
left over to the first sixteen.

Are you saying that Linux clobbers the resources of the last eight
devices in the process of assigning the first sixteen? If so, I'd
say that's a Linux bug.

Or are the last eight hot-added cards that the BIOS never had a
chance to assign? That's definitely a problem.

> > Can't we figure out whether we need this ourselves? Using a command-
> > line option just guarantees that we'll forever be writing customer
> > advisories about this issue.
>
> I think since this is so specific (like the potential of having
> more than 16 cards would be something the customer would know),
> I think it's better to error on the safe side. If a BIOS does
> not recognize an add in card (for whatever reason), and does
> not assign the I/O BAR, then it would be up to the kernel to
> do that. Wouldn't you get more customer complaints about non-working
> I/O, than someone with > 16 PCI cards not being able to use them
> all?

It feels specific now, but in five years, I bet it won't be so
unusual. I really don't want to force customers to figure out
when they need this.

> > This issue is not specific to x86, so I don't really like having
> > the implementation be x86-specific.
>
> We were going for as light a touch as possible, as there is not
> time to verify other arches. I'd be glad to submit a follow on
> patch dealing with the generic case and depend on others for
> testing, if that's of interest.

It's of interest to me. I spend a lot of time pulling generic
out of architecture-specific places. If there's stuff that we
know is generic from the beginning, we shouldn't make work for
ourselves by making it x86-specific.

> > I'm a little bit nervous about Linux's current strategy of assigning
> > resources to things before we even know whether we're going to use
> > them. We don't support dynamic PCI resource reassignment, so maybe
> > we don't have any choice in this case, but generally I prefer the
> > lazy approach.
>
> That's a great idea if it can work. Unfortunately, we are all tied
> to the way BIOS sets up the system, and for UV systems I don't think
> dynamic provisioning would work. There's too much infrastructure
> that all has to cooperate by the time the system is fully functional.

Like I said, we maybe don't have a choice in this case, but I'd like
to have a clearer understanding of the problem and how other OSes
deal with it before we start applying band-aids that will hurt when
we pull them off later.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Bjorn Helgaas on
On Thursday, May 13, 2010 01:38:24 pm Mike Habeck wrote:
> Bjorn Helgaas wrote:
> > On Wednesday, May 12, 2010 12:14:32 pm Mike Travis wrote:
> >> Subject: [Patch 1/1] x86 pci: Add option to not assign BAR's if not already assigned
> >> From: Mike Habeck <habeck(a)sgi.com>
> >>
> >> The Linux kernel assigns BARs that a BIOS did not assign, most likely
> >> to handle broken BIOSes that didn't enumerate the devices correctly.
> >> On UV the BIOS purposely doesn't assign I/O BARs for certain devices/
> >> drivers we know don't use them (examples, LSI SAS, Qlogic FC, ...).
> >> We purposely don't assign these I/O BARs because I/O Space is a very
> >> limited resource. There is only 64k of I/O Space, and in a PCIe
> >> topology that space gets divided up into 4k chucks (this is due to
> >> the fact that a pci-to-pci bridge's I/O decoder is aligned at 4k)...
> >> Thus a system can have at most 16 cards with I/O BARs: (64k / 4k = 16)
> >>
> >> SGI needs to scale to >16 devices with I/O BARs. So by not assigning
> >> I/O BARs on devices we know don't use them, we can do that (iff the
> >> kernel doesn't go and assign these BARs that the BIOS purposely didn't
> >> assign).
> >
> > I don't quite understand this part. If you boot with "pci=nobar",
> > the BIOS doesn't assign BARs, Linux doesn't either, the drivers
> > don't need them -- everything works, and that makes sense so far.
> >
> > Now, if you boot normally (without "pci=nobar"), what changes?
> > The BIOS situation is the same, but Linux tries to assign the
> > unassigned BARs. It may assign a few before running out of space,
> > but the drivers still don't need those BARs. What breaks?
>
> Nothing really breaks, it's more of a problem that the kernel uses
> up the rest of the I/O Space, and starts spitting out warning
> messages as it tries to assign the rest of the I/O BARs that the
> BIOS didn't assign, something like:
>
> pci 0010:03:00.0: BAR 5: can't allocate I/O resource [0x0-0x7f]
> pci 0012:05:00.0: BAR 5: can't allocate I/O resource [0x0-0x7f]
> ...

OK, that's what I would expect. Personally, I think I'd *like*
to have those messages. If 0010:03:00.0 is a device whose driver
depends on I/O space, the message will be a good clue as to why
the driver isn't working.

> And in using up all the I/O space, I think that could prevent a
> hotplug attach of a pci device requiring I/O space (although I
> believe most BIOSes pad the bridge decoders to support that).
> I'm not to familiar with how pci hotplug works on x86 so I may
> be wrong in what I just stated.

Yep, that's definitely a problem, and I don't have a good solution.

HP (and probably SGI) had a nice hardware solution for ia64 --
address translation across the host bridge, so each bridge could
have its own 64K I/O space. But I don't see that coming in the
x86 PC arena.

> > This issue is not specific to x86, so I don't really like having
> > the implementation be x86-specific.
>
> I agree this isn't a x86 specific issue but given the 'norom'
> cmdline option is basically doing the same thing (but for pci
> Expansion ROM BARs) this code was modeled after it.

IMHO, we should fix both.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Bjorn Helgaas on
[Re-added linux-pci, which got lost again somewhere.]

On Monday, May 31, 2010 05:12:00 am Mike Travis wrote:
> H. Peter Anvin wrote:
> > On 05/28/2010 10:10 AM, Mike Travis wrote:
> >> H. Peter Anvin wrote:
> >>> On 05/28/2010 09:53 AM, Mike Travis wrote:
> >>>> Any further consideration for this patch, or has it been rejected?

I'm disappointed that you didn't rework this to make it generic,
not x86-specific. That would be pretty easy and would remove
the need for somebody else to come and clean it up later.

> >>> Well, it's really up to Jesse, but as far as I can see, this patch is a
> >>> net loss of functionality and doesn't actually add anything. Without
> >>> this patch, some resources that were not assigned by BIOS will be
> >>> unreachable. With this patch, *all* resources that were not assigned by
> >>> BIOS will be unreachable...
> >>>
> >>> -hpa
> >>>
> >> Apparently you're missing the point of the patch? The patch is needed
> >> because BIOS is purposely not assigning I/O BAR's to devices that won't
> >> use them, freeing up the resource for devices that do need them. Where
> >> is the "all" resources that are not reachable?
> >
> > No, the patch isn't needed for those.
> >
> > Without your patch:
> >
> > - Devices assigned by BIOS remain assigned;
> > - Devices not assigned by BIOS get assigned until address space
> > exhausted.
> >
> > With your patch:
> >
> > - Devices assigned by BIOS remain assigned;
> > - Devices not assigned by BIOS never get assigned at all.
> >
> > What am I missing here?
>
> BIOS still assigns the MMIO BAR's so the devices are alive.

I'm sorry; I don't follow this. BIOS assigns MMIO BARs regardless
of whether we have your patch.

I'm still having trouble reconciling the stated purpose, i.e., the
changelog, with the behavior. The changelog implies that the patch
is required to make >16 devices with I/O BARs work at all, but per
Mike Habeck, the patch just gets rid of some warnings and maybe helps
with hot-add of devices using I/O space.

Is there a deeper problem that happens if we exhaust I/O space?
Are we releasing device resources in pci_assign_unassigned_bridge_resources()
and then we fail to reassign even MMIO resources after we exhaust
I/O space?

Maybe a complete dmesg log showing the failure would be helpful. if
so, you could open a kernel.org bugzilla and reference it in your
changelog so we can take this issue into account in future PCI work.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/