VFIO driver: Non-privileged user level PCI drivers [Kernel]

Prev: ACPI patches for 2.6.35-merge
Next: ext4: Fix file fragmentation during large file write.

From: Greg KH on 11 Jun 2010 00:30

On Thu, Jun 10, 2010 at 06:58:37PM -0700, Tom Lyon wrote:
> On Thursday 10 June 2010 10:27:36 am Konrad Rzeszutek Wilk wrote:
> > > +EXPORT_SYMBOL(uiommu_fdget);
> >
> > EXPORT_SYMBOL_GPL
> > .. snip
> > > +EXPORT_SYMBOL(uiommu_put);
> >
> > ditto.
> >
>
> Is there a definitive explanation somewhere of when to use each?

For a driver like this, that is very tied to the way that the kernel
works, I would recommend the _GPL marking, like the UIO interface has.

But that's just me. :)

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Avi Kivity on 11 Jun 2010 01:00

On 06/11/2010 04:58 AM, Tom Lyon wrote:
> On Thursday 10 June 2010 10:27:36 am Konrad Rzeszutek Wilk wrote:
>
>>> +EXPORT_SYMBOL(uiommu_fdget);
>>>
>> EXPORT_SYMBOL_GPL
>> .. snip
>>
>>> +EXPORT_SYMBOL(uiommu_put);
>>>
>> ditto.
>>
>>
> Is there a definitive explanation somewhere of when to use each?
>

Always use _GPL unless you have a defensible reason why you shouldn't.
The kernel's license if GPL, exporting a symbol without GPL can be seen
as adding an exception to the license.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Tom Lyon on 11 Jun 2010 18:20

The inline comments are getting pretty hard to wade through, so I'm deleting some
of the lesser stuff - but I am incorporating into the code.

On Tuesday 08 June 2010 10:45:57 pm Michael S. Tsirkin wrote:
> On Tue, Jun 08, 2010 at 04:54:43PM -0700, Tom Lyon wrote:
> > On Tuesday 08 June 2010 03:38:44 pm Michael S. Tsirkin wrote:
> > > On Tue, Jun 08, 2010 at 02:21:52PM -0700, Tom Lyon wrote:
....
> > > > + /* reset to known state if we can */
> > > > + (void) pci_reset_function(vdev->pdev);
> > >
> > > We are opening the device - how can it not be in a known state?
> > If an alternate driver left it in a weird state.
>
> Don't we care if it fails then? I think we do ...
> > > And we should make sure (at open time) we *can* reset on close, fail binding if we can't.
> > How do you propose?
>
> Fail open if reset fails on open?
OK, it'll now fail open if reset fails.

[ bunch of stuff about MSI-X checking and IOMMUs and config registers...]

OK, here's the thing. The IOMMU API today does not do squat about dealing with
interrupts. Interrupts are special because the APIC addresses are not each in
their own page. Yes, the IOMMU hardware supports it (at least Intel), and there's
some Intel intr remapping code (not AMD), but it doesn't look like it is enough.

Therefore, we must not allow the user level driver to diddle the MSI or MSI-X
areas - either in config space or in the device memory space. If the device
doesn't have its MSI-X registers in nice page aligned areas, then it is not
"well-behaved" and it is S.O.L. The SR-IOV spec recommends that devices be
designed the well-behaved way.

When the code in vfio_pci_config speaks of "virtualization" it means that there
are fake registers which the user driver can read or write, but do not affect the
real registers. BARs are one case, MSI regs another. The PCI vendor and device
ID are virtual because SR-IOV doesn't supply them but I wanted the user driver
to find them in the same old place.

> > > > + case VFIO_DMA_MASK: /* set master mode and DMA mask */
> > > > + if (copy_from_user(&mask, uarg, sizeof mask))
> > > > + return -EFAULT;
> > > > + pci_set_master(pdev);
> > >
> > > This better be done by userspace when it sees fit.
> > > Otherwise device might corrupt userspace memory.
This ioctl is gone now - it was vestigial from the dma_sg_map interface.

[ Re: Hotplug and Suspend/Resume]
There are *plenty* of real drivers - brand new ones - which don't bother
with these today. Yeah, I can see adding them to the framework someday -
but if there's no urgent need then it is way down the priority list. Meanwhile,
the other uses beckon. And I never heard the Infiniband users complaining
about not having these things.

> > > > + pages = kzalloc(npage * sizeof(struct page *), GFP_KERNEL);
> > >
> > > If you map a 4G region, this will try to allocate 8Mbytes?
> > Yes. Ce la vie.
>
> First of all, this will fail - and the request is quite real with
> decent sized guests. Second, with appropriately sized allocs before failing
> it will stress the system pretty hard. Split it in chunks of 4K or something.
Changed to use vmalloc/vfree - don't need physical contiguity.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Michael S. Tsirkin on 13 Jun 2010 06:30

On Fri, Jun 11, 2010 at 03:15:53PM -0700, Tom Lyon wrote:
> [ bunch of stuff about MSI-X checking and IOMMUs and config registers...]
>
> OK, here's the thing. The IOMMU API today does not do squat about
> dealing with interrupts. Interrupts are special because the APIC
> addresses are not each in their own page. Yes, the IOMMU hardware
> supports it (at least Intel), and there's some Intel intr remapping
> code (not AMD), but it doesn't look like it is enough.

The iommu book from AMD seems to say that interrupt remapping table
address is taken from the device table entry. So hardware support seems
to be there, and to me it looks like it should be enough.
Need to look at the iommu/msi code some more to figure out
whether what linux does is handling this correctly -
if it doesn't we need to fix that.

> Therefore, we must not allow the user level driver to diddle the MSI
> or MSI-X areas - either in config space or in the device memory space.

It won't help.
Consider that you want to let a userspace driver control
the device with DMA capabilities.

So if there is a range of addresses that device
can write into that can break host, these writes
can be triggered by userspace. Limiting
userspace access to MSI registers won't help:
you need a way to protect host from the device.

> If the device doesn't have its MSI-X registers in nice page aligned
> areas, then it is not "well-behaved" and it is S.O.L. The SR-IOV spec
> recommends that devices be designed the well-behaved way.
>
> When the code in vfio_pci_config speaks of "virtualization" it means
> that there are fake registers which the user driver can read or write,
> but do not affect the real registers. BARs are one case, MSI regs
> another. The PCI vendor and device ID are virtual because SR-IOV
> doesn't supply them but I wanted the user driver to find them in the
> same old place.

Sorry, I still don't understand why do we bother. All this is already
implemented in userspace. Why can't we just use this existing userspace
implementation? It seems that all kernel needs to do is prevent
userspace from writing BARs.

Why can't we replace all this complexity with basically:

if (addr <= PCI_BASE_ADDRESS_5 && addr + len >= PCI_BASE_ADDRESS_0)
return -ENOPERM;

And maybe another register or two. Most registers should be fine.

> [ Re: Hotplug and Suspend/Resume]
> There are *plenty* of real drivers - brand new ones - which don't
> bother with these today. Yeah, I can see adding them to the framework
> someday - but if there's no urgent need then it is way down the
> priority list.

Well, for kernel drivers everything mostly works out of the box, it is
handled by the PCI subsystem. So some kind of framework will need to be
added for userspace drivers as well. And I suspect this issue won't be
fixable later without breaking applications.

> Meanwhile, the other uses beckon.

Which other uses? I thought the whole point was fixing
what's broken with current kvm implementation.
So it seems to be we should not rush it ignoring existing issues such as
hotplug.

> And I never heard
> the Infiniband users complaining about not having these things.

I did.

--
MST
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Tom Lyon on 17 Jun 2010 17:20

On Sunday 13 June 2010 03:23:39 am Michael S. Tsirkin wrote:
> On Fri, Jun 11, 2010 at 03:15:53PM -0700, Tom Lyon wrote:
> > [ bunch of stuff about MSI-X checking and IOMMUs and config registers...]
> >
> > OK, here's the thing. The IOMMU API today does not do squat about
> > dealing with interrupts. Interrupts are special because the APIC
> > addresses are not each in their own page. Yes, the IOMMU hardware
> > supports it (at least Intel), and there's some Intel intr remapping
> > code (not AMD), but it doesn't look like it is enough.
>
> The iommu book from AMD seems to say that interrupt remapping table
> address is taken from the device table entry. So hardware support seems
> to be there, and to me it looks like it should be enough.
> Need to look at the iommu/msi code some more to figure out
> whether what linux does is handling this correctly -
> if it doesn't we need to fix that.
>
> > Therefore, we must not allow the user level driver to diddle the MSI
> > or MSI-X areas - either in config space or in the device memory space.
>
> It won't help.
> Consider that you want to let a userspace driver control
> the device with DMA capabilities.
>
> So if there is a range of addresses that device
> can write into that can break host, these writes
> can be triggered by userspace. Limiting
> userspace access to MSI registers won't help:
> you need a way to protect host from the device.

OK, after more investigation, I realize you are right.
We definitely need the IOMMU protection for interrupts, and
if we have it, a lot of the code for config space protection is pointless.
It does seem that the Intel intr_remapping code does what we want
(accidentally) but that the AMD iommu code does not yet do any
interrupt remapping. Joerg - can you comment? On the roadmap?

I should have an AMD system w IOMMU in a couple of days, so I
can check this out.

>
> > If the device doesn't have its MSI-X registers in nice page aligned
> > areas, then it is not "well-behaved" and it is S.O.L. The SR-IOV spec
> > recommends that devices be designed the well-behaved way.
> >
> > When the code in vfio_pci_config speaks of "virtualization" it means
> > that there are fake registers which the user driver can read or write,
> > but do not affect the real registers. BARs are one case, MSI regs
> > another. The PCI vendor and device ID are virtual because SR-IOV
> > doesn't supply them but I wanted the user driver to find them in the
> > same old place.
>
> Sorry, I still don't understand why do we bother. All this is already
> implemented in userspace. Why can't we just use this existing userspace
> implementation? It seems that all kernel needs to do is prevent
> userspace from writing BARs.

I assume the userspace of which you speak is qemu? This is not what I'm
doing with vfio - I'm interested in the HPC networking model of direct
user space access to the network.

> Why can't we replace all this complexity with basically:
>
> if (addr <= PCI_BASE_ADDRESS_5 && addr + len >= PCI_BASE_ADDRESS_0)
> return -ENOPERM;
>
> And maybe another register or two. Most registers should be fine.
>
> > [ Re: Hotplug and Suspend/Resume]
> > There are *plenty* of real drivers - brand new ones - which don't
> > bother with these today. Yeah, I can see adding them to the framework
> > someday - but if there's no urgent need then it is way down the
> > priority list.
>
> Well, for kernel drivers everything mostly works out of the box, it is
> handled by the PCI subsystem. So some kind of framework will need to be
> added for userspace drivers as well. And I suspect this issue won't be
> fixable later without breaking applications.

Whatever works out of the box for the kernel drivers which don't implement
suspend/resume will work for the user level drivers which don't.
>
> > Meanwhile, the other uses beckon.
>
> Which other uses? I thought the whole point was fixing
> what's broken with current kvm implementation.
> So it seems to be we should not rush it ignoring existing issues such as
> hotplug.
Non-kvm cases. That don't care about suspend/resume.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Prev: ACPI patches for 2.6.35-merge
Next: ext4: Fix file fragmentation during large file write.