VFIO driver: Non-privileged user level PCI drivers [Kernel]

Prev: ACPI patches for 2.6.35-merge
Next: ext4: Fix file fragmentation during large file write.

From: Michael S. Tsirkin on 17 Jun 2010 18:00

On Thu, Jun 17, 2010 at 02:14:00PM -0700, Tom Lyon wrote:
> On Sunday 13 June 2010 03:23:39 am Michael S. Tsirkin wrote:
> > On Fri, Jun 11, 2010 at 03:15:53PM -0700, Tom Lyon wrote:
> > > [ bunch of stuff about MSI-X checking and IOMMUs and config registers...]
> > >
> > > OK, here's the thing. The IOMMU API today does not do squat about
> > > dealing with interrupts. Interrupts are special because the APIC
> > > addresses are not each in their own page. Yes, the IOMMU hardware
> > > supports it (at least Intel), and there's some Intel intr remapping
> > > code (not AMD), but it doesn't look like it is enough.
> >
> > The iommu book from AMD seems to say that interrupt remapping table
> > address is taken from the device table entry. So hardware support seems
> > to be there, and to me it looks like it should be enough.
> > Need to look at the iommu/msi code some more to figure out
> > whether what linux does is handling this correctly -
> > if it doesn't we need to fix that.
> >
> > > Therefore, we must not allow the user level driver to diddle the MSI
> > > or MSI-X areas - either in config space or in the device memory space.
> >
> > It won't help.
> > Consider that you want to let a userspace driver control
> > the device with DMA capabilities.
> >
> > So if there is a range of addresses that device
> > can write into that can break host, these writes
> > can be triggered by userspace. Limiting
> > userspace access to MSI registers won't help:
> > you need a way to protect host from the device.
>
> OK, after more investigation, I realize you are right.
> We definitely need the IOMMU protection for interrupts, and
> if we have it, a lot of the code for config space protection is pointless.
> It does seem that the Intel intr_remapping code does what we want
> (accidentally) but that the AMD iommu code does not yet do any
> interrupt remapping. Joerg - can you comment? On the roadmap?
>
> I should have an AMD system w IOMMU in a couple of days, so I
> can check this out.
>
> >
> > > If the device doesn't have its MSI-X registers in nice page aligned
> > > areas, then it is not "well-behaved" and it is S.O.L. The SR-IOV spec
> > > recommends that devices be designed the well-behaved way.
> > >
> > > When the code in vfio_pci_config speaks of "virtualization" it means
> > > that there are fake registers which the user driver can read or write,
> > > but do not affect the real registers. BARs are one case, MSI regs
> > > another. The PCI vendor and device ID are virtual because SR-IOV
> > > doesn't supply them but I wanted the user driver to find them in the
> > > same old place.
> >
> > Sorry, I still don't understand why do we bother. All this is already
> > implemented in userspace. Why can't we just use this existing userspace
> > implementation? It seems that all kernel needs to do is prevent
> > userspace from writing BARs.
>
> I assume the userspace of which you speak is qemu? This is not what I'm
> doing with vfio - I'm interested in the HPC networking model of direct
> user space access to the network.

In that case, I don't see the point of fake config space registers
at all. For virtualization we need them to run unmodified guest drivers,
but we have an implementation in qemu. If you write your own custom drivers
for HPC, just get -EPERM from kernel and handle this.

> > Why can't we replace all this complexity with basically:
> >
> > if (addr <= PCI_BASE_ADDRESS_5 && addr + len >= PCI_BASE_ADDRESS_0)
> > return -ENOPERM;
> >
> > And maybe another register or two. Most registers should be fine.
> >
> > > [ Re: Hotplug and Suspend/Resume]
> > > There are *plenty* of real drivers - brand new ones - which don't
> > > bother with these today. Yeah, I can see adding them to the framework
> > > someday - but if there's no urgent need then it is way down the
> > > priority list.
> >
> > Well, for kernel drivers everything mostly works out of the box, it is
> > handled by the PCI subsystem. So some kind of framework will need to be
> > added for userspace drivers as well. And I suspect this issue won't be
> > fixable later without breaking applications.
>
> Whatever works out of the box for the kernel drivers which don't implement
> suspend/resume will work for the user level drivers which don't.

How will hotplug/driver unload work then?

> >
> > > Meanwhile, the other uses beckon.
> >
> > Which other uses? I thought the whole point was fixing
> > what's broken with current kvm implementation.
> > So it seems to be we should not rush it ignoring existing issues such as
> > hotplug.
> Non-kvm cases.

Try describing these use-cases in more detail then.
And given that you want to use this driver for networking,
please copy netdev, not just linux-kernel.

> That don't care about suspend/resume.
>
>
>

Let's start with hotplug please. This is a really clear-cut issue:
all pci drivers get an event on hotplug and hotunplug, you need
such events for userspace and you need them on day 0, otherwise
userspace which ignores hotplug will get written, and you will
have to support it forever.

--
MST
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Joerg Roedel on 24 Jun 2010 08:30

On Thu, Jun 17, 2010 at 02:14:00PM -0700, Tom Lyon wrote:
> On Sunday 13 June 2010 03:23:39 am Michael S. Tsirkin wrote:

> OK, after more investigation, I realize you are right.
> We definitely need the IOMMU protection for interrupts, and
> if we have it, a lot of the code for config space protection is pointless.
> It does seem that the Intel intr_remapping code does what we want
> (accidentally) but that the AMD iommu code does not yet do any
> interrupt remapping. Joerg - can you comment? On the roadmap?

Work on this is planned, but not at a high priority by now. I can
re-prioritize this item if needed.

Joerg

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Michael S. Tsirkin on 24 Jun 2010 11:10

On Thu, Jun 24, 2010 at 02:22:19PM +0200, Joerg Roedel wrote:
> On Thu, Jun 17, 2010 at 02:14:00PM -0700, Tom Lyon wrote:
> > On Sunday 13 June 2010 03:23:39 am Michael S. Tsirkin wrote:
>
> > OK, after more investigation, I realize you are right.
> > We definitely need the IOMMU protection for interrupts, and
> > if we have it, a lot of the code for config space protection is pointless.
> > It does seem that the Intel intr_remapping code does what we want
> > (accidentally) but that the AMD iommu code does not yet do any
> > interrupt remapping. Joerg - can you comment? On the roadmap?
>
> Work on this is planned, but not at a high priority by now. I can
> re-prioritize this item if needed.
>
> Joerg

As a stop-gap measure, we could get by with a
portable API that let us figure out whether a given iommu supports
interrupt remapping.

--
MST
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Alex Williamson on 30 Jun 2010 02:20

On Tue, 2010-06-08 at 14:21 -0700, Tom Lyon wrote:
> The VFIO "driver" is used to allow privileged AND non-privileged processes to
> implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
> devices.

Hi Tom,

I found a few bugs. Patch below. The first chunk clears the
pci_config_map on close, otherwise we end up passing virtualized state
from one user to the next. The second is an off by one in the basic
perms. Finally, vfio_bar_fixup() needs an overhaul. It wasn't setting
the lower bits right and is allowing virtual writes of bits that aren't
aligned to the size. This section probably needs another pass or two of
refinement. Thanks,

Alex

Signed-off-by: Alex Williamson <alex.williamson(a)redhat.com>
---

diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 96639e5..a0e8227 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -129,6 +129,10 @@ static int vfio_release(struct inode *inode, struct file *filep)
eventfd_ctx_put(vdev->ev_msi);
vdev->ev_irq = NULL;
}
+ if (vdev->pci_config_map) {
+ kfree(vdev->pci_config_map);
+ vdev->pci_config_map = NULL;
+ }
vfio_domain_unset(vdev);
/* reset to known state if we can */
(void) pci_reset_function(vdev->pdev);
diff --git a/drivers/vfio/vfio_pci_config.c b/drivers/vfio/vfio_pci_config.c
index c821b5d..f6e26b1 100644
--- a/drivers/vfio/vfio_pci_config.c
+++ b/drivers/vfio/vfio_pci_config.c
@@ -79,18 +79,18 @@ struct perm_bits {
static struct perm_bits pci_cap_basic_perm[] = {
{ 0xFFFFFFFF, 0, }, /* 0x00 vendor & device id - RO */
{ 0, 0xFFFFFFFC, }, /* 0x04 cmd & status except mem/io */
- { 0, 0xFF00FFFF, }, /* 0x08 bist, htype, lat, cache */
- { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x0c bar */
+ { 0, 0, }, /* 0x08 class code & revision id */
+ { 0, 0xFF00FFFF, }, /* 0x0c bist, htype, lat, cache */
{ 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x10 bar */
{ 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x14 bar */
{ 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x18 bar */
{ 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x1c bar */
{ 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x20 bar */
- { 0, 0, }, /* 0x24 cardbus - not yet */
- { 0, 0, }, /* 0x28 subsys vendor & dev */
- { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x2c rom bar */
- { 0, 0, }, /* 0x30 capability ptr & resv */
- { 0, 0, }, /* 0x34 resv */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x24 bar */
+ { 0, 0, }, /* 0x28 cardbus - not yet */
+ { 0, 0, }, /* 0x2c subsys vendor & dev */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x30 rom bar */
+ { 0, 0, }, /* 0x34 capability ptr & resv */
{ 0, 0, }, /* 0x38 resv */
{ 0x000000FF, 0x000000FF, }, /* 0x3c max_lat ... irq */
};
@@ -318,30 +318,55 @@ static void vfio_virt_init(struct vfio_dev *vdev)
static void vfio_bar_fixup(struct vfio_dev *vdev)
{
struct pci_dev *pdev = vdev->pdev;
- int bar;
- u32 *lp;
- u32 len;
+ int bar, mem64 = 0;
+ u32 *lp = NULL;
+ u64 len = 0;

for (bar = 0; bar <= 5; bar++) {
- len = pci_resource_len(pdev, bar);
- lp = (u32 *)&vdev->vinfo.bar[bar * 4];
- if (len == 0) {
- *lp = 0;
- } else if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM) {
- *lp &= ~0x1;
- *lp = (*lp & ~(len-1)) |
- (*lp & ~PCI_BASE_ADDRESS_MEM_MASK);
- if (*lp & PCI_BASE_ADDRESS_MEM_TYPE_64)
- bar++;
- } else if (pci_resource_flags(pdev, bar) & IORESOURCE_IO) {
+ if (!mem64) {
+ len = pci_resource_len(pdev, bar);
+ lp = (u32 *)&vdev->vinfo.bar[bar * 4];
+ if (len == 0) {
+ *lp = 0;
+ continue;
+ }
+
+ len = ~(len - 1);
+ } else
+ len >>= 32;
+
+ if (*lp == ~0U)
+ *lp = (u32)len;
+ else
+ *lp &= (u32)len;
+
+ if (mem64) {
+ mem64 = 0;
+ continue;
+ }
+
+ if (pci_resource_flags(pdev, bar) & IORESOURCE_IO)
*lp |= PCI_BASE_ADDRESS_SPACE_IO;
- *lp = (*lp & ~(len-1)) |
- (*lp & ~PCI_BASE_ADDRESS_IO_MASK);
+ else {
+ *lp |= PCI_BASE_ADDRESS_SPACE_MEMORY;
+ if (pci_resource_flags(pdev, bar) & IORESOURCE_PREFETCH)
+ *lp |= PCI_BASE_ADDRESS_MEM_PREFETCH;
+ if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM_64) {
+ *lp |= PCI_BASE_ADDRESS_MEM_TYPE_64;
+ mem64 = 1;
+ }
}
}
+
lp = (u32 *)vdev->vinfo.rombar;
len = pci_resource_len(pdev, PCI_ROM_RESOURCE);
- *lp = *lp & PCI_ROM_ADDRESS_MASK & ~(len-1);
+ len = ~(len - 1);
+
+ if (*lp == ~PCI_ROM_ADDRESS_ENABLE)
+ *lp = (u32)len;
+ else
+ *lp = *lp & ((u32)len | PCI_ROM_ADDRESS_ENABLE);
+
vdev->vinfo.bardirty = 0;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Michael S. Tsirkin on 30 Jun 2010 09:50

On Wed, Jun 30, 2010 at 12:14:12AM -0600, Alex Williamson wrote:
> On Tue, 2010-06-08 at 14:21 -0700, Tom Lyon wrote:
> > The VFIO "driver" is used to allow privileged AND non-privileged processes to
> > implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
> > devices.
>
> Hi Tom,
>
> I found a few bugs. Patch below. The first chunk clears the
> pci_config_map on close, otherwise we end up passing virtualized state
> from one user to the next. The second is an off by one in the basic
> perms. Finally, vfio_bar_fixup() needs an overhaul. It wasn't setting
> the lower bits right and is allowing virtual writes of bits that aren't
> aligned to the size. This section probably needs another pass or two of
> refinement. Thanks,
>
> Alex
>
> Signed-off-by: Alex Williamson <alex.williamson(a)redhat.com>

I still don't see why are we sticking all this emulation in kernel. It
is far from performance hotpath and can easily be emulated in userspace.
qemu does this, you can lift code from there if you like.
Maybe we need to protect the BARs from being manipulated by userspace,
but that should be all. No need for tables.

> ---
>
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index 96639e5..a0e8227 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -129,6 +129,10 @@ static int vfio_release(struct inode *inode, struct file *filep)
> eventfd_ctx_put(vdev->ev_msi);
> vdev->ev_irq = NULL;
> }
> + if (vdev->pci_config_map) {
> + kfree(vdev->pci_config_map);
> + vdev->pci_config_map = NULL;
> + }
> vfio_domain_unset(vdev);
> /* reset to known state if we can */
> (void) pci_reset_function(vdev->pdev);
> diff --git a/drivers/vfio/vfio_pci_config.c b/drivers/vfio/vfio_pci_config.c
> index c821b5d..f6e26b1 100644
> --- a/drivers/vfio/vfio_pci_config.c
> +++ b/drivers/vfio/vfio_pci_config.c
> @@ -79,18 +79,18 @@ struct perm_bits {
> static struct perm_bits pci_cap_basic_perm[] = {
> { 0xFFFFFFFF, 0, }, /* 0x00 vendor & device id - RO */
> { 0, 0xFFFFFFFC, }, /* 0x04 cmd & status except mem/io */
> - { 0, 0xFF00FFFF, }, /* 0x08 bist, htype, lat, cache */
> - { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x0c bar */
> + { 0, 0, }, /* 0x08 class code & revision id */
> + { 0, 0xFF00FFFF, }, /* 0x0c bist, htype, lat, cache */
> { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x10 bar */
> { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x14 bar */
> { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x18 bar */
> { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x1c bar */
> { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x20 bar */
> - { 0, 0, }, /* 0x24 cardbus - not yet */
> - { 0, 0, }, /* 0x28 subsys vendor & dev */
> - { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x2c rom bar */
> - { 0, 0, }, /* 0x30 capability ptr & resv */
> - { 0, 0, }, /* 0x34 resv */
> + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x24 bar */
> + { 0, 0, }, /* 0x28 cardbus - not yet */
> + { 0, 0, }, /* 0x2c subsys vendor & dev */
> + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x30 rom bar */
> + { 0, 0, }, /* 0x34 capability ptr & resv */
> { 0, 0, }, /* 0x38 resv */
> { 0x000000FF, 0x000000FF, }, /* 0x3c max_lat ... irq */
> };
> @@ -318,30 +318,55 @@ static void vfio_virt_init(struct vfio_dev *vdev)
> static void vfio_bar_fixup(struct vfio_dev *vdev)
> {
> struct pci_dev *pdev = vdev->pdev;
> - int bar;
> - u32 *lp;
> - u32 len;
> + int bar, mem64 = 0;
> + u32 *lp = NULL;
> + u64 len = 0;
>
> for (bar = 0; bar <= 5; bar++) {
> - len = pci_resource_len(pdev, bar);
> - lp = (u32 *)&vdev->vinfo.bar[bar * 4];
> - if (len == 0) {
> - *lp = 0;
> - } else if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM) {
> - *lp &= ~0x1;
> - *lp = (*lp & ~(len-1)) |
> - (*lp & ~PCI_BASE_ADDRESS_MEM_MASK);
> - if (*lp & PCI_BASE_ADDRESS_MEM_TYPE_64)
> - bar++;
> - } else if (pci_resource_flags(pdev, bar) & IORESOURCE_IO) {
> + if (!mem64) {
> + len = pci_resource_len(pdev, bar);
> + lp = (u32 *)&vdev->vinfo.bar[bar * 4];
> + if (len == 0) {
> + *lp = 0;
> + continue;
> + }
> +
> + len = ~(len - 1);
> + } else
> + len >>= 32;
> +
> + if (*lp == ~0U)
> + *lp = (u32)len;
> + else
> + *lp &= (u32)len;
> +
> + if (mem64) {
> + mem64 = 0;
> + continue;
> + }
> +
> + if (pci_resource_flags(pdev, bar) & IORESOURCE_IO)
> *lp |= PCI_BASE_ADDRESS_SPACE_IO;
> - *lp = (*lp & ~(len-1)) |
> - (*lp & ~PCI_BASE_ADDRESS_IO_MASK);
> + else {
> + *lp |= PCI_BASE_ADDRESS_SPACE_MEMORY;
> + if (pci_resource_flags(pdev, bar) & IORESOURCE_PREFETCH)
> + *lp |= PCI_BASE_ADDRESS_MEM_PREFETCH;
> + if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM_64) {
> + *lp |= PCI_BASE_ADDRESS_MEM_TYPE_64;
> + mem64 = 1;
> + }
> }
> }
> +
> lp = (u32 *)vdev->vinfo.rombar;
> len = pci_resource_len(pdev, PCI_ROM_RESOURCE);
> - *lp = *lp & PCI_ROM_ADDRESS_MASK & ~(len-1);
> + len = ~(len - 1);
> +
> + if (*lp == ~PCI_ROM_ADDRESS_ENABLE)
> + *lp = (u32)len;
> + else
> + *lp = *lp & ((u32)len | PCI_ROM_ADDRESS_ENABLE);
> +
> vdev->vinfo.bardirty = 0;
> }
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Prev: ACPI patches for 2.6.35-merge
Next: ext4: Fix file fragmentation during large file write.