RFC: Network Plugin Architecture (NPA) for vmxnet3 [Kernel]

Prev: [GIT PULL] ocfs2 fixes for 2.6.34-rc
Next: Exposing TSC "reliability" to userland

From: Avi Kivity on 5 May 2010 14:10

On 05/05/2010 02:02 AM, Pankaj Thakkar wrote:
> 2. Hypervisor control: All control operations from the guest such as programming
> MAC address go through the hypervisor layer and hence can be subjected to
> hypervisor policies. The PF driver can be further used to put policy decisions
> like which VLAN the guest should be on.
>

Is this enforced? Since you pass the hardware through, you can't rely
on the guest actually doing this, yes?

> The plugin image is provided by the IHVs along with the PF driver and is
> packaged in the hypervisor. The plugin image is OS agnostic and can be loaded
> either into a Linux VM or a Windows VM. The plugin is written against the Shell
> API interface which the shell is responsible for implementing. The API
> interface allows the plugin to do TX and RX only by programming the hardware
> rings (along with things like buffer allocation and basic initialization). The
> virtual machine comes up in paravirtualized/emulated mode when it is booted.
> The hypervisor allocates the VF and other resources and notifies the shell of
> the availability of the VF. The hypervisor injects the plugin into memory
> location specified by the shell. The shell initializes the plugin by calling
> into a known entry point and the plugin initializes the data path. The control
> path is already initialized by the PF driver when the VF is allocated. At this
> point the shell switches to using the loaded plugin to do all further TX and RX
> operations. The guest networking stack does not participate in these operations
> and continues to function normally. All the control operations continue being
> trapped by the hypervisor and are directed to the PF driver as needed. For
> example, if the MAC address changes the hypervisor updates its internal state
> and changes the state of the embedded switch as well through the PF control
> API.
>

This is essentially a miniature network stack with a its own mini
bonding layer, mini hotplug, and mini API, except s/API/ABI/. Is this a
correct view?

If so, the Linuxy approach would be to use the ordinary drivers and the
Linux networking API, and hide the bond setup using namespaces. The
bond driver, or perhaps a new, similar, driver can be enhanced to
propagate ethtool commands to its (hidden) components, and to have a
control channel with the hypervisor.

This would make the approach hypervisor agnostic, you're just pairing
two devices and presenting them to the rest of the stack as a single device.

> We have reworked our existing Linux vmxnet3 driver to accomodate NPA by
> splitting the driver into two parts: Shell and Plugin. The new split driver is
>

So the Shell would be the reworked or new bond driver, and Plugins would
be ordinary Linux network drivers.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Pankaj Thakkar on 5 May 2010 15:10

On Tue, May 04, 2010 at 05:58:52PM -0700, Chris Wright wrote:
> Date: Tue, 4 May 2010 17:58:52 -0700
> From: Chris Wright <chrisw(a)sous-sol.org>
> To: Pankaj Thakkar <pthakkar(a)vmware.com>
> CC: "linux-kernel(a)vger.kernel.org" <linux-kernel(a)vger.kernel.org>,
> "netdev(a)vger.kernel.org" <netdev(a)vger.kernel.org>,
> "virtualization(a)lists.linux-foundation.org"
> <virtualization(a)lists.linux-foundation.org>,
> "pv-drivers(a)vmware.com" <pv-drivers(a)vmware.com>,
> Shreyas Bhatewara <sbhatewara(a)vmware.com>,
> "kvm(a)vger.kernel.org" <kvm(a)vger.kernel.org>
> Subject: Re: RFC: Network Plugin Architecture (NPA) for vmxnet3
>
> * Pankaj Thakkar (pthakkar(a)vmware.com) wrote:
> > We intend to upgrade the upstreamed vmxnet3 driver to implement NPA so that
> > Linux users can exploit the benefits provided by passthrough devices in a
> > seamless manner while retaining the benefits of virtualization. The document
> > below tries to answer most of the questions which we anticipated. Please let us
> > know your comments and queries.
>
> How does the throughput, latency, and host CPU utilization for normal
> data path compare with say NetQueue?

NetQueue is really for scaling across multiple VMs. NPA allows similar scaling
and also helps in improving the CPU efficiency for a single VM since the
hypervisor is bypassed. Througput wise both emulation and passthrough (NPA) can
obtain line rates on 10gig but passthrough saves upto 40% cpu based on the
workload. We did a demo at IDF 2009 where we compared 8 VMs running on NetQueue
v/s 8 VMs running on NPA (using Niantic) and we obtained similar CPU efficiency
gains.

>
> And does this obsolete your UPT implementation?

NPA and UPT share a lot of code in the hypervisor. UPT was adopted only by very
limited IHVs and hence NPA is our way forward to have all IHVs onboard.

> How many cards actually support this NPA interface? What does it look
> like, i.e. where is the NPA specification? (AFAIK, we never got the UPT
> one).

We have it working internally with Intel Niantic (10G) and Kawela (1G) SR-IOV
NIC. We are also working with upcoming Broadcom 10G card and plan to support
other IHVs. This is unlike UPT so we don't dictate the register sets or rings
like we did in UPT. Rather we have guidelines like that the card should have an
embedded switch for inter VF switching or should support programming (rx
filters, VLAN, etc) though the PF driver rather than the VF driver.

> How do you handle hardware which has a more symmetric view of the
> SR-IOV world (SR-IOV is only PCI sepcification, not a network driver
> specification)? Or hardware which has multiple functions per physical
> port (multiqueue, hw filtering, embedded switch, etc.)?

I am not sure what do you mean by symmetric view of SR-IOV world?

NPA allows multi-queue VFs and requires an embedded switch currently. As far as
the PF driver is concerned we require IHVs to support all existing and upcoming
features like NetQueue, FCoE, etc. The PF driver is considered special and is
used to drive the traffic for the emulated/paravirtualized VMs and is also used
to program things on behalf of the VFs through the hypervisor. If the hardware
has multiple physical functions they are treated as separate adapters (with
their own set of VFs) and we require the embedded switch to maintain that
distinction as well.

> > NPA offers several benefits:
> > 1. Performance: Critical performance sensitive paths are not trapped and the
> > guest can directly drive the hardware without incurring virtualization
> > overheads.
>
> Can you demonstrate with data?

The setup is 2.667Ghz Nehalem server running SLES11 VM talking to a 2.33Ghz
Barcelona client box running RHEL 5.1. We had netperf streams with 16k msg size
over 64k socket size running between server VM and client and they are using
Intel Niantic 10G cards. In both cases (NPA and regular) the VM was CPU
saturated (used one full core).

TX: regular vmxnet3 = 3085.5 Mbps/GHz; NPA vmxnet3 = 4397.2 Mbps/GHz
RX: regular vmxnet3 = 1379.6 Mbps/GHz; NPA vmxnet3 = 2349.7 Mbps/GHz

We have similar results for other configuration and in general we have seen NPA
is better in terms of CPU cost and can save upto 40% of CPU cost.

>
> > 2. Hypervisor control: All control operations from the guest such as programming
> > MAC address go through the hypervisor layer and hence can be subjected to
> > hypervisor policies. The PF driver can be further used to put policy decisions
> > like which VLAN the guest should be on.
>
> This can happen without NPA as well. VF simply needs to request
> the change via the PF (in fact, hw does that right now). Also, we
> already have a host side management interface via PF (see, for example,
> RTM_SETLINK IFLA_VF_MAC interface).
>
> What is control plane interface? Just something like a fixed register set?

All operations other than TX/RX go through the vmxnet3 shell to the vmxnet3
device emulation. So the control plane is really the vmxnet3 device emulation
as far as the guest is concerned.

>
> > 3. Guest Management: No hardware specific drivers need to be installed in the
> > guest virtual machine and hence no overheads are incurred for guest management.
> > All software for the driver (including the PF driver and the plugin) is
> > installed in the hypervisor.
>
> So we have a plugin per hardware VF implementation? And the hypervisor
> injects this code into the guest?

One guest-agnostic plugin per VF implementation. Yes, the plugin is injected
into the guest by the hypervisor.

> > The plugin image is provided by the IHVs along with the PF driver and is
> > packaged in the hypervisor. The plugin image is OS agnostic and can be loaded
> > either into a Linux VM or a Windows VM. The plugin is written against the Shell
>
> And it will need to be GPL AFAICT from what you've said thus far. It
> does sound worrisome, although I suppose hw firmware isn't particularly
> different.

Yes it would be GPL and we are thinking of enforcing the license in the
hypervisor as well as in the shell.

> How does the shell switch back to emulated mode for live migration?

The hypervisor sends a notification to the shell to switch out of passthrough
and it quiesces the VF and tears down the mapping between VF and the guest. The
shell free's up the buffers and other resources on behalf of the plugin and
reinitializes the s/w vmxnet3 emulation plugin.

> Please make this shell API interface and the PF/VF requirments available.

We have an internal prototype working but we are not yet ready to post the
patch to LKML. We are still in the process of making changes to our windows
driver and want to ensure that we take into account all changes that could
happen.

Thanks,

-pankaj

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Pankaj Thakkar on 5 May 2010 15:50

On Wed, May 05, 2010 at 10:59:51AM -0700, Avi Kivity wrote:
> Date: Wed, 5 May 2010 10:59:51 -0700
> From: Avi Kivity <avi(a)redhat.com>
> To: Pankaj Thakkar <pthakkar(a)vmware.com>
> CC: "linux-kernel(a)vger.kernel.org" <linux-kernel(a)vger.kernel.org>,
> "netdev(a)vger.kernel.org" <netdev(a)vger.kernel.org>,
> "virtualization(a)lists.linux-foundation.org"
> <virtualization(a)lists.linux-foundation.org>,
> "pv-drivers(a)vmware.com" <pv-drivers(a)vmware.com>,
> Shreyas Bhatewara <sbhatewara(a)vmware.com>
> Subject: Re: RFC: Network Plugin Architecture (NPA) for vmxnet3
>
> On 05/05/2010 02:02 AM, Pankaj Thakkar wrote:
> > 2. Hypervisor control: All control operations from the guest such as programming
> > MAC address go through the hypervisor layer and hence can be subjected to
> > hypervisor policies. The PF driver can be further used to put policy decisions
> > like which VLAN the guest should be on.
> >
>
> Is this enforced? Since you pass the hardware through, you can't rely
> on the guest actually doing this, yes?

We don't pass the whole VF to the guest. Only the BAR which is responsible for
TX/RX/intr is mapped into guest space. The interface between the shell and
plugin only allows to do operations related to TX and RX such as send a packet
to the VF, allocate RX buffers, indicate a packet upto the shell. All control
operations are handled by the shell and the shell does what the existing
vmxnet3 drivers does (touch a specific register and let the device emulation do
the work). When a VF is mapped to the guest the hypervisor knows this and
programs the h/w accordingly on behalf of the shell. So for example if the VM
does a MAC address change inside the guest, the shell would write to
VMXNET3_REG_MAC{L|H} registers which would trigger the device emulation to read
the new mac address and update its internal virtual port information for the
virtual switch and if the VF is mapped it would also program the embedded
switch RX filters to reflect the new mac address.

>
> > The plugin image is provided by the IHVs along with the PF driver and is
> > packaged in the hypervisor. The plugin image is OS agnostic and can be loaded
> > either into a Linux VM or a Windows VM. The plugin is written against the Shell
> > API interface which the shell is responsible for implementing. The API
> > interface allows the plugin to do TX and RX only by programming the hardware
> > rings (along with things like buffer allocation and basic initialization). The
> > virtual machine comes up in paravirtualized/emulated mode when it is booted.
> > The hypervisor allocates the VF and other resources and notifies the shell of
> > the availability of the VF. The hypervisor injects the plugin into memory
> > location specified by the shell. The shell initializes the plugin by calling
> > into a known entry point and the plugin initializes the data path. The control
> > path is already initialized by the PF driver when the VF is allocated. At this
> > point the shell switches to using the loaded plugin to do all further TX and RX
> > operations. The guest networking stack does not participate in these operations
> > and continues to function normally. All the control operations continue being
> > trapped by the hypervisor and are directed to the PF driver as needed. For
> > example, if the MAC address changes the hypervisor updates its internal state
> > and changes the state of the embedded switch as well through the PF control
> > API.
> >
>
> This is essentially a miniature network stack with a its own mini
> bonding layer, mini hotplug, and mini API, except s/API/ABI/. Is this a
> correct view?

To some extent yes but there is no complicated bonding nor there is any thing
like a PCI hotplug. The shell interface is small and the OS always interacts
with the shell as the main driver. Based on the underlying VF the plugin
changes and this plugin as well is really small. Our vmxnet3 s/w plugin is
about 1300 lines with whitespaces and comments and the Intel Kawela plugin is
about 1100 lines with whitspaces and comments. The design principle is to put
more of the complexity related to initialization/control into the PF driver
rather than in plugin.

>
> If so, the Linuxy approach would be to use the ordinary drivers and the
> Linux networking API, and hide the bond setup using namespaces. The
> bond driver, or perhaps a new, similar, driver can be enhanced to
> propagate ethtool commands to its (hidden) components, and to have a
> control channel with the hypervisor.
>
> This would make the approach hypervisor agnostic, you're just pairing
> two devices and presenting them to the rest of the stack as a single device.
>
> > We have reworked our existing Linux vmxnet3 driver to accomodate NPA by
> > splitting the driver into two parts: Shell and Plugin. The new split driver is
> >
>
> So the Shell would be the reworked or new bond driver, and Plugins would
> be ordinary Linux network drivers.

In NPA we do not rely on the guest OS to provide any of these services like
bonding or PCI hotplug. We don't rely on the guest OS to unmap a VF and switch
a VM out of passthrough. In a bonding approach that becomes an issue you can't
just yank a device from underneath, you have to wait for the OS to process the
request and switch from using VF to the emulated device and this makes the
hypervisor dependent on the guest OS. Also we don't rely on the presence of all
the drivers inside the guest OS (be it Linux or Windows), the ESX hypervisor
carries all the plugins and the PF drivers and injects the right one as needed.
These plugins are guest agnostic and the IHVs do not have to write plugins for
different OS.

Thanks,

-pankaj

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Dmitry Torokhov on 5 May 2010 16:40

On Wednesday 05 May 2010 01:09:48 pm Arnd Bergmann wrote:
> > > If you have any interesting in developing this further, do:
> > >
> > > (1) move the limited VF drivers directly into the kernel tree,
> > > talk to them through a normal ops vector
> >
> > [PT] This assumes that all the VF drivers would always be available.
> > Also we have to support windows and our current design supports it
> > nicely in an OS agnostic manner.
>
> Your approach assumes that the plugin is always available, which has
> exactly the same implications.

Since plugin[s] are carried by the host they are indeed always
available.

--
Dmitry
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Arnd Bergmann on 5 May 2010 18:00

On Wednesday 05 May 2010 22:36:31 Dmitry Torokhov wrote:
>
> On Wednesday 05 May 2010 01:09:48 pm Arnd Bergmann wrote:
> > > > If you have any interesting in developing this further, do:
> > > >
> > > > (1) move the limited VF drivers directly into the kernel tree,
> > > > talk to them through a normal ops vector
> > >
> > > [PT] This assumes that all the VF drivers would always be available.
> > > Also we have to support windows and our current design supports it
> > > nicely in an OS agnostic manner.
> >
> > Your approach assumes that the plugin is always available, which has
> > exactly the same implications.
>
> Since plugin[s] are carried by the host they are indeed always
> available.

But what makes you think that you can build code that can be linked
into arbitrary future kernel versions? The kernel does not define any
calling conventions that are stable across multiple versions or
configurations. For example, you'd have to provide different binaries
for each combination of

- 32/64 bit code
- gcc -mregparm=?
- lockdep
- tracepoints
- stackcheck
- NOMMU
- highmem
- whatever new gets merged

If you build the plugins only for specific versions of "enterprise" Linux
kernels, the code becomes really hard to debug and maintain.
If you wrap everything in your own version of the existing interfaces, your
code gets bloated to the point of being unmaintainable.

So I have to correct myself: this is very different from assuming the
driver is available in the guest, it's actually much worse.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: [GIT PULL] ocfs2 fixes for 2.6.34-rc
Next: Exposing TSC "reliability" to userland