Provide a zero-copy method on KVM virtio-net. [Kernel]

Prev: [PATCH 05/13] AppArmor: dfa match engine
Next: 2.6.29.6: nfsd: non-standard errno: -9

From: Shirley Ma on 4 Aug 2010 13:10

Hello Eddie,

On Wed, 2010-08-04 at 10:06 +0800, Dong, Eddie wrote:
> But zero-copy is a Linux generic feature that can be used by other
> VMMs as well if the BE service drivers want to incorporate. If we can
> make mp device VMM-agnostic (it may be not yet in current patch), that
> will help Linux more.

First other VMMs support tun/tap which provides most funcs but not zero
copy.

Second mp patch only supports zero copy for vhost now, macvtap zero copy
will not be used by vhost only.

Third, the current mp device doesn't fallback to copy when failure.

So you can extend mp device to support all funcs, but the usage/funcs
will be similar with macvtap, then either mp device will replace macvtap
in the future with similar funcs or we can use enhance macvtap to
support zero copy. It is not necessary to have both in Linux.

I think it's better to implement zero copy in macvtap, tun/tap instead
of creating a new mp device.

Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Xin, Xiaohui on 5 Aug 2010 05:00

Herbert,
The v8 patches are modified mostly based on your comments about
napi_gro_frags interface. How do you think about the patches about
net core system part?
We know currently there are some comments about the mp device,
such as to support zero-copy for tun/tap and macvtap. Since there
isn't a decision yet about it. May you give comments about the
net core system first, since this part is all the same for zero-copy.

Thanks
Xiaohui

>-----Original Message-----
>From: linux-kernel-owner(a)vger.kernel.org [mailto:linux-kernel-owner(a)vger.kernel.org] On
>Behalf Of xiaohui.xin(a)intel.com
>Sent: Thursday, July 29, 2010 7:15 PM
>To: netdev(a)vger.kernel.org; kvm(a)vger.kernel.org; linux-kernel(a)vger.kernel.org;
>mst(a)redhat.com; mingo(a)elte.hu; davem(a)davemloft.net; herbert(a)gondor.apana.org.au;
>jdike(a)linux.intel.com
>Subject: [RFC PATCH v8 00/16] Provide a zero-copy method on KVM virtio-net.
>
>We provide an zero-copy method which driver side may get external
>buffers to DMA. Here external means driver don't use kernel space
>to allocate skb buffers. Currently the external buffer can be from
>guest virtio-net driver.
>
>The idea is simple, just to pin the guest VM user space and then
>let host NIC driver has the chance to directly DMA to it.
>The patches are based on vhost-net backend driver. We add a device
>which provides proto_ops as sendmsg/recvmsg to vhost-net to
>send/recv directly to/from the NIC driver. KVM guest who use the
>vhost-net backend may bind any ethX interface in the host side to
>get copyless data transfer thru guest virtio-net frontend.
>
>patch 01-10: net core and kernel changes.
>patch 11-13: new device as interface to mantpulate external buffers.
>patch 14: for vhost-net.
>patch 15: An example on modifying NIC driver to using napi_gro_frags().
>patch 16: An example how to get guest buffers based on driver
> who using napi_gro_frags().
>
>The guest virtio-net driver submits multiple requests thru vhost-net
>backend driver to the kernel. And the requests are queued and then
>completed after corresponding actions in h/w are done.
>
>For read, user space buffers are dispensed to NIC driver for rx when
>a page constructor API is invoked. Means NICs can allocate user buffers
>from a page constructor. We add a hook in netif_receive_skb() function
>to intercept the incoming packets, and notify the zero-copy device.
>
>For write, the zero-copy deivce may allocates a new host skb and puts
>payload on the skb_shinfo(skb)->frags, and copied the header to skb->data.
>The request remains pending until the skb is transmitted by h/w.
>
>We provide multiple submits and asynchronous notifiicaton to
>vhost-net too.
>
>Our goal is to improve the bandwidth and reduce the CPU usage.
>Exact performance data will be provided later.
>
>What we have not done yet:
> Performance tuning
>
>what we have done in v1:
> polish the RCU usage
> deal with write logging in asynchroush mode in vhost
> add notifier block for mp device
> rename page_ctor to mp_port in netdevice.h to make it looks generic
> add mp_dev_change_flags() for mp device to change NIC state
> add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
> a small fix for missing dev_put when fail
> using dynamic minor instead of static minor number
> a __KERNEL__ protect to mp_get_sock()
>
>what we have done in v2:
>
> remove most of the RCU usage, since the ctor pointer is only
> changed by BIND/UNBIND ioctl, and during that time, NIC will be
> stopped to get good cleanup(all outstanding requests are finished),
> so the ctor pointer cannot be raced into wrong situation.
>
> Remove the struct vhost_notifier with struct kiocb.
> Let vhost-net backend to alloc/free the kiocb and transfer them
> via sendmsg/recvmsg.
>
> use get_user_pages_fast() and set_page_dirty_lock() when read.
>
> Add some comments for netdev_mp_port_prep() and handle_mpassthru().
>
>what we have done in v3:
> the async write logging is rewritten
> a drafted synchronous write function for qemu live migration
> a limit for locked pages from get_user_pages_fast() to prevent Dos
> by using RLIMIT_MEMLOCK
>
>
>what we have done in v4:
> add iocb completion callback from vhost-net to queue iocb in mp device
> replace vq->receiver by mp_sock_data_ready()
> remove stuff in mp device which access structures from vhost-net
> modify skb_reserve() to ignore host NIC driver reserved space
> rebase to the latest vhost tree
> split large patches into small pieces, especially for net core part.
>
>
>what we have done in v5:
> address Arnd Bergmann's comments
> -remove IFF_MPASSTHRU_EXCL flag in mp device
> -Add CONFIG_COMPAT macro
> -remove mp_release ops
> move dev_is_mpassthru() as inline func
> fix a bug in memory relinquish
> Apply to current git (2.6.34-rc6) tree.
>
>what we have done in v6:
> move create_iocb() out of page_dtor which may happen in interrupt context
> -This remove the potential issues which lock called in interrupt context
> make the cache used by mp, vhost as static, and created/destoryed during
> modules init/exit functions.
> -This makes multiple mp guest created at the same time.
>
>what we have done in v7:
> some cleanup prepared to suppprt PS mode
>
>what we have done in v8
> discarding the modifications to point skb->data to guest buffer directly.
> Add code to modify driver to support napi_gro_frags() with Herbert's comments.
> To support PS mode.
> Add mergeable buffer support in mp device.
> Add GSO/GRO support in mp deice.
> Address comments from Eric Dumazet about cache line and rcu usage.
>
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo(a)vger.kernel.org
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Shirley Ma on 10 Aug 2010 21:30

Hello Xiaohui,

On Fri, 2010-08-06 at 17:23 +0800, xiaohui.xin(a)intel.com wrote:
> Our goal is to improve the bandwidth and reduce the CPU usage.
> Exact performance data will be provided later.

Have you had any performance data to share here? I tested my
experimental macvtap zero copy for TX only. The performance I have seen
as below without any tuning, (default setting):

Before: netperf 16K message size results with 60 secs run is 7.5Gb/s
over ixgbe 10GbE card. perf top shows:

2103.00 12.9% copy_user_generic_string
1541.00 9.4% handle_tx
1490.00 9.1% _raw_spin_unlock_irqrestore
1361.00 8.3% _raw_spin_lock_irqsave
1288.00 7.9% _raw_spin_lock
924.00 5.7% vhost_worker

After: netperf results with 60 secs run is 8.1Gb/s, perf output:

1093.00 9.9% _raw_spin_unlock_irqrestore
1048.00 9.5% handle_tx
934.00 8.5% _raw_spin_lock_irqsave
864.00 7.9% _raw_spin_lock
644.00 5.9% vhost_worker
387.00 3.5% use_mm

I am still working on collecting more data (latency, cpu
utilization...). I will let you know once I get all data for macvtap TX
zero copy. Also I found some vhost performance regression on the new
kernel with tuning. I used to get 9.4Gb/s, now I couldn't get it.

Shirley

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Shirley Ma on 10 Aug 2010 21:50

On Tue, 2010-08-10 at 18:23 -0700, Shirley Ma wrote:
> Also I found some vhost performance regression on the new
> kernel with tuning. I used to get 9.4Gb/s, now I couldn't get it.

I forgot to mention the kernel I used 2.6.36 one. And I found the native
host BW is limited to 8.0Gb/s, so the regression might come from the
device driver not vhost.

Shirley

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Shirley Ma on 10 Aug 2010 23:30

Hello Avi,

On Mon, 2010-08-02 at 19:32 +0300, Avi Kivity wrote:
> When are you seeing gup_fast() fall back to gup()? It should be at
> most
> once per page (when a guest starts up none of its pages are mapped,
> it
> faults them in on demand).

netperf/netserver latency results are pretty good for message size
between 1 bytes and 512 bytes when I have 64 bytes small copy.

However if I don't have any small copy , the ping RTT time is
unreasonable huge. Since we think it's better to have small message with
copy, so there will be no issue.

Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: [PATCH 05/13] AppArmor: dfa match engine
Next: 2.6.29.6: nfsd: non-standard errno: -9