From: Stephen Hemminger on
Still not sure this is a good idea for a couple of reasons:

1. We already have lots of special cases with skb's (frags and fraglist),
and skb's travel through a lot of different parts of the kernel. So any
new change like this creates lots of exposed points for new bugs. Look
at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
and ppp and ...

2. SKB's can have infinite lifetime in the kernel. If these buffers come from
a fixed size pool in an external device, they can easily all get tied up
if you have a slow listener. What happens then?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Xin, Xiaohui on
>-----Original Message-----
>From: Stephen Hemminger [mailto:shemminger(a)vyatta.com]
>Sent: Monday, June 07, 2010 7:14 AM
>To: Xin, Xiaohui
>Cc: netdev(a)vger.kernel.org; kvm(a)vger.kernel.org; linux-kernel(a)vger.kernel.org;
>mst(a)redhat.com; mingo(a)elte.hu; davem(a)davemloft.net; herbert(a)gondor.apana.org.au;
>jdike(a)linux.intel.com
>Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>
>Still not sure this is a good idea for a couple of reasons:
>
>1. We already have lots of special cases with skb's (frags and fraglist),
> and skb's travel through a lot of different parts of the kernel. So any
> new change like this creates lots of exposed points for new bugs. Look
> at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
> and ppp and ...
>

Yes, I agree on your concern at some extent. But the skbs which use external pages in our cases have short travels which start from NIC driver and then forward to the guest immediately. It will not travel into host kernel stack or any filters which may avoid many problems you have mentioned here.

Another point is that we try to make the solution more generic to different NIC drivers, then many drivers may use the way without modifications.

>2. SKB's can have infinite lifetime in the kernel. If these buffers come from
> a fixed size pool in an external device, they can easily all get tied up
> if you have a slow listener. What happens then?

The pool is not fixed size, it's the size of usable buffers submitted by guest virtio-net driver. Guest virtio-net will check how much buffers are filled and do resubmit. What a slow listener mean? A slow NIC driver?

Thanks
Xiaohui
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Xin, Xiaohui on
>-----Original Message-----
>From: kvm-owner(a)vger.kernel.org [mailto:kvm-owner(a)vger.kernel.org] On Behalf Of Andi
>Kleen
>Sent: Monday, June 07, 2010 3:51 PM
>To: Stephen Hemminger
>Cc: Xin, Xiaohui; netdev(a)vger.kernel.org; kvm(a)vger.kernel.org;
>linux-kernel(a)vger.kernel.org; mst(a)redhat.com; mingo(a)elte.hu; davem(a)davemloft.net;
>herbert(a)gondor.apana.org.au; jdike(a)linux.intel.com
>Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>
>Stephen Hemminger <shemminger(a)vyatta.com> writes:
>
>> Still not sure this is a good idea for a couple of reasons:
>>
>> 1. We already have lots of special cases with skb's (frags and fraglist),
>> and skb's travel through a lot of different parts of the kernel. So any
>> new change like this creates lots of exposed points for new bugs. Look
>> at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
>> and ppp and ...
>>
>> 2. SKB's can have infinite lifetime in the kernel. If these buffers come from
>> a fixed size pool in an external device, they can easily all get tied up
>> if you have a slow listener. What happens then?
>
>3. If they come from an internal pool what happens when the kernel runs
>low on memory? How is that pool balanced against other kernel
>memory users?
>
The size of the pool is limited by the virtqueue capacity now.
If the virtqueue is really huge, then how to balance on memory is a problem.
I did not consider it clearly how to tune it dynamically currently...

>-Andi
>
>--
>ak(a)linux.intel.com -- Speaking for myself only.
>--
>To unsubscribe from this list: send the line "unsubscribe kvm" in
>the body of a message to majordomo(a)vger.kernel.org
>More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Xin, Xiaohui on
>-----Original Message-----
>From: Mitchell Erblich [mailto:erblichs(a)earthlink.net]
>Sent: Monday, June 07, 2010 4:17 PM
>To: Andi Kleen
>Cc: Stephen Hemminger; Xin, Xiaohui; netdev(a)vger.kernel.org; kvm(a)vger.kernel.org;
>linux-kernel(a)vger.kernel.org; mst(a)redhat.com; mingo(a)elte.hu; davem(a)davemloft.net;
>herbert(a)gondor.apana.org.au; jdike(a)linux.intel.com
>Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>
>
>On Jun 7, 2010, at 12:51 AM, Andi Kleen wrote:
>
>> Stephen Hemminger <shemminger(a)vyatta.com> writes:
>>
>>> Still not sure this is a good idea for a couple of reasons:
>>>
>>> 1. We already have lots of special cases with skb's (frags and fraglist),
>>> and skb's travel through a lot of different parts of the kernel. So any
>>> new change like this creates lots of exposed points for new bugs. Look
>>> at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
>>> and ppp and ...
>>>
>>> 2. SKB's can have infinite lifetime in the kernel. If these buffers come from
>>> a fixed size pool in an external device, they can easily all get tied up
>>> if you have a slow listener. What happens then?
>>
>> 3. If they come from an internal pool what happens when the kernel runs
>> low on memory? How is that pool balanced against other kernel
>> memory users?
>>
>> -Andi
>>
>> --
>> ak(a)linux.intel.com -- Speaking for myself only.
>
>In general,
>
>When an internal pool is created/used, their SHOULD be a reason.
>Maybe, to keep allocation latency to a min, OR ...
>
The internal pool here is a collection of user buffers submitted
by guest virtio-net driver. Guest put buffers here and driver get
buffers from it. If guest submit more buffers then driver needs,
we need somewhere to put the buffers, it's the internal pool here
to deal with.

>Now IMO,
>
>internal pool objects should have a ref count and
>if that count is 0, then under memory pressure and/or num
>of objects are above a high water mark, then they are freed,
>
>OR if there is a last reference age field, then the object is to be
>cleaned if dirty, then freed,
>
>Else, the pool is allowed to grow if the number of objects in the
>pool is below a set max (max COULD equal Infinity).

Thanks for the thoughts.

Basically, the size of the internal pool is not decided by the pool itself,
To add/delete the objects in the pool is not a task of the pool itself too.
It's decided by guest virtio-net driver and vhost-net driver both, and
decided by the guest receive speed and submit speed.
The max size of the pool is limited by the virtqueue buffer numbers.

Thanks
Xiaohui

>
>
>
>Mitchell Erblich
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Ben Hutchings on
On Sun, 2010-06-20 at 21:59 +1000, Herbert Xu wrote:
> On Sun, Jun 20, 2010 at 02:47:19PM +0300, Michael S. Tsirkin wrote:
> >
> > Let's do this then. So far the virtio spec avoided making layout
> > assumptions, leaving guests lay out data as they see fit.
> > Isn't it possible to keep supporting this with zero copy for hardware
> > that can issue DMA at arbitrary addresses?
>
> I think you're mistaken with respect to what is being proposed.
> Raising 512 bytes isn't a hard constraint, it is merely an
> optimisation for Intel NICs because their PS mode can produce
> a head fragment of up to 512 bytes.
>
> If the guest didn't allocate 512 bytes it wouldn't be the end of
> the world, it'd just mean that we'd either copy whatever is in
> the head fragment, or we waste 4096-X bytes of memory where X
> is the number of bytes in the head.

If I understand correctly what this 'PS mode' is (I haven't seen the
documentation for it), it is a feature that Microsoft requested from
hardware vendors for use in Hyper-V. As a result, the SFC9000 family
and presumably other controllers also implement something similar.

Ben.

--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/