From: Herbert Xu on
On Sun, Jun 06, 2010 at 04:13:48PM -0700, Stephen Hemminger wrote:
> Still not sure this is a good idea for a couple of reasons:
>
> 1. We already have lots of special cases with skb's (frags and fraglist),
> and skb's travel through a lot of different parts of the kernel. So any
> new change like this creates lots of exposed points for new bugs. Look
> at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
> and ppp and ...
>
> 2. SKB's can have infinite lifetime in the kernel. If these buffers come from
> a fixed size pool in an external device, they can easily all get tied up
> if you have a slow listener. What happens then?

I agree with Stephen on this.

FWIW I don't think we even need the external pages concept in
order to implement zero-copy receive (which I gather is the intent
here).

Here is one way to do it, simply construct a completely non-linear
packet in the driver, as you would if you were using the GRO frags
interface (grep for napi_gro_frags under drivers/net for examples).

This way you can transfer the entire contents of the packet without
copying through to the other side, provided that the host stack does
not modify the packet.

If the host side did modify the packet then we have to incur the
memory cost anyway.

IOW I think the only feature provided by the external pages
construct is allowing the skb->head area to be shared without
copying. I'm claiming that this can be done by simply making
skb->head empty.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert(a)gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Xin, Xiaohui on
>-----Original Message-----
>From: Herbert Xu [mailto:herbert(a)gondor.apana.org.au]
>Sent: Tuesday, June 08, 2010 1:28 PM
>To: Stephen Hemminger
>Cc: Xin, Xiaohui; netdev(a)vger.kernel.org; kvm(a)vger.kernel.org;
>linux-kernel(a)vger.kernel.org; mst(a)redhat.com; mingo(a)elte.hu; davem(a)davemloft.net;
>jdike(a)linux.intel.com
>Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>
>On Sun, Jun 06, 2010 at 04:13:48PM -0700, Stephen Hemminger wrote:
>> Still not sure this is a good idea for a couple of reasons:
>>
>> 1. We already have lots of special cases with skb's (frags and fraglist),
>> and skb's travel through a lot of different parts of the kernel. So any
>> new change like this creates lots of exposed points for new bugs. Look
>> at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
>> and ppp and ...
>>
>> 2. SKB's can have infinite lifetime in the kernel. If these buffers come from
>> a fixed size pool in an external device, they can easily all get tied up
>> if you have a slow listener. What happens then?
>
>I agree with Stephen on this.
>
>FWIW I don't think we even need the external pages concept in
>order to implement zero-copy receive (which I gather is the intent
>here).
>
>Here is one way to do it, simply construct a completely non-linear
>packet in the driver, as you would if you were using the GRO frags
>interface (grep for napi_gro_frags under drivers/net for examples).
>
I'm not sure if I understand your way correctly:
1) Does the way only deal with driver with SG feature? Since packet
is non-linear...

2) Is skb->data still pointing to guest user buffers?
If yes, how to avoid the modifications to net core change to skb?

3) In our way only parts of drivers need be modified to support zero-copy.
and here, need we modify all the drivers?

If I missed your idea, may you explain it in more detail?

>This way you can transfer the entire contents of the packet without
>copying through to the other side, provided that the host stack does
>not modify the packet.
>

>If the host side did modify the packet then we have to incur the
>memory cost anyway.
>
>IOW I think the only feature provided by the external pages
>construct is allowing the skb->head area to be shared without
>copying. I'm claiming that this can be done by simply making
>skb->head empty.
>
I think to make skb->head empty at first will cause more effort to pass the check with
skb header. Have I missed something here? I really make the skb->head NULL
just before kfree(skb) in skb_release_data(), it's done by callback we have made for skb.

Thanks
Xiaohui

>Cheers,
>--
>Visit Openswan at http://www.openswan.org/
>Email: Herbert Xu ~{PmV>HI~} <herbert(a)gondor.apana.org.au>
>Home Page: http://gondor.apana.org.au/~herbert/
>PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Herbert Xu on
On Wed, Jun 09, 2010 at 05:54:02PM +0800, Xin, Xiaohui wrote:
>
> I'm not sure if I understand your way correctly:
> 1) Does the way only deal with driver with SG feature? Since packet
> is non-linear...

No the hardware doesn't have to support SG. You just need to
place the entire packet contents in a page instead of skb->head.

> 2) Is skb->data still pointing to guest user buffers?
> If yes, how to avoid the modifications to net core change to skb?

skb->data would not point to guest user buffers. In the common
case the packet is not modified on its way to the guest so this
is not an issue.

In the rare case where it is modified, you only have to copy the
bits which are modified and the cost of that is inconsequential
since you have to write to that memory anyway.

> 3) In our way only parts of drivers need be modified to support zero-copy.
> and here, need we modify all the drivers?

If you're asking the portion of each driver supporting zero-copy
that needs to be modified, then AFAICS this doesn't change that
very much at all.

> I think to make skb->head empty at first will cause more effort to pass the check with
> skb header. Have I missed something here? I really make the skb->head NULL
> just before kfree(skb) in skb_release_data(), it's done by callback we have made for skb.

No I'm not suggesting you set it to NULL. It should have some
memory allocated, but skb_headlen(skb) should be zero.

Please have a look at how the napi_gro_frags interface works (e.g.,
in drivers/net/cxgb3/sge.c). This is exactly the model that I am
suggesting.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert(a)gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Xin, Xiaohui on
>-----Original Message-----
>From: Herbert Xu [mailto:herbert(a)gondor.apana.org.au]
>Sent: Friday, June 11, 2010 1:21 PM
>To: Xin, Xiaohui
>Cc: Stephen Hemminger; netdev(a)vger.kernel.org; kvm(a)vger.kernel.org;
>linux-kernel(a)vger.kernel.org; mst(a)redhat.com; mingo(a)elte.hu; davem(a)davemloft.net;
>jdike(a)linux.intel.com
>Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>
>On Wed, Jun 09, 2010 at 05:54:02PM +0800, Xin, Xiaohui wrote:
>>
>> I'm not sure if I understand your way correctly:
>> 1) Does the way only deal with driver with SG feature? Since packet
>> is non-linear...
>
>No the hardware doesn't have to support SG. You just need to
>place the entire packet contents in a page instead of skb->head.
>
>> 2) Is skb->data still pointing to guest user buffers?
>> If yes, how to avoid the modifications to net core change to skb?
>
>skb->data would not point to guest user buffers. In the common
>case the packet is not modified on its way to the guest so this
>is not an issue.
>
>In the rare case where it is modified, you only have to copy the
>bits which are modified and the cost of that is inconsequential
>since you have to write to that memory anyway.
>
>> 3) In our way only parts of drivers need be modified to support zero-copy.
>> and here, need we modify all the drivers?
>
>If you're asking the portion of each driver supporting zero-copy
>that needs to be modified, then AFAICS this doesn't change that
>very much at all.
>
>> I think to make skb->head empty at first will cause more effort to pass the check with
>> skb header. Have I missed something here? I really make the skb->head NULL
>> just before kfree(skb) in skb_release_data(), it's done by callback we have made for skb.
>
>No I'm not suggesting you set it to NULL. It should have some
>memory allocated, but skb_headlen(skb) should be zero.
>
>Please have a look at how the napi_gro_frags interface works (e.g.,
>in drivers/net/cxgb3/sge.c). This is exactly the model that I am
>suggesting.
>
>Cheers,
>--
>Visit Openswan at http://www.openswan.org/
>Email: Herbert Xu ~{PmV>HI~} <herbert(a)gondor.apana.org.au>
>Home Page: http://gondor.apana.org.au/~herbert/
>PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Herbert,
I explained what I think the thought in your mind here, please clarify if
something missed.

1) Modify driver from netdev_alloc_skb() to alloc user pages if dev is zero-copyed.
If the driver support PS mode, then modify alloc_page() too.
2) Add napi_gro_frags() in driver to receive the user pages instead of driver's receiving
function.
3) napi_gro_frags() will allocate small skb and pull the header data from
the first page to skb->data.

Is above the way what you have suggested?
I have thought something in detail about the way.

1) The first page will have an offset after the header is copied into allocated kernel skb.
The offset should be recalculated when the user page data is transferred to guest. This
may modify some of the gro code.

2) napi_gro_frags() may remove a page when it's data is totally be pulled, but we cannot
put a user page as normally. This may modify the gro code too.

3) When the user buffer returned to guest, some of them need to be appended a vnet header.
That means for some pages, the vnet header room should be reserved when allocated.
But we cannot know which one will be used as the first page when allocated. If we reserved vnet header for each page, since the set_skb_frag() in guest driver only use the offset 0 for second pages, then page data will be wrong.

4) Since the user buffer pages should be released, so we still need a dtor callback to do that, and then I still need a place to hold it. How do you think about to put it in skb_shinfo?

Currently I can only think of this.
How do you think about then?

Thanks
Xiaohui



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Xin, Xiaohui on
>-----Original Message-----
>From: netdev-owner(a)vger.kernel.org [mailto:netdev-owner(a)vger.kernel.org] On Behalf Of
>Xin, Xiaohui
>Sent: Saturday, June 12, 2010 5:31 PM
>To: Herbert Xu
>Cc: Stephen Hemminger; netdev(a)vger.kernel.org; kvm(a)vger.kernel.org;
>linux-kernel(a)vger.kernel.org; mst(a)redhat.com; mingo(a)elte.hu; davem(a)davemloft.net;
>jdike(a)linux.intel.com
>Subject: RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>
>>-----Original Message-----
>>From: Herbert Xu [mailto:herbert(a)gondor.apana.org.au]
>>Sent: Friday, June 11, 2010 1:21 PM
>>To: Xin, Xiaohui
>>Cc: Stephen Hemminger; netdev(a)vger.kernel.org; kvm(a)vger.kernel.org;
>>linux-kernel(a)vger.kernel.org; mst(a)redhat.com; mingo(a)elte.hu; davem(a)davemloft.net;
>>jdike(a)linux.intel.com
>>Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>>
>>On Wed, Jun 09, 2010 at 05:54:02PM +0800, Xin, Xiaohui wrote:
>>>
>>> I'm not sure if I understand your way correctly:
>>> 1) Does the way only deal with driver with SG feature? Since packet
>>> is non-linear...
>>
>>No the hardware doesn't have to support SG. You just need to
>>place the entire packet contents in a page instead of skb->head.
>>
>>> 2) Is skb->data still pointing to guest user buffers?
>>> If yes, how to avoid the modifications to net core change to skb?
>>
>>skb->data would not point to guest user buffers. In the common
>>case the packet is not modified on its way to the guest so this
>>is not an issue.
>>
>>In the rare case where it is modified, you only have to copy the
>>bits which are modified and the cost of that is inconsequential
>>since you have to write to that memory anyway.
>>
>>> 3) In our way only parts of drivers need be modified to support zero-copy.
>>> and here, need we modify all the drivers?
>>
>>If you're asking the portion of each driver supporting zero-copy
>>that needs to be modified, then AFAICS this doesn't change that
>>very much at all.
>>
>>> I think to make skb->head empty at first will cause more effort to pass the check with
>>> skb header. Have I missed something here? I really make the skb->head NULL
>>> just before kfree(skb) in skb_release_data(), it's done by callback we have made for skb.
>>
>>No I'm not suggesting you set it to NULL. It should have some
>>memory allocated, but skb_headlen(skb) should be zero.
>>
>>Please have a look at how the napi_gro_frags interface works (e.g.,
>>in drivers/net/cxgb3/sge.c). This is exactly the model that I am
>>suggesting.
>>
>>Cheers,
>>--
>>Visit Openswan at http://www.openswan.org/
>>Email: Herbert Xu ~{PmV>HI~} <herbert(a)gondor.apana.org.au>
>>Home Page: http://gondor.apana.org.au/~herbert/
>>PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
>
>Herbert,
>I explained what I think the thought in your mind here, please clarify if
>something missed.
>
>1) Modify driver from netdev_alloc_skb() to alloc user pages if dev is zero-copyed.
> If the driver support PS mode, then modify alloc_page() too.
>2) Add napi_gro_frags() in driver to receive the user pages instead of driver's receiving
>function.
>3) napi_gro_frags() will allocate small skb and pull the header data from
>the first page to skb->data.
>
>Is above the way what you have suggested?
>I have thought something in detail about the way.
>
>1) The first page will have an offset after the header is copied into allocated kernel skb.
>The offset should be recalculated when the user page data is transferred to guest. This
>may modify some of the gro code.
>
>2) napi_gro_frags() may remove a page when it's data is totally be pulled, but we cannot
>put a user page as normally. This may modify the gro code too.
>
>3) When the user buffer returned to guest, some of them need to be appended a vnet header.
>That means for some pages, the vnet header room should be reserved when allocated.
>But we cannot know which one will be used as the first page when allocated. If we reserved
>vnet header for each page, since the set_skb_frag() in guest driver only use the offset 0 for
>second pages, then page data will be wrong.
>
>4) Since the user buffer pages should be released, so we still need a dtor callback to do that,
>and then I still need a place to hold it. How do you think about to put it in skb_shinfo?
>
>Currently I can only think of this.
>How do you think about then?
>
>Thanks
>Xiaohui

Herbert,
In this way, I think we should create 3 functions at least in drivers to allocate rx buffer, to receive the rx buffers, and to clean the rx buffers.

We can also have another way here. We can provide a function to only substitute
alloc_page(), and a function to release the pages when cleaning the rx buffers.
The skb for the rx buffer can be allocated in original way, and when pushing
the data to guest, the header data will be copied to guest buffer. In this way, we
should reserve sufficient room for the header in the first guest user buffers.
That need modifications to guest virtio-net kernel. And this way only suitable for
PS mode supported driver. Considered the advanced driver mostly has PS mode.
So it should be not a critical issue.

Thanks
Xiaohui


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/