From: Michael S. Tsirkin on
On Mon, Jun 28, 2010 at 01:08:07PM +0300, Michael S. Tsirkin wrote:
> Userspace virtio server has the following hack
> so guests rely on it, and we have to replicate it, too:
>
> Use port number to detect incoming IPv4 DHCP response packets,
> and fill in the checksum for these.
>
> The issue we are solving is that on linux guests, some apps
> that use recvmsg with AF_PACKET sockets, don't know how to
> handle CHECKSUM_PARTIAL;
> The interface to return the relevant information was added
> in 8dc4194474159660d7f37c495e3fc3f10d0db8cc,
> and older userspace does not use it.
> One important user of recvmsg with AF_PACKET is dhclient,
> so we add a work-around just for DHCP.
>
> Don't bother applying the hack to IPv6 as userspace virtio does not
> have a work-around for that - let's hope guests will do the right
> thing wrt IPv6.
>
> Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com>

Tested-by: Laine Stump <laine(a)redhat.com>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Sridhar Samudrala on
On Mon, 2010-06-28 at 13:08 +0300, Michael S. Tsirkin wrote:
> Userspace virtio server has the following hack
> so guests rely on it, and we have to replicate it, too:
>
> Use port number to detect incoming IPv4 DHCP response packets,
> and fill in the checksum for these.
>
> The issue we are solving is that on linux guests, some apps
> that use recvmsg with AF_PACKET sockets, don't know how to
> handle CHECKSUM_PARTIAL;
> The interface to return the relevant information was added
> in 8dc4194474159660d7f37c495e3fc3f10d0db8cc,
> and older userspace does not use it.
> One important user of recvmsg with AF_PACKET is dhclient,
> so we add a work-around just for DHCP.
>
> Don't bother applying the hack to IPv6 as userspace virtio does not
> have a work-around for that - let's hope guests will do the right
> thing wrt IPv6.
>
> Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com>
> ---
>
> Dave, I'm going to put this patch on the vhost tree,
> no need for you to bother merging it - you'll get
> it with a pull request.
>
>
> drivers/vhost/net.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
> 1 files changed, 43 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index cc19595..03bba6a 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -24,6 +24,10 @@
> #include <linux/if_tun.h>
> #include <linux/if_macvlan.h>
>
> +#include <linux/ip.h>
> +#include <linux/udp.h>
> +#include <linux/netdevice.h>
> +
> #include <net/sock.h>
>
> #include "vhost.h"
> @@ -186,6 +190,44 @@ static void handle_tx(struct vhost_net *net)
> unuse_mm(net->dev.mm);
> }
>
> +static int peek_head(struct sock *sk)

This routine is doing more than just peeking the head of sk's receive
queue. May be this should be named similar to what qemu calls
'work_around_broken_dhclient()'
> +{
> + struct sk_buff *skb;
> +
> + lock_sock(sk);
> + skb = skb_peek(&sk->sk_receive_queue);
> + if (unlikely(!skb)) {
> + release_sock(sk);
> + return 0;
> + }
> + /* Userspace virtio server has the following hack so
> + * guests rely on it, and we have to replicate it, too: */
> + /* Use port number to detect incoming IPv4 DHCP response packets,
> + * and fill in the checksum. */
> +
> + /* The issue we are solving is that on linux guests, some apps
> + * that use recvmsg with AF_PACKET sockets, don't know how to
> + * handle CHECKSUM_PARTIAL;
> + * The interface to return the relevant information was added in
> + * 8dc4194474159660d7f37c495e3fc3f10d0db8cc,
> + * and older userspace does not use it.
> + * One important user of recvmsg with AF_PACKET is dhclient,
> + * so we add a work-around just for DHCP. */
> + if (skb->ip_summed == CHECKSUM_PARTIAL &&
> + skb_headlen(skb) >= skb_transport_offset(skb) +
> + sizeof(struct udphdr) &&
> + udp_hdr(skb)->dest == htons(68) &&
> + skb_network_header_len(skb) >= sizeof(struct iphdr) &&
> + ip_hdr(skb)->protocol == IPPROTO_UDP &&
> + skb->protocol == htons(ETH_P_IP)) {

Isn't it more logical to check for skb->protocol, followed by ip_hdr and
then udp_hdr?

> + skb_checksum_help(skb);
> + /* Restore ip_summed value: tun passes it to user. */
> + skb->ip_summed = CHECKSUM_PARTIAL;
> + }
> + release_sock(sk);
> + return 1;
> +}
> +
> /* Expects to be always run from workqueue - which acts as
> * read-size critical section for our kind of RCU. */
> static void handle_rx(struct vhost_net *net)
> @@ -222,7 +264,7 @@ static void handle_rx(struct vhost_net *net)
> vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
> vq->log : NULL;
>
> - for (;;) {
> + while (peek_head(sock->sk)) {
> head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> ARRAY_SIZE(vq->iov),
> &out, &in,

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Michael S. Tsirkin on
On Mon, Jun 28, 2010 at 03:19:41PM -0700, Sridhar Samudrala wrote:
> On Mon, 2010-06-28 at 13:08 +0300, Michael S. Tsirkin wrote:
> > Userspace virtio server has the following hack
> > so guests rely on it, and we have to replicate it, too:
> >
> > Use port number to detect incoming IPv4 DHCP response packets,
> > and fill in the checksum for these.
> >
> > The issue we are solving is that on linux guests, some apps
> > that use recvmsg with AF_PACKET sockets, don't know how to
> > handle CHECKSUM_PARTIAL;
> > The interface to return the relevant information was added
> > in 8dc4194474159660d7f37c495e3fc3f10d0db8cc,
> > and older userspace does not use it.
> > One important user of recvmsg with AF_PACKET is dhclient,
> > so we add a work-around just for DHCP.
> >
> > Don't bother applying the hack to IPv6 as userspace virtio does not
> > have a work-around for that - let's hope guests will do the right
> > thing wrt IPv6.
> >
> > Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com>
> > ---
> >
> > Dave, I'm going to put this patch on the vhost tree,
> > no need for you to bother merging it - you'll get
> > it with a pull request.
> >
> >
> > drivers/vhost/net.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
> > 1 files changed, 43 insertions(+), 1 deletions(-)
> >
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index cc19595..03bba6a 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -24,6 +24,10 @@
> > #include <linux/if_tun.h>
> > #include <linux/if_macvlan.h>
> >
> > +#include <linux/ip.h>
> > +#include <linux/udp.h>
> > +#include <linux/netdevice.h>
> > +
> > #include <net/sock.h>
> >
> > #include "vhost.h"
> > @@ -186,6 +190,44 @@ static void handle_tx(struct vhost_net *net)
> > unuse_mm(net->dev.mm);
> > }
> >
> > +static int peek_head(struct sock *sk)
>
> This routine is doing more than just peeking the head of sk's receive
> queue. May be this should be named similar to what qemu calls
> 'work_around_broken_dhclient()'
> > +{
> > + struct sk_buff *skb;
> > +
> > + lock_sock(sk);
> > + skb = skb_peek(&sk->sk_receive_queue);
> > + if (unlikely(!skb)) {
> > + release_sock(sk);
> > + return 0;
> > + }
> > + /* Userspace virtio server has the following hack so
> > + * guests rely on it, and we have to replicate it, too: */
> > + /* Use port number to detect incoming IPv4 DHCP response packets,
> > + * and fill in the checksum. */
> > +
> > + /* The issue we are solving is that on linux guests, some apps
> > + * that use recvmsg with AF_PACKET sockets, don't know how to
> > + * handle CHECKSUM_PARTIAL;
> > + * The interface to return the relevant information was added in
> > + * 8dc4194474159660d7f37c495e3fc3f10d0db8cc,
> > + * and older userspace does not use it.
> > + * One important user of recvmsg with AF_PACKET is dhclient,
> > + * so we add a work-around just for DHCP. */
> > + if (skb->ip_summed == CHECKSUM_PARTIAL &&
> > + skb_headlen(skb) >= skb_transport_offset(skb) +
> > + sizeof(struct udphdr) &&
> > + udp_hdr(skb)->dest == htons(68) &&
> > + skb_network_header_len(skb) >= sizeof(struct iphdr) &&
> > + ip_hdr(skb)->protocol == IPPROTO_UDP &&
> > + skb->protocol == htons(ETH_P_IP)) {
>
> Isn't it more logical to check for skb->protocol, followed by ip_hdr and
> then udp_hdr?


Yes, but then we'll only exit after checking them all.
My way we'll almost always exit after port check.

> > + skb_checksum_help(skb);
> > + /* Restore ip_summed value: tun passes it to user. */
> > + skb->ip_summed = CHECKSUM_PARTIAL;
> > + }
> > + release_sock(sk);
> > + return 1;
> > +}
> > +
> > /* Expects to be always run from workqueue - which acts as
> > * read-size critical section for our kind of RCU. */
> > static void handle_rx(struct vhost_net *net)
> > @@ -222,7 +264,7 @@ static void handle_rx(struct vhost_net *net)
> > vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
> > vq->log : NULL;
> >
> > - for (;;) {
> > + while (peek_head(sock->sk)) {
> > head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> > ARRAY_SIZE(vq->iov),
> > &out, &in,
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Miller on
From: "Michael S. Tsirkin" <mst(a)redhat.com>
Date: Tue, 29 Jun 2010 16:04:39 +0300

> Since using the module involves updating the management tools
> as well, if we go down this route it will be much less painful
> for everyone to do push it upstream.

Ok, you can make your case to Patrick McHardy and if he'll merge
it into his netfilter GIT tree I guess I'll have to take it :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Anthony Liguori on
On 06/29/2010 08:04 AM, Michael S. Tsirkin wrote:
> On Tue, Jun 29, 2010 at 12:36:47AM -0700, David Miller wrote:
>
>> From: "Michael S. Tsirkin"<mst(a)redhat.com>
>> Date: Mon, 28 Jun 2010 13:08:07 +0300
>>
>>
>>> Userspace virtio server has the following hack
>>> so guests rely on it, and we have to replicate it, too:
>>>
>>> Use port number to detect incoming IPv4 DHCP response packets,
>>> and fill in the checksum for these.
>>>
>>> The issue we are solving is that on linux guests, some apps
>>> that use recvmsg with AF_PACKET sockets, don't know how to
>>> handle CHECKSUM_PARTIAL;
>>> The interface to return the relevant information was added
>>> in 8dc4194474159660d7f37c495e3fc3f10d0db8cc,
>>> and older userspace does not use it.
>>> One important user of recvmsg with AF_PACKET is dhclient,
>>> so we add a work-around just for DHCP.
>>>
>>> Don't bother applying the hack to IPv6 as userspace virtio does not
>>> have a work-around for that - let's hope guests will do the right
>>> thing wrt IPv6.
>>>
>>> Signed-off-by: Michael S. Tsirkin<mst(a)redhat.com>
>>>
>> Yikes, this is awful too.
>>
>> Nothing in the kernel should be mucking around with procotol packets
>> like this by default. In particular, what the heck does port 67 mean?
>> Locally I can use it for whatever I want for my own purposes, I don't
>> have to follow the conventions for service ports as specified by the
>> IETF.
>>
>> But I can't have the packet checksum state be left alone for port 67
>> traffic on a box using virtio because you have this hack there.
>>
>> And yes it's broken on machines using the qemu thing, but at least the
>> hack there is restricted to userspace.
>>
> Yes, and I think it was a mistake to add the hack there. This is what
> prevented applications from using the new interface in the 3 years
> since it was first introduced.
>

It's far more complicated than that. dhclient is part of ISC's DHCP
package. They do not have a public SCM and instead require you to join
their Software Guild to get access to it.

This problem was identified in one distribution and the patch was pushed
upstream but because they did not have a public SCM, most other
distributions did not see the fix until it appeared in a release. ISC
has a pretty long release cycle historically.

ISC's had the fix for a long time but there was a 3-year gap in their
releases and since their SCM isn't public, users are stuck with the last
release.

This hack makes sense in QEMU as we have a few hacks like this to fix
broken guests. A primary use of virtualization is to run old
applications so it makes sense for us to do that.

I don't think it makes sense for vhost to do this. These guests are so
old that they don't have the requisite features to achieve really high
performance anyway.

I've always thought making vhost totally transparent was a bad idea and
this is one of the reasons. We can do a lot of ugly things in userspace
that we shouldn't be doing in the kernel.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/