From: Michael Breuer on
On 1/18/2010 5:08 PM, Jarek Poplawski wrote:
> On Mon, Jan 18, 2010 at 04:39:24PM -0500, Michael Breuer wrote:
>
>> Some output from tc -s qdisc:
>>
> ...
>
>> After the conclusion of the test:
>> qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap 1 2 2 2
>> 1 2 0 0 1 1 1 1 1 1 1 1
>> Sent 244900497 bytes 3416350 pkt (dropped 0, overlimits 0 requeues 0)
>> rate 0bit 0pps backlog 0b 0p requeues 0
>> qdisc pfifo_fast 0: dev eth1 root refcnt 2 bands 3 priomap 1 2 2 2
>> 1 2 0 0 1 1 1 1 1 1 1 1
>> Sent 564380 bytes 4708 pkt (dropped 0, overlimits 0 requeues 0)
>> rate 0bit 0pps backlog 0b 0p requeues 0
>>
> Great!
>
>
>>
>> During the test, 8.9GB received; 232.9MB sent).
>>
>> I also connected a second device through the wifi router. I was able
>> to ping that device w/o loss while DHCP packets were being dropped
>> to the other connected device.
>>
> Could you remind us if the problem is always with this first device?
> Btw, I wonder if you could test it skipping the (HP?) switch?
>
> Jarek P.
>
No - can be any device connected via the wifi router - wired or wireless.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Michael Breuer on
On 1/18/2010 5:17 PM, Jarek Poplawski wrote:
> On Mon, Jan 18, 2010 at 11:08:14PM +0100, Jarek Poplawski wrote:
>
>> Btw, I wonder if you could test it skipping the (HP?) switch?
>>
> If so, then of course don't forget to try tcpdump on the router.
>
> Jarek P.
>
Well - no.... but I'm not sure that would show anything.

Setup diagram:

Server->gb switch-> (100mb) wifi router -> devices
|
Win7 PC (gb)

The problem does not occur (at least I haven't been able to recreate it)
at 100mb, and the wifi router doesn't do 1Gb. I drive the traffic from
the win7 PC to the server. I've seen the loss when the only traffic
going through the wifi router was ping & dhcp. I've also never seen any
loss on a device directly attached to the 1GB switch. I can drive load
through the wifi router while driving load from the Win7 box, but don't
see TX packet loss at all when not doing DHCP RELEASE/RENEW.

As there is no packet loss to devices not involved in the DHCP sequence
through the same path, I'm not really sure that the GB switch is implicated.

As I don't have a standalone sniffer, I'm thinking that it might be
easier to instrument places where the TX packet could be dropped and see
at least whether it's getting to the card.

Given the circumstances of the TX drop, and that it was DHCP traffic
while under load that caused the oops rectified with the two patches,
I'm thinking that the packet loss is the current manifestation of
whatever the underlying problem is. Given the extra hop required to
break things, and given that a dhcp release/renew seems to trigger
things, I keep coming back to arp logic as being somehow implicated.

If arp is somehow involved, then I'd expect to see manifestations under
similar circumstances with other drivers. As the pskb_may_pull patch
stopped the crash, perhaps other drivers do suffer packet loss and it's
just not been widely noticed or attributed to the kernel - especially if
the network topology is a factor. I do know people at large enterprises
who have been complaining of what *could* be this same issue, however
they're currently blaming their switch vendors. As most traffic is TCP,
this is really only noticed by those few places deeply concerned with
latency. It's likely something altogether different, but then again,
maybe not.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jarek Poplawski on
On Mon, Jan 18, 2010 at 05:25:43PM -0500, Michael Breuer wrote:
> On 1/18/2010 5:08 PM, Jarek Poplawski wrote:
> >Could you remind us if the problem is always with this first device?
> >Btw, I wonder if you could test it skipping the (HP?) switch?
> >
> >Jarek P.
> No - can be any device connected via the wifi router - wired or wireless.

Anyway, if it can't be repeated with this Win7 box or even the router
getting dhcp itself, then this router's interaction seems most
suspected.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Michael Breuer on
On 1/18/2010 5:47 PM, Michael Breuer wrote:
> On 1/18/2010 5:17 PM, Jarek Poplawski wrote:
>> On Mon, Jan 18, 2010 at 11:08:14PM +0100, Jarek Poplawski wrote:
>>> Btw, I wonder if you could test it skipping the (HP?) switch?
>> If so, then of course don't forget to try tcpdump on the router.
>>
>> Jarek P.
> Well - no.... but I'm not sure that would show anything.
>
> Setup diagram:
>
> Server->gb switch-> (100mb) wifi router -> devices
> |
> Win7 PC (gb)
>
> The problem does not occur (at least I haven't been able to recreate
> it) at 100mb, and the wifi router doesn't do 1Gb. I drive the traffic
> from the win7 PC to the server. I've seen the loss when the only
> traffic going through the wifi router was ping & dhcp. I've also never
> seen any loss on a device directly attached to the 1GB switch. I can
> drive load through the wifi router while driving load from the Win7
> box, but don't see TX packet loss at all when not doing DHCP
> RELEASE/RENEW.
>
> As there is no packet loss to devices not involved in the DHCP
> sequence through the same path, I'm not really sure that the GB switch
> is implicated.
>
> As I don't have a standalone sniffer, I'm thinking that it might be
> easier to instrument places where the TX packet could be dropped and
> see at least whether it's getting to the card.
>
> Given the circumstances of the TX drop, and that it was DHCP traffic
> while under load that caused the oops rectified with the two patches,
> I'm thinking that the packet loss is the current manifestation of
> whatever the underlying problem is. Given the extra hop required to
> break things, and given that a dhcp release/renew seems to trigger
> things, I keep coming back to arp logic as being somehow implicated.
>
> If arp is somehow involved, then I'd expect to see manifestations
> under similar circumstances with other drivers. As the pskb_may_pull
> patch stopped the crash, perhaps other drivers do suffer packet loss
> and it's just not been widely noticed or attributed to the kernel -
> especially if the network topology is a factor. I do know people at
> large enterprises who have been complaining of what *could* be this
> same issue, however they're currently blaming their switch vendors. As
> most traffic is TCP, this is really only noticed by those few places
> deeply concerned with latency. It's likely something altogether
> different, but then again, maybe not.
Ok - one last update for a while ...not sure what's next... I put some
printk's into sky2.c xmit logic - the packets are being sent to the
card, and the i/o's are completing successfully. So it would seem either
the switch is dropping the packets, or else the wifi router is. As
tcpdump doesn't show the packets arriving on the wifi router, I'm
leaning towards the switch. I ran wireshark on the win7 box to see what
is coming off the switch. I did notice one thing that's visible to the
win7 box but is not showing up on the linux wireshark - before every
successful dhcpoffer, there's an XID message broadcast from the device.
I'm wondering why I don't see this on the linux side:

The packet is from the mac of the device, dst ff:ff:ff:ff:ff:ff;
protocol eth:llc... hex packet: ffffffffffff001cccf39ff600060001af810100.

Now I guess I've got some reading to do... I've got no idea what the
correct application of llc messages would be given my topology :(. I do
suspect that the llc stuff (or lack thereof under some conditions) is
causing the switch to fail to forward the dhcpoffer message. As the
dhcpoffer message is not broadcast, but directed to the remote mac
address and as that address is not connected directly to the switch, I'm
guessing that under some conditions whatever tells the switch how to
find the mac is missing. I'd guess that the wifi router should be
letting the switch know around the time it forwards the first arp and/or
DHCP broadcast message from the client... or maybe the linux box should
be doing something before the offer.

So net-net, as far as my TX packet loss issue, sky2 is in the clear. If
something on the linux side should be informing the switch about
something then there may still be an issue. If the wifi router should be
doing something differently, then it's unfortunately likely a 2.4.37
kernel issue (That's what dd-wrt is using).



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jarek Poplawski on
On Tue, Jan 19, 2010 at 12:46:24AM -0500, Michael Breuer wrote:
> So net-net, as far as my TX packet loss issue, sky2 is in the clear. If
> something on the linux side should be informing the switch about
> something then there may still be an issue. If the wifi router should be
> doing something differently, then it's unfortunately likely a 2.4.37
> kernel issue (That's what dd-wrt is using).

IMHO until there is no proof from a sniffer or some regs dumps the
switch and the router are more suspicious than your NIC or linux box.
Then debugging these other things isn't so much interesting from my
POV ;-)

Anyway, if you only want to get it working (instead of debugging),
it seems you might try moving the dhcp server to the router or maybe
even using two separate servers with their pools - unless I missed
something in your config.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/