From: Michael Breuer on
On 1/18/2010 4:00 PM, Stephen Hemminger wrote:
> On Mon, 18 Jan 2010 15:56:45 -0500
> Michael Breuer<mbreuer(a)majjas.com> wrote:
>
>
>>>> 2. The dropped tx packet (DHCP) is a bit harder to recreate, but it
>>>> still happens.
>>>>
>>>>
> You might want to use tc filter rule to set priority of DHCP packets
> higher. This would cause them to be in a separate queue and eliminate
> the problem.
>
>
Ok - for fun, tried that - no change. Not sure I see why this might be a
factor. The packet loss happens when TX load is low and RX high.
Also, packets only being dropped if traversing a router vs.to the
router itself. Keep in mind that pings to the router did not lose
packets, pings through the router lost packets. The router was not under
load (traffic is being generated from a device connected via the 1Gb
switch, not the wifi router), and tcpdump on the router input port shows
the pings to the router, but not the ones through the router.

One added note, when I just tried this, the test data ended while the
packet loss was occurring. The DHCPOFFER packet loss did not clear until
about a minute after the throughput abated. I really think something is
getting hosed, and I'd but some weird interaction with the arp logic
high on the list of suspects. Not sure what else would be a factor when
looking at the extra hop on the same subnet.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Michael Breuer on
On 1/18/2010 4:25 PM, Jarek Poplawski wrote:
> On Mon, Jan 18, 2010 at 03:56:45PM -0500, Michael Breuer wrote:
>
>> On 1/18/2010 3:46 PM, Jarek Poplawski wrote:
>>
>>> On Mon, Jan 18, 2010 at 11:29:31AM -0500, Michael Breuer wrote:
>>>
>>>> Ok - up on the two patches, no DMAR. Some early observations:
>>>>
>>>> 1. There's an early on MMAP oops (see below). This happens once, at
>>>> the completion of the transition to runlevel 5 (I've seen it
>>>> entering runlevel 3 as well). This does not recur when runlevels are
>>>> subsequently changed. I do not see this when running with DMAR
>>>> enabled.
>>>>
>>> OK, you mentioned this oops (actually a warning only) happened during
>>> previous tests too.
>>>
>> Yes - dk if it's significant or not. Only obvious difference between
>> DMAR and not.
>>
> OK, let's try (as long as possible) if it can break so hard as with
> DMAR.
>
>
>>>> 2. The dropped tx packet (DHCP) is a bit harder to recreate, but it
>>>> still happens.
>>>>
>>> Btw, I guess you improved the test because you didn't mention it here,
>>> even after my explicit question?:
>>> http://permalink.gmane.org/gmane.linux.network/149171
>>>
>> I had been focusing on the hangs - dhcp causing the initial crash
>> from December. After things stabilized with the af patch& skb may
>> pull I started noticing the dropped tx packets. I reported the TX
>> loss on the 16th of January after confirming the issue.
>>
> OK, but we need to establish some status quo after these patches
> before any new things (including DMAR), so I'd suggest trying this
> config really longer and harder.
>
>
>>>> Interestingly, I initially saw no dropped packets
>>>> with ping - but after I went the DCHP route and eventually
>>>> reconnected, I could then cause dropped tx packets with ping. To
>>>> clarify:
>>>>
>>>> a) start throughput
>>>> b) ping device - no packet loss - this was true for the entire test run.
>>>> c) start throughput again
>>>> d) ping - no loss.
>>>> e) drop wifi on the device& restart - first attempt worked. Repeat
>>>> attempt yielded the dropped DHCPOFFER packets. After about 6 tries,
>>>> the device reconnected to wifi.
>>>> f) ping again (after the reconnection) - packet loss rate about 80%.
>>>> g) simultaneously ping the wifi router - no loss.
>>>> h) After a while, packets are no longer dropped during ping. If I
>>>> manage to cause the dhcp drop again, and then ping after the device
>>>> finally reconnects, packet loss is significant for a while (maybe 30
>>>> sec to a minute). Then things return to normal. Note that the packet
>>>> loss continues even if the reported throughput drops to nil.
>>>> i) I can't cause the initial packet loss at RX rates below about
>>>> 30,000KBPS (as reported by nethogs). At rates over 40 I can
>>>> reproduce this on this set of patches& config about 60% of the
>>>> time.
>>>>
>>> I forgot to mention, but did you try to check if these lost ping
>>> packets are "being dropped somewhere after wireshark sees them and
>>> before hitting the wire" like DHCPOFFER? Aren't there any sky2
>>> warnings/resets while this happens?
>>>
>>> Jarek P.
>>>
>> Yes. There are no errors, and no statistics anywhere that I know to
>> look reflect the loss. Nothing in netstat; ethtool -S; etc. The only
>> loss reported is RX. The recent TX warnings/resets happened while
>> the machine was up for several days and while unattended and under
>> high RX load.
>>
> Please check "tc -s qdisc" each time as well.
>
> Jarek P
>

Some output from tc -s qdisc:

Before test:
qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap 1 2 2 2 1 2
0 0 1 1 1 1 1 1 1 1
Sent 35279532 bytes 291080 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
qdisc pfifo_fast 0: dev eth1 root refcnt 2 bands 3 priomap 1 2 2 2 1 2
0 0 1 1 1 1 1 1 1 1
Sent 377308 bytes 3107 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0

During test (after initial observed packet loss):
qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap 1 2 2 2 1 2
0 0 1 1 1 1 1 1 1 1
Sent 123389424 bytes 1781403 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
qdisc pfifo_fast 0: dev eth1 root refcnt 2 bands 3 priomap 1 2 2 2 1 2
0 0 1 1 1 1 1 1 1 1
Sent 400862 bytes 3250 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0

During test - while packet loss occuring:
qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap 1 2 2 2 1 2
0 0 1 1 1 1 1 1 1 1
Sent 150518974 bytes 2138312 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
qdisc pfifo_fast 0: dev eth1 root refcnt 2 bands 3 priomap 1 2 2 2 1 2
0 0 1 1 1 1 1 1 1 1
Sent 422003 bytes 3432 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0

After the conclusion of the test:
qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap 1 2 2 2 1 2
0 0 1 1 1 1 1 1 1 1
Sent 244900497 bytes 3416350 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
qdisc pfifo_fast 0: dev eth1 root refcnt 2 bands 3 priomap 1 2 2 2 1 2
0 0 1 1 1 1 1 1 1 1
Sent 564380 bytes 4708 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0


During the test, 8.9GB received; 232.9MB sent).

I also connected a second device through the wifi router. I was able to
ping that device w/o loss while DHCP packets were being dropped to the
other connected device.

Last note: just moved to 2.6.32.4 from .3 for this test (from git).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jarek Poplawski on
On Mon, Jan 18, 2010 at 04:24:52PM -0500, Michael Breuer wrote:
> On 1/18/2010 4:00 PM, Stephen Hemminger wrote:
> >On Mon, 18 Jan 2010 15:56:45 -0500
> >Michael Breuer<mbreuer(a)majjas.com> wrote:
> >
> >>>>2. The dropped tx packet (DHCP) is a bit harder to recreate, but it
> >>>>still happens.
> >>>>
> >You might want to use tc filter rule to set priority of DHCP packets
> >higher. This would cause them to be in a separate queue and eliminate
> >the problem.
> >
> Ok - for fun, tried that - no change. Not sure I see why this might
> be a factor. The packet loss happens when TX load is low and RX
> high.
> Also, packets only being dropped if traversing a router vs.to the
> router itself. Keep in mind that pings to the router did not lose
> packets, pings through the router lost packets. The router was not
> under load (traffic is being generated from a device connected via
> the 1Gb switch, not the wifi router), and tcpdump on the router
> input port shows the pings to the router, but not the ones through
> the router.
>
> One added note, when I just tried this, the test data ended while
> the packet loss was occurring. The DHCPOFFER packet loss did not
> clear until about a minute after the throughput abated. I really
> think something is getting hosed, and I'd but some weird interaction
> with the arp logic high on the list of suspects. Not sure what else
> would be a factor when looking at the extra hop on the same subnet.

Good point! Actually, IIRC, your setup might be a problem: you seem
to have two switches on the path (I guess the router is a bridge for
these wireless), so I wonder if it's not something between them.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jarek Poplawski on
On Mon, Jan 18, 2010 at 04:39:24PM -0500, Michael Breuer wrote:
> Some output from tc -s qdisc:
....
> After the conclusion of the test:
> qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap 1 2 2 2
> 1 2 0 0 1 1 1 1 1 1 1 1
> Sent 244900497 bytes 3416350 pkt (dropped 0, overlimits 0 requeues 0)
> rate 0bit 0pps backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: dev eth1 root refcnt 2 bands 3 priomap 1 2 2 2
> 1 2 0 0 1 1 1 1 1 1 1 1
> Sent 564380 bytes 4708 pkt (dropped 0, overlimits 0 requeues 0)
> rate 0bit 0pps backlog 0b 0p requeues 0

Great!

>
>
> During the test, 8.9GB received; 232.9MB sent).
>
> I also connected a second device through the wifi router. I was able
> to ping that device w/o loss while DHCP packets were being dropped
> to the other connected device.

Could you remind us if the problem is always with this first device?
Btw, I wonder if you could test it skipping the (HP?) switch?

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jarek Poplawski on
On Mon, Jan 18, 2010 at 11:08:14PM +0100, Jarek Poplawski wrote:
> Btw, I wonder if you could test it skipping the (HP?) switch?

If so, then of course don't forget to try tcpdump on the router.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/