From: Rick Jones on
pk <pk(a)pk.invalid> wrote:
> Rick Jones wrote:

> > Just for grins, I would suggest some explicit socket buffer size
> > settings. I don't know how that is done with iperf or nuttcp, but for
> > netperf it would be something along the lines of:
> >
> > netperf -H <remote> -- -m 64K -s <size> -S <size>
> >
> > I'd probably try something like:
> >
> > for s in 32 64 128 256
> > do
> > netperf -H <remote> -- -m 64K -s $s -S $s
> > done

> This is what I get (pretty much same results in both directions, so I'm only
> pasting one direction):

Argh! I left-off the 'K' !-(

that should be

netperf -H <remote> -- -m 64K -s ${s}K -S ${s}K

sorry about that.

> > Are there any TCP retransmissions recorded on the sender during
> > the tests?

> Yes, there is an average of ~85 retransimssions on the sending side when
> running the 4 netperf tests above (in total, not per test).

While the netperf tests above were fubar thanks to my forgetting the
'K' a TCP retransmission is generally not goodness when it comes to
making throughput. That will tend to keep the congestion window (aka
cwnd) suppressed. How much depends on the congestion control
algorithm being used.


> > What sort of interfaces are being used on the systems on either end?

> One end is physical ethernet, the other end is a KVM virtual machine
> bridged to an ethernet network (the guest is using the virtio
> driver). The KVM host is not very loaded, as this is the only VM
> running on it.

Just for paranoia, you can achieve "good" throughput from the VM to a
local system?

> > A lot of this will not apply to your situation, but some will:
> >
> > Some of my checklist items when presented with assertions of poor
> > network performance, in no particular order:
> >

> > *) Are there TCP retransmissions being registered in netstat
> > statistics on the sending system? Take a snapshot of netstat
> > -s -t from just before the transfer, and one from just after
> > and run it through beforeafter from
> > ftp://ftp.cup.hp.com/dist/networking/tools:
> >
> > netstat -s -t > before
> > transfer or wait 60 or so seconds if the transfer was already going
> > netstat -s -t > after
> > beforeafter before after > delta

> Yes, see above.

Definitely check that with the corrected netperf (or other test).

> > *) Are there packet drops registered in ethtool -S statistics on
> > either side of the transfer? Take snapshots in a matter
> > similar to that with netstat.

> No packet drops at the ethernet level.

Probably at the point where the LAN meets the WAN.

> > *) What is the latency between the two end points. Install netperf on
> > both sides, start netserver on one side and on the other side run:
> >
> > netperf -t TCP_RR -l 30 -H <remote>
> >
> > and invert the transaction/s rate to get the RTT latency. There
> > are caveats involving NIC interrupt coalescing settings defaulting
> > in favor of throughput/CPU util over latency:
> >
> > ftp://ftp.cup.hp.com/dist/networking/briefs/nic_latency_vs_tput.txt
> >
> > but when the connections are over a WAN latency is important and
> > may not be clouded as much by NIC settings.

> This pretty much gives me the same latency I measured earlier, ie
> 1/12.87 = 0.078 msecs which is consistent with what ping shows.

It will likely be somewhat higher under load.

> > This all leads into:
> >
> > *) What is the *effective* TCP (or other) window size for the
> > connection. One limit to the performance of a TCP bulk transfer
> > is:
> >
> > Tput <= W(eff)/RTT
> >
> > The effective window size will be the lesser of:
> >
> > a) the classic TCP window advertised by the receiver (the value in
> > the TCP header's window field shifted by the window scaling
> > factor exchanged during connection establishment (why one wants
> > to get traces including the connection establishment...)
> >
> > this will depend on whether/what the receiving application has
> > requested via a setsockopt(SO_RCVBUF) call and the sysctl limits
> > set in the OS. If the application does not call
> > setsockopt(SO_RCVBUF) then the Linux stack will "autotune" the
> > advertised window based on other sysctl limits in the OS.
> >
> > b) the computed congestion window on the sender - this will be
> > affected by the packet loss rate over the connection, hence the
> > interest in the netstat and ethtool stats.
> >
> > c) the quantity of data to which the sending TCP can maintain a
> > reference while waiting for it to be ACKnowledged by the
> > receiver - this will be akin to the classic TCP window case
> > above, but on the sending side, and concerning
> > setsockopt(SO_SNDBUF) and sysctl settings.
> >
> > d) the quantity of data the sending application is willing/able to
> > send at any one time before waiting for some sort of
> > application-level acknowledgement. FTP and rcp will just blast
> > all the data of the file into the socket as fast as the socket
> > will take it. scp has some application-layer "windowing" which
> > may cause it to put less data out onto the connection than TCP
> > might otherwise have permitted. NFS has the maximum number of
> > outstanding requests it will allow at one time acting as a
> > defacto "window" etc etc etc

> Not sure this helps, but tcpdumping during the nuttcp test, the
> peers initially advertise a window of 5840 and 5792
> (sender/receiver), both with window scaling == 2^9. As the test
> progresses, the receiver advertises bigger and bigger window sizes
> (scaled values: 8704, 11264, 13824, 16896 etc. up to 82944 at the
> end of the test).

Scale factor of 9 eh? That probably comes from setting the high end
of the [wr]mem to 16MB. Anyway, shows how autotuning is willing to
grow things quite large indeed.

> The sender advertises 6144 from the second segment
> onwards and sticks to that.

Shawn Osterman's "tcptrace" can take-in a tcpudmp trace and emit files
you can feed into xplot (or whatever replaces it) to show things like
cwnd (probably assuming an older congestion control algorithm) and
show where retransmissions likely took place and such. Sometimes it
is best to take the trace on the receiver, sometimes on the sender.

rick jones

--
the road to hell is paved with business decisions...
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
From: pk on
Rick Jones wrote:

> pk <pk(a)pk.invalid> wrote:
>> Rick Jones wrote:
>
>> > Just for grins, I would suggest some explicit socket buffer size
>> > settings. I don't know how that is done with iperf or nuttcp, but for
>> > netperf it would be something along the lines of:
>> >
>> > netperf -H <remote> -- -m 64K -s <size> -S <size>
>> >
>> > I'd probably try something like:
>> >
>> > for s in 32 64 128 256
>> > do
>> > netperf -H <remote> -- -m 64K -s $s -S $s
>> > done
>
>> This is what I get (pretty much same results in both directions, so I'm
>> only pasting one direction):
>
> Argh! I left-off the 'K' !-(
>
> that should be
>
> netperf -H <remote> -- -m 64K -s ${s}K -S ${s}K
>
> sorry about that.

Right, so now we're getting again the "expected" results:

# for s in 32 64 128 256; do netperf -H 1x.x.x.x -p 5001 -- -P 5002 -m 64K -
s $sK -S $sK; done
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 5002 AF_INET to 1x.x.x.x
(1x.x.x.x) port 5002 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

1048576 2048 65536 10.30 0.52
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 5002 AF_INET to 1x.x.x.x
(1x.x.x.x) port 5002 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

1048576 2048 65536 10.16 0.53
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 5002 AF_INET to 1x.x.x.x
(1x.x.x.x) port 5002 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

1048576 2048 65536 10.88 0.49
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 5002 AF_INET to 1x.x.x.x
(1x.x.x.x) port 5002 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

1048576 2048 65536 10.08 0.53

>> > Are there any TCP retransmissions recorded on the sender during
>> > the tests?
>
>> Yes, there is an average of ~85 retransimssions on the sending side when
>> running the 4 netperf tests above (in total, not per test).
>
> While the netperf tests above were fubar thanks to my forgetting the
> 'K' a TCP retransmission is generally not goodness when it comes to
> making throughput. That will tend to keep the congestion window (aka
> cwnd) suppressed. How much depends on the congestion control
> algorithm being used.

With the corrected tests, now I'm seeing about 30 retransmits per test
(including the 256K). The available congestion algorithms on the machine are
cubic and reno, both of which give roughly the same results.

>> > What sort of interfaces are being used on the systems on either end?
>
>> One end is physical ethernet, the other end is a KVM virtual machine
>> bridged to an ethernet network (the guest is using the virtio
>> driver). The KVM host is not very loaded, as this is the only VM
>> running on it.
>
> Just for paranoia, you can achieve "good" throughput from the VM to a
> local system?

Yes, both systems can (almost) max out the bandwidth in a single TCP
connection to another local machine.

>> The sender advertises 6144 from the second segment
>> onwards and sticks to that.
>
> Shawn Osterman's "tcptrace" can take-in a tcpudmp trace and emit files
> you can feed into xplot (or whatever replaces it) to show things like
> cwnd (probably assuming an older congestion control algorithm) and
> show where retransmissions likely took place and such. Sometimes it
> is best to take the trace on the receiver, sometimes on the sender.

That is a very good suggestion, I will try it. I'm starting to think that
the problem is in some intermediate node or link.

Thank you again!
From: Rick Jones on
pk <pk(a)pk.invalid> wrote:
> Right, so now we're getting again the "expected" results:

> # for s in 32 64 128 256; do netperf -H 1x.x.x.x -p 5001 -- -P 5002 -m 64K -
> s $sK -S $sK; done

> ...

> 1048576 2048 65536 10.88 0.49
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 5002 AF_INET to 1x.x.x.x
> (1x.x.x.x) port 5002 AF_INET : demo
> Recv Send Send
> Socket Socket Message Elapsed
> Size Size Size Time Throughput
> bytes bytes bytes secs. 10^6bits/sec

> 1048576 2048 65536 10.08 0.53

I'm a little concerned that the send socket size is being reported as
2048 there.

> >> > Are there any TCP retransmissions recorded on the sender during
> >> > the tests?


> With the corrected tests, now I'm seeing about 30 retransmits per
> test (including the 256K). The available congestion algorithms on
> the machine are cubic and reno, both of which give roughly the same
> results.

Pity there isn't a "Farragut" ("Damn the retransmissions full-speed
ahead!") module one can use for testing. I suppose though there is -
the UDP_STREAM test. I'd probably use 1460 as the value of -m - we
want to avoid IP fragmentation if we can. Likely as not the sending
side will report something close to local link-rate. It will be the
second line of data that will be of interest.

So, something like:

netperf -H 1.X.X.X.X -p 5001 -t UDP_STREAM -- -P 5002 -m 1460 -s 1M -S 1M

If that also happens to report about half a megabit, I might wonder if
you have bonded links in your path. If it reports more than half a
megabit, it is probably the packet losses holding things up.

Do be careful with that test. BTW, it may require you to add a -R 1
to the test-specific stuff. I had to put-in some code to cover the
backsides of people using netperf for functional testing and doing
link-down. They would do link-down on the test link(s) while netperf
UDP_STREAM was running, the stack would dutifully try to find another
way to get to the destination, and start pumping the UDP_STREAM
traffic out their default route - which happened to be their
production network... So for UDP_STREAM tests by default the socket
has SO_DONTROUTE set on it. It offended by sensibilities to cover the
backsides of folks doing something so stupid as functional testing on
systems without an airgap to their production networks but there you
are... the life of a benchmark maintainer :)

> > Shawn Osterman's "tcptrace" can take-in a tcpudmp trace and emit
> > files you can feed into xplot (or whatever replaces it) to show
> > things like cwnd (probably assuming an older congestion control
> > algorithm) and show where retransmissions likely took place and
> > such. Sometimes it is best to take the trace on the receiver,
> > sometimes on the sender.

> That is a very good suggestion, I will try it. I'm starting to think
> that the problem is in some intermediate node or link.

happy benchmarking,

rick jones
--
Process shall set you free from the need for rational thought.
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
From: pk on
Rick Jones wrote:

> Pity there isn't a "Farragut" ("Damn the retransmissions full-speed
> ahead!") module one can use for testing. I suppose though there is -
> the UDP_STREAM test. I'd probably use 1460 as the value of -m - we
> want to avoid IP fragmentation if we can. Likely as not the sending
> side will report something close to local link-rate. It will be the
> second line of data that will be of interest.
>
> So, something like:
>
> netperf -H 1.X.X.X.X -p 5001 -t UDP_STREAM -- -P 5002 -m 1460 -s 1M -S 1M
>
> If that also happens to report about half a megabit, I might wonder if
> you have bonded links in your path. If it reports more than half a
> megabit, it is probably the packet losses holding things up.

$ netperf -H 1x.x.x.x -p 5002 -t UDP_STREAM -- -P 5003 -m 1460 -s 1M -S 1M
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 5003 AF_INET to
1x.x.x.x (1x.x.x.x) port 5003 AF_INET : demo
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec

2097152 1460 10.00 1352768 0 1580.01
2097152 10.00 65165 76.11

Ok, now I'm puzzled. Surely 76 Mbit/sec look quite a lot to me. The
bandwidth purchased from the colo at the slowest of the two sites (the one
running the above test) should be around 10 Mbit...I'm not sure how to
interpret those results.

In a single isolated instance, I even got

UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 5003 AF_INET to
1x.x.x.x (1x.x.x.x) port 5003 AF_INET : demo
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec

2097152 1460 10.00 1399679 0 1634.82
2097152 10.00 384976 449.65

which doesn't make any sense to me. Except for this last one, tests in both
directions show figures around 75/79 Mbits/sec as the first one above.
From: Rick Jones on
pk <pk(a)pk.invalid> wrote:
> Rick Jones wrote:

> > Pity there isn't a "Farragut" ("Damn the retransmissions full-speed
> > ahead!") module one can use for testing. I suppose though there is -
> > the UDP_STREAM test. I'd probably use 1460 as the value of -m - we
> > want to avoid IP fragmentation if we can. Likely as not the sending
> > side will report something close to local link-rate. It will be the
> > second line of data that will be of interest.
> >
> > So, something like:
> >
> > netperf -H 1.X.X.X.X -p 5001 -t UDP_STREAM -- -P 5002 -m 1460 -s 1M -S 1M
> >
> > If that also happens to report about half a megabit, I might wonder if
> > you have bonded links in your path. If it reports more than half a
> > megabit, it is probably the packet losses holding things up.

> $ netperf -H 1x.x.x.x -p 5002 -t UDP_STREAM -- -P 5003 -m 1460 -s 1M -S 1M
> UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 5003 AF_INET to
> 1x.x.x.x (1x.x.x.x) port 5003 AF_INET : demo
> Socket Message Elapsed Messages
> Size Size Time Okay Errors Throughput
> bytes bytes secs # # 10^6bits/sec

> 2097152 1460 10.00 1352768 0 1580.01
> 2097152 10.00 65165 76.11

> Ok, now I'm puzzled. Surely 76 Mbit/sec look quite a lot to me. The
> bandwidth purchased from the colo at the slowest of the two sites
> (the one running the above test) should be around 10 Mbit...I'm not
> sure how to interpret those results.

The top line of numbers is what the number of perceived successful
sendto() calls multiplied by the bytes for each sendto() divided by
the test time (along with unit conversion to 10^6 bits per second -
aka megabits) The second line is what the receiver reported receiving
over the same interval.

> In a single isolated instance, I even got

> UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 5003 AF_INET to
> 1x.x.x.x (1x.x.x.x) port 5003 AF_INET : demo
> Socket Message Elapsed Messages
> Size Size Time Okay Errors Throughput
> bytes bytes secs # # 10^6bits/sec

> 2097152 1460 10.00 1399679 0 1634.82
> 2097152 10.00 384976 449.65

> which doesn't make any sense to me. Except for this last one, tests
> in both directions show figures around 75/79 Mbits/sec as the first
> one above.

There may be some "anomalies" in the way the colo does bandwidth
throttling. And there being bandwidth throttling at/by the colo
tosses-in a whole new set of potential issues with the single-stream
TCP performance...

Do deal with the local (sending side) issue of reporting more than
local link-rate for sending there was recently a patch from Andrew
Gallatin to have netperf set IP_RECVERR on Linux which should at least
partially work-around things going awry in the Linux stack (the flow
control only works for certain relationships between the socket buffer
size and the driver transmit queue depth, or something along those
lines) - the sending side reporting more than local link rate. That
though will not address getting more than what the colo claims to be
giving you into the cloud.

One further thing you could do is add a global -F <filename> where
<filename> is a file with uncompressible data. I suppose there is a
small possibility there is something doing data compression, and that
would be a way to get around it.

If your colo agreement has total bytes transferred
limits/levels/charges, do be careful running netperf tests. It
wouldn't do to have a "free" benchmark cause a big colo bill...

rick jones
--
I don't interest myself in "why". I think more often in terms of
"when", sometimes "where"; always "how much." - Joubert
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...