From: pk on
Apologies for the only marginally topical question.

On a properly tuned system (which all modern linux systems should be), is a
single TCP connection supposed to be able to use all the available
bandwidth?

The issue I'm seeing is as follows. Testing between two computers over a
transatlantic WAN link (RTT about 80ms). iperf and nuttcp report roughly 0.5
Mbit/sec throughput, no matter how I tune the TCP settings on the two hosts.
As far as I know, recent Linux kernels should support TCP autotuning, so
messing around shouldn't even be necessary. But still, I tried to modify the
suggested settings to no avail (net.core.rmem_max = 16777216,
net.core.wmem_max = 16777216, net.ipv4.tcp_rmem = 4096 87380 16777216,
net.ipv4.tcp_wmem = 4096 65536 16777216. I also tried other settings as
found in articles dealing with TCP tuning, but basically they all ended up
modifying those values, although the numbers differed slightly. TCP window
scaling and selective acknowledgements are all enabled).

But, if I run multiple connections concurrently (like eg nuttcp -N2 or
more), each connection gets 1Mbit/sec throughput: nuttcp -N4 (running 4
concurrent threads) totals 2 Mbit/sec.
I would expect to be able to get the same 2 Mbit/sec with a single TCP
connection, too.

I'm not sure whether this is expected behavior, or if not, if there's
something else I forgot to check or where to look to debug this.
I have very limited knowledge of the intermediate network topology, other
than what's shown by traceroute/mtr.

Any pointers appreciated. Thanks.
From: Rick Jones on
pk <pk(a)pk.invalid> wrote:
> Apologies for the only marginally topical question.

> On a properly tuned system (which all modern linux systems should
> be), is a single TCP connection supposed to be able to use all the
> available bandwidth?

Maybe :)

> The issue I'm seeing is as follows. Testing between two computers
> over a transatlantic WAN link (RTT about 80ms). iperf and nuttcp
> report roughly 0.5 Mbit/sec throughput, no matter how I tune the TCP
> settings on the two hosts. As far as I know, recent Linux kernels
> should support TCP autotuning, so messing around shouldn't even be
> necessary. But still, I tried to modify the suggested settings to no
> avail (net.core.rmem_max = 16777216, net.core.wmem_max = 16777216,
> net.ipv4.tcp_rmem = 4096 87380 16777216, net.ipv4.tcp_wmem = 4096
> 65536 16777216. I also tried other settings as found in articles
> dealing with TCP tuning, but basically they all ended up modifying
> those values, although the numbers differed slightly. TCP window
> scaling and selective acknowledgements are all enabled).

net.core.[rw]mem_max only affect those apps making setsockopt() calls.

Autotuning quite capable of allowing the window/socket buffer to
become very large indeed - larger than need be, and *perhaps* large
enough to allow an intermediate queue to fill somewhere.

Just for grins, I would suggest some explicit socket buffer size
settings. I don't know how that is done with iperf or nuttcp, but for
netperf it would be something along the lines of:

netperf -H <remote> -- -m 64K -s <size> -S <size>

I'd probably try something like:

for s in 32 64 128 256
do
netperf -H <remote> -- -m 64K -s $s -S $s
done

> But, if I run multiple connections concurrently (like eg nuttcp -N2
> or more), each connection gets 1Mbit/sec throughput: nuttcp -N4
> (running 4 concurrent threads) totals 2 Mbit/sec.
> I would expect to be able to get the same 2 Mbit/sec with a single
> TCP connection, too.

Are there any TCP retransmissions recorded on the sender during the
tests?

What sort of interfaces are being used on the systems on either end?
Any chance that interrupt coalescing is getting in the way?

Might also be interesting to know if the 2MBit/s WAN link is a single
link "uinder the covers" or if it is some sort of bonding of lesser
bandwidth links.

> I'm not sure whether this is expected behavior, or if not, if
> there's something else I forgot to check or where to look to debug
> this. I have very limited knowledge of the intermediate network
> topology, other than what's shown by traceroute/mtr.

A lot of this will not apply to your situation, but some will:

Some of my checklist items when presented with assertions of poor
network performance, in no particular order:

*) Is *any one* CPU on either end of the transfer at or close to 100%
utilization? A given TCP connection cannot really take advantage
of more than the services of a single core in the system, so
average CPU utilization being low does not a priori mean things are
OK.

*) Are there TCP retransmissions being registered in netstat
statistics on the sending system? Take a snapshot of netstat -s -t
from just before the transfer, and one from just after and run it
through beforeafter from
ftp://ftp.cup.hp.com/dist/networking/tools:

netstat -s -t > before
transfer or wait 60 or so seconds if the transfer was already going
netstat -s -t > after
beforeafter before after > delta

*) Are there packet drops registered in ethtool -S statistics on
either side of the transfer? Take snapshots in a matter similar to
that with netstat.

*) Are there packet drops registered in the stats for the switch(es)
being traversed by the transfer? These would be retrieved via
switch-specific means.

*) What is the latency between the two end points. Install netperf on
both sides, start netserver on one side and on the other side run:

netperf -t TCP_RR -l 30 -H <remote>

and invert the transaction/s rate to get the RTT latency. There
are caveats involving NIC interrupt coalescing settings defaulting
in favor of throughput/CPU util over latency:

ftp://ftp.cup.hp.com/dist/networking/briefs/nic_latency_vs_tput.txt

but when the connections are over a WAN latency is important and
may not be clouded as much by NIC settings.

This all leads into:

*) What is the *effective* TCP (or other) window size for the
connection. One limit to the performance of a TCP bulk transfer
is:

Tput <= W(eff)/RTT

The effective window size will be the lesser of:

a) the classic TCP window advertised by the receiver (the value in
the TCP header's window field shifted by the window scaling
factor exchanged during connection establishment (why one wants
to get traces including the connection establishment...)

this will depend on whether/what the receiving application has
requested via a setsockopt(SO_RCVBUF) call and the sysctl limits
set in the OS. If the application does not call
setsockopt(SO_RCVBUF) then the Linux stack will "autotune" the
advertised window based on other sysctl limits in the OS.

b) the computed congestion window on the sender - this will be
affected by the packet loss rate over the connection, hence the
interest in the netstat and ethtool stats.

c) the quantity of data to which the sending TCP can maintain a
reference while waiting for it to be ACKnowledged by the
receiver - this will be akin to the classic TCP window case
above, but on the sending side, and concerning
setsockopt(SO_SNDBUF) and sysctl settings.

d) the quantity of data the sending application is willing/able to
send at any one time before waiting for some sort of
application-level acknowledgement. FTP and rcp will just blast
all the data of the file into the socket as fast as the socket
will take it. scp has some application-layer "windowing" which
may cause it to put less data out onto the connection than TCP
might otherwise have permitted. NFS has the maximum number of
outstanding requests it will allow at one time acting as a
defacto "window" etc etc etc

rick jones
--
It is not a question of half full or empty - the glass has a leak.
The real question is "Can it be patched?"
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
From: unruh on
On 2010-06-15, pk <pk(a)pk.invalid> wrote:
> Apologies for the only marginally topical question.
>
> On a properly tuned system (which all modern linux systems should be), is a
> single TCP connection supposed to be able to use all the available
> bandwidth?

Depends what you mean by that.
>
> The issue I'm seeing is as follows. Testing between two computers over a
> transatlantic WAN link (RTT about 80ms). iperf and nuttcp report roughly 0.5
> Mbit/sec throughput, no matter how I tune the TCP settings on the two hosts.

So. It is the link. Why you would try to test things on a translatlantic
cable I have no idea.


> As far as I know, recent Linux kernels should support TCP autotuning, so
> messing around shouldn't even be necessary. But still, I tried to modify the
> suggested settings to no avail (net.core.rmem_max = 16777216,
> net.core.wmem_max = 16777216, net.ipv4.tcp_rmem = 4096 87380 16777216,
> net.ipv4.tcp_wmem = 4096 65536 16777216. I also tried other settings as
> found in articles dealing with TCP tuning, but basically they all ended up
> modifying those values, although the numbers differed slightly. TCP window
> scaling and selective acknowledgements are all enabled).
>
> But, if I run multiple connections concurrently (like eg nuttcp -N2 or
> more), each connection gets 1Mbit/sec throughput: nuttcp -N4 (running 4
> concurrent threads) totals 2 Mbit/sec.
> I would expect to be able to get the same 2 Mbit/sec with a single TCP
> connection, too.

Depends on what you are trying to do.
How about transferring a large file(eg the kernel) over that link all at
once, and see what throughput you get.
Remember that with the 80ms any packet housekeeping will take far longer
than transfering the file will. (at 2Mb/s, 80ms is the equivalent of
20KB of transfered data.)


>
> I'm not sure whether this is expected behavior, or if not, if there's
> something else I forgot to check or where to look to debug this.
> I have very limited knowledge of the intermediate network topology, other
> than what's shown by traceroute/mtr.
>
> Any pointers appreciated. Thanks.
From: D. Stussy on
"pk" <pk(a)pk.invalid> wrote in message news:2959923.OKlZ2vcFHj(a)xkzjympik...
> Apologies for the only marginally topical question.
>
> On a properly tuned system (which all modern linux systems should be), is
a
> single TCP connection supposed to be able to use all the available
> bandwidth?

If it's the only network bound process at the time, yes, there's no reason
not to.

If you wish to apportion the existing bandwidth among several network-bound
tasks, see the "tc" command from the IProute2 package (
http://www.linux-foundation.org/en/Net:Iproute2 ). The HTB classifier is
exceptionally good at splitting the bandwidth as allocated, and with the
"ceil[ing]" command, giving any unused allocations to those that are using
it. However, it may not help if the issue is concerning bandwidth of two
instances of the same application. Don't forget that this requires certain
items enabled in the kernel.



From: pk on
Rick Jones wrote:

> Just for grins, I would suggest some explicit socket buffer size
> settings. I don't know how that is done with iperf or nuttcp, but for
> netperf it would be something along the lines of:
>
> netperf -H <remote> -- -m 64K -s <size> -S <size>
>
> I'd probably try something like:
>
> for s in 32 64 128 256
> do
> netperf -H <remote> -- -m 64K -s $s -S $s
> done

This is what I get (pretty much same results in both directions, so I'm only
pasting one direction):

# for s in 32 64 128 256; do netperf -H 1x.x.x.x -p 5001 -- -P 5002 -m 64K -
s $s -S $s; done
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 5002 AF_INET to 1x.x.x.x
(1x.x.x.x) port 5002 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

256 2048 65536 10.52 0.02
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 5002 AF_INET to 1x.x.x.x
(1x.x.x.x) port 5002 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

256 2048 65536 10.88 0.02
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 5002 AF_INET to 1x.x.x.x
(1x.x.x.x) port 5002 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

256 2048 65536 10.08 0.02
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 5002 AF_INET to 1x.x.x.x
(1x.x.x.x) port 5002 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

512 2048 65536 10.09 0.05

>> But, if I run multiple connections concurrently (like eg nuttcp -N2
>> or more), each connection gets 1Mbit/sec throughput: nuttcp -N4
>> (running 4 concurrent threads) totals 2 Mbit/sec.
>> I would expect to be able to get the same 2 Mbit/sec with a single
>> TCP connection, too.
>
> Are there any TCP retransmissions recorded on the sender during the
> tests?

Yes, there is an average of ~85 retransimssions on the sending side when
running the 4 netperf tests above (in total, not per test).
A curious thing I noticed is that all the retransmissions occur with the
-[sS] 32, 64 and 128 tests only; NO retransmissions ever occur when running
the -[sS] 256 test. I've run the tests many times, and the 256 one
consistently showed no retransmissions (and the others did).

> What sort of interfaces are being used on the systems on either end?

One end is physical ethernet, the other end is a KVM virtual machine bridged
to an ethernet network (the guest is using the virtio driver). The KVM host
is not very loaded, as this is the only VM running on it.

> Any chance that interrupt coalescing is getting in the way?

# ethtool -c eth0
Coalesce parameters for eth0:
Adaptive RX: off TX: off
stats-block-usecs: 999936
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 18
rx-frames: 6
rx-usecs-irq: 18
rx-frames-irq: 6

tx-usecs: 80
tx-frames: 20
tx-usecs-irq: 80
tx-frames-irq: 20

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

> Might also be interesting to know if the 2MBit/s WAN link is a single
> link "uinder the covers" or if it is some sort of bonding of lesser
> bandwidth links.

Yes, it might be interesting :)

>> I'm not sure whether this is expected behavior, or if not, if
>> there's something else I forgot to check or where to look to debug
>> this. I have very limited knowledge of the intermediate network
>> topology, other than what's shown by traceroute/mtr.
>
> A lot of this will not apply to your situation, but some will:
>
> Some of my checklist items when presented with assertions of poor
> network performance, in no particular order:
>
> *) Is *any one* CPU on either end of the transfer at or close to 100%
> utilization? A given TCP connection cannot really take advantage
> of more than the services of a single core in the system, so
> average CPU utilization being low does not a priori mean things are
> OK.

Neither box uses more than 0.3% with any test (one is a dual (virtual) CPU,
the other is a dual 4-core CPU).

> *) Are there TCP retransmissions being registered in netstat
> statistics on the sending system? Take a snapshot of netstat -s -t
> from just before the transfer, and one from just after and run it
> through beforeafter from
> ftp://ftp.cup.hp.com/dist/networking/tools:
>
> netstat -s -t > before
> transfer or wait 60 or so seconds if the transfer was already going
> netstat -s -t > after
> beforeafter before after > delta

Yes, see above.

> *) Are there packet drops registered in ethtool -S statistics on
> either side of the transfer? Take snapshots in a matter similar to
> that with netstat.

No packet drops at the ethernet level.

> *) Are there packet drops registered in the stats for the switch(es)
> being traversed by the transfer? These would be retrieved via
> switch-specific means.

No drops, at least in the switches I can access (ie those immediately
connected to the two hosts).

> *) What is the latency between the two end points. Install netperf on
> both sides, start netserver on one side and on the other side run:
>
> netperf -t TCP_RR -l 30 -H <remote>
>
> and invert the transaction/s rate to get the RTT latency. There
> are caveats involving NIC interrupt coalescing settings defaulting
> in favor of throughput/CPU util over latency:
>
> ftp://ftp.cup.hp.com/dist/networking/briefs/nic_latency_vs_tput.txt
>
> but when the connections are over a WAN latency is important and
> may not be clouded as much by NIC settings.

This pretty much gives me the same latency I measured earlier, ie 1/12.87 =
0.078 msecs which is consistent with what ping shows.

> This all leads into:
>
> *) What is the *effective* TCP (or other) window size for the
> connection. One limit to the performance of a TCP bulk transfer
> is:
>
> Tput <= W(eff)/RTT
>
> The effective window size will be the lesser of:
>
> a) the classic TCP window advertised by the receiver (the value in
> the TCP header's window field shifted by the window scaling
> factor exchanged during connection establishment (why one wants
> to get traces including the connection establishment...)
>
> this will depend on whether/what the receiving application has
> requested via a setsockopt(SO_RCVBUF) call and the sysctl limits
> set in the OS. If the application does not call
> setsockopt(SO_RCVBUF) then the Linux stack will "autotune" the
> advertised window based on other sysctl limits in the OS.
>
> b) the computed congestion window on the sender - this will be
> affected by the packet loss rate over the connection, hence the
> interest in the netstat and ethtool stats.
>
> c) the quantity of data to which the sending TCP can maintain a
> reference while waiting for it to be ACKnowledged by the
> receiver - this will be akin to the classic TCP window case
> above, but on the sending side, and concerning
> setsockopt(SO_SNDBUF) and sysctl settings.
>
> d) the quantity of data the sending application is willing/able to
> send at any one time before waiting for some sort of
> application-level acknowledgement. FTP and rcp will just blast
> all the data of the file into the socket as fast as the socket
> will take it. scp has some application-layer "windowing" which
> may cause it to put less data out onto the connection than TCP
> might otherwise have permitted. NFS has the maximum number of
> outstanding requests it will allow at one time acting as a
> defacto "window" etc etc etc

Not sure this helps, but tcpdumping during the nuttcp test, the peers
initially advertise a window of 5840 and 5792 (sender/receiver), both with
window scaling == 2^9. As the test progresses, the receiver advertises
bigger and bigger window sizes (scaled values: 8704, 11264, 13824, 16896
etc. up to 82944 at the end of the test). The sender advertises 6144 from
the second segment onwards and sticks to that.

During the netperf test, on the other hand, both parties stick to a window
size of 512, which produces many "TCP Window Full" segments when seen in
Wireshark (actually, one every two).

Thanks.