From: Wietse Venema on
It's nice for a change to work with someone who provides actual
information, instead of that dork from last week who was just
venting his opinions and not giving people a chance to help.

In this case it looks like a Solaris TCP bug when sending data over
a connection that suffers from heavy packet loss. Apparently,
Solaris 10 stops retransmitting, and therefore the other side times
out.

I'm not wrapping the packets this time, so you will have to
widen your window to read this message comfortably.

Wietse

The three-way TCP handshake show no wscale options, so we don't
have to worry about borked firewalls mis-handling this:

09:16:40.786945 IP 213.207.90.2.59301 > 145.222.14.10.25: S 2466228028:2466228028(0) win 49640 <mss 1460,nop,nop,sackOK>
09:16:40.789806 IP 145.222.14.10.25 > 213.207.90.2.59301: S 3886146351:3886146351(0) ack 2466228029 win 5840 <mss 1380>
09:16:40.789829 IP 213.207.90.2.59301 > 145.222.14.10.25: . ack 1 win 49680

The handshake shows a round-trip time of 2.9ms. The receiver's MTU
is smaller than 1460, which suggests that his packets are encapsulated
in some other protocol. It does not matter for the problem at hand.

Next is a segment from the middle of transmission. The sender has
sent a full window up to byte 22233, but apparently there was major
packet loss after byte 5673.

09:16:51.869847 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 5673 win 16560
09:16:51.869861 IP 213.207.90.2.59301 > 145.222.14.10.25: P 20853:22233(1380) ack 137 win 49680
09:16:51.869874 IP 213.207.90.2.59301 > 145.222.14.10.25: . 5673:7053(1380) ack 137 win 49680
09:16:51.869976 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 5673 win 16560
09:16:51.869977 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 5673 win 16560
09:16:51.869978 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 5673 win 16560
09:16:51.870110 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 5673 win 16560
09:16:51.870111 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 5673 win 16560
09:16:51.870376 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 5673 win 16560
09:16:56.216513 IP 213.207.90.2.59301 > 145.222.14.10.25: . 5673:7053(1380) ack 137 win 49680
09:16:56.260383 IP 145.222.14.10.25 > 213.207.90.2.59301: . ack 7053 win 19320
09:16:56.260414 IP 213.207.90.2.59301 > 145.222.14.10.25: . 7053:8433(1380) ack 137 win 49680
09:16:56.260421 IP 213.207.90.2.59301 > 145.222.14.10.25: . 8433:9813(1380) ack 137 win 49680
09:16:56.263577 IP 145.222.14.10.25 > 213.207.90.2.59301: . ack 8433 win 22080
09:16:56.263588 IP 213.207.90.2.59301 > 145.222.14.10.25: . 9813:11193(1380) ack 137 win 49680
09:16:56.263610 IP 213.207.90.2.59301 > 145.222.14.10.25: . 11193:12573(1380) ack 137 win 49680
09:16:56.263844 IP 145.222.14.10.25 > 213.207.90.2.59301: . ack 9813 win 24840
09:16:56.263855 IP 213.207.90.2.59301 > 145.222.14.10.25: . 12573:13953(1380) ack 137 win 49680
09:16:56.263865 IP 213.207.90.2.59301 > 145.222.14.10.25: . 13953:15333(1380) ack 137 win 49680
09:16:56.266641 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 9813 win 24840
09:16:56.266776 IP 145.222.14.10.25 > 213.207.90.2.59301: . ack 11193 win 28980
09:16:56.266777 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 11193 win 28980
09:16:56.266800 IP 213.207.90.2.59301 > 145.222.14.10.25: . 15333:16713(1380) ack 137 win 49680

Sequences like this repeat through the entire session. This network
connection is so bad that your machine can send only 107656 bytes
in 85 seconds or 1.27 kbyte/s.

I notice there are many sequences like this:

09:16:51.869844 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 5673 win 16560
09:16:51.869845 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 5673 win 16560
09:16:51.869846 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 5673 win 16560
09:16:51.869847 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 5673 win 16560

and this:

09:16:51.869976 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 5673 win 16560
09:16:51.869977 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 5673 win 16560
09:16:51.869978 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 5673 win 16560

and this:

09:17:20.597737 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 40173 win 32767
09:17:20.597738 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 40173 win 32767
09:17:20.597739 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 40173 win 32767

And many more. This is consistent with network congestion. These
identical ACKs would have been sent several ms separated from each
other (remember, the round-trip time is about 3ms). Apparently the
ACKs were queued up behind other data that was clogging some pipe
somewere, and so they ended up traveling together.

Network congestion is also consistent with increased packet loss.

And now for the surprise: after 09:18:16.515592 the sender stops
retransmitting. That would be a sender TCP bug.

The first thing that happens next is 30s later, at 09:18:46.512567,
when the server reports a 421 timeout error and closes the connection.

09:18:16.512400 IP 213.207.90.2.59301 > 145.222.14.10.25: . 106413:107793(1380) ack 137 win 49680
09:18:16.512408 IP 213.207.90.2.59301 > 145.222.14.10.25: . 107793:109173(1380) ack 137 win 49680
09:18:16.512627 IP 145.222.14.10.25 > 213.207.90.2.59301: . ack 106413 win 32767
09:18:16.512644 IP 213.207.90.2.59301 > 145.222.14.10.25: . 109173:110553(1380) ack 137 win 49680
09:18:16.515557 IP 145.222.14.10.25 > 213.207.90.2.59301: . ack 107793 win 32767
09:18:16.515592 IP 213.207.90.2.59301 > 145.222.14.10.25: P 110553:111933(1380) ack 137 win 49680
09:18:16.515823 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 107793 win 32767
09:18:16.518354 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 107793 win 32767
[nothing for 30s]
09:18:46.512567 IP 145.222.14.10.25 > 213.207.90.2.59301: P 137:188(51) ack 107793 win 32767
09:18:46.513098 IP 145.222.14.10.25 > 213.207.90.2.59301: F 188:188(0) ack 107793 win 32767

The sender should not stop retransmitting like this. You may want
to provide the complete trace to the Solaris maintainers.

From: Martijn de Munnik on
On Thu, 21 Jan 2010 10:30:54 -0500 (EST), wietse(a)porcupine.org (Wietse
Venema) wrote:
> It's nice for a change to work with someone who provides actual
> information, instead of that dork from last week who was just
> venting his opinions and not giving people a chance to help.

I don't know how one can help me if I don't supply info ;)
>
> In this case it looks like a Solaris TCP bug when sending data over
> a connection that suffers from heavy packet loss. Apparently,
> Solaris 10 stops retransmitting, and therefore the other side times
> out.

Ok, this is useful help. I did reroute the message via a linux postfix
server and now it is successfully delivered. I try to find out if I can
solve the Solaris bug but at least our customer is happy now!
>
> I'm not wrapping the packets this time, so you will have to
> widen your window to read this message comfortably.

With your explanation I'm able to read the tcp dump, one day I will fully
understand it...
>
> Wietse
>
> The three-way TCP handshake show no wscale options, so we don't
> have to worry about borked firewalls mis-handling this:
>
> 09:16:40.786945 IP 213.207.90.2.59301 > 145.222.14.10.25: S
> 2466228028:2466228028(0) win 49640 <mss 1460,nop,nop,sackOK>
> 09:16:40.789806 IP 145.222.14.10.25 > 213.207.90.2.59301: S
> 3886146351:3886146351(0) ack 2466228029 win 5840 <mss 1380>
> 09:16:40.789829 IP 213.207.90.2.59301 > 145.222.14.10.25: . ack 1 win
49680
>
> The handshake shows a round-trip time of 2.9ms. The receiver's MTU
> is smaller than 1460, which suggests that his packets are encapsulated
> in some other protocol. It does not matter for the problem at hand.
>
> Next is a segment from the middle of transmission. The sender has
> sent a full window up to byte 22233, but apparently there was major
> packet loss after byte 5673.
>
> 09:16:51.869847 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 5673 win
> 16560
> 09:16:51.869861 IP 213.207.90.2.59301 > 145.222.14.10.25: P
> 20853:22233(1380) ack 137 win 49680
> 09:16:51.869874 IP 213.207.90.2.59301 > 145.222.14.10.25: .
> 5673:7053(1380) ack 137 win 49680
> 09:16:51.869976 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 5673 win
> 16560
> 09:16:51.869977 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 5673 win
> 16560
> 09:16:51.869978 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 5673 win
> 16560
> 09:16:51.870110 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 5673 win
> 16560
> 09:16:51.870111 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 5673 win
> 16560
> 09:16:51.870376 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 5673 win
> 16560
> 09:16:56.216513 IP 213.207.90.2.59301 > 145.222.14.10.25: .
> 5673:7053(1380) ack 137 win 49680
> 09:16:56.260383 IP 145.222.14.10.25 > 213.207.90.2.59301: . ack 7053 win
> 19320
> 09:16:56.260414 IP 213.207.90.2.59301 > 145.222.14.10.25: .
> 7053:8433(1380) ack 137 win 49680
> 09:16:56.260421 IP 213.207.90.2.59301 > 145.222.14.10.25: .
> 8433:9813(1380) ack 137 win 49680
> 09:16:56.263577 IP 145.222.14.10.25 > 213.207.90.2.59301: . ack 8433 win
> 22080
> 09:16:56.263588 IP 213.207.90.2.59301 > 145.222.14.10.25: .
> 9813:11193(1380) ack 137 win 49680
> 09:16:56.263610 IP 213.207.90.2.59301 > 145.222.14.10.25: .
> 11193:12573(1380) ack 137 win 49680
> 09:16:56.263844 IP 145.222.14.10.25 > 213.207.90.2.59301: . ack 9813 win
> 24840
> 09:16:56.263855 IP 213.207.90.2.59301 > 145.222.14.10.25: .
> 12573:13953(1380) ack 137 win 49680
> 09:16:56.263865 IP 213.207.90.2.59301 > 145.222.14.10.25: .
> 13953:15333(1380) ack 137 win 49680
> 09:16:56.266641 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 9813 win
> 24840
> 09:16:56.266776 IP 145.222.14.10.25 > 213.207.90.2.59301: . ack 11193
win
> 28980
> 09:16:56.266777 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 11193
win
> 28980
> 09:16:56.266800 IP 213.207.90.2.59301 > 145.222.14.10.25: .
> 15333:16713(1380) ack 137 win 49680
>
> Sequences like this repeat through the entire session. This network
> connection is so bad that your machine can send only 107656 bytes
> in 85 seconds or 1.27 kbyte/s.
>
> I notice there are many sequences like this:
>
> 09:16:51.869844 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 5673 win
> 16560
> 09:16:51.869845 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 5673 win
> 16560
> 09:16:51.869846 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 5673 win
> 16560
> 09:16:51.869847 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 5673 win
> 16560
>
> and this:
>
> 09:16:51.869976 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 5673 win
> 16560
> 09:16:51.869977 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 5673 win
> 16560
> 09:16:51.869978 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 5673 win
> 16560
>
> and this:
>
> 09:17:20.597737 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 40173
win
> 32767
> 09:17:20.597738 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 40173
win
> 32767
> 09:17:20.597739 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 40173
win
> 32767
>
> And many more. This is consistent with network congestion. These
> identical ACKs would have been sent several ms separated from each
> other (remember, the round-trip time is about 3ms). Apparently the
> ACKs were queued up behind other data that was clogging some pipe
> somewere, and so they ended up traveling together.
>
> Network congestion is also consistent with increased packet loss.
>
> And now for the surprise: after 09:18:16.515592 the sender stops
> retransmitting. That would be a sender TCP bug.
>
> The first thing that happens next is 30s later, at 09:18:46.512567,
> when the server reports a 421 timeout error and closes the connection.
>
> 09:18:16.512400 IP 213.207.90.2.59301 > 145.222.14.10.25: .
> 106413:107793(1380) ack 137 win 49680
> 09:18:16.512408 IP 213.207.90.2.59301 > 145.222.14.10.25: .
> 107793:109173(1380) ack 137 win 49680
> 09:18:16.512627 IP 145.222.14.10.25 > 213.207.90.2.59301: . ack 106413
win
> 32767
> 09:18:16.512644 IP 213.207.90.2.59301 > 145.222.14.10.25: .
> 109173:110553(1380) ack 137 win 49680
> 09:18:16.515557 IP 145.222.14.10.25 > 213.207.90.2.59301: . ack 107793
win
> 32767
> 09:18:16.515592 IP 213.207.90.2.59301 > 145.222.14.10.25: P
> 110553:111933(1380) ack 137 win 49680
> 09:18:16.515823 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 107793
win
> 32767
> 09:18:16.518354 IP 145.222.14.10.25 > 213.207.90.2.59301: P ack 107793
win
> 32767
> [nothing for 30s]
> 09:18:46.512567 IP 145.222.14.10.25 > 213.207.90.2.59301: P 137:188(51)
> ack 107793 win 32767
> 09:18:46.513098 IP 145.222.14.10.25 > 213.207.90.2.59301: F 188:188(0)
ack
> 107793 win 32767
>
> The sender should not stop retransmitting like this. You may want
> to provide the complete trace to the Solaris maintainers.

--
YoungGuns
Kasteleinenkampweg 7b
5222 AX 's-Hertogenbosch
T. 073 623 56 40
F. 073 623 56 39
www.youngguns.nl
KvK 18076568