From: Gary Smith on
> Have you disabled window scaling on your Postfix server. Lost connections
> are often the result of firewalls mangling "advanced" TCP features.
>
> - Disable window scaling
> - Disable ECN
>

I don't believe we have disabled any of the advanced features. That will give me something to do this weekend. I was thinking that maybe Weitse was right and that it's a conntrack issue, but changing ipvsadm to persistent has reduced the number of lost data commands.

What I'm thinking is there is some tweaks I need to make to the timeout of connections being NAT'ed back through ipvsadm. For some reason I was thinking that iptables connection tracking and ipvsadm NAT tracking were interrelated and the more I look, this is not the case. So it could be similar to what Weitse thought, just from a different angle.

While I'm in there I'll look at making sure all of the other settings are sane for the firewall boxes.

Gary-

From: Wietse Venema on
Gary Smith:
> > If the NAT assumes that everything is a web client and drops
> > connections after a few seconds, then Postfix will report lost
> > connections.
> >
> > If the NAT keeps connections open but it is a crappy box that can
> > maintain state for only 100 connections, then it will be forced to
> > to drop connections, and Postfix will report lost connections.
>
> I was thinking that at first. The firewall has a high connection
> timeout and we tweaked up the connection tracking buckets pretty
> high, but still under the 4g of ram it has. The case that was
> pointed out failed after receiving a few mb in the first transmission
> and only a couple hundred k in the retries.

Is it too much trouble to show the records of a few connections?
You can anonymize host and address information.

Wietse

From: Wietse Venema on
Gary Smith:
> May 13 18:48:33 host01 postfix/smtpd[18110]: connect from sender[senderip]
> May 13 18:48:33 host01 postfix/smtpd[18110]: setting up TLS connection from sender[senderip]
> May 13 18:48:33 host01 postfix/smtpd[18110]: Anonymous TLS connection established from sender[senderip]: TLSv1 with cipher RC4-SHA (128/128 bits)
> May 13 18:48:37 host01 postfix/smtpd[18110]: B30AAAFE4F: sender[senderip]]
> May 13 18:48:42 host01 postfix/smtpd[18110]: lost connection after DATA (1723601 bytes) from sender[senderip]

This strongly suggests that you have is a 10 second time limit
on the life time of NAT/VPS/whatever state.

Wietse

From: Gary Smith on
> This strongly suggests that you have is a 10 second time limit
> on the life time of NAT/VPS/whatever state.
>
> Wietse

Makes complete sense. I will bounce it off the ipvsadm list. They don't tend to respond much as of recent.

BTW, I did notice, while analyzing some of the logs, that a good percentage of the connections were unknown. I might be able to write off a number of these as being spammers with bad implementations for disconnect. So I might be chasing a partial ghost.

May 13 04:08:33 host01 postfix/smtpd[10912]: lost connection after DATA from unknown[82.178.110.201]
May 13 04:08:34 host01 postfix/smtpd[10409]: lost connection after RCPT from unknown[109.96.25.206]
May 13 04:09:23 host01 postfix/smtpd[10301]: lost connection after RCPT from unknown[190.107.112.194]

[root tmp]# grep -c "lost connection after RCPT from" maillog
1646
[root tmp]# grep -c "lost connection after RCPT from unknown" maillog
1153
[root tmp]# grep -c "lost connection after DATA from" maillog
689
[root tmp]# grep -c "lost connection after DATA from unknown" maillog
465

Anyway, thanks everyone for providing me some directions on where to look. I think the advanced TCP and the timeout and the ipvsadm might be the biggest issue.

Gary-

From: Victor Duchovni on
On Fri, May 14, 2010 at 11:20:47AM -0700, Gary Smith wrote:

> May 13 04:08:33 host01 postfix/smtpd[10912]: lost connection after DATA from unknown[82.178.110.201]

Listed on SpamHaus XBL and PBL

> May 13 04:08:34 host01 postfix/smtpd[10409]: lost connection after RCPT from unknown[109.96.25.206]

Listed on SpamHaus XBL and PBL

> May 13 04:09:23 host01 postfix/smtpd[10301]: lost connection after RCPT from unknown[190.107.112.194]

Listed on SpamHaus XBL

Unless these listings postdate your log entries, you should probably
not allow these clients to get as far as "DATA".

reject_rbl_client zen.spamhaus.org

provided your traffic load is under the SpamHaus free access limit, and
your DNS is not slaved to an ISP or other public forwarder that handles
DNS for many different organizations.

--
Viktor.

P.S. Morgan Stanley is looking for a New York City based, Senior Unix
system/email administrator to architect and sustain our perimeter email
environment. If you are interested, please drop me a note.