From: lundman on

Solaris 10, x86
sendmail-8.13.7

I am currently experiencing a problem on multiple servers.
clientmqueue has a handful of messages that it is attempting to
deliver to localhost:25. One of these messages fails, for whatever
reason, and sendmail -Ac receives a Broken Pipe. All remaining emails
in clientmqueue are then automatically deferred. This will go on until
the emails have expired and are bounced. (or I clean out the offending
email).

I would guess that one of my timeouts is set too low which is why I
receive the timeout error, and abrupt disconnect. Naturally sendmail -
Ac will then receive Broken Pipe. But it feels undesirable that it
would simply remember the problem with the connection to localhost:25,
and defer all other emails without trying. Can this be disabled
somehow? I have not using HostStatusDirectory (although we were), nor
StatusFile (although default Sun submit.cf was).

Details:

Running /var/spool/clientmqueue/lB2JL9wt008760 (sequence 1 of 53)
iriiri21(a)censored... Connecting to [127.0.0.1] via relay...
[snip]
250 2.1.5 <iriiri21(a)censored>... Recipient ok
354 Enter mail, end with "." on a line by itself
myschool.co.kr: Name server timeout
anet.ne.jp: Name server timeout
>>> .
421 4.4.1 collect: read timeout on connection from localhost,
from=<customer(a)our-host>
>>> QUIT
iriiri21(a)censored... Deferred: 421 4.4.1 collect: read timeout on
connection from localhost, from=<customer(a)our-host>

Running /var/spool/clientmqueue/lB1BI2Ql026119 (sequence 2 of 53)
censored(a)email.com... Deferred

Running /var/spool/clientmqueue/lB1AQXfJ001015 (sequence 3 of 53)
censored(a)email.com... Deferred

Running /var/spool/clientmqueue/lB1ALu4c023269 (sequence 4 of 53)
censored(a)email.com... Deferred



truss details:

23344: =\r\n - - - - - - = _ N e x t P a r t _ 0 0 0 _ 0 0 A 7 _ 5
A 1
23344: 6 B C 8 3 . E 3 D D 7 4 3 E - -\r\n\r\n
23344: write(1, " > > > .\n", 6) = 6
23344: write(6, " .\r\n", 3) Err#32 EPIPE
23344: Received signal #13, SIGPIPE [ignored]
23344: read(7, 0x08279A28, 8192) = 99
23344: 4 2 1 4 . 4 . 1 c o l l e c t : r e a d t i m e o u
t
23344: o n c o n n e c t i o n f r o m l o c a l h o s t ,
f r
23344: o m = <

truss confirms that it does no reading of HostStatusDirectory, or
StatusFile, it simply just iterates the remaining clientmqueue entries
and defers them. It would then sleep 30 mins, and try again. Since the
first email is the same problem email it will fail again, then defer
all remaining emails.


Now, it would seem my system has a timeout too lean, but fixing that
would only make the problem less likely to occur.

I can run 2 (or more) clientmqueue runners, but that too, would also
just make it less likely to occur.

Is there someway to disable this feature for submit.cf so that it does
not remember a previous connection failure to localhost:25? Even
though my configuration settings are most likely bad, it would seem
undesirable that, for whatever failure reason, if you get a message
that will never pass delivery, all other deliveries to localhost are
ignored, and remain so until the problem is cleared or all messages
are bounced.



From: Per Hedeland on
In article
<f970ebf7-35a9-4d06-8ce1-cea7ab1e3184(a)d4g2000prg.googlegroups.com>
"lundman(a)lundman.net" <lundman(a)lundman.net> writes:
>
>I would guess that one of my timeouts is set too low which is why I
>receive the timeout error, and abrupt disconnect. Naturally sendmail -
>Ac will then receive Broken Pipe. But it feels undesirable that it
>would simply remember the problem with the connection to localhost:25,
>and defer all other emails without trying. Can this be disabled
>somehow? I have not using HostStatusDirectory (although we were), nor
>StatusFile (although default Sun submit.cf was).

The "host status" is always cached during a queue run, the
HostStatusDirectory allows for remembering it *between* queue runs, and
StatusFile is not relevant at all here (it's effectively write-only as
far as sendmail is concerned).

>Running /var/spool/clientmqueue/lB2JL9wt008760 (sequence 1 of 53)
>iriiri21(a)censored... Connecting to [127.0.0.1] via relay...
>[snip]
>250 2.1.5 <iriiri21(a)censored>... Recipient ok
>354 Enter mail, end with "." on a line by itself
>myschool.co.kr: Name server timeout
>anet.ne.jp: Name server timeout
>>>> .
>421 4.4.1 collect: read timeout on connection from localhost,

>Now, it would seem my system has a timeout too lean, but fixing that
>would only make the problem less likely to occur.

Well, that depends - if you can limit the time "things" will take, you
can be sure that you don't exceed a timeout. And of course it is
"silly" that the delivery to localhost:25 by the MSP should ever take
long enough that the MTA times out. The "things" here are specifically
DNS lookups for canonicalization of addresses in headers - this is
pretty pointless to do in the MSP since the MTA will do it anyway
(unless you have disabled it there) - check out the section MESSAGE
SUBMISSION PROGRAM in cf/README for a way to handle this.

HOWEVER, I would be very suspicious of why you have messages in your
clientmqueue with such problematic addresses in the headers in the first
place - the only thing there should normally be locally generated mail
from scripts or local users with "simplistic" MUAs. It seems somewhat
unlikely that they would generate mail with unresolvable domains - make
sure that you don't have some broken cgi script that allows spammers to
abuse your box (i.e. first of all check that the "problematic" message
is really legitimate).

>Is there someway to disable this feature for submit.cf so that it does
>not remember a previous connection failure to localhost:25?

If you finally want to go that route, you can set confTO_HOSTSTATUS to
0 in submit.mc, see cf/README.

--Per Hedeland
per(a)hedeland.org
From: lundman on
On Dec 7, 7:22 am, p...(a)hedeland.org (Per Hedeland) wrote:
> In article
> <f970ebf7-35a9-4d06-8ce1-cea7ab1e3...(a)d4g2000prg.googlegroups.com>
>
> The "host status" is always cached during a queue run, the
> HostStatusDirectory allows for remembering it *between* queue runs, and
> StatusFile is not relevant at all here (it's effectively write-only as
> far as sendmail is concerned).

Ah interesting. It had to be an in-memory caching of some sort, I just
did not know the name it would use.



> Well, that depends - if you can limit the time "things" will take, you
> can be sure that you don't exceed a timeout. And of course it is
> "silly" that the delivery to localhost:25 by the MSP should ever take
> long enough that the MTA times out. The "things" here are specifically
> DNS lookups for canonicalization of addresses in headers - this is
> pretty pointless to do in the MSP since the MTA will do it anyway
> (unless you have disabled it there) - check out the section MESSAGE
> SUBMISSION PROGRAM in cf/README for a way to handle this.

So the MSP looks up hostnames, that does seem rather pointless. But
there probably is a side-effect in disabling it in MSP.


> HOWEVER, I would be very suspicious of why you have messages in your
> clientmqueue with such problematic addresses in the headers in the first
> place - the only thing there should normally be locally generated mail
> from scripts or local users with "simplistic" MUAs. It seems somewhat
> unlikely that they would generate mail with unresolvable domains - make
> sure that you don't have some broken cgi script that allows spammers to
> abuse your box (i.e. first of all check that the "problematic" message
> is really legitimate).

Oh it was spam, for sure. Which makes it worse in a way, that one bad
spam message will defer, and eventually bounce, all other legitimate
emails. Not thousands of spam, just one message :(


> If you finally want to go that route, you can set confTO_HOSTSTATUS to
> 0 in submit.mc, see cf/README.
>

It sounds like perhaps I do not want to go that route from your hints?
It is nice to know the root cause, Admittedly in this situation it was
a slow DNS resolver, but imagine that a specific emails triggers MTA
to core dump, every time. Creating Broken Pipes, and from that, all
other emails deferred etc. Insulting the programmers aside, and that
it would be so unlikely it isn't funny, wouldn't you want to remove
the problem that could bounce legitimate emails?

Thank you very much for you reply. I have increased the timeout, so at
the very least this should be unlikely to happen again. Now to
apologise to all customers...


Lund
From: Per Hedeland on
In article
<e8acc4e0-92f9-4314-b562-0625bdf1eeab(a)a35g2000prf.googlegroups.com>
"lundman(a)lundman.net" <lundman(a)lundman.net> writes:
>On Dec 7, 7:22 am, p...(a)hedeland.org (Per Hedeland) wrote:
>
>> Well, that depends - if you can limit the time "things" will take, you
>> can be sure that you don't exceed a timeout. And of course it is
>> "silly" that the delivery to localhost:25 by the MSP should ever take
>> long enough that the MTA times out. The "things" here are specifically
>> DNS lookups for canonicalization of addresses in headers - this is
>> pretty pointless to do in the MSP since the MTA will do it anyway
>> (unless you have disabled it there) - check out the section MESSAGE
>> SUBMISSION PROGRAM in cf/README for a way to handle this.
>
>So the MSP looks up hostnames, that does seem rather pointless. But
>there probably is a side-effect in disabling it in MSP.

Did you check cf/README?

>> HOWEVER, I would be very suspicious of why you have messages in your
>> clientmqueue with such problematic addresses in the headers in the first
>> place - the only thing there should normally be locally generated mail
>> from scripts or local users with "simplistic" MUAs. It seems somewhat
>> unlikely that they would generate mail with unresolvable domains - make
>> sure that you don't have some broken cgi script that allows spammers to
>> abuse your box (i.e. first of all check that the "problematic" message
>> is really legitimate).
>
>Oh it was spam, for sure.

So how did it end up in your clientmqueue?

>> If you finally want to go that route, you can set confTO_HOSTSTATUS to
>> 0 in submit.mc, see cf/README.
>
>It sounds like perhaps I do not want to go that route from your hints?

In general it's obviously sub-optimal, but I guess it's no real problem
in the MSP - e.g. AFAIK it will keep trying to connect to the MTA for
each message in the queue even if the MTA is actually down or something,
but it should be pretty cheap and maybe there's nothing else it could
spend CPU cycles on anyway... But I certainly think it's more of a hack
than the alternatives.

--Per Hedeland
per(a)hedeland.org