From: Mehul Ved on
On Mon, Dec 21, 2009 at 5:27 AM, Noel Jones <njones(a)megan.vbhcs.org> wrote:
> Your original complaint is that postfix is not sending or receiving mail.
>  The only evidence I see of that in the log file is
>
> Dec 19 21:21:42 www postfix/smtpd[18342]: NOQUEUE: reject: RCPT from
> unknown[190.2.225.89]: 550 5.1.1 <ndhe(a)hitechplast.net>: Recipient address
> rejected: User unknown in local recipient table;
> from=<moonlito7(a)riscy.monsanto.com> to=<ndhe(a)hitechplast.

Pasting more logs from previous day

> So...
> 1. TURN OFF VERBOSE LOGGING

I haven't enabled verbose logging, I checked both master.cf and
main.cf but there's no verbose logging enabled. Or am I missing
something?

> 2. show some evidence of postfix not sending or receiving mail.

logs after the problem occurred - http://pastebin.ca/1722318
logs before the problem occurred - http://pastebin.ca/1722322
I couldn't paste the whole log as the file is 50mb in size. Each day's
logs come out to 5MB+ too.

> 3. show the contents of master.cf

http://pastebin.ca/1722112

> I see that postdrop and smtpd log time differently.  Although this is
> somewhat annoying, it rarely leads to any operational problems.  What leads
> you to believe this causes a problem with accepting or delivering mail?
>
> Turn off verbose logging and show some more logs.  Don't show verbose logs
> unless you are specifically asked for them. 99.9% of postfix problems are
> easily diagnosed with normal logging.
> http://www.postfix.org/DEBUG_README.html
>
>>
>> $ postconf -n
>> smtp_use_tls = yes
>
> This parameter is deprecated.  Replace it with
> smtp_tls_security_level = may
> http://www.postfix.org/TLS_README.html
> but this won't cause problems except in rare cases where the receiving
> system offers STARTTLS but the receiver's TLS doesn't work.
>
>> smtpd_recipient_restrictions =
>> permit_sasl_authenticated,permit_mynetworks,reject_unauth_destination
>
> OK.
>
>> smtpd_use_tls = yes
>
> This parameter is deprecated.  Replace it with
> smtpd_tls_security_level = may
> http://www.postfix.org/TLS_README.html
> but this won't cause problems except in rare cases where the sender has a
> broken TLS implementation.

Thank you. Will change those.

>> $ qshape | head
>>                                         T  5 10 20 40  80 160 320 640 1280
>> 1280+
>>                                TOTAL 6598  3 15 17 45 125 293 372 366 1417
>>  3945
>>           www.progresspartners.co.in 5537  3 14 16 44 118 285 288 352 1222
>>  3195
>>                localhost.localdomain  668  0  0  0  0   0   0  57   0  136
>>   475
>
> Looks like a lot of deferred mail, even for localhost. Without corresponding
> logging, this is meaningless.

Not much going to deferred. It's stuck in active queue.


$ qshape active | head
T 5 10 20 40 80 160 320 640 1280 1280+
TOTAL 736 0 0 0 0 0 0 0 0 191 545
www.progresspartners.co.in 624 0 0 0 0 0 0 0 0 174 450
localhost.localdomain 106 0 0 0 0 0 0 0 0 15 91
saswatabanerjee.com 3 0 0 0 0 0 0 0 0 0 3
hitechplast.net 2 0 0 0 0 0 0 0 0 1 1
hitechplast.co.in 1 0 0 0 0 0 0 0 0 1 0

$ qshape -s active | head
T 5 10 20 40 80 160 320 640 1280 1280+
TOTAL 544 0 0 0 0 0 0 0 0 125 419
www.progresspartners.co.in 107 0 0 0 0 0 0 0 0 15 92
hitechplast.co.in 101 0 0 0 0 0 0 0 0 27 74
clearplastics.co.in 40 0 0 0 0 0 0 0 0 9 31
alerts.kotak.com 14 0 0 0 0 0 0 0 0 0 14
MAILER-DAEMON 12 0 0 0 0 0 0 0 0 7 5
gmail.com 10 0 0 0 0 0 0 0 0 3 7
progresspartners.co.in 10 0 0 0 0 0 0 0 0 2 8
hitechplast.net 7 0 0 0 0 0 0 0 0 4 3

$ qshape deferred | head
T 5 10 20 40 80 160 320 640 1280 1280+
TOTAL 6 0 0 0 0 0 0 0 0 3 3
castrol.co.in 2 0 0 0 0 0 0 0 0 0 2
ukrtel.net 1 0 0 0 0 0 0 0 0 1 0
kccworld.co.kr 1 0 0 0 0 0 0 0 0 0 1
alerts.kotak.com 1 0 0 0 0 0 0 0 0 1 0
technolution.com 1 0 0 0 0 0 0 0 0 1 0

$ qshape -s deferred | head
T 5 10 20 40 80 160 320 640 1280 1280+
TOTAL 6 0 0 0 0 0 0 0 0 3 3
MAILER-DAEMON 3 0 0 0 0 0 0 0 0 3 0
mipak.co.in 2 0 0 0 0 0 0 0 0 0 2
coatingspl.co.in 1 0 0 0 0 0 0 0 0 0 1

>> $ pflogsumm -d today /var/log/maillog
>>
>> Per-Hour Traffic Summary
>>     time          received  delivered   deferred    bounced     rejected
>>     --------------------------------------------------------------------
>>     0000-0100         207        361         14          3         70
>>     0100-0200         157         63          0         16        133
>>     0200-0300          83          0          0          0        239
>>     0300-0400          67          0          0          0        241
>>     0400-0500          64          0          0          0        177
>>     0500-0600          32          0          0          0        194
>>     0600-0700          51          0          0          0        217
>>     0700-0800          41          0          0          0        213
>>     0800-0900          46          0          0          0        189
>>     0900-1000         136          0          0          0        219
>>     1000-1100         304         91         14         10        251
>>     1100-1200         215          0          0          0        219
>>     1200-1300         331          0          0          0        470
>>     1300-1400         336         74          5          1        309
>>     1400-1500         359         95          2          1        330
>>     1500-1600         331          0          0          0        406
>>     1600-1700         302          0          0          0        363
>
> Curious.  This shows mail being received, but very few delivery attempts.
>  Logs?

Logs are above. And curiosly the delivery attempts only happen when I
restart postfix. It delivers mails for a couple of minutes and then
stops again.


> Are you using a content_filter or Mailscanner software?  Any upgrades run on
> this server recently?  Any obvious signs of postfix trouble?
> http://www.postfix.org/DEBUG_README.html#logging

I am running clamd for mail scanning.
There were no upgrades on the server.

From: Stan Hoeppner on
Mehul Ved put forth on 12/21/2009 5:01 AM:

> logs after the problem occurred - http://pastebin.ca/1722318
> logs before the problem occurred - http://pastebin.ca/1722322

You inbound for relay (I assume so due to the TLS):

Dec 19 00:03:18 www postfix/smtpd[23607]: connect from
c-24-34-20-2.hsd1.ma.comcast.net[24.34.20.2]
Dec 19 00:03:18 www postfix/smtpd[23607]: setting up TLS connection from
c-24-34-20-2.hsd1.ma.comcast.net[24.34.20.2]
Dec 19 00:03:19 www postfix/smtpd[23607]: TLS connection established from
c-24-34-20-2.hsd1.ma.comcast.net[24.34.20.2]: TLSv1 with cipher RC4-MD5 (128/128
bits)
Dec 19 00:03:20 www postfix/smtpd[23607]: 41A432CB8B2E:
client=c-24-34-20-2.hsd1.ma.comcast.net[24.34.20.2]
Dec 19 00:03:20 www postfix/cleanup[24661]: 41A432CB8B2E:
message-id=<20091218183320.41A432CB8B2E(a)www.progresspartners.co.in>
Dec 19 00:03:20 www postfix/smtpd[23607]: disconnect from
c-24-34-20-2.hsd1.ma.comcast.net[24.34.20.2]

Me inbound for relay:

Dec 20 15:17:42 greer postfix/smtpd[18910]: connect from
gffx.hardwarefreak.com[192.168.100.53]
Dec 20 15:17:42 greer postfix/smtpd[18910]: 4B64C6C439:
client=gffx.hardwarefreak.com[192.168.100.53]
Dec 20 15:17:42 greer postfix/cleanup[18912]: 4B64C6C439:
message-id=<4B2E9476.4090003(a)hardwarefreak.com>
Dec 20 15:17:42 greer postfix/qmgr[18646]: 4B64C6C439:
from=<stan(a)hardwarefreak.com>, size=2039, nrcpt=1 (queue active)
Dec 20 15:17:42 greer postfix/smtpd[18910]: disconnect from
gffx.hardwarefreak.com[192.168.100.53]
Dec 20 15:17:43 greer postfix/smtp[18913]: 4B64C6C439:
to=<postfix-users(a)postfix.org>, relay=mail.cloud9.net[168.100.1.3]:25, delay=1,
delays=0.04/0.02/0.52/0.43, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as
1FB851035EB)
Dec 20 15:17:43 greer postfix/qmgr[18646]: 4B64C6C439: removed

At the worst, it looks like your qmgr daemon isn't running, or isn't being
called, or, at the least, isn't being logged. Assuming it's the first, why
isn't your qmgr daemon running? If it's the 2nd, why isn't your cleanup process
handing the mail off to qmgr? If it's the 3rd, either you've got your logging
level set funky, _or_ the first is true: the qmgr daemon isn't running.

Or maybe you've got some special configuration I'm not familiar with that just
doesn't log qmngr messages.

--
Stan

From: Mehul Ved on
On Mon, Dec 21, 2009 at 5:40 PM, Stan Hoeppner <stan(a)hardwarefreak.com> wrote:
> At the worst, it looks like your qmgr daemon isn't running, or isn't being
> called, or, at the least, isn't being logged.  Assuming it's the first, why
> isn't your qmgr daemon running?  If it's the 2nd, why isn't your cleanup process
> handing the mail off to qmgr?  If it's the 3rd, either you've got your logging
> level set funky, _or_ the first is true:  the qmgr daemon isn't running..
>
> Or maybe you've got some special configuration I'm not familiar with that just
> doesn't log qmngr messages.

It has to be first or second because I see qmgr and cleanup working
for a minute or two if I restart postfix but again it stops without
any error or notice showing in the logs.
There's no special configuration or any changes made to the file which
would cause that.

This is something I found today
Dec 18 16:23:35 www postfix/pickup[23157]: fatal: watchdog timeout
Dec 19 00:52:16 www postfix/pickup[25514]: fatal: watchdog timeout
Dec 19 05:35:47 www postfix/smtp[27520]: fatal: watchdog timeout
Dec 19 06:38:43 www postfix/local[27540]: fatal: watchdog timeout
Dec 19 18:44:43 www postfix/smtp[30674]: fatal: watchdog timeout
Dec 19 18:46:35 www postfix/local[30718]: fatal: watchdog timeout
Dec 20 05:58:06 www postfix/local[30381]: fatal: watchdog timeout
Dec 20 10:46:21 www postfix/pickup[30721]: fatal: watchdog timeout
Dec 20 15:29:42 www postfix/local[31411]: fatal: watchdog timeout
Dec 20 15:33:50 www postfix/smtp[367]: fatal: watchdog timeout
Dec 20 20:12:54 www postfix/pickup[17393]: fatal: watchdog timeout
Dec 21 06:04:25 www postfix/local[939]: fatal: watchdog timeout
Dec 21 13:35:57 www postfix/pickup[16348]: fatal: watchdog timeout
Dec 21 16:50:13 www postfix/pickup[4898]: fatal: watchdog timeout

And the timing of the first problem coincides with when the problem started..
So, it turns out I was looking at the wrong thing. I need to find out
now as to why I am getting this error.

From: Mehul Ved on
On Mon, Dec 21, 2009 at 5:52 PM, Mehul Ved <mehul.n.ved(a)gmail.com> wrote:
> This is something I found today
> Dec 18 16:23:35 www postfix/pickup[23157]: fatal: watchdog timeout
> Dec 19 00:52:16 www postfix/pickup[25514]: fatal: watchdog timeout
> Dec 19 05:35:47 www postfix/smtp[27520]: fatal: watchdog timeout
> Dec 19 06:38:43 www postfix/local[27540]: fatal: watchdog timeout
> Dec 19 18:44:43 www postfix/smtp[30674]: fatal: watchdog timeout
> Dec 19 18:46:35 www postfix/local[30718]: fatal: watchdog timeout
> Dec 20 05:58:06 www postfix/local[30381]: fatal: watchdog timeout
> Dec 20 10:46:21 www postfix/pickup[30721]: fatal: watchdog timeout
> Dec 20 15:29:42 www postfix/local[31411]: fatal: watchdog timeout
> Dec 20 15:33:50 www postfix/smtp[367]: fatal: watchdog timeout
> Dec 20 20:12:54 www postfix/pickup[17393]: fatal: watchdog timeout
> Dec 21 06:04:25 www postfix/local[939]: fatal: watchdog timeout
> Dec 21 13:35:57 www postfix/pickup[16348]: fatal: watchdog timeout
> Dec 21 16:50:13 www postfix/pickup[4898]: fatal: watchdog timeout
>
> And the timing of the first problem coincides with when the problem started.
> So, it turns out I was looking at the wrong thing. I need to find out
> now as to why I am getting this error.

To add to that, when postfix is running, the server load spikes up to
around 120 as reported by top.

From: Stan Hoeppner on
Mehul Ved put forth on 12/21/2009 6:37 AM:
> On Mon, Dec 21, 2009 at 5:52 PM, Mehul Ved <mehul.n.ved(a)gmail.com> wrote:
>> This is something I found today
>> Dec 18 16:23:35 www postfix/pickup[23157]: fatal: watchdog timeout
>> Dec 19 00:52:16 www postfix/pickup[25514]: fatal: watchdog timeout
>> Dec 19 05:35:47 www postfix/smtp[27520]: fatal: watchdog timeout
>> Dec 19 06:38:43 www postfix/local[27540]: fatal: watchdog timeout
>> Dec 19 18:44:43 www postfix/smtp[30674]: fatal: watchdog timeout
>> Dec 19 18:46:35 www postfix/local[30718]: fatal: watchdog timeout
>> Dec 20 05:58:06 www postfix/local[30381]: fatal: watchdog timeout
>> Dec 20 10:46:21 www postfix/pickup[30721]: fatal: watchdog timeout
>> Dec 20 15:29:42 www postfix/local[31411]: fatal: watchdog timeout
>> Dec 20 15:33:50 www postfix/smtp[367]: fatal: watchdog timeout
>> Dec 20 20:12:54 www postfix/pickup[17393]: fatal: watchdog timeout
>> Dec 21 06:04:25 www postfix/local[939]: fatal: watchdog timeout
>> Dec 21 13:35:57 www postfix/pickup[16348]: fatal: watchdog timeout
>> Dec 21 16:50:13 www postfix/pickup[4898]: fatal: watchdog timeout
>>
>> And the timing of the first problem coincides with when the problem started.
>> So, it turns out I was looking at the wrong thing. I need to find out
>> now as to why I am getting this error.
>
> To add to that, when postfix is running, the server load spikes up to
> around 120 as reported by top.

This host is a web server that happens to run postfix correct? Not the other
way 'round? Does this web server run apache and php? Were both fully patched
up to the day the problems started?

I assume you see where I'm going with this.

--
Stan