From: John Marshall on
Environment: Sendmail 8.14.4 on FreeBSD 8.0

I posted here earlier this month asking for troubleshooting tips for
this problem. The problem is VERY intermittent and has only just
recurred (on a different system). This time I was armed with clues on
how to get gdb to attach to the stuck process and produce a backtrace.

I got a backtrace from the stuck queue runner processes and thought I'd
get a backtrace from a happy queue runner as well. They were identical!
When I exited gdb I noticed that the 'stuck' queue runner had woken up
and flushed its queues. I'm thinking that this smells like the
operating system.

PROCESS STATES BEFORE DEBUG (stuck runners are in 'I' state)

PID TT STAT TIME COMMAND
80298 ?? Ss 0:17.68 sendmail: accepting connections (sendmail)
80299 ?? I 0:46.62 sendmail: running queue: /var/spool/mqueue/qd1/df (sendmail)
80300 ?? I 0:08.83 sendmail: running queue: /var/spool/mqueue/mby/df (sendmail)
80301 ?? S 0:31.58 sendmail: running queue: /var/spool/mqueue/oz/df (sendmail)
80302 ?? S 0:30.71 sendmail: running queue: /var/spool/mqueue/rw2/df (sendmail)
80303 ?? S 0:33.29 sendmail: running queue: /var/spool/mqueue/hold/df (sendmail)
80304 ?? S 0:30.55 sendmail: running queue: /var/spool/mqueue/pgp/df (sendmail)

BACKTRACE OF STUCK PROCESS 80299

(gdb) bt
#0 0x28346547 in sigsuspend () from /lib/libc.so.7
#1 0x28344e98 in sigpause () from /lib/libc.so.7
#2 0x2833be3e in pause () from /lib/libc.so.7
#3 0x080cc7c8 in sleep ()
#4 0x08099c51 in run_work_group ()
#5 0x08099ebf in runqueue ()
#6 0x0805538d in main ()

BACKTRACE OF HAPPY PROCESS 80301

(gdb) bt
#0 0x28346547 in sigsuspend () from /lib/libc.so.7
#1 0x28344e98 in sigpause () from /lib/libc.so.7
#2 0x2833be3e in pause () from /lib/libc.so.7
#3 0x080cc7c8 in sleep ()
#4 0x08099c51 in run_work_group ()
#5 0x08099ebf in runqueue ()
#6 0x0805538d in main ()

PROCESS STATES AFTER DEBUG

PID TT STAT TIME COMMAND
80298 ?? Ss 0:17.69 sendmail: accepting connections (sendmail)
80299 ?? S 0:46.66 sendmail: running queue: /var/spool/mqueue/qd1/df (sendmail)
80300 ?? S 0:08.85 sendmail: running queue: /var/spool/mqueue/mby/df (sendmail)
80301 ?? S 0:31.60 sendmail: running queue: /var/spool/mqueue/oz/df (sendmail)
80302 ?? S 0:30.73 sendmail: running queue: /var/spool/mqueue/rw2/df (sendmail)
80303 ?? S 0:33.32 sendmail: running queue: /var/spool/mqueue/hold/df (sendmail)
80304 ?? S 0:30.58 sendmail: running queue: /var/spool/mqueue/pgp/df (sendmail)

Obviously the debugger woke up the soundly-sleeping processes.

--
John Marshall
From: Claus Aßmann on
John Marshall wrote:

> I got a backtrace from the stuck queue runner processes and thought I'd
> get a backtrace from a happy queue runner as well. They were identical!
> When I exited gdb I noticed that the 'stuck' queue runner had woken up
> and flushed its queues. I'm thinking that this smells like the
> operating system.

> BACKTRACE OF STUCK PROCESS 80299

> #0 0x28346547 in sigsuspend () from /lib/libc.so.7
> #1 0x28344e98 in sigpause () from /lib/libc.so.7
> #2 0x2833be3e in pause () from /lib/libc.so.7
> #3 0x080cc7c8 in sleep ()
> #4 0x08099c51 in run_work_group ()

sleep() should be one of these calls:

if (njobs == 0 && WorkGrp[wgrp].wg_lowqintvl < MIN_SLEEP_TIME)
sleep(MIN_SLEEP_TIME);
else if (WorkGrp[wgrp].wg_lowqintvl <= 0)
sleep(QueueIntvl > 0 ? QueueIntvl : MIN_SLEEP_TIME);
else
sleep(WorkGrp[wgrp].wg_lowqintvl);

Unless you have a really large value for one of these, the process
should continue after a while.

Note: on some systems sleep() has a problem when used in a
multi-threaded program. sendmail should not be compiled
with pthreads (only libmilter does). You could try the
workaround from libmilter/listener.c: look for MI_SLEEP().
From: John Marshall on
On 22 Feb 2010 00:45:22 GMT, Claus A?mann wrote:
> John Marshall wrote:
>> I got a backtrace from the stuck queue runner processes and thought I'd
>> get a backtrace from a happy queue runner as well. They were identical!
>> When I exited gdb I noticed that the 'stuck' queue runner had woken up
>> and flushed its queues. I'm thinking that this smells like the
>> operating system.
>
> sleep() should be one of these calls:
>
> if (njobs == 0 && WorkGrp[wgrp].wg_lowqintvl < MIN_SLEEP_TIME)
> sleep(MIN_SLEEP_TIME);
> else if (WorkGrp[wgrp].wg_lowqintvl <= 0)
> sleep(QueueIntvl > 0 ? QueueIntvl : MIN_SLEEP_TIME);
> else
> sleep(WorkGrp[wgrp].wg_lowqintvl);
>
> Unless you have a really large value for one of these, the process
> should continue after a while.

I'm not overriding any of the sendmail defaults, so this should always
be 5s (from MIN_SLEEP_TIME) or 1s (from QueueIntvl).

I've written this up and sent it to the FreeBSD-STABLE mailing list:

<http://lists.freebsd.org/pipermail/freebsd-stable/2010-February/055368.html>

--
John Marshall
From: John Marshall on
On Tue, 23 Feb 2010 02:41:07 +0000 (UTC), John Marshall wrote:
> On 22 Feb 2010 00:45:22 GMT, Claus A?mann wrote:
>> sleep() should be one of these calls:
-- snip --
>> Unless you have a really large value for one of these, the process
>> should continue after a while.
>
> I'm not overriding any of the sendmail defaults, so this should always
> be 5s (from MIN_SLEEP_TIME) or 1s (from QueueIntvl).

SLEEPING (HAPPY) QUEUE RUNNER - 8.14.4 Build on FreeBSD 8.0
(gdb) bt
#0 0x2833f6eb in sigsuspend () from /lib/libc.so.7
#1 0x2833e050 in sigpause () from /lib/libc.so.7
#2 0x28334fae in pause () from /lib/libc.so.7
#3 0x080cc975 in sleep ()
#4 0x08099d96 in run_work_group ()
#5 0x0809a016 in runqueue ()
#6 0x08055379 in main ()

SLEEPING (HAPPY) QUEUE RUNNER - 8.14.4 Hacked Build on FreeBSD 8.0
(gdb) bt
#0 0x283c43f7 in nanosleep () from /lib/libc.so.7
#1 0x080cbf7a in sleep ()
#2 0x08099c81 in run_work_group ()
#3 0x08099eef in runqueue ()
#4 0x080553bd in main ()

Is there any reason why we shouldn't be using FreeBSD's nanosleep(2) in
sendmail's sleep() in libsm/clock.c?

Sendmail's sleep() will use nanosleep if HAVE_NANOSLEEP is defined. The
only place I found that defined is in include/sm/conf.h, and only for
SOLARIS. I defined HAVE_NANOSLEEP for FreeBSD in include/sm/conf.h and
re-built to produce the above. Is there any reason systems other than
SOLARIS shouldn't use their nanosleep?

--
John Marshall
From: Claus Aßmann on
John Marshall wrote:

> Is there any reason why we shouldn't be using FreeBSD's nanosleep(2) in
> sendmail's sleep() in libsm/clock.c?

I don't know any. It's probably just because nobody ever
told sendmail.org about it (unfortunately sendmail 8 uses a
"hand coded" feature definition file and the developers
have lost access to most OSs and get very little feedback
-- except for the Sun version where someone at Sun always
provides us with the latest updates.)

If you can tell us which FreeBSD versions have a reliable
nanosleep() implementation, then we can add it to conf.h
(patches welcome :-)