From: A on
I am getting intermittent unexpected result from waitpid on Solaris 9
running Perl 5.8.8.

Here is the scenario (the bare bones code is below).

Program_A, written in Perl, is invoked about a million times every
day. Most of the times, it invokes (using fork-exec) Program_B which
is written in C++. Program_A uses waitpid to get the exit code of
Program_B.
It works fine most of the times, but about a few dozen times every
day, the waitpid apparently fails and when it fails, I get

$? is -1
$! is "No child processes"

In all of the cases I have investigated, the child process, Program_B,
started and completed gracefully with "exit(0)" and of course, the pid-
s match from the trace log of both processes.

The output, from the code below, in such case is

Child pid=5196, exitCode=0xffffffff (No child processes)

Program_A itself is transient and short lived, and, depending on its
input, executes Program_B at most once.

What am I doing wrong?
How to detect and correct this?

Thanks for your help.

# ------------------------------------------- begin code
-------------------------------------------------
#!/usr/local/bin/perl

# program_A

my $cpid;
my $ec = undef;
my $em = undef;

sub getChildStatus
{
my $tc = undef;
my $tm = undef;
my $r = undef;

while ( 1 ) {
$r = waitpid($cpid, 0);
$tc = $?;
$em = $!;
last if ( -1 == $r || $r == $cpid );
print STDERR "waitpid($cpid, 0) returned $r ( $? )\n";
}
if ( $cpid == $r ) {
$ec = $tc;
$em = $tm;
}
}

sub sigCLDhandler
{
my $sig = shift;
print STDERR "caught SIG $sig\n";
getChildStatus;
}


sub runIt
{
my $oldSigCld = $SIG{CLD};
local $SIG{CLD} = \&sigChldHandler;

$cpid = fork;
if ( ! defined $cpid ) { print STDERR "fork failed [ $! ]\n";
return; }

if ( 0 == $cpid ) {
print STDERR "child pid $$ starting\n";

exec program_B, .. .. ..;

print STDERR "child pid $$: exec failed [$!], exiting with -1\n";
exit(-1);
} # 0 == $cpid i.e. the child

getChildStatus; # only the parent reaches here
$SIG{CLD} = $oldSigCld ;
} # runIt

#
# main
#
runIt;
if ( $ec ) {
printf STDERR "Child pid=$cpid exitcode=%#08x msg=(%s)\n", $ec, $em;
}

# ------------------------------------------- end code
-------------------------------------------------

From: xhoster on
"A" <ad_101(a)yahoo.com> wrote:
> I am getting intermittent unexpected result from waitpid on Solaris 9
> running Perl 5.8.8.
>
> Here is the scenario (the bare bones code is below).
>
> Program_A, written in Perl, is invoked about a million times every
> day. Most of the times, it invokes (using fork-exec) Program_B which
> is written in C++. Program_A uses waitpid to get the exit code of
> Program_B.
> It works fine most of the times, but about a few dozen times every
> day, the waitpid apparently fails and when it fails, I get
>
> $? is -1
> $! is "No child processes"
>
> In all of the cases I have investigated, the child process, Program_B,
> started and completed gracefully with "exit(0)" and of course, the pid-
> s match from the trace log of both processes.
>
> The output, from the code below, in such case is
>
> Child pid=5196, exitCode=0xffffffff (No child processes)
>
> Program_A itself is transient and short lived, and, depending on its
> input, executes Program_B at most once.
>
> What am I doing wrong?

You are mucking with $SIG{CLD} when, as far as I can tell, you have
no need to. getChildStatus (and the waitpid in it) can get called twice,
once from the sig handler and once from the runIt. If it does get called
twice, the second time that child no longer exists, as it was already
waited on. Remove the $SIG{CLD} stuff.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
From: A on
On Feb 13, 3:44 pm, xhos...(a)gmail.com wrote:
>
> You are mucking with $SIG{CLD} when, as far as I can tell, you have
> no need to. getChildStatus (and the waitpid in it) can get called twice,
> once from the sig handler and once from the runIt. If it does get called
> twice, the second time that child no longer exists, as it was already
> waited on. Remove the $SIG{CLD} stuff.
>
> Xho
>
> - Show quoted text -

Thanks for your reply.

First, there's a typo in my original message.

The third line after the while(1) in getChildStatus should be
$tm = $!;
instead of
$em = $!;

Now, to the point that the waitpid could get called twice.

Please note that the code is designed to guard against this, the
assignments to the globals $ec and $em are done if and only if waitpid
returns the matching pid.
So, even if it is called twice, the second time waitpid returns -1,
and then
getChildStatus returns without modifying the globals.

On your advice to remove the $SIG{CLD}, there are 3 statements,

the first statement saves the handler,
the second statement installs the current one needed by this
routine
and the last one re-installs the saved handler.

which one(s) would you suggest I remove?

Yes, there's a deficiency (bug, if you will) in the code. The
$SIG{CLD} should be re-installed if fork fails, but that I think, is
of no consequence to the problem at hand.

Thanks again.

From: xhoster on
"A" <ad_101(a)yahoo.com> wrote:
> On Feb 13, 3:44 pm, xhos...(a)gmail.com wrote:
> >
> > You are mucking with $SIG{CLD} when, as far as I can tell, you have
> > no need to. getChildStatus (and the waitpid in it) can get called
> > twice, once from the sig handler and once from the runIt. If it does
> > get called twice, the second time that child no longer exists, as it
> > was already waited on. Remove the $SIG{CLD} stuff.
> >
> > Xho
> >
> > - Show quoted text -
>
> Thanks for your reply.
>
> First, there's a typo in my original message.
>
> The third line after the while(1) in getChildStatus should be
> $tm = $!;
> instead of
> $em = $!;
>
> Now, to the point that the waitpid could get called twice.
>
> Please note that the code is designed to guard against this, the
> assignments to the globals $ec and $em are done if and only if waitpid
> returns the matching pid.

The waitpid of one getChildStatus returns the expected pid and sets the
global $? and $!. Before it can do anything else, the waitpid of the other
getChildStatus returns -1 and over writes the global $? and $! with it's
own values, but for this one $r does not meet the if and so returns control
to the first getChildStatus. The first getChildStatus was the right pid
recorded in $r (as that was a lexical and didn't get overwritten), but has
the wrong $? and $! because they did get overwritten, and now those get
recorded into your $tm and $cm

>
> On your advice to remove the $SIG{CLD}, there are 3 statements,
>
> the first statement saves the handler,
> the second statement installs the current one needed by this
> routine
> and the last one re-installs the saved handler.
>
> which one(s) would you suggest I remove?

Probably all of them, but it is not really possible to know from what you
give. We would need to see the code that set the orginal handler that is
getting saved and then restored. If the handler you inherit is necessary,
then why would it be safe to overwrite it with something else for even the
duration of this routine? On the other hand, if the handler you inherit is
not necessary, then what is the point of saving and re-installing it? If
there is no other code which intalls a handler in the first place, then I'd
remove all three of those things. (And even if not, remove at least two,
see below)

> Yes, there's a deficiency (bug, if you will) in the code. The
> $SIG{CLD} should be re-installed if fork fails, but that I think, is
> of no consequence to the problem at hand.

Since you use local to install the handler, I think the old one will be
reinstalled upon fork failure anyway. Saving the old one explicitly and
reinstalling explicit seem to be unnecessary, assuming the local is doing
its job.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
From: Mark on
On Feb 13, 11:22 am, "A" <ad_...(a)yahoo.com> wrote:
> I am getting intermittent unexpected result from waitpid on Solaris 9
>
> sub runIt
> {
> my $oldSigCld = $SIG{CLD};
> local $SIG{CLD} = \&sigChldHandler;

I think you meant sigCLDhandler here.