From: Chris Vine on
On Wed, 16 Jun 2010 20:44:15 -0400
Jeff Layton <jlayton(a)redhat.com> wrote:
[snip]
> I stand corrected then. That's pretty close to the nfsd that I've been
> testing. I pulled down the nfsd init script and the only thing that
> looks substantially different is that it sends signals to nfsd to shut
> it down rather than just running "rpc.nfsd 0". That should work fine,
> however.
>
> Still I think the problem is basically something like what I've
> described. You ended up somehow with sockets on the sv_permsocks list
> that didn't hold lockd references. The way I described is one way that
> could occur. Another seems to be __write_ports_addxprt (which I think
> is clearly broken in light of this)...
>
> The root cause of this however is likely to be related to this
> problem:
>
> > Jun 15 16:07:18 laptop kernel: svc: failed to register lockdv3 RPC
> > service (errno 110). Jun 15 16:07:18 laptop kernel: lockd_up:
> > makesock failed, error=-110
>
> ...which means that the kernel couldn't talk to portmap or rpcbind.
> Maybe it wasn't up at the time? Or a problem with firewalling?

My initial reaction was "of course it is up" but your mention of
portmap sent me investigating with interesting results. I was going
to say "of course its is up" because the standard start-up script for
nfsd (rc.nfsd) checks whether rpc.portmap and rpc.statd are running, if
not starts them, and then starts exportfs, rpc.rquotad, rpc.nfsd and
rpc.mountd.

However, if I start portmap and statd early on so they do not rely on
the nfsd start-up script, then nfsd starts fine, so it seems to be a
timing thing notwithstanding that they are all started (at user level)
sequentially and in the same thread/process.

The timing problem does not arise on kernel-2.6.34 and earlier. Nor
does it arise on my pentium uniprocessor machine with kernel 2.6.35-rc3,
so it could well be core/thread related. It looks as if something in
the kernel has changed on that in 2.6.35 which provokes the kernel bug
report if timing is wrong. (If timing is wrong and if this is a user
tools rather than a kernel deficiency, and I express no view on that,
then I suppose it probably needs to be handled more gracefully in the
kernel.)

Chris


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jeff Layton on
On Thu, 17 Jun 2010 11:38:15 +0100
Chris Vine <chris(a)cvine.freeserve.co.uk> wrote:

> On Wed, 16 Jun 2010 20:44:15 -0400
> Jeff Layton <jlayton(a)redhat.com> wrote:
> [snip]
> > I stand corrected then. That's pretty close to the nfsd that I've been
> > testing. I pulled down the nfsd init script and the only thing that
> > looks substantially different is that it sends signals to nfsd to shut
> > it down rather than just running "rpc.nfsd 0". That should work fine,
> > however.
> >
> > Still I think the problem is basically something like what I've
> > described. You ended up somehow with sockets on the sv_permsocks list
> > that didn't hold lockd references. The way I described is one way that
> > could occur. Another seems to be __write_ports_addxprt (which I think
> > is clearly broken in light of this)...
> >
> > The root cause of this however is likely to be related to this
> > problem:
> >
> > > Jun 15 16:07:18 laptop kernel: svc: failed to register lockdv3 RPC
> > > service (errno 110). Jun 15 16:07:18 laptop kernel: lockd_up:
> > > makesock failed, error=-110
> >
> > ...which means that the kernel couldn't talk to portmap or rpcbind.
> > Maybe it wasn't up at the time? Or a problem with firewalling?
>
> My initial reaction was "of course it is up" but your mention of
> portmap sent me investigating with interesting results. I was going
> to say "of course its is up" because the standard start-up script for
> nfsd (rc.nfsd) checks whether rpc.portmap and rpc.statd are running, if
> not starts them, and then starts exportfs, rpc.rquotad, rpc.nfsd and
> rpc.mountd.
>
> However, if I start portmap and statd early on so they do not rely on
> the nfsd start-up script, then nfsd starts fine, so it seems to be a
> timing thing notwithstanding that they are all started (at user level)
> sequentially and in the same thread/process.
>
> The timing problem does not arise on kernel-2.6.34 and earlier. Nor
> does it arise on my pentium uniprocessor machine with kernel 2.6.35-rc3,
> so it could well be core/thread related. It looks as if something in
> the kernel has changed on that in 2.6.35 which provokes the kernel bug
> report if timing is wrong. (If timing is wrong and if this is a user
> tools rather than a kernel deficiency, and I express no view on that,
> then I suppose it probably needs to be handled more gracefully in the
> kernel.)
>
> Chris
>
>

The timing may help tickle the other bugs in nfsd startup/shutdown.
I've just sent another couple of patches to Bruce (and cc'ed you) that
I suspect may help this. With those, it should always be the case that
a nfsd sv_permsocks entry holds a lockd reference.

It would be good if you could test the stack of patches in the
nfsd-error branch of my kernel.org git tree:

http://git.kernel.org/?p=linux/kernel/git/jlayton/linux.git;a=summary

Thanks,
--
Jeff Layton <jlayton(a)redhat.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/