VFS: fix recent breakage of FS_REVAL

Prev: perf, trace: Remove IRQ-disable from perf/tracepoint interaction
Next: [PATCH] md: fix raid6test build error

From: Al Viro on 24 May 2010 15:10

On Mon, May 24, 2010 at 01:06:31PM -0400, Trond Myklebust wrote:

> I believe that the answer is that most filehandle types include an
> encoding of the inode number of the export directory. In other words, as
> long as '/a' and '/b' are different directories, then they will result
> in the generation of different filehandles for /a/x and /b/x.
>
> It seems that is not always the case, though. According to the
> definition of mk_fsid(), it looks as if the 'FSID_UUID8' and
> 'FSID_UUID16' filehandle types only encode the uuid of the filesystem,
> and have no inode information. They will therefore not be able to
> distinguish between an export through '/a' or '/b'.
>
> Neil, Bruce am I right?

Er? On server:

mount -t ext2 /dev/sdb1 /srv/nfs4
mount -t ext2 /dev/sda1 /srv/nfs4/a
mount -t ext2 /dev/sda1 /srv/nfs4/b

after that /srv/nfs4/a and /srv/nfs4/b will have the *same* inode, nevermind
the inode number. I really mean the same filesystem mounted twice; if you
want to include inumber of mountpoint into fsid, fine, turn the above into

mount -t ext2 /dev/sdb1 /srv/nfs4
mount -t ext2 /dev/sda1 /srv/nfs4/a
mount -t ext2 /dev/sda1 /srv/nfs4/b
mount -t ext2 /dev/sda3 /srv/nfs4/a/z
mount -t ext2 /dev/sda3 /srv/nfs4/b/z

At that point you have the same fs (ext2 from sda3) mounted on /srv/nfs4/a/z
and /srv/nfs4/b/z, with the same directory inode overmounted by it in both
mountpoints. Suppose your referral point is on /a/z/x and /b/z/x resp. and
see the question upthread...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Trond Myklebust on 24 May 2010 17:20

On Mon, 2010-05-24 at 20:08 +0100, Al Viro wrote:
> On Mon, May 24, 2010 at 01:06:31PM -0400, Trond Myklebust wrote:
>
> > I believe that the answer is that most filehandle types include an
> > encoding of the inode number of the export directory. In other words, as
> > long as '/a' and '/b' are different directories, then they will result
> > in the generation of different filehandles for /a/x and /b/x.
> >
> > It seems that is not always the case, though. According to the
> > definition of mk_fsid(), it looks as if the 'FSID_UUID8' and
> > 'FSID_UUID16' filehandle types only encode the uuid of the filesystem,
> > and have no inode information. They will therefore not be able to
> > distinguish between an export through '/a' or '/b'.
> >
> > Neil, Bruce am I right?
>
> Er? On server:
>
> mount -t ext2 /dev/sdb1 /srv/nfs4
> mount -t ext2 /dev/sda1 /srv/nfs4/a
> mount -t ext2 /dev/sda1 /srv/nfs4/b
>
> after that /srv/nfs4/a and /srv/nfs4/b will have the *same* inode, nevermind
> the inode number. I really mean the same filesystem mounted twice; if you
> want to include inumber of mountpoint into fsid, fine, turn the above into
>
> mount -t ext2 /dev/sdb1 /srv/nfs4
> mount -t ext2 /dev/sda1 /srv/nfs4/a
> mount -t ext2 /dev/sda1 /srv/nfs4/b
> mount -t ext2 /dev/sda3 /srv/nfs4/a/z
> mount -t ext2 /dev/sda3 /srv/nfs4/b/z
>
> At that point you have the same fs (ext2 from sda3) mounted on /srv/nfs4/a/z
> and /srv/nfs4/b/z, with the same directory inode overmounted by it in both
> mountpoints. Suppose your referral point is on /a/z/x and /b/z/x resp. and
> see the question upthread...

Sorry... I misunderstood you.

In cases like the above, then the default behaviour of the server would
be to assign the same filehandles to those mount points. The
administrator can, however, make them different by choosing to use the
'fsid' mount option to manually assign different fsids to the different
export points.

If not, then the client will automatically group these things in the
same superblock, so like the server, it too is supposed to share the
same inode for these different objects. It will then use
d_obtain_alias() to get a root dentry for that inode (see
nfs4_get_root()).

Cheers
Trond

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Al Viro on 24 May 2010 19:10

On Mon, May 24, 2010 at 05:13:32PM -0400, Trond Myklebust wrote:

> Sorry... I misunderstood you.
>
> In cases like the above, then the default behaviour of the server would
> be to assign the same filehandles to those mount points. The
> administrator can, however, make them different by choosing to use the
> 'fsid' mount option to manually assign different fsids to the different
> export points.
>
> If not, then the client will automatically group these things in the
> same superblock, so like the server, it too is supposed to share the
> same inode for these different objects. It will then use
> d_obtain_alias() to get a root dentry for that inode (see
> nfs4_get_root()).

Yes, it will. So what will happen in nfs_follow_referral()? Note that
we check the rootpath returned by the server (whatever it will end up
being) against the mnt_devname + relative path from mnt_root to referral
point. In this case it'll be /a/z or /b/z (depending on which export
will server select when it sees the fsid) vs /a/z/x or /b/z/x (depending
on which one does client walk into). And the calls of nfs4_proc_fs_locations()
will get identical arguments whether client walks into a/z/x or b/z/x.
So will the actual RPC requests seen by the server, so it looks like in
at least one of those cases we will get the rootpath that is _not_ a prefix
we are expecting, stepping into
if (strncmp(path, fs_path, strlen(fs_path)) != 0) {
dprintk("%s: path %s does not begin with fsroot %s\n",
__func__, path, fs_path);
return -ENOENT;
}
in nfs4_validate_fspath().

Question regarding RFC3530: is it actually allowed to have the same fhandle
show up in two different locations in server's namespace? If so, what
should GETATTR with FS_LOCATIONS return for it?

Client question: what stops you from stack overflows in that area? Call
chains you've got are *deep*, and I really wonder what happens if you
hit a referral point while traversing nested symlink, get pathname
resolution (already several levels into recursion) call ->follow_link(),
bounce down through nfs_do_refmount/nfs_follow_referral/try_location/
vfs_kern_mount/nfs4_referral_get_sb/nfs_follow_remote_path into
vfs_path_lookup, which will cheerfully add a few more loops like that.

Sure, the *total* nesting depth through symlinks is still limited by 8, but
that pile of stack frames is _MUCH_ fatter than what we normally have in
pathname resolution. You've suddenly added ~60 extra stack frames to the
worst-case stack footprint of the pathname resolution. Don't try that
on sparc64, boys and girls, it won't be happy with attempt to carve ~12Kb
extra out of its kernel stack... In fact, it's worse than just ~60 stack
frames - several will contain (on-stack) struct nameidata in them, which
very definitely will _not_ fit into the minimal stack frame. It's about
160 bytes extra, for each of those (up to 7).

Come to think of that, x86 variants might get rather upset about that kind
of treatment as well. Minimal stack frames are smaller, but so's the stack...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Al Viro on 24 May 2010 19:50

On Tue, May 25, 2010 at 12:01:09AM +0100, Al Viro wrote:
> Client question: what stops you from stack overflows in that area? Call
> chains you've got are *deep*, and I really wonder what happens if you
> hit a referral point while traversing nested symlink, get pathname
> resolution (already several levels into recursion) call ->follow_link(),
> bounce down through nfs_do_refmount/nfs_follow_referral/try_location/
> vfs_kern_mount/nfs4_referral_get_sb/nfs_follow_remote_path into
> vfs_path_lookup, which will cheerfully add a few more loops like that.
>
> Sure, the *total* nesting depth through symlinks is still limited by 8, but
> that pile of stack frames is _MUCH_ fatter than what we normally have in
> pathname resolution. You've suddenly added ~60 extra stack frames to the
> worst-case stack footprint of the pathname resolution. Don't try that
> on sparc64, boys and girls, it won't be happy with attempt to carve ~12Kb
> extra out of its kernel stack... In fact, it's worse than just ~60 stack
> frames - several will contain (on-stack) struct nameidata in them, which
> very definitely will _not_ fit into the minimal stack frame. It's about
> 160 bytes extra, for each of those (up to 7).

Actually, just what will happen if you have a referral that would eventually
resolve to a directory you have no permissions to access? AFAICS, you'll
end up trying it on all alternates, since nfs_follow_referral() will cheerfully
keep trying one variant after another, getting -EACCES from each. Worse,
if there are nested referrals in it, you'll get all sequences of alternates
tried before you give up.

...o*O(at least it's merely exponential; Ackermann would be even more fun)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Neil Brown on 24 May 2010 21:20

On Mon, 24 May 2010 12:59:03 +0100
Al Viro <viro(a)ZenIV.linux.org.uk> wrote:

> On Mon, May 24, 2010 at 04:57:56PM +1000, Neil Brown wrote:
> >
> > Commit 1f36f774b22a0ceb7dd33eca626746c81a97b6a5 broke FS_REVAL_DOT semantics.
> >
> > In particular, before this patch, the command
> > ls -l
> > in an NFS mounted directory would always check if the directory on the server
> > had changed and if so would flush and refill the pagecache for the dir.
> > After this patch, the same "ls -l" will repeatedly return stale date until
> > the cached attributes for the directory time out.
> >
> > The following patch fixes this by ensuring the d_revalidate is called by
> > do_last when "." is being looked-up.
> > link_path_walk has already called d_revalidate, but in that case LOOKUP_OPEN
> > is not set so nfs_lookup_verify_inode chooses not to do any validation.
> >
> > The following patch restores the original behaviour.
> >
> > Cc: stable(a)kernel.org
> > Signed-off-by: NeilBrown <neilb(a)suse.de>
>
> Applied, but I really don't like the way you do it; note that e.g. foo/bar/.
> gets that revalidation as well, for no good reason. If anything, shouldn't
> we handle that thing in the _beginning_ of pathname resolution, not in
> the end? For now it'd do, and it's a genuine regression, but...
>

Thanks.

I think I see what you mean by "at the beginning" - the problem path is
simply ".", and both the start and end of that are "." so we can handle at
either end... But I don't think there is any other special handling of the
'start' of a path, so I imagine it would be a fairly ugly special case.

We could avoid the extra GETATTR in "foo/bar/." by allowing NFS to keep some
state in the namei_data to record that it has valid attributes for a given
dentry so if it sees the same dentry again it doesn't need to revalidate.

I must confess though that I don't feel I understand VFS name lookup properly
any more. Since intents were added it seems to have become much more obscure
and complex. I cannot help thinking that there must be a better way:
distinguish between the various cases at a higher level so we don't need as
many flags being passed around and interpreted by widely separate pieces of
code. I don't have a concrete proposal but I would certainly be interested
to work on one if there were any hope of real change.
Thoughts?

Thanks,
NeilBrown

> BTW, here's a question for nfs client folks: is it true that for any two
> pathnames on _client_ resolving to pairs (mnt1, dentry) and (mnt2, dentry)
> resp., nfs_devname(mnt1, dentry, ...) and nfs_devname(mnt2, dentry, ...)
> should yield the strings that do not differ past the ':' (i.e. that the
> only possible difference is going to be in spelling the server name)?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3
Prev: perf, trace: Remove IRQ-disable from perf/tracepoint interaction
Next: [PATCH] md: fix raid6test build error

VFS: fix recent breakage of FS_REVAL_DOT