From: Miklos Szeredi on
On Thu, 24 Jun 2010, Andrew Lutomirski wrote:
> Wasn't the point that /proc/self/mounts (and presumably
> /proc/self/mountinfo) isn't scalable and we wanted a syscall to query
> it efficiently (and racelessly)?

The question was how to support statvfs() efficiently, and the only
thing missing there is f_flags which can easily be added to the
existing statfs() syscall.

A separate mount_info() syscall might possibly be useful, but that's
another story.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andreas Dilger on
On 2010-06-24, at 08:08, Andy Lutomirski wrote:
> Something like fsid but actually specified to uniquely identify a superblock. (Currently, fsid seems to be set by the filesystem, and nothing in particular ensures that two different filesystems couldn't have collisions.) We could guarantee (or have a flag guaranteeing) that (fsid, st_inode) actually uniquely identifies an inode.

I think the right solution for this issue is to (gradually) start enforcing the "uniqueness" of the UUID in the filesystem superblock. That is what it is supposed to be for. Using (fsid, st_inode) doesn't necessarily help anything, if "fsid" isn't unique, and the same "st_inode" number is used on two different mountpoints.

To start, tracking the UUID at mount time an printing a non-fatal error at mount time if the mounted UUID is not unique would help, as would having e.g. fsck track the UUIDs of the underlying filesystems and printing a non-fatal error if it hits a duplicate UUID.

At some point in the future, the kernel can be changed to refuse to mount a filesystem with a duplicate UUID. I believe mount.xfs already does this.

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andreas Dilger on
On 2010-06-24, at 07:14, Nick Piggin wrote:
> This means glibc has to provide f_flag support by parsing /proc/mounts
> and stat(2)ing mount points. This is really slow, and /proc/mounts is
> hard for the kernel to provide.

Not only that, but if a mountpoint is broken (e.g. remote NFS server) then the glibc stat of all the mountpoints can hang the statvfs() call even if there is no interest in that particular filesystem.

> It's actually the last scalability bottleneck in the core vfs for dbench (samba) after my patches.
>
> Not only that, but it's racy.
>
> Other than types, other differences are:
> - statvfs(2) has is f_frsize, which seems fairly useless.

Actually, we were just lamenting the fact that f_frsize is currently broken, because Lustre wants to export the IO size as 1MB for good RPC performance, but the underlying blocksize is 4kB (ext3 blocksize). Similarly, NFS might want to export the rsize/wsize of 32kB or 64kB even if the underlying filesystem blocksize is smaller.

> - statvfs(2) has f_favail.
> - statfs(2) f_bsize is optimal transfer block, statvfs(2) f_bsize is fs
> block size. The latter could be useful for disk space algorithms.
> Both can be ill defned.

According to POSIX, "f_bsize" is the blocksize, but unfortunately this was botched in the earlier Linux implementations so currently they are both set to the same value, and using anything other than that breaks userspace programs that get them mixed up.

> - statvfs(2) lacks f_type.
>
> Is there anything more we should add here? Samba wants a capabilities
> field, with things like sparse files, quotas, compression, encryption,
> case preserving/sensitive.

It wouldn't be a bad idea, but then you could get into issues of what exactly the above flags mean. That said, I think it is better to have broad categories of features that may be slightly ill-defined than having nothing at all.

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Nick Piggin on
On Thu, Jun 24, 2010 at 04:48:20PM +0200, Miklos Szeredi wrote:
> On Thu, 24 Jun 2010, Andrew Lutomirski wrote:
> > Wasn't the point that /proc/self/mounts (and presumably
> > /proc/self/mountinfo) isn't scalable and we wanted a syscall to query
> > it efficiently (and racelessly)?
>
> The question was how to support statvfs() efficiently, and the only
> thing missing there is f_flags which can easily be added to the
> existing statfs() syscall.
>
> A separate mount_info() syscall might possibly be useful, but that's
> another story.

Native statvfs() support is my motivation, but I am thinking that if
we are going to introduce a new syscall (or version rev the statfs
syscall somehow), then we should think hard about what else we can do.

More superblock info should be possible, more detailed info like like
related mounts will be costlier, so that may be better off as a
different syscall.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Nick Piggin on
On Thu, Jun 24, 2010 at 05:13:38PM -0600, Andreas Dilger wrote:
> On 2010-06-24, at 07:14, Nick Piggin wrote:
> > This means glibc has to provide f_flag support by parsing /proc/mounts
> > and stat(2)ing mount points. This is really slow, and /proc/mounts is
> > hard for the kernel to provide.
>
> Not only that, but if a mountpoint is broken (e.g. remote NFS server) then the glibc stat of all the mountpoints can hang the statvfs() call even if there is no interest in that particular filesystem.

Good point.


> > It's actually the last scalability bottleneck in the core vfs for dbench (samba) after my patches.
> >
> > Not only that, but it's racy.
> >
> > Other than types, other differences are:
> > - statvfs(2) has is f_frsize, which seems fairly useless.
>
> Actually, we were just lamenting the fact that f_frsize is currently broken, because Lustre wants to export the IO size as 1MB for good RPC performance, but the underlying blocksize is 4kB (ext3 blocksize). Similarly, NFS might want to export the rsize/wsize of 32kB or 64kB even if the underlying filesystem blocksize is smaller.
>
> > - statvfs(2) has f_favail.
> > - statfs(2) f_bsize is optimal transfer block, statvfs(2) f_bsize is fs
> > block size. The latter could be useful for disk space algorithms.
> > Both can be ill defned.
>
> According to POSIX, "f_bsize" is the blocksize, but unfortunately this was botched in the earlier Linux implementations so currently they are both set to the same value, and using anything other than that breaks userspace programs that get them mixed up.

So is "frsize" supposed to be the optimal block size, or what?
f_bsize AFAIKS should be filesystem allocation block size because
apparently some programs require it to calculate size of file on
disk.

If we can't change existing suboptimal legacy things, then let's
introduce new APIs that do the right thing. Apps that care will
eventually start using eg. a new syscall.

>
> > - statvfs(2) lacks f_type.
> >
> > Is there anything more we should add here? Samba wants a capabilities
> > field, with things like sparse files, quotas, compression, encryption,
> > case preserving/sensitive.
>
> It wouldn't be a bad idea, but then you could get into issues of what exactly the above flags mean. That said, I think it is better to have broad categories of features that may be slightly ill-defined than having nothing at all.

Yes it would be tricky. I don't want to add features that will just
be useless or go unused, but I don't want to change the syscall API
just to add f_flags, without looking at other possibilities.

Thanks,
Nick

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/