From: Andreas Dilger on
On 2010-06-29, at 19:16, David Howells wrote:
> Implement a pair of new system calls to provide extended and further extensible stat functions.
>
> The third of the associated patches provides these new system calls:
>
> struct xstat_dev {
> unsigned int major;
> unsigned int minor;
> };

Doesn't glibc use two 64-bit values for devices?

> struct xstat {
> unsigned int struct_version;
> #define XSTAT_STRUCT_VERSION 0

I dislike sequential "version" fields (which are "all or nothing"), and prefer the ext2/3/4-like "feature flags" that allow the caller to state what features and fields it expects and/or understands. This allows extensibility without unduly breaking compatibility.

> unsigned int st_mode;

Having a separate MODE flag would be great for "ls --color", since that is basically the only information that it needs that isn't already available in the readdir() output.

> unsigned int st_nlink;
> unsigned int st_uid;
> unsigned int st_gid;

In struct stat64 it uses "unsigned long" for both st_uid and st_gid. Having a 64-bit value here is useful for CIFS servers to be able to remap different UID domains into a 32-bit domain and a 32-bit UID. If you change this, please remember to reorder the fields for proper 64-bit alignment.

> unsigned int st_blksize;
> Does st_blksize really need to be 64 bits on a 64-bit system?

I don't think so, but adding a 32-bit padding couldn't hurt.

> unsigned long long st_ino;
> unsigned long long st_size;
> Should the inode number and data version number fields be 128-bit?

I wouldn't object to having a 128-bit st_ino field, since this is what Lustre will be using internally in the next release.

Similarly, _filesystems_ are not SO far from hitting the 64-bit size limit (a Lustre filesystem will likely hit 100PB ~= 2^57 bytes in the next year), so having a 128-bit st_size wouldn't be unreasonable, because...

What is also very convenient that I learned Solaris stat() does is it returns the device size in st_size for a block device file. This is very convenient, and avoids the morass of ioctls and "binary llseek guessing" used by libext2fs and libblkid to determine the size of a block device. Any reason not to add this into this new syscall?

> unsigned long long st_blocks;

If st_size is 128-bit (or has padding) then st_blocks should have the same.

> unsigned long long query_flags;

It is inconsistent to have all the other fields use the "st_" prefix, but "query_flags" and "struct_version" do not have this prefix.

> #define XSTAT_QUERY_AMC_TIMES 0x00000004ULL

Can these be split into separate ATIME, MTIME, CTIME flags?

> #define XSTAT_QUERY_CREATION_TIME 0x00000008ULL

It seems a bit inconsistent to call the field "st_btime" and the mask "CREATION_TIME". It would be more consistent (if somewhat less clear) to call the mask "BTIME". The struct definition should get short comments for each field to explain their meaning anyway, so "st_btime" can have "/* birth/creation time */".

> #define XSTAT_QUERY_INODE_GENERATION 0x00000020ULL

This is also a bit inconsistent with the "st_gen" field name.

> #define XSTAT_QUERY_DATA_VERSION 0x00000040ULL

It wouldn't be a bad idea to interleave these flags with each of the fields that they represent, to make it more clear what is included in each.

> #define XSTAT_QUERY__ORDINARY_SET 0x00000017ULL
> #define XSTAT_QUERY__GET_ANYWAY 0x0000007fULL

Could you provide some information what the semantic distinction between these is? It might be useful to have an "XSTAT_QUERY_LEGACY_STAT" mask that returns only the fields that are in the previous struct stat, unless that is what "ORDINARY_SET" means, in which case it should be renamed I think.

> #define XSTAT_QUERY__DEFINED_SET 0x0000007fULL

It is smart to have a "DEFINED_SET" mask that maps to the currently-understood fields. This ensures that applications compiled against a specific set of headers/struct will not request fields which they don't understand. It might be better to call this "XSTAT_QUERY_ALL" so that it is more easily understood and used by callers, instead of the incorrect "-1" or "~0" that some may be tempted to use if they don't understand what "__DEFINED_SET" means.


Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Howells on
Andreas Dilger <adilger(a)dilger.ca> wrote:

> Doesn't glibc use two 64-bit values for devices?

So it would seem. Does Linux need to, though?

> I dislike sequential "version" fields (which are "all or nothing"), and
> prefer the ext2/3/4-like "feature flags" that allow the caller to state what
> features and fields it expects and/or understands. This allows
> extensibility without unduly breaking compatibility.

My aim was to avoid the need to create new stat syscalls in the future by
making it possible to increment the version number you're asking for.

> > unsigned int st_uid;
> > unsigned int st_gid;
>
> In struct stat64 it uses "unsigned long" for both st_uid and st_gid. Having
> a 64-bit value here is useful for CIFS servers to be able to remap different
> UID domains into a 32-bit domain and a 32-bit UID. If you change this,
> please remember to reorder the fields for proper 64-bit alignment.

glibc, on the other hand, only supports 32-bits for these.

My thought was that I could add extension fields to provide access to the
remote username/UID/GID values as well as the local UID/GID (since the latter
are used by the permission checking routines in the local VFS).

> I wouldn't object to having a 128-bit st_ino field, since this is what
> Lustre will be using internally in the next release.

I wonder how best to represent 128-bit numbers.

unsigned long long long

gives:

include/linux/stat.h:151: error: 'long long long' is too long for GCC

so perhaps something like:

struct xstat_u128 { unsigned long long lsw, msw; };

however, I suspect the kernel will require a bit of reengineering to handle a
pgoff_t and loff_t of 128-bits.

> What is also very convenient that I learned Solaris stat() does is it
> returns the device size in st_size for a block device file. This is very
> convenient, and avoids the morass of ioctls and "binary llseek guessing"
> used by libext2fs and libblkid to determine the size of a block device. Any
> reason not to add this into this new syscall?

That's a separate problem. That can be implemented now by overriding getattr
on blockdev files. You could also set st_blocks and st_blksize to indicate
parameters of the blockdev - though that may upset df, I suppose.

> It is inconsistent to have all the other fields use the "st_" prefix, but
> "query_flags" and "struct_version" do not have this prefix.

They are a different sort of field (metametadata, I suppose). But I can add
that on if you'd prefer.

> It wouldn't be a bad idea to interleave these flags with each of the fields
> that they represent, to make it more clear what is included in each.

Or comments could be used for that.

> > #define XSTAT_QUERY__ORDINARY_SET 0x00000017ULL
> > #define XSTAT_QUERY__GET_ANYWAY 0x0000007fULL
>
> Could you provide some information what the semantic distinction between
> these is? It might be useful to have an "XSTAT_QUERY_LEGACY_STAT" mask that
> returns only the fields that are in the previous struct stat, unless that is
> what "ORDINARY_SET" means, in which case it should be renamed I think.

XSTAT_QUERY_LEGACY_STAT is XSTAT_QUERY__ORDINARY_SET. Is "legacy" an
appropriate appellation, though? They're the set most people expect to see
and want to use.

> > #define XSTAT_QUERY__DEFINED_SET 0x0000007fULL
>
> It is smart to have a "DEFINED_SET" mask that maps to the
> currently-understood fields. This ensures that applications compiled
> against a specific set of headers/struct will not request fields which they
> don't understand. It might be better to call this "XSTAT_QUERY_ALL" so that
> it is more easily understood and used by callers, instead of the incorrect
> "-1" or "~0" that some may be tempted to use if they don't understand what
> "__DEFINED_SET" means.

Passing -1 (or ULONGLONG_MAX) to get everything would be reasonable.

This should probably be an internal kernel constant.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: David Howells on
Christoph Hellwig <hch(a)infradead.org> wrote:

> The cost of adding a syscall is much smaller than adding all the wanking
> in your system call.

Simply, if inelegantly put.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Arnd Bergmann on
On Wednesday 30 June 2010, Christoph Hellwig wrote:
> The cost of adding a syscall is much smaller

Ack. No need for different struct layout version since we
can add another stat syscall every ten years.

> So adding a few fields of padding at the end for new members is fine,
> but doing overkill of versioning including queries for supported
> versions doesn't.

The ability to request and return a subset of the fields seems useful
regardless and it can be used to avoid the need for this kind of padding.
A sufficient amount of padding wouldn't be too bad either, but I guess
we should not have both the padding _and_ the option for extending the
structure after the padding.

With the padding, the 'size' argument can go away, though I'd argue that
even without the padding we can safely add extra fixed-length fields
when needed and not need a size argument.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jeff Layton on
On Wed, 30 Jun 2010 15:31:39 +0200
Arnd Bergmann <arnd(a)arndb.de> wrote:

> On Wednesday 30 June 2010, Christoph Hellwig wrote:
> > The cost of adding a syscall is much smaller
>
> Ack. No need for different struct layout version since we
> can add another stat syscall every ten years.
>
> > So adding a few fields of padding at the end for new members is fine,
> > but doing overkill of versioning including queries for supported
> > versions doesn't.
>
> The ability to request and return a subset of the fields seems useful
> regardless and it can be used to avoid the need for this kind of padding.
> A sufficient amount of padding wouldn't be too bad either, but I guess
> we should not have both the padding _and_ the option for extending the
> structure after the padding.
>
> With the padding, the 'size' argument can go away, though I'd argue that
> even without the padding we can safely add extra fixed-length fields
> when needed and not need a size argument.
>

Simply having a flags field seems sufficient to me too. I don't think
we need padding, version or a size. Just make it a "rule" that if you
add a new field that it has to go at the end of the struct and a new
flag has to go with it. The kernel will need to only fill out fields
that are requested and that it knows about.

In the event that we approach running out of flags, we could even use
the last flag as a "HAS_FLAGS2" flag, to add a new flags field at the
end. Ugly, but it would avoid the need for a new syscall. We can kick
that potential problem down the road though. With 64 flags to play
with, it likely won't be a problem for a while.

--
Jeff Layton <jlayton(a)redhat.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/