d_ino considered harmful [Kernel]

Prev: Error "Unknown relocation: 36" on module load on Sparc
Next: How to printk/sprintf uint64_t on Sparc without format and argument types mismatch

From: David Dillow on 17 Jun 2010 14:20

On Thu, 2010-06-17 at 14:04 -0400, J. R. Okajima wrote:
> David Dillow:
> > For example, our main Lustre scratch space has over 285 million files in
> > it, and using find -inum takes over 72 hours to walk the tree using
> :::
> > Using ne2scan -- which uses libext2fs and combines the inode scan and
> > the name lookup -- takes over 48 hours to generate a list of candidate
> > files for the purge example. With an optimized inode scan and the custom
> :::
>
> While I've never heard of ne2scan, I am interested in this simplified
> problem such as "find the pathname(s) from an inum in a huge fs."
> Is ne2scan essentially equivalent to "debugfs ncheck inum"?

Yes, except it does that for every live inode in the system. ne2scan is
extended for use on Lustre's backing store -- it parses the LOV object
map and displays the information -- so I'm not sure how usable it will
be on an plain ext{2,3,4} file system. It is based off of e2scan in the
Oracle (nee Sun) Lustre version of e2fsprogs, so that could be a
starting point for you.

--
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Valerie Aurora on 17 Jun 2010 15:00

On Fri, Jun 18, 2010 at 03:04:08AM +0900, J. R. Okajima wrote:
>
> David Dillow:
> > For example, our main Lustre scratch space has over 285 million files in
> > it, and using find -inum takes over 72 hours to walk the tree using
> :::
> > Using ne2scan -- which uses libext2fs and combines the inode scan and
> > the name lookup -- takes over 48 hours to generate a list of candidate
> > files for the purge example. With an optimized inode scan and the custom
> :::
>
> While I've never heard of ne2scan, I am interested in this simplified
> problem such as "find the pathname(s) from an inum in a huge fs."
> Is ne2scan essentially equivalent to "debugfs ncheck inum"?
>
> About Valeris's patch, as long as "ls -i" is useful/helpful,
> > + /* Use of d_ino without st_dev is always buggy. */
> is not true.

What I'm hearing again and again is that d_ino is useful to improve
performance. As Andreas put it to me, if d_ino is the same, the
referenced file may or may not be the same, but if it's different, the
files are definitely different. Only in well-controlled environments
known not to have submounts or bind mounts do people trust d_ino to be
from the same file system as the other entries in a directory.

I only submitted this patch half-seriously - mainly I wanted to find
out how people are using d_ino, and therefore what I need to do for
fallthru directory entries in union mounts.

In order to get the correct inode number for a directory entry
referring to a lower layer file or directory, we have to do a
->lookup() from the fs-specific readdir code (or else require that
fallthrus store an arbitrarily sized integer - which seriously
restricts the implementation). Now, doing a ->lookup() to get d_ino
makes no sense if we are using d_ino as a way to avoid the cost
stat(), which is mainly the ->lookup(). And you definitely can't use
d_ino by itself in a union mount.

I'm inclined to save the trouble and just return 1 in d_ino for
fallthru directory entries, especially now that I've tested it
system-wide and had no obvious problems.

-VAL
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andreas Dilger on 17 Jun 2010 19:40

On 2010-06-16, at 13:54, Valerie Aurora wrote:
> On Wed, Jun 16, 2010 at 02:59:13PM -0400, Valerie Aurora wrote:
>> Who needs d_ino anyway? I am running a kernel with this patch -
>> Gnome, a browser, IRC, kernel compile, etc. and everything works.
>
> Gosh, maybe it would help to patch the currently used readdir instead
> of just old_readdir() (thanks, Arnd). And return 1 instead of 0 so ls
> doesn't think all files are deleted (thanks, Andreas).
>
> I'm running a kernel with the below patch and everything still works.
> Apparently "ls -i" is still using the bogus d_ino performance
> improvement mentioned here because it returns all 1's for inode
> number.

I don't see why the presence of d_ino is a "bogus" performance optimization. It is useful for some things, and replacing this with "1" by no means helps anything IMHO, and destroys some useful optimizations (e.g. finding which inodes may be hard links), so I'm against this patch.

> http://www.mail-archive.com/bug-findutils(a)gnu.org/msg02531.html
>
> -VAL
>
> commit 5902fd7b7407e059c5cea1bf1ea101a1ff8a6072
> Author: Valerie Aurora <vaurora(a)redhat.com>
> Date: Wed Jun 16 11:05:06 2010 -0700
>
> VFS: Always return 1 for d_ino
>
> Use of d_ino without the corresponding st_dev is always buggy in the
> presence of submounts, bind mounts, and union mounts. E.g., the d_ino
> of a mountpoint will be the inode number of the directory under the
> mountpoint, not the mounted directory. Correct code must call stat(),
> which returns the correct device ID and inode in st_dev and st_ino.
> Since no one should be using d_ino anyway, always return 1 to detect
> bugs.
>
> diff --git a/fs/readdir.c b/fs/readdir.c
> index dd3eae1..5ff8f10 100644
> --- a/fs/readdir.c
> +++ b/fs/readdir.c
> @@ -91,11 +91,8 @@ static int fillonedir(void * __buf, const char * name, int namlen, loff_t offset
>
> if (buf->result)
> return -EINVAL;
> - d_ino = ino;
> - if (sizeof(d_ino) < sizeof(ino) && d_ino != ino) {
> - buf->result = -EOVERFLOW;
> - return -EOVERFLOW;
> - }
> + /* Use of d_ino without st_dev is always buggy. */
> + d_ino = 1;
> buf->result++;
> dirent = buf->dirent;
> if (!access_ok(VERIFY_WRITE, dirent,
> @@ -172,11 +169,8 @@ static int filldir(void * __buf, const char * name, int namlen, loff_t offset,
> buf->error = -EINVAL; /* only used if we fail.. */
> if (reclen > buf->count)
> return -EINVAL;
> - d_ino = ino;
> - if (sizeof(d_ino) < sizeof(ino) && d_ino != ino) {
> - buf->error = -EOVERFLOW;
> - return -EOVERFLOW;
> - }
> + /* Use of d_ino without st_dev is always buggy. */
> + d_ino = 1;
> dirent = buf->previous;
> if (dirent) {
> if (__put_user(offset, &dirent->d_off))

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andreas Dilger on 17 Jun 2010 21:50

On 2010-06-17, at 12:04, "J. R. Okajima" <hooanon05(a)yahoo.co.jp> wrote:

>
> I am interested in this simplified
> problem such as "find the pathname(s) from an inum in a huge fs."
> Is ne2scan essentially equivalent to "debugfs ncheck inum"?

The (n)e2scan program is essentially just an optimized ext3 inode
table scanner we wrote for Lustre that walks the inode table in order,
and optimistically reads directory inode blocks (in disk offset order)
and matches the inode numbers to an icrementally-build tree of parent
directories when the directory entries appear. Since the most common
case is that parent has a lower inode number than the subdirectories
there is rarely a need to keep whole subdirectories in memory. This
is fairly efficient when dumping the whole Filesystem, since it makes
a single pass over the metadata, though it is inefficient when doing a
small subset of the filesystem.

As the name implies, it is very extN specific. For Lustre 2.0 we use a
different method to get O(1) FID (inode number) to pathname(s)
lookup. Each file stores an xattr with the {parent FID, filename}
tuples for each link to the file, whenever an inode is created,
linked, unlinked, or renamed.

In the common case, storing the filename and parent FID adds no
overhead to these operations since the inode needs to be written to
update the nlink count anyway, and the xattr can be stored in the
inode and does not generate extra IO unless there are more hard links
than can fit in the inode.

This allows doing optimized pathname generation for all links to a
file, and can in theory be used for any type of filesystem that has
efficient xattr storage.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: J. R. Okajima on 17 Jun 2010 23:00

Andreas Dilger:
> As the name implies, it is very extN specific. For Lustre 2.0 we use a
> different method to get O(1) FID (inode number) to pathname(s)
> lookup. Each file stores an xattr with the {parent FID, filename}
> tuples for each link to the file, whenever an inode is created,
> linked, unlinked, or renamed.

Honestly speaking, this approach is the one which came to my mind when I
read David's mail. Andreas's approach and explanation is perfect as I
should admire.

Thank you
J. R. Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3
Prev: Error "Unknown relocation: 36" on module load on Sparc
Next: How to printk/sprintf uint64_t on Sparc without format and argument types mismatch