fallthru: ext2 fallthru support [Kernel]

Prev: Fix typo in percpu.h
Next: x86: correctly wire up the newuname system call

From: Miklos Szeredi on 21 Apr 2010 05:40

On Wed, 21 Apr 2010, Jamie Lokier wrote:
> Hmm. I smell potential confusion for some otherwise POSIX-friendly
> userspaces.
>
> When I open /path/to/foo, call fstat (st_dev=2, st_ino=5678), and then
> keep the file open, then later do a readdir which includes foo
> (dir.st_dev=1, d_ino=1234), I'm going to immediately assume a rename
> or unlink happened, close the file, abort streaming from it, refresh
> the GUI windows, refresh application caches for that name entry, etc.
>
> Because in the POSIX world I think open files have stable inode
> numbers (as long as they are open), and I don't think that an open
> file can have it's name's d_ino not match the inode number unless it's
> a mount point, which my program would know about.
>
> This plays into inotify, where you have to know if you are monitoring
> every directory that contains a link to a file, to know if you need to
> monitor the file itself directly instead.
>
> Now I think it's fair enough that a union mount doesn't play all the
> traditional rules :-) C'est la vie.
>
> This mismatch of (dir.st_dev,d_ino) and st_ino strongly resembles a
> file-bind-mount. Like bind mounts, it's quite annoying for programs
> that like to assume they've seen all of a file's links when they've
> seen i_nlink of them.
>
> Bind mounts can be detected by looking in /proc/mounts. st_dev
> changing doesn't work because it can be a binding of the same
> filesystem.
>
> How would I go about detecting when a union mount's directory entry
> has similar behaviour, without calling stat() on each entry? Is it
> just a matter of recognising a particular filesystem name in
> /proc/mounts, or something more?

Detecting mount points is best done by comparing st_dev for the parent
directory with st_dev of the child. This is much simpler than parsing
/proc/mounts and will work for bind mounts as well as union mounts.

I think there's no question that union mounts might break apps (POSIX
or not). But I think there's hope that they are few and can easily be
fixed.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Jamie Lokier on 21 Apr 2010 06:00

Miklos Szeredi wrote:
> On Wed, 21 Apr 2010, Jamie Lokier wrote:
> > Hmm. I smell potential confusion for some otherwise POSIX-friendly
> > userspaces.
> >
> > When I open /path/to/foo, call fstat (st_dev=2, st_ino=5678), and then
> > keep the file open, then later do a readdir which includes foo
> > (dir.st_dev=1, d_ino=1234), I'm going to immediately assume a rename
> > or unlink happened, close the file, abort streaming from it, refresh
> > the GUI windows, refresh application caches for that name entry, etc.
> >
> > Because in the POSIX world I think open files have stable inode
> > numbers (as long as they are open), and I don't think that an open
> > file can have it's name's d_ino not match the inode number unless it's
> > a mount point, which my program would know about.
> >
> > This plays into inotify, where you have to know if you are monitoring
> > every directory that contains a link to a file, to know if you need to
> > monitor the file itself directly instead.
> >
> > Now I think it's fair enough that a union mount doesn't play all the
> > traditional rules :-) C'est la vie.
> >
> > This mismatch of (dir.st_dev,d_ino) and st_ino strongly resembles a
> > file-bind-mount. Like bind mounts, it's quite annoying for programs
> > that like to assume they've seen all of a file's links when they've
> > seen i_nlink of them.
> >
> > Bind mounts can be detected by looking in /proc/mounts. st_dev
> > changing doesn't work because it can be a binding of the same
> > filesystem.
> >
> > How would I go about detecting when a union mount's directory entry
> > has similar behaviour, without calling stat() on each entry? Is it
> > just a matter of recognising a particular filesystem name in
> > /proc/mounts, or something more?
>
> Detecting mount points is best done by comparing st_dev for the parent
> directory with st_dev of the child. This is much simpler than parsing
> /proc/mounts and will work for bind mounts as well as union mounts.

Sorry, no: That does not work for bind mounts. Both layers can have
the same st_dev. Nor does O_NOFOLLOW stop traversal in the middle of
a path, there is no handy O_NOCROSSMOUNTS, and no st_mode flag or
d_type to say it's a bind mount. Bind mounts are really a big pain
for i_nlink+inotify name counting.

Besides, calling stat() on every entry in a large directory to check
st_ino can be orders of magnitude slower than readdir() on a large
directory - especially with a cold cache. It is quicker, but much
more complicated, to parse /proc/mounts and apply arcane rules to find
the exceptions.

Can a union mount overlap two parts of the same filesystem?

> I think there's no question that union mounts might break apps (POSIX
> or not). But I think there's hope that they are few and can easily be
> fixed.

I agree, and union moint is a very useful feature that's worth
breaking a few apps for :-)

I'm curious if there's a clear way to go about it in this case, or
if it'll involve a certain amount of pattern recognition in /proc/mounts.

Basically I'm wondering if it's been thought about already.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Miklos Szeredi on 21 Apr 2010 06:20

On Wed, 21 Apr 2010, Jamie Lokier wrote:
> Sorry, no: That does not work for bind mounts. Both layers can have
> the same st_dev.

Okay.

> Nor does O_NOFOLLOW stop traversal in the middle of
> a path, there is no handy O_NOCROSSMOUNTS, and no st_mode flag or
> d_type to say it's a bind mount. Bind mounts are really a big pain
> for i_nlink+inotify name counting.

I'm confused. You are monitoring a specific file and would like to
know if something is happening to any of it's links, right?

Why do you need to know about bind mounts for that?

Count the number of times you encounter that d_ino and if that matches
i_nlink then every directory is monitored. Simple as that, no?

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Jamie Lokier on 21 Apr 2010 13:40

Miklos Szeredi wrote:
> On Wed, 21 Apr 2010, Jamie Lokier wrote:
> > Sorry, no: That does not work for bind mounts. Both layers can have
> > the same st_dev.
>
> Okay.
>
> > Nor does O_NOFOLLOW stop traversal in the middle of
> > a path, there is no handy O_NOCROSSMOUNTS, and no st_mode flag or
> > d_type to say it's a bind mount. Bind mounts are really a big pain
> > for i_nlink+inotify name counting.
>
> I'm confused. You are monitoring a specific file and would like to
> know if something is happening to any of it's links, right?

Not quite. I'm monitoring a million files (say), so I must use
directory watches for most of them. I need directory watches anyway,
when the semantic is "calling open on /path/to/file and reading would
return the same data", because renames and unlinks are also a way to
invalidate monitored file contents.

At a high level, what we're talking about is the ability to cache and
verify the the validity information derived from reading files in the
filesystem, in a manner which efficiently triggers invalidation only
on changes. Being able to answer, as quickly as possible, "if I read
this, that and other, will I get the same results as the last time I
did those operations, without having to actually do them to check".
There are many applications, provided the method is reliable.

> Why do you need to know about bind mounts for that?
>
> Count the number of times you encounter that d_ino and if that matches
> i_nlink then every directory is monitored. Simple as that, no?

When I see a file has i_nlink > 1, I must watch the file directly
using a file-watch (with inotify; polling with stat() with dnotify),
_unless_ I have seen all the links to that file.

When I've seen all the links to a file, I know that my directory
watches on the directories containing those links are sufficient to
detect changes to the file contents. That's because every
file change will get notified on at least one of those paths.

I learn that I've seen all the links by seeing d_ino during readdir as
you suggested, or by st_ino in the cases where I've not had reason to
readdir and I have needed to open the file or call stat.

Let's look at some bind mounts. One where st_ino doesn't work:

/dirA/file1 [hard link to inode 100, i_nlink = 2]
/dirA/bound [bind mount, has /dirA/file1 mounted on it]
/dirB/file2 [hard link to inode 100, i_nlink = 2]

If the program is asked to open /dirA/file1 and /dirA/bound at various
times, and never asked to readdir /dirA, it will have used fstat not
readdir, seen the same (st_dev,st_ino,i_nlink = 2), and _wrongly_
concluded that it is monitoring all directories containing paths to
the file.

To avoid that problem, it parses /proc/mounts and detects that
/dirA/bound does not contributed to the link count. This is faster
than calling readdir in all possible places that it can happen.

Another one, where readdir + d_ino doesn't work anyway:

/dirA/file1 [hard link to inode 100, i_nlink = 2]
/dirB/dirX [bind mount, has /dirA mounted on it]
/dirC/file2 [hard link to inode 100, i_nlink = 2]

This time the program is asked to open /dirA/file1 and
/dirB/dirX/file1 at various times. Suppose it aggressively calls
readdir on all of the places it goes near, and uses d_ino comparisons.

Bear in mind it can't hunt for /dirC because there may be millions of
directories; this is just an example.

Then it will see the same d_ino for /dirA/file1 and /dirB/dirX/file1,
and wrongly conclude that it is monitoring all directories containing
paths to the file.

So again, it must parse /proc/mounts to detect that everything found
under /dirB/dirX mirrors /dirA.

This is a bit more complicated by the fact that inotify/dnotify send
events to the watching dentry parent of the link used to access a
file, not necessarily the parent in the mounted path space.

Although this doesn't make the bind mount problem go away, this is
where union mounts complicate the picture more:

Ideally, the program may assume that d_ino and st_ino match as long as
the file is open (on any filesystem), or that the filesystem type is
in a whitelist of ones with stable inode numbers (most local
filesystems), and it's not a mountpoint. So when it's asked to open
at one path, and something else asks it to readdir at another path, it
could combine the information to learn when it's found all entries,
without having to use redundant readdirs and stats.

I'm thinking that I might have to detect union mounts specially in
/proc/mounts, now that they are a VFS feature, and disable a bunch of
assumptions about d_ino when seeing them. Hopefully it is possible to
unambiguously check for union mount points in /proc/mounts?

d_ino == directory's st_ino sounds neat. Maybe that will be enough,
as a special magical Linux rule. When reading a directory, it's cheap
to get the directory's st_ino with fstat(). It's possible to bind
mount a directory on it's _own_ child, so that st_ino == directory's
st_ino, but d_ino isn't affected so maybe that's the trick to use.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Valerie Aurora on 21 Apr 2010 17:40

On Wed, Apr 21, 2010 at 10:52:21AM +0100, Jamie Lokier wrote:
> Miklos Szeredi wrote:
> > Detecting mount points is best done by comparing st_dev for the parent
> > directory with st_dev of the child. This is much simpler than parsing
> > /proc/mounts and will work for bind mounts as well as union mounts.
>
> Sorry, no: That does not work for bind mounts. Both layers can have
> the same st_dev. Nor does O_NOFOLLOW stop traversal in the middle of
> a path, there is no handy O_NOCROSSMOUNTS, and no st_mode flag or
> d_type to say it's a bind mount. Bind mounts are really a big pain
> for i_nlink+inotify name counting.
>
> Besides, calling stat() on every entry in a large directory to check
> st_ino can be orders of magnitude slower than readdir() on a large
> directory - especially with a cold cache. It is quicker, but much
> more complicated, to parse /proc/mounts and apply arcane rules to find
> the exceptions.
>
> Can a union mount overlap two parts of the same filesystem?

No. Each layer must be a separate file system, the bottom must be
read-only, the top must be writable, and they must be unioned at their
mount points.

> > I think there's no question that union mounts might break apps (POSIX
> > or not). But I think there's hope that they are few and can easily be
> > fixed.
>
> I agree, and union moint is a very useful feature that's worth
> breaking a few apps for :-)
>
> I'm curious if there's a clear way to go about it in this case, or
> if it'll involve a certain amount of pattern recognition in /proc/mounts.

All it takes is looking for the "union" string in the mount options.

> Basically I'm wondering if it's been thought about already.

Not as much as it deserves. :) Do you have any thoughts about better
solutions?

Something to keep in mind is that most of the app issues are already
present with bind mounts. In many cases, if an app doesn't work with
union mounts, it's also not going to work with bind mounts. I think
you have a good point that we could use a more straightforward way to
say, "Hey, you can't use the normal st_dev/st_ino rules right now..."

-VAL
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3
Prev: Fix typo in percpu.h
Next: x86: correctly wire up the newuname system call