fanotify as syscalls [Kernel]

Prev: BUG kmalloc-64: Poison overwritten, INFO: Allocated in bdi_alloc_work+0x2b/0x100 age=175 cpu=1 pid=3514
Next: [PATCH 62/72] Blackfin: bf537-stamp: add adp5588 gpio resources

From: Eric Paris on 16 Sep 2009 18:40

On Wed, 2009-09-16 at 22:49 +0100, Jamie Lokier wrote:
> Eric Paris wrote:
> > On Wed, 2009-09-16 at 13:56 +0100, Jamie Lokier wrote:
> > > Alan Cox wrote:
> > > > > You can't rely on the name being non-racy, but you _can_ reliably
> > > > > invalidate application-level caches from the sequence of events
> > > > > including file writes, creates, renames, links, unlinks, mounts. And
> > > > > revalidate such caches by the absence of pending events.
> > > >
> > > > You can't however create the caches reliably because you've no idea if
> > > > you are referencing the right object in the first place - which is why
> > > > you want a handle in these cases. I see fanotify as a handle producing
> > > > addition to inotify, not as a replacement (plus some other bits around
> > > > open blocking for HSM etc)
> > >
> > > There are two sets of events getting mixed up here. Inode events -
> > > reads, writes, truncates, chmods; and directory events - renames,
> > > links, creates, unlinks.
> >
> > My understanding of you argument is that fanotify does not yet provide
> > all inotify events, namely those of directories operations and thus is
> > not suitable to wholesale replace everything inotify can do.
>
> Largely that, plus seeing fanotify look like it'll acquire some
> capabilities that would be useful with those inotify events and
> inotify not getting them. Bothered by the apparent direction of
> development, really.
>
> Btw, I'm not sure you can use inotify+fanotify together simultaneously
> in this way, which may be of benefit - caching might help the
> anti-malware-style access controls. I'll have to think carefully
> about ordering of some events, and using fanotify and inotify
> independently at the same time loses that ordering.

As soon as you say "ordering" you lose :)

>
> > I've already said that working towards that goal is something I plan
> > to pursue,
>
> Sorry, I missed that, just as I didn't find a reply to Evigny's "I
> need pids". And from another mail, I thought you were stopping at the
> things with file descriptors.
>
> > but for now, you still have inotify.
>
> That's right. And it sucks for subtrees, so that's why I'd like to
> absorb improvements on subtree inclusions, and exclusion nodes look
> useful too.
>
> > The mlocate/updatedb people ask me about fanotify and it's on the todo
> > list to allow global reception of of such events. The fd you get would
> > be of the dir where the event happened. They didn't care, and I haven't
> > decided if we would provide the path component like inotify does. Most
> > users are perfectly happy to stat everything in the dir.
>
> mlocate/updatedb is relatively low performance and of course wants to
> be system-wide. It's not looking so good if a user wants an indexer
> just on their /home, and the administrator does not want everyone else
> to pay the cost.
>
> But I think we're quite agreed on how useful subtrees would be.
> System-wide events won't be needed if we can monitor the / subtree
> to get the same effect, and that'll also sort out namespaces and chroots.
>
> Stat'ing every entry in a dir event. Thinking out loud:
>
> 1. Stat'ing everything in a dir just to find out which 1 file was
> deleted can be quite expensive for some uses (if you have a
> large dir and it happens a lot), and is unpleasant just because
> each change _should_ result in about O(1) work. Taste, style,
> elegance ;-)
>
> For POSIX filesystems, I don't see any logical problem with
> this, actually. You don't need to call stat()! It's enough to
> call readdir() and look at d_ino to track
> renames/links/creates/unlinks - assuming only directory-change
> events are relevant here.
>
> Just an unpleasant O(n) scaling with directory size.
>
> (Note that I ignore mount points not returning the correct
> d_ino, because apps can track the mount list and
> compensate; they should be doing this anyway).

I think we generally agree here.

>
> 2. updatedb-style indexing apps don't care about the
> readdir/stat-all-entries cost, because they don't need to read
> the directory after every change, they only need to do it once
> every 24 hours if any events were received in that interval!
>
> (Obviously this isn't the same for pseudo-real-time indexers.)

See, that's why the mlocate/updatedb contacted me. They don't want it
to be a 24 hour thing. They was pseudo real time without thrashing the
hd. Turns out, the interface is there, and the backend just needs
struct path information available to give it to them.

> For Samba-style caching, on the other hand, the cost of
> rescanning a large directory when one file is being read often
> and another file in it is changing often might be prohibitive,
> forcing it to use heuristics to decide when to monitor a
> directory and when not to to cache it, depending on directory
> size. I'd rather avoid that.

This seems like you are confusing the events that happen to a directory
(mv/rename) and the events that happen to a file (read/write.) (The
former I do not have code for, the latter I do) Which is surprising
since you so clearly delineate them later, so maybe I'm
misunderstanding.

> 3. Non-POSIX filesystems don't always have stable inode numbers.
> You can't tell that foo was renamed to bar by reading the
> directory and looking at d_ino, or by calling stat on each entry.
>
> You can assume stable inode numbers for inodes where there's an
> open file descriptor; that *might* be just enough to squeeze
> through the logic of a cache. I'm not sure right now.
>
> 4. You can't tell when file contents are changed from stat info.
>
> That means you have to receive an inode event, not a directory
> event for data changes, but that's not a problem of course - the
> name-used-for-access isn't useful for data changes anyway
> (except for debugging perhaps).
>
> 5. stat() doesn't tell you about xattr and ACL changes. xattrs can
> be large and slow to read on a whole directory. But as point 4,
> if attribute changes count as inode changes, there's no problem.
>
> 6. Calling stat() pulls a lot into cache that doesn't need to be in
> cache: all those inodes. But as I mentioned in points 1, 4 and
> 5, provided only directory name operations pass the directory to
> be scanned, and inode operations always pass the inode, you can
> use readdir() and avoid stat(), so the inodes don't have to be
> pulled into cache after all.
>
> Except for non-POSIX inode instability. Would be good to work
> out if that breaks the algorithm.
>
> In summary, calling readdir() and maybe stat/getxattr on each entry in
> a directory might be workable, but I'd rather it was avoidable.
> Simple apps may prefer to do it anyway - and let multiple events in a
> directory be merged as a result.
>
> While I'm here it would be nice to receive one event instead of two
> for operations which involve two paths: link, rename and bind mount.
> Having to pair up two events from inotify isn't helpful in any way.

What do you propose the format of the event should be. Is this
precluded in what's been proposed?

> Imho an API that satisfies everything we've talking about would let
> you specify which fields you want to receive in the event when you
> bind a listener. Not _everything_ is selectable of course, but
> whether you want:
>
> For inode events (data read/write, attribute/ACL/xattr changes):
>
> - Open file descriptor of the affected file [Optional].

Could be added and I've agreed to take a look. I'm just not sure
bringing back the major flaw of inotify is really moving us forward.

> - The inode number and device number (always?).

hmmm, if you have the fd you have both, if you choose to get a wd like
inotify, I say you're still own your own to do the magic map. I don't
want to copy tons of almost always useless data into userspace.

> - A way to identify the vfsmount (due to bind mounts making the
> device number insufficient to identify a directory; always?).

Now you want reliable path names? I need a vfsmount in kernel to open
the fd for userspace, but I don't see how that's translatable to
anything even remotely useable by userspace....

> For directory events (create/unlink/link/rename/reflink/mkdir/rmdir
> /mount/umount):
>
> - Same as inode above, for the object created/linked/deleted.
>
> - Same as inode above, for the directory containing the source name.
> - Source name [Optional].
> - Same as above, for the directory containing the target name
> - Target name [Optional]
>
> Source and target are the two names for
> rename/link/reflink/bind-mount operations. Otherwise there is
> only one name to include.
>
> Ironically, it begins to look a bit like netlink ;-)
>
> As you can see, I've made the open descriptors optional, and the names
> for directory events optional. For directory events, the object
> descriptor option should be independent from the source/target
> directory descriptor option.
>
> Add one more option: wait for Ack before file accessing process can
> proceed, or don't require Ack. That basically distinguishes inotify
> behaviour from fsnotify behaviour.

Those 2 things are completely independent. If you request read with
blocking you get read with blocking. If you request just read, you get
just read.

> It's not obvious, but that option's useful for directory events too,
> if you think about it: Think like an anti-malware or other access
> control manager, and ask: what if I have to block something which
> depends on the layout of files? Just as directory events are enough
> for caching, they are enough for complete access control of
> layout-dependent state too. For example, some line of text is no
> problem in a random file, but might be forbidden by the access manager
> from appearing in .bash_login, including by "mv harmless .bash_login".

I'd say they can realize the bad data when .bash_login is next opened
and deny access then :)

No, it's honestly a good idea, and one that is going to likely take
serious hook movement. A lot of these things can go on the todo list,
but don't sound like show stoppers to me....

> > It's hopefully feasible, but it's going to take some fsnotify hook
> > movements and possibly so arguments with Al to get the information I
> > want where I want it.
>
> That may, indeed, be a sticking point :-)

My silent comrade from suse is looking at moving some hooks as we speak
so hopefully directory events can get added shortly after a merge. I
don't think we should wait until every conceived (but not necessarily
needed) possibility is coded. We have things that work, meet needs, and
hopefully you'll agree leave us with a place to go in the future. Do
you?

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds on 17 Sep 2009 12:50

On Wed, 16 Sep 2009, Jamie Lokier wrote:
>
> I'd forgotten about Linus' strace argument. That's a good one.
>
> Of course everything should be a syscall by that argument :-)

Oh yes, everything _should_ be a syscall.

The problem is that many things are too "amorphous" to be system calls,
and don't have any sane generic semantics (ie they only act on a specific
device node). So we have ioctl's etc for those things.

And then we have page faults. I've long wished that from a system call
tracing standpoint we could show page faults as pseudo-system-calls (at
least as long as they happen from user space - trying to handle nesting is
not worth it). It would make it _so_ much more obvious what the
performance patterns are if you could just do

strace -ttT firefox

for the cold-cache case and you'd see where the time is really spent.

(yeah, yeah, you can get that kind of information other ways, but it's a
hell of a lot less convenient than just getting a nice trace with
timestamps).

> And strace can trace some ioctls and setsockopts. (But it's never
> pretty to see isatty() showing in strace as SNDCTL_TMR_TIMEBASE :-)

Yes, strace can fix things up, and show "send a packet" as "fanotify". But
it's nasty and hard.

Quite frankly, I have _never_ever_ seen a good reason for talking to the
kernel with some idiotic packet interface. It's just a fancy way to do
ioctl's, and everybody knows that ioctl's are bad and evil. Why are fancy
packet interfaces suddenly much better?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Arjan van de Ven on 17 Sep 2009 13:40

On Thu, 17 Sep 2009 09:40:16 -0700 (PDT)
Linus Torvalds <torvalds(a)linux-foundation.org> wrote:
>
> And then we have page faults. I've long wished that from a system
> call tracing standpoint we could show page faults as
> pseudo-system-calls (at least as long as they happen from user space
> - trying to handle nesting is not worth it). It would make it _so_
> much more obvious what the performance patterns are if you could just
> do
>
> strace -ttT firefox
>
> for the cold-cache case and you'd see where the time is really spent.
>
> (yeah, yeah, you can get that kind of information other ways, but
> it's a hell of a lot less convenient than just getting a nice trace
> with timestamps).

ohhh I should add pagefaults to timechart
good one.

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Eric Paris on 17 Sep 2009 15:00

On Thu, 2009-09-17 at 09:40 -0700, Linus Torvalds wrote:
>
> On Wed, 16 Sep 2009, Jamie Lokier wrote:
> >
> > I'd forgotten about Linus' strace argument. That's a good one.
> >
> > Of course everything should be a syscall by that argument :-)
>
> Oh yes, everything _should_ be a syscall.

I rewrote the interface and prototyped out a working fanotify like so:

SYSCALL_DEFINE4(fanotify_init, unsigned int, flags, int, event_f_flags,
__u64, mask, int, priority)

int flags indicates - things like global or directed, fd's or wd's,
could include fail allow vs fail deny, O_CLOEXEC, O_NONBLOCK, etc
int event_f_flags - flags used when opening an fd for the listener
__u64 mask - in global mode the events of interest
int priority - the order fanotify listeners should be checked (so HSM
can be before AV scanners)

Do we need a timeout for access decisions? I left room for that in the
bind address, but we can't just leave room to spare with a syscall...

SYSCALL_DEFINE6(fanotify_modify_mark, int, fanotify_fd,
unsigned int, flags, int, fd,
const char __user *, pathname, __u64, mask,
__u64, ignored_mask)

int fanotify_fd - duh
int flags - add, remove, flush, events on child, event on subtree?
int fd - either fd to object or fd to dir for relative pathname
const char __user * pathname - either pathname or null if only use fd
__u64 mask - events this inode cares about
__u64 ignored_mask - events this inode does NOT care about

(not yet done, would someone like to comment?)
fanotify_response(int fanotify_fd, __u64 cookie, __u32 response);
__u64 cookie - which of our permission requests we are waiting on
__u32 response - allow, deny, wait longer

Could be done using write(), but I think the strace argument clearly
says that this should be a syscall that can be easily found and reported

(not settled in my mind)
int fanotify_ignore_sb(int fanotify_fd, unsigned int flags,
long f_type, fsid_t f_fsid)
int fanotify_fd - duh
unsigned int flags - f_type or fsid?
long f_type - statfs(2) f_type
fsid_t f_fsid - statfs(2) f_fsid

Reads from the fd would return data of this structure:

struct fanotify_event_metadata {
__u32 event_len;
__u32 vers;
__s32 fd;
__u32 mask;
__u32 f_flags;
__s32 pid;
__s32 uid;
__s32 tgid;
__u64 cookie;
} __attribute__((packed));

Thanks to event_len and vers, we could extend it to include

__u32 filename1_len,
char filename1[filename1_len]
__u32 filename2_len,
char filename2[filename2_len]

This can all take shape as that work is completed and I don't believe
should block merging.

Do my syscalls look pretty enough? I'm down to 3 or 4.
Jamie, you tend to agree that the interface and the event types are nice
enough that we can build out (if we get the right hooks in the right
places) everywhere we need to go?

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andreas Gruenbacher on 17 Sep 2009 16:10

On Wednesday, 16 September 2009 14:17:08 Jamie Lokier wrote:
> Eric Paris wrote:
> > On Wed, 2009-09-16 at 08:52 +0100, Jamie Lokier wrote:
> > > Seriously, what does system-wide fanotify do when run from a
> > > chroot/namespace/cgroup, and a file outside them is accessed?
> >
> > At the moment an fanotify global listener is system wide. Truely system
> > wide. A gentleman from suse is looking rectify the problem so that if
> > run inside a namespace it stays inside the namespace. Note that this
> > particular little tidbit is not in the 8 patches I proposed. At the
> > moment those just include the UI and basic notification.
>
> I'll be really interested in the gentleman's solution.

I guess Eric meant me.

From my point of view, "global" events make no sense, and fanotify listeners
should register which directories they are interested in (e.g., include "/",
exclude "/proc"). This takes care of chroots and namespaces as well.

I think we want to register for events on objects rather than in the
namespace, i.e., for inodes visible in multiple places because of hardlinks
or bind mounts, we get the same kinds of events no matter which path is used.
(The path actually used would still show up in /proc/self/fd/x.) When moving
registered inodes, the registrations would move with them. This is how
inotify works, except that inotify watches are not recursive.

The difficulty with this is that in the worst case, this would require walking
the entire namespace and all cached inodes. I don't see how this could be
done for two reasons:

* First, we can't take the vfsmount_lock and dcache_lock for the entire time.

* Second, we would need to pin almost all the inodes, which is a clear no-go.

[Why pin? At least we would need to remember which objects a listener has
registered interest in, so we need to pin the inodes. We could still
allow unregistered directory inodes to be thrown out because we can
recreate their registration status from the parent. We can't recreate the
registration status of non-directories because of hardlinks, though.]

The only other idea I could come up with is to only allow recursive
registrations at mount points: instead of inodes, the vfsmounts would be
included or excluded (probably automatically including bind mounts). This has
one big drawback though: users would no longer be able to watch arbitrary
subtrees anymore. Privileged users could still arrange to watch almost all
subtrees with bind mounts (mount --bind /foo/bar /foo/bar).

Any ideas?

Thanks,
Andreas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: BUG kmalloc-64: Poison overwritten, INFO: Allocated in bdi_alloc_work+0x2b/0x100 age=175 cpu=1 pid=3514
Next: [PATCH 62/72] Blackfin: bf537-stamp: add adp5588 gpio resources