From: Jamie Lokier on
Eric Paris wrote:
> On Tue, 2009-09-15 at 16:49 -0700, Linus Torvalds wrote:
> > And btw, I still want to know what's so wonderful about fanotify that we
> > would actually want yet-another-filesystem-notification-interface. So I'm
> > not sayying that I'll take a system call interface.
>
> The real thing that fanotify provides is an open fd with the event
> rather than some arbitrary 'watch descriptor' that userspace must
> somehow magically map back to data on disk. This means that it could be
> used to provide subtree notification, which inotify is completely
> incapable of doing.

That's a bit of a spurious claim.

- fanotify does not provide subtree notification in it's
present form. When it is extended to do that, why wouldn't
inotify be as well? That's an fsnotify feature, common to both.

- fanotify does not provide notification at all for some events that
you get with inotify. It is not a superset, so you can't use
fanotify to provide a subtree-capable equivalent to inotify. What
a mess when you need the combination of both features!

- fanotify requires you call readlink(/proc/fd/N) for every event to
get the path. It's not a particularly efficient way to get it,
especially when an apps wants to know if it's something in it's
region of interest but doesn't care about the actual path.
When an apps knows it needs the map back to to path, why make it
slow to get it? That "extensible data format" is being
underutilised...

- fanotify's descriptor may be race-prone as a way to get the subtree
used for access, because any of the parent directories could have
moved and even been deleted before the app calls
readlink(/proc/fd/N). I don't know if a _reliable_ way to track
changes in a subtree can be built on it. Maybe it can but it
appears this hasn't been analysed. It depends on
readlink(/proc/fd/N)'s behaviour when the dentry's have been
changed, among other things.

- Does the descriptor cause umount to fail when user does "do some
stuff in baz; umount baz", or does it serialise nicely? That's one
of inotify's nice features - it doesn't cause umounts to fail.

> And it can be used to provide system wide notification. We all know
> who wants that.

People who want to break out of chroot/namespace jails using the
conveniently provided open file descriptor? :-)

Seriously, what does system-wide fanotify do when run from a
chroot/namespace/cgroup, and a file outside them is accessed?

If the event is delivered with file desciptor, that's a security hole.
If it's not delivered, that sounds like working subtree support?

I'd expect anti-malware to want to be run inside VMs quite often...

Note that there's no such thing as "the real system root" any more.

> It provides an extensible data format which allows growth impossible in
> inotify. I don't know if anyone remember the inotify patches which
> wanted to overload the inotify cookie field for some other information,
> but inotify information extension is not reasonable or backwards
> compatible.

I agree with this (although that's what flags are for -- see clone).

I don't have a problem with the next interface being fanotify (despite
arguing a lot); I just want to see the next one being useful for the
things I would otherwise be proposing my own yet-another-interface
for. So we don't need a fourth one soon after the third due to
easily foreseen limitations.

> I've got private commitments for two very large anti malware companies,
> both of which unprotect and hack syscall tables in their customer's
> kernels, that they would like to move to an fanotify interface. Both
> Red Hat and Suse have expressed interest in these patches and have
> contributed to the patch set.
>
> The patch set is actually rather small (entire set of about 20 patches
> is 1800 lines) as it builds on the fsnotify work already in 2.6.31 to
> reuse code from inotify rather than reimplement the same things over and
> over (like we previously had with inotify and dnotify)

I don't have any problem with either of these, and _fs_notify
generally seems like an improvement. I don't have a problem with
fanotify either. For what it does, it's ok.

> Don't know what else to say.....

Answer questions about use-cases that you're not interested in? Why
block them? What about Evigny's request for an event without an open
fd - because he needs the pid information (inotify doesn't provide)
but not the fd?

Sorry to be so harsh. I'm really trying to make sure we don't repeat
the mistakes of dnotify and inotify, and end up with a third interface
which also is too restrictive (because it's good enough for your
anti-malware and HSM customers) so that a fourth interface will be
needed soon after.

I'd like to be able to use it from some applications to accelerate
userspace caching of things (faster Make, faster Samba) without
penalising all other applications touching unrelated parts of the
filesystem. The attitude "you can live with 10% slowdown" worries me.
I'm sure that can be fixed with a bit of care.

If the intention is to maintain fanotify and inotify side-by-side for
different uses (because fanotify returns open descriptors and blocks
the accessing process until acked), that's ok with me. It makes
sense. But then it's messy that neither offers a superset of the
other regarding which files and events are tracked.

If it's right that inotify has no room for extensibility (I'm not sure
about this), than it appears we already made a mess with dnotify and
inotify, so it would be a shame to repeat the same mistakes again.
Let's get the next one right, even it takes a bit longer, ok?

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Alan Cox on
> - fanotify does not provide subtree notification in it's
> present form. When it is extended to do that, why wouldn't
> inotify be as well? That's an fsnotify feature, common to both.

Because inotify gives you no reliable access to the object monitored as
the name passed back is not an object reference and is racy. Inotify is
fine for making pretty icons pop up on desktops and making file
selectors update, but it is somewhat inadequate for indexers and
completely useless for stuff like HSM.

> - fanotify requires you call readlink(/proc/fd/N) for every event to
> get the path. It's not a particularly efficient way to get it,

IFF you want the path, but the path isn't usually the most valuable bit.
Plus you'll find the readlink is extremely quick anyway.

> People who want to break out of chroot/namespace jails using the
> conveniently provided open file descriptor? :-)

chroot isn't a security model. You can already do this with AF_UNIX
sockets (and there are apps that intentionally use fchdir that way)

> I'd expect anti-malware to want to be run inside VMs quite often...

Inside of containers - unlikely. Inside of guests sure but thats not
going to relevant to fanotify()

> the accessing process until acked), that's ok with me. It makes
> sense. But then it's messy that neither offers a superset of the
> other regarding which files and events are tracked.

Agreed.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Arnd Bergmann on
On Tuesday 15 September 2009, Eric Paris wrote:
> fanotify_modify_mark_at() --- like inotify_add_watch and rm_watch
> fanotify_modify_mark_fd() --- same but with an fd instead of a path

I think these two can be merged into one without adding complexity,
in the same way that sys_utimensat can take a file descriptor or
a path or both.

> fanotify_response() --- userspace tells the kernel what to do if requested/allowed
> (could probably be done using write() to the fanotify fd)
> fanotify_exclude() --- a horrid syscall which might be better as an ioctl since it isn't strongly typed....

Please don't use an ioctl here. While ioctl is fine for character devices
and sort of fine for sockets, I think it would be very bad style to use
it on file descriptors that you get back from specialized syscalls like
fanotify_init. Do one or the other, but do it consistently.

Why is it not strongly typed anyway? Something like

int fanotify_ignore_sb(int fanotify_fd, unsigned int flag,
long f_type, fsid_t f_fsid);

would be type safe, although I think it would be better to only handle
one of the two cases. Can you think of a case that you can't handle if
you have to decide between them and only do one interface (f_type or fsid)?

Arnd <><
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jamie Lokier on
Alan Cox wrote:
> > - fanotify does not provide subtree notification in it's
> > present form. When it is extended to do that, why wouldn't
> > inotify be as well? That's an fsnotify feature, common to both.
>
> Because inotify gives you no reliable access to the object monitored as
> the name passed back is not an object reference and is racy. Inotify is
> fine for making pretty icons pop up on desktops and making file
> selectors update, but it is somewhat inadequate for indexers and
> completely useless for stuff like HSM.

That was my point. (Why do people keep not getting it?)

You can't rely on the name being non-racy, but you _can_ reliably
invalidate application-level caches from the sequence of events
including file writes, creates, renames, links, unlinks, mounts. And
revalidate such caches by the absence of pending events.

(There is one obscure case which inotify is missing, though, which
means it cannot detect file changes in certain cases with hard links.
I intend to fix that one.)

For that, an inode isn't useful, a descriptor isn't useful, a
directory descriptor/inode and pathname isn't useful, and file write
events by themselves aren't useful. None of them quite do it by
themselves.

But with the correct combination of events, you can maintain very
efficient application-level caching of file data / directory listing
and lookups / stat results you have previously read from the
filesystem. That's because the information you have previously
depended upon, including path lookups, are all notified as one sort of
inotify event or another when changed.

Which doesn't sound all that special until you realise you can very
quickly revalidate application-caches of any data structure calculated
from reading things from the filesystem, no matter how many
prerequisites or how complex the data structures, in a single system
call. Amortised over many revalidations if you have them in parallel.

That can apply to things like git, make, ccache, samba, rsync, httpd
path walks, and virtually any "web templating" framework. Of course
it takes userspace support as well, but that's where I'm coming from
regarding "acceleration" and the essential kernel infrastructure.

Clearly, I'm going to have to explain with working code :-)

> but it is somewhat inadequate for indexers

For indexers, the real inadequacy is the need to attach inotify
watches to every directory at system startup, and to stat() everything
to check it hasn't changed since the indexer was last running. Both
are very slow on a large directory tree. The former can be dealt with
using subtree watches (yes, even with hard links - I have proposed an
algorithm for this but I think nobody understood it ;-). The latter
needs filesystem support for a persistent change attribute.

> > - fanotify requires you call readlink(/proc/fd/N) for every event to
> > get the path. It's not a particularly efficient way to get it,
>
> IFF you want the path, but the path isn't usually the most valuable bit.
> Plus you'll find the readlink is extremely quick anyway.

I agree, you don't usually want the whole path.

So what was the point about fanotify making subtree tracking possible
with it's file descriptor, if not by readlink(/proc/fd/N)?
Descriptors don't tell you which subtree a file is in any better than
inotify watches. I.e. they do, if you track them and their containing
directories all individually.

> > People who want to break out of chroot/namespace jails using the
> > conveniently provided open file descriptor? :-)
>
> chroot isn't a security model. You can already do this with AF_UNIX
> sockets (and there are apps that intentionally use fchdir that way)

Ah, no. AF_UNIX works with explicit sender cooperation.

fanotify gives you access to files without sender cooperation, as it
intercepts every open().

> > I'd expect anti-malware to want to be run inside VMs quite often...
>
> Inside of containers - unlikely.

Why not? Some people run entire distributions in containiners, and
present them as VMs to the world for other people to admin.

> > the accessing process until acked), that's ok with me. It makes
> > sense. But then it's messy that neither offers a superset of the
> > other regarding which files and events are tracked.
>
> Agreed.

In the end this is my main gripe.

-- Jamie

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Evgeniy Polyakov on
On Tue, Sep 15, 2009 at 05:54:59PM -0400, Eric Paris (eparis(a)redhat.com) wrote:
> Nothing's impossible, but is netlink a square peg for this round hole?
> One of the great benefits of netlink, the attribute matching and
> filtering, although possibly useful isn't some panacea as we have to do
> that well before netlink to have anything like decent performance.
> Imagine every single fs event creating an skb and sending it with
> netlink only to have most of them dropped.

There is no problem with performance even with single IO per skb.
Consider usual send/recv calls which may end up with the same skb per
syscall - most of the overhead comes from data copy or syscall machinery
(for small writes) and not from allocation path.

I have a 3.5 years old performance graph at
http://www.ioremap.net/gallery/netlink_perf.png

which shows 400 MB/s of bandwidth for 4k writes, I'm pretty sure it is
limited by copy performance only.

> The only other benefit to netlink that I know of is the reasonable,
> easy, and clean addition of information later in time with backwards
> compatibility as needed. That's really cool, I admit, but with the
> limited amount of additional info that users have wanted out of inotify
> I think my data type extensibility should be enough.

I want alot from inotify which I'm afraid will not be easy with fanotify
either, but its existing model just does not allow its extension. I
would not be 100% sure that there will be no additional needs in a year
or so for fanotify.

> > Moreover you can implement a pool of working threads and
> > postpone all the work to them and appropriate event queues, which will
> > allow to use rlimits for the listeners and open files 'kind of' on
> > behalf of those processes.
>
> I'm sorry, I don't userstand. I don't see how worker threads help
> anything here. Can you explain what you are thinking?

I meant that it could be possible to postpone all the work of queueing,
event allocation, fd opening and population all be done on behalf of
some other threads in the system and only original process credentials
would be checked to satisfy various limits. In this case there will be
no questions in which context given fd was created and it is possible to
use async netlink nature.

I do not force you to do this of course, but there is already quite huge
infrastructure for similar tasks and it could be worth to
change/reconsider things to use existing models and not invent own.
Of course this is a matter of overall benefit.

> > But it is quite diferent from the approach you selected and which is
> > more obvious indeed. So if you ask a question whether fanotify should
> > use sockets or syscalls, I would prefer sockets
>
> I've heard someone else off list say this as well. I'm not certain why.
> I actually spent the day yesterday and have fanotify working over 5 new
> syscalls (good thing I wrote the code with separate back and and front
> ends for just this purpose) And I really don't hate it. I think 3
> might be enough.
>
> fanotify_init() ---- very much like inotify_init
> fanotify_modify_mark_at() --- like inotify_add_watch and rm_watch
> fanotify_modify_mark_fd() --- same but with an fd instead of a path

Those two can be combined I think.

> fanotify_response() --- userspace tells the kernel what to do if requested/allowed
> (could probably be done using write() to the fanotify fd)
> fanotify_exclude() --- a horrid syscall which might be better as an ioctl since it isn't strongly typed....

It all sounds good and simple, but what if you will need modify command
with new arguments? Instead of adding new typed option you will need to
add another syscall. I already did that for inotify but via ioctl and
pretty sure there will be such need for much wider fanotify some time in
the future.

> I don't see what's gained using netlink. I am already reusing the
> fsnotify code to do all my queuing. Someone help me understand the
> benefit of netlink and help me understand how we can reasonably meet the
> needs and I'll try to prototype it.
>
> 1) fd's must be opened in the recv process

Or just injected into registered process' fd table with appropriate
limit checks? In this case it can be done on behalf of whatever other
worker.

> 2) reliability, if loss must know on the send side

You have this knowledge at netlink sending time, but there is no way to
wait until 'fail' condition is removed like when you can block writing
into socket waiting for buffer space to become large enough.

And there is no way to tell how many listeners got message and how many
was dropped in multicast deliver except that there were drops.
This can be trivially extended though.

--
Evgeniy Polyakov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/