fanotify as syscalls [Kernel]

Prev: BUG kmalloc-64: Poison overwritten, INFO: Allocated in bdi_alloc_work+0x2b/0x100 age=175 cpu=1 pid=3514
Next: [PATCH 62/72] Blackfin: bf537-stamp: add adp5588 gpio resources

From: Eric Paris on 14 Sep 2009 15:10

Long ago I implemented fanotify as basically a /dev interface using
ioctls(). Alan suggested I use a socket protocol and could then make
use of get/setsockopt() which although still not great is light years
better than ioctl. Currently the fanotify interface as I want to push
it to Linus and as I've been requesting comments on for the last 1.25
years is just that. It really makes no use of the networking system
other than bind() and setsockopt() and everyone tends to agree the
things I want to do can't reasonably be done using network hooks and a
'real' socket protocol. I like this interface, setsockopt() makes it so
easy to add new functionality as we flush out other users. fanotify as
it stands today has a number of groups who will port to it, has a nubmer
of advantages over inotify, and I have been told privately meets the
needs of the original group of people who paid me to work on it (two
very large anti-malware companies who currently unprotect and hack the
syscall table of their users)

Just this week I got another request to look at syscalls. So I did, I
haven't prototyped it, but I can do it with syscalls, they would look
like this:

int fanotify_init(int flags, int f_flags, __u64 mask, unsigned int priority);

int fanotify_add_mark(int fanotify_fd, char *path, __u64 mask, __u64 ignored_mask);
int fanotify_add_mark_fd(int fanotify_fd, int fd, __u64 mask, __u64 ignored_mask);
int fanotify_rm_mark(int fanotify_fd, char *path, __u64 mask);
int fanotify_rm_mark_fd(int fanotify_fd, int fd, __u64 mask);
Those above 4 could probably be squashed into 2 syscalls with an extra
flags field.

int fanotify_clear_marks(int fanotify_fd);

int fanotify_perm_response(int fanotify_fd, __u64 cookie, int response);

int fanotify_ignore_sb(int fanotify_fd, long f_type);
int fanotify_ignore_fsid(int fanotify_fd, fsid_t f_fsid);
These 2 are the most questionable, they would honestly only be used for
things that wanted system wide notification, I can't imagine that being
many things other than AV vendors. But they really need a way to
exclude notification when people open/close/read/write to /proc (which
is the point of the ignore_sb.)

Since I don't have a solution to subtree notification I don't know if it
will work in this syscall framework. I know people want subtree
notification and I'm willing to take a stab at it after the fscking all
notification is accepted. That's one of the main reasons I like
setsockopt over tons of syscalls. I can add a new one very easily. I
also can easily expand arguments by just creating a new sockopt. No
userspace headaches.

Are there demands that I convert to syscalls? Do I really gain anything
using 9 new inextensible syscalls over socket(), bind(), and 8
setsockopt() calls?

I'd like to send these patches along, so a ruling from on high would be
great....

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Evgeniy Polyakov on 15 Sep 2009 16:20

On Mon, Sep 14, 2009 at 03:08:15PM -0400, Eric Paris (eparis(a)redhat.com) wrote:
> Just this week I got another request to look at syscalls. So I did, I
> haven't prototyped it, but I can do it with syscalls, they would look
> like this:
>
> int fanotify_init(int flags, int f_flags, __u64 mask, unsigned int priority);

....

> Are there demands that I convert to syscalls? Do I really gain anything
> using 9 new inextensible syscalls over socket(), bind(), and 8
> setsockopt() calls?

In my personal opinion sockets are way much simpler and obvious than
syscalls. Also one should not edit hundred of arch-dependant headers
conflicting with other pity syscallers.

But implementing af_fanotify opens a door for zillions of other
af_something which can be implemented using existing infrastructure
namely netlink will solve likely 99% of potential usage cases.

And frankly I did not find it way too convincing that using netlink is
impossible in your scenario if some things will be simplified, namely
event merging. Moreover you can implement a pool of working threads and
postpone all the work to them and appropriate event queues, which will
allow to use rlimits for the listeners and open files 'kind of' on
behalf of those processes.

But it is quite diferent from the approach you selected and which is
more obvious indeed. So if you ask a question whether fanotify should
use sockets or syscalls, I would prefer sockets, but I still recommend
to rethink your decision to move away from netlink to be 100% sure that
it will not solve your needs.

--
Evgeniy Polyakov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Eric Paris on 15 Sep 2009 18:00

On Wed, 2009-09-16 at 00:16 +0400, Evgeniy Polyakov wrote:
> On Mon, Sep 14, 2009 at 03:08:15PM -0400, Eric Paris (eparis(a)redhat.com) wrote:
> > Just this week I got another request to look at syscalls. So I did, I
> > haven't prototyped it, but I can do it with syscalls, they would look
> > like this:
> >
> > int fanotify_init(int flags, int f_flags, __u64 mask, unsigned int priority);
>
> ...
>
> > Are there demands that I convert to syscalls? Do I really gain anything
> > using 9 new inextensible syscalls over socket(), bind(), and 8
> > setsockopt() calls?
>
> In my personal opinion sockets are way much simpler and obvious than
> syscalls. Also one should not edit hundred of arch-dependant headers
> conflicting with other pity syscallers.
>
> But implementing af_fanotify opens a door for zillions of other
> af_something which can be implemented using existing infrastructure
> namely netlink will solve likely 99% of potential usage cases.
>
> And frankly I did not find it way too convincing that using netlink is
> impossible in your scenario if some things will be simplified, namely
> event merging.

Nothing's impossible, but is netlink a square peg for this round hole?
One of the great benefits of netlink, the attribute matching and
filtering, although possibly useful isn't some panacea as we have to do
that well before netlink to have anything like decent performance.
Imagine every single fs event creating an skb and sending it with
netlink only to have most of them dropped.

The only other benefit to netlink that I know of is the reasonable,
easy, and clean addition of information later in time with backwards
compatibility as needed. That's really cool, I admit, but with the
limited amount of additional info that users have wanted out of inotify
I think my data type extensibility should be enough.

> Moreover you can implement a pool of working threads and
> postpone all the work to them and appropriate event queues, which will
> allow to use rlimits for the listeners and open files 'kind of' on
> behalf of those processes.

I'm sorry, I don't userstand. I don't see how worker threads help
anything here. Can you explain what you are thinking?

> But it is quite diferent from the approach you selected and which is
> more obvious indeed. So if you ask a question whether fanotify should
> use sockets or syscalls, I would prefer sockets

I've heard someone else off list say this as well. I'm not certain why.
I actually spent the day yesterday and have fanotify working over 5 new
syscalls (good thing I wrote the code with separate back and and front
ends for just this purpose) And I really don't hate it. I think 3
might be enough.

fanotify_init() ---- very much like inotify_init
fanotify_modify_mark_at() --- like inotify_add_watch and rm_watch
fanotify_modify_mark_fd() --- same but with an fd instead of a path
fanotify_response() --- userspace tells the kernel what to do if requested/allowed
(could probably be done using write() to the fanotify fd)
fanotify_exclude() --- a horrid syscall which might be better as an ioctl since it isn't strongly typed....

If there is a strong need for syscalls I'm ready and willing. I'd love
to hear Linus say socket+setsockopt() is a merge blocker and I have to
go to syscalls if he sees it that way. If davem and friends say I'm not
networky enough to use socket()+setsockopt() I guess I have to look at
netlink (which I'm still not convinced buys us anything vs the crazy
complexity) or go with syscalls. I'm perfectly willing to admit this is
a /dev+ioctl type interface only implemented on socket+setsockopt(). If
that's a no go, someone say it now please.

> but I still recommend
> to rethink your decision to move away from netlink to be 100% sure that
> it will not solve your needs.

I don't see what's gained using netlink. I am already reusing the
fsnotify code to do all my queuing. Someone help me understand the
benefit of netlink and help me understand how we can reasonably meet the
needs and I'll try to prototype it.

1) fd's must be opened in the recv process
2) reliability, if loss must know on the send side

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds on 15 Sep 2009 20:00

On Tue, 15 Sep 2009, Eric Paris wrote:
>
> I don't see what's gained using netlink.

I'm personally not a big believer in netlink. What's the point, really? If
you are sending datagrams back-and-forth, go wild. But if it's more
structured than that, netlink has no actual upsides as far as I can tell.

Same goes for sockets in this case, actually. What's the upside?

I'll throw out a couple of upsides of actual system calls, people can feel
free to comment:

- things like 'strace' _work_ and the traces make sense, and you
generally see what the app is trying to do from the traces (sure, it
takes some time for strace to learn new system calls, but even when it
only gives a system call number, it's never any worse than some
"made-up packet interface".

- if you have a system call definition, it tends to be a much stricter
interface than "let's send some packets around with a network
interface".

- No unnecessary infrastructure.

That said, maybe the netlink/socket people can argue for their
standpoints.

(And btw, I still want to know what's so wonderful about fanotify that we
would actually want yet-another-filesystem-notification-interface. So I'm
not sayying that I'll take a system call interface. I just don't think
that hiding interfaces behind some random packet interface is at all any
better)

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Eric Paris on 15 Sep 2009 21:30

On Tue, 2009-09-15 at 16:49 -0700, Linus Torvalds wrote:

> And btw, I still want to know what's so wonderful about fanotify that we
> would actually want yet-another-filesystem-notification-interface. So I'm
> not sayying that I'll take a system call interface.

The real thing that fanotify provides is an open fd with the event
rather than some arbitrary 'watch descriptor' that userspace must
somehow magically map back to data on disk. This means that it could be
used to provide subtree notification, which inotify is completely
incapable of doing. And it can be used to provide system wide
notification. We all know who wants that.

It provides an extensible data format which allows growth impossible in
inotify. I don't know if anyone remember the inotify patches which
wanted to overload the inotify cookie field for some other information,
but inotify information extension is not reasonable or backwards
compatible.

fanotify also allows userspace to make access decisions and cache those
in the kernel. Useful for integrity checkers (anti-malware people) and
for hierarchical storage management people.

I've got private commitments for two very large anti malware companies,
both of which unprotect and hack syscall tables in their customer's
kernels, that they would like to move to an fanotify interface. Both
Red Hat and Suse have expressed interest in these patches and have
contributed to the patch set.

The patch set is actually rather small (entire set of about 20 patches
is 1800 lines) as it builds on the fsnotify work already in 2.6.31 to
reuse code from inotify rather than reimplement the same things over and
over (like we previously had with inotify and dnotify)

Don't know what else to say.....

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: BUG kmalloc-64: Poison overwritten, INFO: Allocated in bdi_alloc_work+0x2b/0x100 age=175 cpu=1 pid=3514
Next: [PATCH 62/72] Blackfin: bf537-stamp: add adp5588 gpio resources