From: Eric Paris on
On Thu, 2009-09-17 at 22:07 +0200, Andreas Gruenbacher wrote:

> From my point of view, "global" events make no sense, and fanotify listeners
> should register which directories they are interested in (e.g., include "/",
> exclude "/proc"). This takes care of chroots and namespaces as well.

While I completely agree that most users don't want global events, the
antimalware vendors who today, unprotect and hack the syscall table on
their unsuspecting customer's machines to intercept every read, write,
open, close, mmap, etc syscall want EXACTLY that. They'd been asking
for a way to get this information for quite some time now. The largest
vendors in this market have agreed the interface (well, when it was a
socket interface that I talked about for so long) should meet their
needs.

Subtree watching / isn't any different or better, just harder and more
complex to implement. You still have to exclude /proc and /sys and
everything else. Just like one must with a global listener. Still
though, this sounds like an issue for the f_type and f_fsid exclusion
syscall I say I'm still not settled on. Not and issue with the basis of
fanotify or with the 3 proposed syscalls.

Jamie, do you see a problem with what I have been asking for review on
or see a problem with extending it moving forward?

Linus, do you see the value of 'yet another notification scheme' ?

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andreas Gruenbacher on
On Friday, 18 September 2009 22:52:08 Eric Paris wrote:
> On Thu, 2009-09-17 at 22:07 +0200, Andreas Gruenbacher wrote:
> > From my point of view, "global" events make no sense, and fanotify
> > listeners should register which directories they are interested in (e.g.,
> > include "/", exclude "/proc"). This takes care of chroots and namespaces
> > as well.
>
> While I completely agree that most users don't want global events, the
> antimalware vendors who today, unprotect and hack the syscall table on
> their unsuspecting customer's machines to intercept every read, write,
> open, close, mmap, etc syscall want EXACTLY that.

I understand that "global" is what those guys get today for lack of a
reasonable mechanism, but it's not what anybody can ge given by fanotify: it
conflicts with filesystem namespaces.

Consider running several "virtual machines" in separate namespaces on the same
kernel. With "global" you are forced to run the same global fanotify
listeners everywhere; with per-mount-point listeners, you can choose
between "global" and something more fine-grained by identifying which
vfsmounts you are interested in. (Filesystem namespaces correspond to
vfsmount hierarchies.)

> [...] You still have to exclude /proc and /sys and everything else.

Those are mount points, and so convenient to handle with a per-mount-point
mechanism. No additional kernel code needed.

> [...] Still though, this sounds like an issue for the f_type and f_fsid
> exclusion syscall I say I'm still not settled on.

Those are also obsolete with a per-mount-point mechanism.

Thanks,
Andreas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Eric Paris on
On Sat, 2009-09-19 at 00:00 +0200, Andreas Gruenbacher wrote:
> On Friday, 18 September 2009 22:52:08 Eric Paris wrote:
> > On Thu, 2009-09-17 at 22:07 +0200, Andreas Gruenbacher wrote:
> > > From my point of view, "global" events make no sense, and fanotify
> > > listeners should register which directories they are interested in (e.g.,
> > > include "/", exclude "/proc"). This takes care of chroots and namespaces
> > > as well.
> >
> > While I completely agree that most users don't want global events, the
> > antimalware vendors who today, unprotect and hack the syscall table on
> > their unsuspecting customer's machines to intercept every read, write,
> > open, close, mmap, etc syscall want EXACTLY that.
>
> I understand that "global" is what those guys get today for lack of a
> reasonable mechanism, but it's not what anybody can ge given by fanotify: it
> conflicts with filesystem namespaces.
>
> Consider running several "virtual machines" in separate namespaces on the same
> kernel. With "global" you are forced to run the same global fanotify
> listeners everywhere; with per-mount-point listeners, you can choose
> between "global" and something more fine-grained by identifying which
> vfsmounts you are interested in. (Filesystem namespaces correspond to
> vfsmount hierarchies.)

Let me start by saying I am agreeing I should pursue subtree
notification. It's what I think everyone really wants. It's a great
idea, and I think you might have a simple way to get close. Clearly
these are avenues I'm willing and hoping to pursue. Also I say it
again, I believe the interface as proposed (except maybe some of my
exclusion stuff) is flexible enough to implement any of these ideas.
Does anyone disagree?

BUT to solve one of the main problems fanotify is intending to solve it
needs a way to be the 'fscking all notifier.' It needs to be the whole
damn system. I totally agree that what I have in my tree today (yet
unposted) restricting global notification (CAP_SYS_ADMIN) is highly
inadequate. If any root task in any namespace could easily hop on out
of it's namespace using fanotify, that's a problem. No arguments with
me.

But there must be a way for fanotify to globally get everything. That's
one of the main points of fanotify. It needs to be a fscking all
notifier, even of things in a completely detached namespace. AV vendors
are going to get it. Their customers our users are going to load kernel
modules that do horrible things. These are the realities of the world
in which we live. Do we really throw 10's or 100's of thousands of our
users under the bus because we don't like the software they are using on
philosophical grounds?

I'm sure namespace people are calling me an idiot and tell me to stay in
my namespace. I want to stay in my namespace for 'most' root users, but
I need a way to get a global scanner. I want to know what is the sanest
way? And for people who feel it's insane, just don't compile it in.
I'll make global listeners a build option. But global listeners is an
absolute requirement. I was considering saying you needed cap_sys_admin
and you needed current->ns_proxy->mnt_ns == the original init task's
mnt_ns. Maybe this isn't a great way to determine if a task should be
allowed to use global listeners. Is there a better way to restrict it?

Think about your web hosting company. They sell 'cheap' vm's to
customers in a private name. The web hosting company want to run an AV
scanner that scans every file on the computer, their files, their
customer's files, everything. Certainly we don't want the customer to
break out of their namespace. So, what is the sanest, even if you hate
the idea so much you compile it out, way to let the hosting company get
information about files in their customer's detached namespace which not
letting their customers get information about each other?

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andreas Gruenbacher on
On Saturday, 19 September 2009 5:04:31 Eric Paris wrote:
> Let me start by saying I am agreeing I should pursue subtree
> notification. It's what I think everyone really wants. It's a great
> idea, and I think you might have a simple way to get close. Clearly
> these are avenues I'm willing and hoping to pursue. Also I say it
> again, I believe the interface as proposed (except maybe some of my
> exclusion stuff) is flexible enough to implement any of these ideas.
> Does anyone disagree?

It does seem flexible enough. However, the current interface assumes "global"
listeners (the mask argument of fanotify_init):

int fanotify_init(int flags, int f_flags, __u64 mask,
unsigned int priority);

Once subtree support is added, this parameter becomes obsolete. That's pretty
broken for a syscall yet to be introduced.

> BUT to solve one of the main problems fanotify is intending to solve it
> needs a way to be the 'fscking all notifier.' It needs to be the whole
> damn system.

Think of a system after boot, with a single global namespace. Whatever you
access by filename is reachable from the namespace root. At this point,
nothing more global exists. A listener can watch the mount points of
interest, and everything's fine.

What's a bit more tricky is to ensure that this listener will continue to
receive all events from whatever else is mounted anywhere, irrespective of
namespaces. I think we can get there.

By the way, Documentation/filesystems/sharedsubtree.txt describes how
filesystem namespaces work.

Thanks,
Andreas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jamie Lokier on
Andreas Gruenbacher wrote:
> On Saturday, 19 September 2009 5:04:31 Eric Paris wrote:
> > Let me start by saying I am agreeing I should pursue subtree
> > notification. It's what I think everyone really wants. It's a great
> > idea, and I think you might have a simple way to get close. Clearly
> > these are avenues I'm willing and hoping to pursue. Also I say it
> > again, I believe the interface as proposed (except maybe some of my
> > exclusion stuff) is flexible enough to implement any of these ideas.
> > Does anyone disagree?
>
> It does seem flexible enough. However, the current interface assumes "global"
> listeners (the mask argument of fanotify_init):
>
> int fanotify_init(int flags, int f_flags, __u64 mask,
> unsigned int priority);
>
> Once subtree support is added, this parameter becomes obsolete. That's pretty
> broken for a syscall yet to be introduced.
>
> > BUT to solve one of the main problems fanotify is intending to solve it
> > needs a way to be the 'fscking all notifier.' It needs to be the whole
> > damn system.
>
> Think of a system after boot, with a single global namespace. Whatever you
> access by filename is reachable from the namespace root. At this point,
> nothing more global exists. A listener can watch the mount points of
> interest, and everything's fine.
>
> What's a bit more tricky is to ensure that this listener will continue to
> receive all events from whatever else is mounted anywhere, irrespective of
> namespaces. I think we can get there.

I think so to, and that'd be a great all round solution.

We _have_ to receive mount & umount events to do this. But even
inotify-style tracking needs those if it's to be accurate, so it's not
an additional burden.

It would be logical if fanotify could block and ack those in the same
way as it can block and ack other accesses (with the usual filtering
rules on which inodes trigger events, and which don't or are cached).

As in to prevent: mount --bind innocent .bash_login, but also to
ensure it always knows what's mounted when another event occurs.

> By the way, Documentation/filesystems/sharedsubtree.txt describes how
> filesystem namespaces work.

Fortunately, after making a new namespace you can read the mounts in
the new namespace from /proc/self/mount* (I think) without having to
know anything about the shared subtree rules.

So to follow monitoring/checking across all namespaces, it would (I
think) be enough to receive a fanotify "new namespace" event, and Ack
that event to allow the CLONE_NS to proceed. It's still tricky stuff
though.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/