Unify KVM kernel-space and user-space code into a single project [Kernel]

Prev: [PATCH 1/5] x86-32: Split cache flush handler from simd handler
Next: [RESEND][PATCH 1/3] Add tracing_off_event() to stop tracing when a bug or warning occur

From: Joerg Roedel on 24 Mar 2010 09:00

On Wed, Mar 24, 2010 at 02:08:17PM +0200, Avi Kivity wrote:
> On 03/24/2010 01:59 PM, Joerg Roedel wrote:

> You can always provide the kernel and module paths as command line
> parameters. It just won't be transparently usable, but if you're using
> qemu from the command line, presumably you can live with that.

I don't want the tool for myself only. A typical perf user expects that
it works transparent.

>> Could be easily done using notifier chains already in the kernel.
>> Probably implemented with much less than 100 lines of additional code.
>
> And a userspace interface for that.

Not necessarily. The perf event is configured to measure systemwide kvm
by userspace. The kernel side of perf takes care that it stays
system-wide even with added vm instances. So in this case the consumer
for the notifier would be the perf kernel part. No userspace interface
required.

> If we make an API, I'd like it to be generally useful.

Thats hard to do at this point since we don't know what people will use
it for. We should keep it simple in the beginning and add new features
as they are requested and make sense in this context.

> It's a total headache. For example, we'd need security module hooks to
> determine access permissions. So far we managed to avoid that since kvm
> doesn't allow you to access any information beyond what you provided it
> directly.

Depends on how it is designed. A filesystem approach was already
mentioned. We could create /sys/kvm/ for example to expose information
about virtual machines to userspace. This would not require any new
security hooks.

> Copying the objects is a one time cost. If you run perf for more than a
> second or two, it would fetch and cache all of the data. It's really
> the same problem with non-guest profiling, only magnified a bit.

I don't think we can cache filesystem data of a running guest on the
host. It is too hard to keep such a cache coherent.

>>> Other userspaces can also provide this functionality, like they have to
>>> provide disk, network, and display emulation. The kernel is not a huge
>>> library.

If two userspaces run in parallel what is the single instance where perf
can get a list of guests from?

> kvm.ko has only a small subset of the information that is used to define
> a guest.

The subset is not small. It contains all guest vcpus, the complete
interrupt routing hardware emulation and manages event the guests
memory.

Joerg

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Joerg Roedel on 24 Mar 2010 09:50

On Wed, Mar 24, 2010 at 03:05:02PM +0200, Avi Kivity wrote:
> On 03/24/2010 02:50 PM, Joerg Roedel wrote:

>> I don't want the tool for myself only. A typical perf user expects that
>> it works transparent.
>
> A typical kvm user uses libvirt, so we can integrate it with that.

Someone who uses libvirt and virt-manager by default is probably not
interested in this feature at the same level a kvm developer is. And
developers tend not to use libvirt for low-level kvm development. A
number of developers have stated in this thread already that they would
appreciate a solution for guest enumeration that would not involve
libvirt.

> Someone needs to know about the new guest to fetch its symbols. Or do
> you want that part in the kernel too?

The samples will be tagged with the guest-name (and some additional
information perf needs). Perf userspace can access the symbols then
through /sys/kvm/guest0/fs/...

>> Depends on how it is designed. A filesystem approach was already
>> mentioned. We could create /sys/kvm/ for example to expose information
>> about virtual machines to userspace. This would not require any new
>> security hooks.
>
> Who would set the security context on those files?

An approach like: "The files are owned and only readable by the same
user that started the vm." might be a good start. So a user can measure
its own guests and root can measure all of them.

> Plus, we need cgroup support so you can't see one container's guests
> from an unrelated container.

cgroup support is an issue but we can solve that too. Its in general
still less complex than going through the whole libvirt-qemu-kvm stack.

> Integration with qemu would allow perf to tell us that the guest is
> hitting the interrupt status register of a virtio-blk device in pci
> slot 5 (the information is already available through the kvm_mmio
> trace event, but only qemu can decode it).

Yeah that would be interesting information. But it is more related to
tracing than to pmu measurements.
The information which you mentioned above are probably better
captured by an extension of trace-events to userspace.

Joerg

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Joerg Roedel on 24 Mar 2010 11:10

On Wed, Mar 24, 2010 at 03:57:39PM +0200, Avi Kivity wrote:
> On 03/24/2010 03:46 PM, Joerg Roedel wrote:

>> Someone who uses libvirt and virt-manager by default is probably not
>> interested in this feature at the same level a kvm developer is. And
>> developers tend not to use libvirt for low-level kvm development. A
>> number of developers have stated in this thread already that they would
>> appreciate a solution for guest enumeration that would not involve
>> libvirt.
>
> So would I.

Great.

> But when I weigh the benefit of truly transparent system-wide perf
> integration for users who don't use libvirt but do use perf, versus
> the cost of transforming kvm from a single-process API to a
> system-wide API with all the complications that I've listed, it comes
> out in favour of not adding the API.

Its not a transformation, its an extension. The current per-process
/dev/kvm stays mostly untouched. Its all about having something like
this:

$ cd /sys/kvm/guest0
$ ls -l
-r-------- 1 root root 0 2009-08-17 12:05 name
dr-x------ 1 root root 0 2009-08-17 12:05 fs
$ cat name
guest0
$ # ...

The fs/ directory is used as the mount point for the guest root fs.

>> The samples will be tagged with the guest-name (and some additional
>> information perf needs). Perf userspace can access the symbols then
>> through /sys/kvm/guest0/fs/...
>
> I take that as a yes? So we need a virtio-serial client in the kernel
> (which might be exploitable by a malicious guest if buggy) and a
> fs-over-virtio-serial client in the kernel (also exploitable).

What I meant was: perf-kernel puts the guest-name into every sample and
perf-userspace accesses /sys/kvm/guest_name/fs/ later to resolve the
symbols. I leave the question of how the guest-fs is exposed to the host
out of this discussion. We should discuss this seperatly.

>> An approach like: "The files are owned and only readable by the same
>> user that started the vm." might be a good start. So a user can measure
>> its own guests and root can measure all of them.
>
> That's not how sVirt works. sVirt isolates a user's VMs from each
> other, so if a guest breaks into qemu it can't break into other guests
> owned by the same user.

If a vm breaks into qemu it can access the host file system which is the
bigger problem. In this case there is no isolation anymore. From that
context it can even kill other VMs of the same user independent of a
hypothetical /sys/kvm/.

>> Yeah that would be interesting information. But it is more related to
>> tracing than to pmu measurements. The information which you
>> mentioned above are probably better captured by an extension of
>> trace-events to userspace.
>
> It's all related. You start with perf, see a problem with mmio, call up
> a histogram of mmio or interrupts or whatever, then zoom in on the
> misbehaving device.

Yes, but its different from the implementation point-of-view. For the
user it surely all plays together.

Joerg

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Joerg Roedel on 24 Mar 2010 11:40

On Wed, Mar 24, 2010 at 03:26:53PM +0000, Daniel P. Berrange wrote:
> On Wed, Mar 24, 2010 at 04:01:37PM +0100, Joerg Roedel wrote:
> > >> An approach like: "The files are owned and only readable by the same
> > >> user that started the vm." might be a good start. So a user can measure
> > >> its own guests and root can measure all of them.
> > >
> > > That's not how sVirt works. sVirt isolates a user's VMs from each
> > > other, so if a guest breaks into qemu it can't break into other guests
> > > owned by the same user.
> >
> > If a vm breaks into qemu it can access the host file system which is the
> > bigger problem. In this case there is no isolation anymore. From that
> > context it can even kill other VMs of the same user independent of a
> > hypothetical /sys/kvm/.
>
> No it can't. With sVirt every single VM has a custom security label and
> the policy only allows it access to disks / files with a matching label,
> and prevents it attacking any other VMs or processes on the host. THis
> confines the scope of any exploit in QEMU to those resources the admin
> has explicitly assigned to the guest.

Even better. So a guest which breaks out can't even access its own
/sys/kvm/ directory. Perfect, it doesn't need that access anyway.

Joerg

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Joerg Roedel on 24 Mar 2010 11:50

On Wed, Mar 24, 2010 at 05:12:55PM +0200, Avi Kivity wrote:
> On 03/24/2010 05:01 PM, Joerg Roedel wrote:
>> $ cd /sys/kvm/guest0
>> $ ls -l
>> -r-------- 1 root root 0 2009-08-17 12:05 name
>> dr-x------ 1 root root 0 2009-08-17 12:05 fs
>> $ cat name
>> guest0
>> $ # ...
>>
>> The fs/ directory is used as the mount point for the guest root fs.
>
> The problem is /sys/kvm, not /sys/kvm/fs.

I am not tied to /sys/kvm. We could also use /proc/<pid>/kvm/ for
example. This would keep anything in the process space (except for the
global list of VMs which we should have anyway).

>> What I meant was: perf-kernel puts the guest-name into every sample and
>> perf-userspace accesses /sys/kvm/guest_name/fs/ later to resolve the
>> symbols. I leave the question of how the guest-fs is exposed to the host
>> out of this discussion. We should discuss this seperatly.
>
> How I see it: perf-kernel puts the guest pid into every sample, and
> perf-userspace uses that to resolve to a mountpoint served by fuse, or
> to a unix domain socket that serves the files.

We need a bit more information than just the qemu-pid, but yes, this
would also work out.

>> If a vm breaks into qemu it can access the host file system which is the
>> bigger problem. In this case there is no isolation anymore. From that
>> context it can even kill other VMs of the same user independent of a
>> hypothetical /sys/kvm/.
>
> It cannot. sVirt labels the disk image and other files qemu needs with
> the appropriate label, and everything else is off limits. Even if you
> run the guest as root, it won't have access to other files.

See my reply to Daniel's email.

>> Yes, but its different from the implementation point-of-view. For the
>> user it surely all plays together.
>
> We need qemu to cooperate for mmio tracing, and we can cooperate with
> qemu for symbol resolution. If it prevents adding another kernel API,
> that's a win from my POV.

Thats true. Probably qemu can inject this information in the
kvm-trace-events stream.

Joerg

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: [PATCH 1/5] x86-32: Split cache flush handler from simd handler
Next: [RESEND][PATCH 1/3] Add tracing_off_event() to stop tracing when a bug or warning occur