Enhance perf to collect KVM guest os statistics from host side [Kernel]

Prev: + tmpfs-fix-oops-on-remounts-with-mpol=default.patch added to -mm tree
Next: [PATCH 5/5] doc: add the documentation for mpol=local

From: Zhang, Yanmin on 22 Mar 2010 03:30

On Fri, 2010-03-19 at 09:21 +0100, Ingo Molnar wrote:
> Nice progress!
>
> This bit:
>
> > 1) perf kvm top
> > [root(a)lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms
> > --guestmodules=/home/ymzhang/guest/modules top
>

> Will be really be painful to developers - to enter that long line while we
> have these things called 'computers' that ought to reduce human work. Also,
> it's incomplete, we need access to the guest system's binaries to do ELF
> symbol resolution and dwarf decoding.
Yes, I agree with you and Avi that we need the enhancement be user-friendly.
One of my start points is to keep the tool having less dependency on
other components. Admin/developers could write script wrappers quickly if
perf has parameters to support the new capability.

>
> So we really need some good, automatic way to get to the guest symbol space,
> so that if a developer types:
>
> perf kvm top
>
> Then the obvious thing happens by default. (which is to show the guest
> overhead)
>
> There's no technical barrier on the perf tooling side to implement all that:
> perf supports build-ids extensively and can deal with multiple symbol spaces -
> as long as it has access to it. The guest kernel could be ID-ed based on its
> /sys/kernel/notes and /sys/module/*/notes/.note.gnu.build-id build-ids.
I tried sshfs quickly. sshfs could mount root filesystem of guest os nicely.
I could access the files quickly. However, it doesn't work when I access
/proc/ and /sys/ because sshfs/scp depend on file size while the sizes of most
files of /proc/ and /sys/ are 0.

>
> So some sort of --guestmount option would be the natural solution, which
> points to the guest system's root: and a Qemu enumeration of guest mounts
> (which would be off by default and configurable) from which perf can pick up
> the target guest all automatically. (obviously only under allowed permissions
> so that such access is secure)
If sshfs could access /proc/ and /sys correctly, here is a design:
--guestmount points to a directory which consists of a list of sub-directories.
Every sub-directory's name is just the qemu process id of guest os. Admin/developer
mounts every guest os instance's root directory to corresponding sub-directory.

Then, perf could access all files. It's possible because guest os instance
happens to be multi-threading in a process. One of the defects is the accessing to
guest os becomes slow or impossible when guest os is very busy.

>
> This would allow not just kallsyms access via $guest/proc/kallsyms but also
> gives us the full space of symbol features: access to the guest binaries for
> annotation and general symbol resolution, command/binary name identification,
> etc.
>
> Such a mount would obviously not broaden existing privileges - and as an
> additional control a guest would also have a way to indicate that it does not
> wish a guest mount at all.
>
> Unfortunately, in a previous thread the Qemu maintainer has indicated that he
> will essentially NAK any attempt to enhance Qemu to provide an easily
> discoverable, self-contained, transparent guest mount on the host side.
>
> No technical justification was given for that NAK, despite my repeated
> requests to particulate the exact security problems that such an approach
> would cause.
>
> If that NAK does not stand in that form then i'd like to know about it - it
> makes no sense for us to try to code up a solution against a standing
> maintainer NAK ...
>
> The other option is some sysadmin level hackery to NFS-mount the guest or so.
> This is a vastly inferior method that brings us back to the absymal usability
> levels of OProfile:
>
> 1) it wont be guest transparent
> 2) has to be re-done for every guest image.
> 3) even if packaged it has to be gotten into every. single. Linux. distro. separately.
> 4) old Linux guests wont work out of box
>
> In other words: it's very inconvenient on multiple levels and wont ever happen
> on any reasonable enough scale to make a difference to Linux.
>
> Which is an unfortunate situation - and the ball is on the KVM/Qemu side so i
> can do little about it.
>
> Thanks,
>
> Ingo

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Ingo Molnar on 22 Mar 2010 06:40

* oerg Roedel <joro(a)8bytes.org> wrote:

> > It can decide whether it exposes the files. Nor are there any "security
> > issues" to begin with.
>
> I am not talking about security. [...]

You were talking about security, in the portion of your mail that you snipped
out, and which i replied to:

> > 2. The guest can decide for its own if it want to pass this
> > inforamtion to the host-perf. No security issues at all.

I understood that portion to mean what it says: that your claim that your
proposal 'has no security issues at all', in contrast to my suggestion.

> [...] Security was sufficiently flamed about already.

All i saw was my suggestion to allow a guest to securely (and scalably and
conveniently) integrate/mount its filesystems to the host if both sides (both
the host and the guest) permit it, to make it easier for instrumentation to
pick up symbol details.

I.e. if a guest runs then its filesystem may be present on the host side as:

/guests/Fedora-G1/
/guests/Fedora-G1/proc/
/guests/Fedora-G1/usr/
/guests/Fedora-G1/.../

( This feature would be configurable and would be default-off, to maintain the
current status quo. )

i.e. it's a bit like sshfs or NFS or loopback block mounts, just in an
integrated and working fashion (sshfs doesnt work well with /proc for example)
and more guest transparent (obviously sshfs or NFS exports need per guest
configuration), and lower overhead than sshfs/NFS - i.e. without the
(unnecessary) networking overhead.

That suggestion was 'countered' by an unsubstantiated claim by Anthony that
this kind of usability feature would somehow be a 'security nighmare'.

In reality it is just an incremental, more usable, faster and more
guest-transparent form of what is already possible today via:

- loopback mounts on host
- NFS exports
- SMB exports
- sshfs
- (and other mechanisms)

I wish there was at least flaming about it - as flames tend to have at least
some specifics in them.

What i saw instead was a claim about a 'security nightmare', which was, when i
asked for specifics, was followed by deafening silence. And you appear to have
repeated that claim here, unwilling to back it up with specifics.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Ingo Molnar on 22 Mar 2010 07:00

* oerg Roedel <joro(a)8bytes.org> wrote:

> On Sun, Mar 21, 2010 at 07:43:00PM +0100, Ingo Molnar wrote:
> > Having access to the actual executable files that include the symbols achieves
> > precisely that - with the additional robustness that all this functionality is
> > concentrated into the host, while the guest side is kept minimal (and
> > transparent).
>
> If you want to access the guests file-system you need a piece of software
> running in the guest which gives you this access. But when you get an event
> this piece of software may not be runnable (if the guest is in an interrupt
> handler or any other non-preemptible code path). When the host finally gets
> access to the guests filesystem again the source of that event may already
> be gone (process has exited, module unloaded...). The only way to solve that
> is to pass the event information to the guest immediatly and let it collect
> the information we want.

The very same is true of profiling in the host space as well (KVM is nothing
special here, other than its unreasonable insistence on not enumerating
readily available information in a more usable way).

So are you suggesting a solution to a perf problem we already solved
differently? (and which i argue we solved in a better way)

We have solved that in the host space already (and quite elaborately so), and
not via your suggestion of moving symbol resolution to a different stage, but
by properly generating the right events to allow the post-processing stage to
see processes that have already exited, to robustly handle files that have
been rebuilt, etc.

From an instrumentation POV it is fundamentally better to acquire the right
data and delay any complexities to the analysis stage (the perf model) than to
complicate sampling (the oprofile dcookies model).

Your proposal of 'doing the symbol resolution in the guest context' is in
essence re-arguing that very similar point that oprofile lost. Did you really
intend to re-argue that point as well? If yes then please propose an
alternative implementation for everything that perf does wrt. symbol lookups.

What we propose for 'perf kvm' right now is simply a straight-forward
extension of the existing (and well working) symbol handling code to
virtualization.

> > You need to be aware of the fact that symbol resolution is a separate step
> > from call chain generation.
>
> Same concern as above applies to call-chain generation too.

Best would be if you demonstrated any problems of the perf symbol lookup code
you are aware of on the host side, as it has that exact design you are
criticising here. We are eager to fix any bugs in it.

If you claim that it's buggy then that should very much be demonstratable - no
need to go into theoretical arguments about it.

( You should be aware of the fact that perf currently works with 'processes
exiting prematurely' and similar scenarios just fine, so if you want to
demonstrate that it's broken you will probably need a different example. )

> > > How we speak to the guest was already discussed in this thread. My
> > > personal opinion is that going through qemu is an unnecessary step and
> > > we can solve that more clever and transparent for perf.
> >
> > Meaning exactly what?
>
> Avi was against that but I think it would make sense to give names to
> virtual machines (with a default, similar to network interface names). Then
> we can create a directory in /dev/ with that name (e.g. /dev/vm/fedora/).
> Inside the guest a (priviledged) process can create some kind of named
> virt-pipe which results in a device file created in the guests directory
> (perf could create /dev/vm/fedora/perf for example). This file is used for
> guest-host communication.

That is kind of half of my suggestion - the built-in enumeration guests and a
guaranteed channel to them accessible to tools. (KVM already has its own
special channel so it's not like channels of communication are useless.)

The other half of my suggestion is that if we bring this thought to its
logical conclusion then we might as well walk the whole mile and not use
quirky, binary API single-channel pipes. I.e. we could use this convenient,
human-readable, structured, hierarchical abstraction to expose information in
a finegrained, scalable way, which has a world-class implementation in Linux:
the 'VFS namespace'.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Ingo Molnar on 22 Mar 2010 08:30

* Joerg Roedel <joro(a)8bytes.org> wrote:

> On Mon, Mar 22, 2010 at 11:59:27AM +0100, Ingo Molnar wrote:
> > Best would be if you demonstrated any problems of the perf symbol lookup code
> > you are aware of on the host side, as it has that exact design you are
> > criticising here. We are eager to fix any bugs in it.
> >
> > If you claim that it's buggy then that should very much be demonstratable - no
> > need to go into theoretical arguments about it.
>
> I am not claiming anything. I just try to imagine how your proposal will
> look like in practice and forgot that symbol resolution is done at a later
> point.
>
> But even with defered symbol resolution we need more information from the
> guest than just the rip falling out of KVM. The guest needs to tell us about
> the process where the event happened (information that the host has about
> itself without any hassle) and which executable-files it was loaded from.

Correct - for full information we need a good paravirt perf integration of the
kernel bits to pass that through. (I.e. we want to 'integrate' the PID space
as well, at least within the perf notion of PIDs.)

Initially we can do without that as well.

> Probably. At least it is the solution that fits best into the current design
> of perf. But we should think about how this will be done. Raw disk access is
> no solution because we need to access virtual file-systems of the guest too.
> [...]

I never said anything about 'raw disk access'. Have you seen my proposal of
(optional) VFS namespace integration? (It can be found repeated the Nth time
in my mail you replied to)

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Zhang, Yanmin on 22 Mar 2010 23:20

On Mon, 2010-03-22 at 13:44 -0300, Arnaldo Carvalho de Melo wrote:
> Em Mon, Mar 22, 2010 at 03:24:47PM +0800, Zhang, Yanmin escreveu:
> > On Fri, 2010-03-19 at 09:21 +0100, Ingo Molnar wrote:
> > > So some sort of --guestmount option would be the natural solution, which
> > > points to the guest system's root: and a Qemu enumeration of guest mounts
> > > (which would be off by default and configurable) from which perf can pick up
> > > the target guest all automatically. (obviously only under allowed permissions
> > > so that such access is secure)
> > If sshfs could access /proc/ and /sys correctly, here is a design:
> > --guestmount points to a directory which consists of a list of sub-directories.
> > Every sub-directory's name is just the qemu process id of guest os. Admin/developer
> > mounts every guest os instance's root directory to corresponding sub-directory.
> >
> > Then, perf could access all files. It's possible because guest os instance
> > happens to be multi-threading in a process. One of the defects is the accessing to
> > guest os becomes slow or impossible when guest os is very busy.
>
> If the MMAP events on the guest included a cookie that could later be
> used to query for the symtab of that DSO, we wouldn't need to access the
> guest FS at all, right?
It depends on specific sub commands. As for 'perf kvm top', developers want to see
the profiling immediately. Even with 'perf kvm record', developers also want to
see results quickly. At least I'm eager for the results when investigating
a performance issue.

>
> With build-ids and debuginfo-install like tools the symbol resolution
> could be performed by using the cookies (build-ids) as keys to get to
> the *-debuginfo packages with matching symtabs (and DWARF for source
> annotation, etc).
We can't make sure guest os uses the same os images, or don't know where we
could find the original DVD images being used to install guest os.

Current perf does save build id, including both kernls's and other application
lib/executables.

>
> We have that for the kernel as:
>
> [acme(a)doppio linux-2.6-tip]$ l /sys/kernel/notes
> -r--r--r-- 1 root root 36 2010-03-22 13:14 /sys/kernel/notes
> [acme(a)doppio linux-2.6-tip]$ l /sys/module/ipv6/sections/.note.gnu.build-id
> -r--r--r-- 1 root root 4096 2010-03-22 13:38 /sys/module/ipv6/sections/.note.gnu.build-id
> [acme(a)doppio linux-2.6-tip]$
>
> That way we would cover DSOs being reinstalled in long running 'perf
> record' sessions too.
That's one of objectives of perf to support long running.

>
> This was discussed some time ago but would require help from the bits
> that load DSOs.
>
> build-ids then would be first class citizens.
>
> - Arnaldo

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12
Prev: + tmpfs-fix-oops-on-remounts-with-mpol=default.patch added to -mm tree
Next: [PATCH 5/5] doc: add the documentation for mpol=local