Enhance perf to collect KVM guest os statistics from host side [Kernel]

Prev: + tmpfs-fix-oops-on-remounts-with-mpol=default.patch added to -mm tree
Next: [PATCH 5/5] doc: add the documentation for mpol=local

From: Zhang, Yanmin on 18 Mar 2010 03:50

On Thu, 2010-03-18 at 10:45 +0800, Zhang, Yanmin wrote:
> On Wed, 2010-03-17 at 17:26 +0800, Zhang, Yanmin wrote:
> > On Tue, 2010-03-16 at 10:47 +0100, Ingo Molnar wrote:
> > > * Zhang, Yanmin <yanmin_zhang(a)linux.intel.com> wrote:
> > >
> > > > On Tue, 2010-03-16 at 15:48 +0800, Zhang, Yanmin wrote:
> > > > > On Tue, 2010-03-16 at 07:41 +0200, Avi Kivity wrote:
> > > > > > On 03/16/2010 07:27 AM, Zhang, Yanmin wrote:
> > > > > > > From: Zhang, Yanmin<yanmin_zhang(a)linux.intel.com>
> > > > > > >
> > > > > > > Based on the discussion in KVM community, I worked out the patch to support
> > > > > > > perf to collect guest os statistics from host side. This patch is implemented
> > > > > > > with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a
> > > > > > > critical bug and provided good suggestions with other guys. I really appreciate
> > > > > > > their kind help.
> > > > > > >
> > > > > > > The patch adds new subcommand kvm to perf.
> > > > > > >
> > > > > > > perf kvm top
> > > > > > > perf kvm record
> > > > > > > perf kvm report
> > > > > > > perf kvm diff
> > > > > > >
> > > > > > > The new perf could profile guest os kernel except guest os user space, but it
> > > > > > > could summarize guest os user space utilization per guest os.
> > > > > > >
> > > > > > > Below are some examples.
> > > > > > > 1) perf kvm top
> > > > > > > [root(a)lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms
> > > > > > > --guestmodules=/home/ymzhang/guest/modules top
> > > > > > >
> > > > > > >
> > > > > >
> > > > > Thanks for your kind comments.
> > > > >
> > > > > > Excellent, support for guest kernel != host kernel is critical (I can't
> > > > > > remember the last time I ran same kernels).
> > > > > >
> > > > > > How would we support multiple guests with different kernels?
> > > > > With the patch, 'perf kvm report --sort pid" could show
> > > > > summary statistics for all guest os instances. Then, use
> > > > > parameter --pid of 'perf kvm record' to collect single problematic instance data.
> > > > Sorry. I found currently --pid isn't process but a thread (main thread).
> > > >
> > > > Ingo,
> > > >
> > > > Is it possible to support a new parameter or extend --inherit, so 'perf
> > > > record' and 'perf top' could collect data on all threads of a process when
> > > > the process is running?
> > > >
> > > > If not, I need add a new ugly parameter which is similar to --pid to filter
> > > > out process data in userspace.
> > >
> > > Yeah. For maximum utility i'd suggest to extend --pid to include this, and
> > > introduce --tid for the previous, limited-to-a-single-task functionality.
> > >
> > > Most users would expect --pid to work like a 'late attach' - i.e. to work like
> > > strace -f or like a gdb attach.
> >
> > Thanks Ingo, Avi.
> >
> > I worked out below patch against tip/master of March 15th.
> >
> > Subject: [PATCH] Change perf's parameter --pid to process-wide collection
> > From: Zhang, Yanmin <yanmin_zhang(a)linux.intel.com>
> >
> > Change parameter -p (--pid) to real process pid and add -t (--tid) meaning
> > thread id. Now, --pid means perf collects the statistics of all threads of
> > the process, while --tid means perf just collect the statistics of that thread.
> >
> > BTW, the patch fixes a bug of 'perf stat -p'. 'perf stat' always configures
> > attr->disabled=1 if it isn't a system-wide collection. If there is a '-p'
> > and no forks, 'perf stat -p' doesn't collect any data. In addition, the
> > while(!done) in run_perf_stat consumes 100% single cpu time which has bad impact
> > on running workload. I added a sleep(1) in the loop.
> >
> > Signed-off-by: Zhang Yanmin <yanmin_zhang(a)linux.intel.com>
> Ingo,
>
> Sorry, the patch has bugs. I need do a better job and will work out 2
> separate patches against the 2 issues.

I worked out 3 new patches against tip/master tree of Mar. 17th.

1) Patch perf_stat: Fix the issue that perf doesn't enable counters when
target_pid != -1. Change the condition to fork/exec subcommand. If there
is a subcommand parameter, perf always fork/exec it. The usage example is:
#perf stat -a sleep 10
So this command could collect statistics for 10 seconds precisely. User
still could stop it by CTRL+C.

2) Patch perf_record: Fix the issue that when perf forks/exec a subcommand,
it should enable all counters after the new process is execing.Change the
condition to fork/exec subcommand. If there is a subcommand parameter,
perf always fork/exec it. The usage example is:
#perf record -f -a sleep 10
So this command could collect statistics for 10 seconds precisely. User
still could stop it by CTRL+C.

3) perf_pid: Change parameter --pid to process-wide collection. Add --tid
which means collecting thread-wide statistics. Usage example is:
#perf top -p 8888
#perf record -p 8888 -f sleep 10
#perf stat -p 8888 -f sleep 10

Arnaldo,

Pls. apply the 3 attached patches.

Yanmin

From: Ingo Molnar on 18 Mar 2010 04:10

* Zhang, Yanmin <yanmin_zhang(a)linux.intel.com> wrote:

> I worked out 3 new patches against tip/master tree of Mar. 17th.

Cool! Mind sending them as a series of patches instead of attachment? That
makes it easier to review them. Also, the Signed-off-by lines seem to be
missing plus we need a per patch changelog as well.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Zachary Amsden on 18 Mar 2010 04:50

On 03/17/2010 07:41 PM, Sheng Yang wrote:
> On Thursday 18 March 2010 13:22:28 Sheng Yang wrote:
>
>> On Thursday 18 March 2010 12:50:58 Zachary Amsden wrote:
>>
>>> On 03/17/2010 03:19 PM, Sheng Yang wrote:
>>>
>>>> On Thursday 18 March 2010 05:14:52 Zachary Amsden wrote:
>>>>
>>>>> On 03/16/2010 11:28 PM, Sheng Yang wrote:
>>>>>
>>>>>> On Wednesday 17 March 2010 10:34:33 Zhang, Yanmin wrote:
>>>>>>
>>>>>>> On Tue, 2010-03-16 at 11:32 +0200, Avi Kivity wrote:
>>>>>>>
>>>>>>>> On 03/16/2010 09:48 AM, Zhang, Yanmin wrote:
>>>>>>>>
>>>>>>>>> Right, but there is a scope between kvm_guest_enter and really
>>>>>>>>> running in guest os, where a perf event might overflow. Anyway,
>>>>>>>>> the scope is very narrow, I will change it to use flag PF_VCPU.
>>>>>>>>>
>>>>>>>> There is also a window between setting the flag and calling 'int
>>>>>>>> $2' where an NMI might happen and be accounted incorrectly.
>>>>>>>>
>>>>>>>> Perhaps separate the 'int $2' into a direct call into perf and
>>>>>>>> another call for the rest of NMI handling. I don't see how it
>>>>>>>> would work on svm though - AFAICT the NMI is held whereas vmx
>>>>>>>> swallows it.
>>>>>>>>
>>>>>>>> I guess NMIs
>>>>>>>> will be disabled until the next IRET so it isn't racy, just tricky.
>>>>>>>>
>>>>>>> I'm not sure if vmexit does break NMI context or not. Hardware NMI
>>>>>>> context isn't reentrant till a IRET. YangSheng would like to double
>>>>>>> check it.
>>>>>>>
>>>>>> After more check, I think VMX won't remained NMI block state for
>>>>>> host. That's means, if NMI happened and processor is in VMX non-root
>>>>>> mode, it would only result in VMExit, with a reason indicate that
>>>>>> it's due to NMI happened, but no more state change in the host.
>>>>>>
>>>>>> So in that meaning, there _is_ a window between VMExit and KVM handle
>>>>>> the NMI. Moreover, I think we _can't_ stop the re-entrance of NMI
>>>>>> handling code because "int $2" don't have effect to block following
>>>>>> NMI.
>>>>>>
>>>>>> And if the NMI sequence is not important(I think so), then we need to
>>>>>> generate a real NMI in current vmexit-after code. Seems let APIC send
>>>>>> a NMI IPI to itself is a good idea.
>>>>>>
>>>>>> I am debugging a patch based on apic->send_IPI_self(NMI_VECTOR) to
>>>>>> replace "int $2". Something unexpected is happening...
>>>>>>
>>>>> You can't use the APIC to send vectors 0x00-0x1f, or at least, aren't
>>>>> supposed to be able to.
>>>>>
>>>> Um? Why?
>>>>
>>>> Especially kernel is already using it to deliver NMI.
>>>>
>>> That's the only defined case, and it is defined because the vector field
>>> is ignore for DM_NMI. Vol 3A (exact section numbers may vary depending
>>> on your version).
>>>
>>> 8.5.1 / 8.6.1
>>>
>>> '100 (NMI) Delivers an NMI interrupt to the target processor or
>>> processors. The vector information is ignored'
>>>
>>> 8.5.2 Valid Interrupt Vectors
>>>
>>> 'Local and I/O APICs support 240 of these vectors (in the range of 16 to
>>> 255) as valid interrupts.'
>>>
>>> 8.8.4 Interrupt Acceptance for Fixed Interrupts
>>>
>>> '...; vectors 0 through 15 are reserved by the APIC (see also: Section
>>> 8.5.2, "Valid Interrupt Vectors")'
>>>
>>> So I misremembered, apparently you can deliver interrupts 0x10-0x1f, but
>>> vectors 0x00-0x0f are not valid to send via APIC or I/O APIC.
>>>
>> As you pointed out, NMI is not "Fixed interrupt". If we want to send NMI,
>> it would need a specific delivery mode rather than vector number.
>>
>> And if you look at code, if we specific NMI_VECTOR, the delivery mode would
>> be set to NMI.
>>
>> So what's wrong here?
>>
> OK, I think I understand your points now. You meant that these vectors can't
> be filled in vector field directly, right? But NMI is a exception due to
> DM_NMI. Is that your point? I think we agree on this.
>

Yes, I think we agree. NMI is the only vector in 0x0-0xf which can be
sent via self-IPI because the vector itself does not matter for NMI.

Zach
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Ingo Molnar on 19 Mar 2010 04:30

Nice progress!

This bit:

> 1) perf kvm top
> [root(a)lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms
> --guestmodules=/home/ymzhang/guest/modules top

Will be really be painful to developers - to enter that long line while we
have these things called 'computers' that ought to reduce human work. Also,
it's incomplete, we need access to the guest system's binaries to do ELF
symbol resolution and dwarf decoding.

So we really need some good, automatic way to get to the guest symbol space,
so that if a developer types:

perf kvm top

Then the obvious thing happens by default. (which is to show the guest
overhead)

There's no technical barrier on the perf tooling side to implement all that:
perf supports build-ids extensively and can deal with multiple symbol spaces -
as long as it has access to it. The guest kernel could be ID-ed based on its
/sys/kernel/notes and /sys/module/*/notes/.note.gnu.build-id build-ids.

So some sort of --guestmount option would be the natural solution, which
points to the guest system's root: and a Qemu enumeration of guest mounts
(which would be off by default and configurable) from which perf can pick up
the target guest all automatically. (obviously only under allowed permissions
so that such access is secure)

This would allow not just kallsyms access via $guest/proc/kallsyms but also
gives us the full space of symbol features: access to the guest binaries for
annotation and general symbol resolution, command/binary name identification,
etc.

Such a mount would obviously not broaden existing privileges - and as an
additional control a guest would also have a way to indicate that it does not
wish a guest mount at all.

Unfortunately, in a previous thread the Qemu maintainer has indicated that he
will essentially NAK any attempt to enhance Qemu to provide an easily
discoverable, self-contained, transparent guest mount on the host side.

No technical justification was given for that NAK, despite my repeated
requests to particulate the exact security problems that such an approach
would cause.

If that NAK does not stand in that form then i'd like to know about it - it
makes no sense for us to try to code up a solution against a standing
maintainer NAK ...

The other option is some sysadmin level hackery to NFS-mount the guest or so.
This is a vastly inferior method that brings us back to the absymal usability
levels of OProfile:

1) it wont be guest transparent
2) has to be re-done for every guest image.
3) even if packaged it has to be gotten into every. single. Linux. distro. separately.
4) old Linux guests wont work out of box

In other words: it's very inconvenient on multiple levels and wont ever happen
on any reasonable enough scale to make a difference to Linux.

Which is an unfortunate situation - and the ball is on the KVM/Qemu side so i
can do little about it.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Ingo Molnar on 21 Mar 2010 14:50

* oerg Roedel <joro(a)8bytes.org> wrote:

> On Fri, Mar 19, 2010 at 09:21:22AM +0100, Ingo Molnar wrote:
> > Unfortunately, in a previous thread the Qemu maintainer has indicated that he
> > will essentially NAK any attempt to enhance Qemu to provide an easily
> > discoverable, self-contained, transparent guest mount on the host side.
> >
> > No technical justification was given for that NAK, despite my repeated
> > requests to particulate the exact security problems that such an approach
> > would cause.
> >
> > If that NAK does not stand in that form then i'd like to know about it - it
> > makes no sense for us to try to code up a solution against a standing
> > maintainer NAK ...
>
> I still think it is the best and most generic way to let the guest do the
> symbol resolution. [...]

Not really.

> [...] This has several advantages:
>
> 1. The guest knows best about its symbol space. So this would be
> extensible to other guest operating systems. A brave
> developer may even implement symbol passing for Windows or
> the BSDs ;-)

Having access to the actual executable files that include the symbols achieves
precisely that - with the additional robustness that all this functionality is
concentrated into the host, while the guest side is kept minimal (and
transparent).

> 2. The guest can decide for its own if it want to pass this
> inforamtion to the host-perf. No security issues at all.

It can decide whether it exposes the files. Nor are there any "security
issues" to begin with.

> 3. The guest can also pass us the call-chain and we don't need
> to care about complicated of fetching from the guest
> ourself.

You need to be aware of the fact that symbol resolution is a separate step
from call chain generation.

I.e. call-chains are a (entirely) separate issue, and could reasonably be done
in the guest or in the host.

It has no bearing on this symbol resolution question.

> 4. This way extensible to nested virtualization too.

Nested virtualization is actually already taken care of by the filesystem
solution via an existing method called 'subdirectories'. If the guest offers
sub-guests then those symbols will be exposed in a similar way via its own
'guest files' directory hierarchy.

I.e. if we have 'Guest-2' nested inside 'the 'Guest-Fedora-1' instance, we get:

/guests/
/guests/Guest-Fedora-1/etc/
/guests/Guest-Fedora-1/usr/

we'd also have:

/guests/Guest-Fedora-1/guests/Guest-2/

So this is taken care of automatically.

I.e. none of the four 'advantages' listed here are actually advantages over my
proposed solution, so your conclusion is subsequently flawed as well.

> How we speak to the guest was already discussed in this thread. My personal
> opinion is that going through qemu is an unnecessary step and we can solve
> that more clever and transparent for perf.

Meaning exactly what?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12
Prev: + tmpfs-fix-oops-on-remounts-with-mpol=default.patch added to -mm tree
Next: [PATCH 5/5] doc: add the documentation for mpol=local