Enhance perf to collect KVM guest os statistics from host side [Kernel]

Prev: + tmpfs-fix-oops-on-remounts-with-mpol=default.patch added to -mm tree
Next: [PATCH 5/5] doc: add the documentation for mpol=local

From: Avi Kivity on 17 Mar 2010 06:10

On 03/17/2010 11:51 AM, Sheng Yang wrote:
>
>> I think you need DM_NMI for that to work correctly.
>>
>> An alternative is to call the NMI handler directly.
>>
> apic_send_IPI_self() already took care of APIC_DM_NMI.
>

So it does (though not for x2apic?). I don't see why it doesn't work.

> And NMI handler would block the following NMI?
>
>

It wouldn't - won't work without extensive changes.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Zachary Amsden on 17 Mar 2010 17:20

On 03/16/2010 11:28 PM, Sheng Yang wrote:
> On Wednesday 17 March 2010 10:34:33 Zhang, Yanmin wrote:
>
>> On Tue, 2010-03-16 at 11:32 +0200, Avi Kivity wrote:
>>
>>> On 03/16/2010 09:48 AM, Zhang, Yanmin wrote:
>>>
>>>> Right, but there is a scope between kvm_guest_enter and really running
>>>> in guest os, where a perf event might overflow. Anyway, the scope is
>>>> very narrow, I will change it to use flag PF_VCPU.
>>>>
>>> There is also a window between setting the flag and calling 'int $2'
>>> where an NMI might happen and be accounted incorrectly.
>>>
>>> Perhaps separate the 'int $2' into a direct call into perf and another
>>> call for the rest of NMI handling. I don't see how it would work on svm
>>> though - AFAICT the NMI is held whereas vmx swallows it.
>>>
>>> I guess NMIs
>>> will be disabled until the next IRET so it isn't racy, just tricky.
>>>
>> I'm not sure if vmexit does break NMI context or not. Hardware NMI context
>> isn't reentrant till a IRET. YangSheng would like to double check it.
>>
> After more check, I think VMX won't remained NMI block state for host. That's
> means, if NMI happened and processor is in VMX non-root mode, it would only
> result in VMExit, with a reason indicate that it's due to NMI happened, but no
> more state change in the host.
>
> So in that meaning, there _is_ a window between VMExit and KVM handle the NMI.
> Moreover, I think we _can't_ stop the re-entrance of NMI handling code because
> "int $2" don't have effect to block following NMI.
>
> And if the NMI sequence is not important(I think so), then we need to generate
> a real NMI in current vmexit-after code. Seems let APIC send a NMI IPI to
> itself is a good idea.
>
> I am debugging a patch based on apic->send_IPI_self(NMI_VECTOR) to replace
> "int $2". Something unexpected is happening...
>

You can't use the APIC to send vectors 0x00-0x1f, or at least, aren't
supposed to be able to.

Zach
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Zhang, Yanmin on 17 Mar 2010 22:50

On Wed, 2010-03-17 at 17:26 +0800, Zhang, Yanmin wrote:
> On Tue, 2010-03-16 at 10:47 +0100, Ingo Molnar wrote:
> > * Zhang, Yanmin <yanmin_zhang(a)linux.intel.com> wrote:
> >
> > > On Tue, 2010-03-16 at 15:48 +0800, Zhang, Yanmin wrote:
> > > > On Tue, 2010-03-16 at 07:41 +0200, Avi Kivity wrote:
> > > > > On 03/16/2010 07:27 AM, Zhang, Yanmin wrote:
> > > > > > From: Zhang, Yanmin<yanmin_zhang(a)linux.intel.com>
> > > > > >
> > > > > > Based on the discussion in KVM community, I worked out the patch to support
> > > > > > perf to collect guest os statistics from host side. This patch is implemented
> > > > > > with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a
> > > > > > critical bug and provided good suggestions with other guys. I really appreciate
> > > > > > their kind help.
> > > > > >
> > > > > > The patch adds new subcommand kvm to perf.
> > > > > >
> > > > > > perf kvm top
> > > > > > perf kvm record
> > > > > > perf kvm report
> > > > > > perf kvm diff
> > > > > >
> > > > > > The new perf could profile guest os kernel except guest os user space, but it
> > > > > > could summarize guest os user space utilization per guest os.
> > > > > >
> > > > > > Below are some examples.
> > > > > > 1) perf kvm top
> > > > > > [root(a)lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms
> > > > > > --guestmodules=/home/ymzhang/guest/modules top
> > > > > >
> > > > > >
> > > > >
> > > > Thanks for your kind comments.
> > > >
> > > > > Excellent, support for guest kernel != host kernel is critical (I can't
> > > > > remember the last time I ran same kernels).
> > > > >
> > > > > How would we support multiple guests with different kernels?
> > > > With the patch, 'perf kvm report --sort pid" could show
> > > > summary statistics for all guest os instances. Then, use
> > > > parameter --pid of 'perf kvm record' to collect single problematic instance data.
> > > Sorry. I found currently --pid isn't process but a thread (main thread).
> > >
> > > Ingo,
> > >
> > > Is it possible to support a new parameter or extend --inherit, so 'perf
> > > record' and 'perf top' could collect data on all threads of a process when
> > > the process is running?
> > >
> > > If not, I need add a new ugly parameter which is similar to --pid to filter
> > > out process data in userspace.
> >
> > Yeah. For maximum utility i'd suggest to extend --pid to include this, and
> > introduce --tid for the previous, limited-to-a-single-task functionality.
> >
> > Most users would expect --pid to work like a 'late attach' - i.e. to work like
> > strace -f or like a gdb attach.
>
> Thanks Ingo, Avi.
>
> I worked out below patch against tip/master of March 15th.
>
> Subject: [PATCH] Change perf's parameter --pid to process-wide collection
> From: Zhang, Yanmin <yanmin_zhang(a)linux.intel.com>
>
> Change parameter -p (--pid) to real process pid and add -t (--tid) meaning
> thread id. Now, --pid means perf collects the statistics of all threads of
> the process, while --tid means perf just collect the statistics of that thread.
>
> BTW, the patch fixes a bug of 'perf stat -p'. 'perf stat' always configures
> attr->disabled=1 if it isn't a system-wide collection. If there is a '-p'
> and no forks, 'perf stat -p' doesn't collect any data. In addition, the
> while(!done) in run_perf_stat consumes 100% single cpu time which has bad impact
> on running workload. I added a sleep(1) in the loop.
>
> Signed-off-by: Zhang Yanmin <yanmin_zhang(a)linux.intel.com>
Ingo,

Sorry, the patch has bugs. I need do a better job and will work out 2
separate patches against the 2 issues.

Yanmin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Zachary Amsden on 18 Mar 2010 01:00

On 03/17/2010 03:19 PM, Sheng Yang wrote:
> On Thursday 18 March 2010 05:14:52 Zachary Amsden wrote:
>
>> On 03/16/2010 11:28 PM, Sheng Yang wrote:
>>
>>> On Wednesday 17 March 2010 10:34:33 Zhang, Yanmin wrote:
>>>
>>>> On Tue, 2010-03-16 at 11:32 +0200, Avi Kivity wrote:
>>>>
>>>>> On 03/16/2010 09:48 AM, Zhang, Yanmin wrote:
>>>>>
>>>>>> Right, but there is a scope between kvm_guest_enter and really running
>>>>>> in guest os, where a perf event might overflow. Anyway, the scope is
>>>>>> very narrow, I will change it to use flag PF_VCPU.
>>>>>>
>>>>> There is also a window between setting the flag and calling 'int $2'
>>>>> where an NMI might happen and be accounted incorrectly.
>>>>>
>>>>> Perhaps separate the 'int $2' into a direct call into perf and another
>>>>> call for the rest of NMI handling. I don't see how it would work on
>>>>> svm though - AFAICT the NMI is held whereas vmx swallows it.
>>>>>
>>>>> I guess NMIs
>>>>> will be disabled until the next IRET so it isn't racy, just tricky.
>>>>>
>>>> I'm not sure if vmexit does break NMI context or not. Hardware NMI
>>>> context isn't reentrant till a IRET. YangSheng would like to double
>>>> check it.
>>>>
>>> After more check, I think VMX won't remained NMI block state for host.
>>> That's means, if NMI happened and processor is in VMX non-root mode, it
>>> would only result in VMExit, with a reason indicate that it's due to NMI
>>> happened, but no more state change in the host.
>>>
>>> So in that meaning, there _is_ a window between VMExit and KVM handle the
>>> NMI. Moreover, I think we _can't_ stop the re-entrance of NMI handling
>>> code because "int $2" don't have effect to block following NMI.
>>>
>>> And if the NMI sequence is not important(I think so), then we need to
>>> generate a real NMI in current vmexit-after code. Seems let APIC send a
>>> NMI IPI to itself is a good idea.
>>>
>>> I am debugging a patch based on apic->send_IPI_self(NMI_VECTOR) to
>>> replace "int $2". Something unexpected is happening...
>>>
>> You can't use the APIC to send vectors 0x00-0x1f, or at least, aren't
>> supposed to be able to.
>>
> Um? Why?
>
> Especially kernel is already using it to deliver NMI.
>
>

That's the only defined case, and it is defined because the vector field
is ignore for DM_NMI. Vol 3A (exact section numbers may vary depending
on your version).

8.5.1 / 8.6.1

'100 (NMI) Delivers an NMI interrupt to the target processor or
processors. The vector information is ignored'

8.5.2 Valid Interrupt Vectors

'Local and I/O APICs support 240 of these vectors (in the range of 16 to
255) as valid interrupts.'

8.8.4 Interrupt Acceptance for Fixed Interrupts

'...; vectors 0 through 15 are reserved by the APIC (see also: Section
8.5.2, "Valid Interrupt Vectors")'

So I misremembered, apparently you can deliver interrupts 0x10-0x1f, but
vectors 0x00-0x0f are not valid to send via APIC or I/O APIC.

Zach
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Huang, Zhiteng on 18 Mar 2010 01:30

Hi Avi, Ingo,

I've been following through this long thread since the very first email.

I'm a performance engineer whose job is to tune workloads run on top of KVM (and Xen previously). As a performance engineer, I desperately want to have a tool that can monitor the host and guests at same time. Think about >100 guests mixed with Linux/Windows running together on single system, being able to know what's happening is critical to do performance analysis. Actually I am the person asked Yanmin to add feature for CPU utilization break down (into host_usr, host_krn, guest_usr, guest_krn) so that I can monitor dozens of running guests. I hasn't made this patch work on my system yet but I _do_ think this patch is a very good start.

And finally, monitoring guests from host is useful for users too (administrator and performance guy like me). I really appreciate you guys' work and would love to provide feedback from my point of view if needed.

Regards,

HUANG, Zhiteng

Intel SSG/SSD/SPA/PRC Scalability Lab

-----Original Message-----
From: kvm-owner(a)vger.kernel.org [mailto:kvm-owner(a)vger.kernel.org] On Behalf Of Avi Kivity
Sent: Wednesday, March 17, 2010 11:55 AM
To: Frank Ch. Eigler
Cc: Anthony Liguori; Ingo Molnar; Zhang, Yanmin; Peter Zijlstra; Sheng Yang; linux-kernel(a)vger.kernel.org; kvm(a)vger.kernel.org; Marcelo Tosatti; oerg Roedel; Jes Sorensen; Gleb Natapov; Zachary Amsden; ziteng.huang(a)intel.com
Subject: Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

On 03/17/2010 02:41 AM, Frank Ch. Eigler wrote:
> Hi -
>
> On Tue, Mar 16, 2010 at 06:04:10PM -0500, Anthony Liguori wrote:
>
>> [...]
>> The only way to really address this is to change the interaction.
>> Instead of running perf externally to qemu, we should support a perf
>> command in the qemu monitor that can then tie directly to the perf
>> tooling. That gives us the best possible user experience.
>>
> To what extent could this be solved with less crossing of
> isolation/abstraction layers, if the perfctr facilities were properly
> virtualized?
>

That's the more interesting (by far) usage model. In general guest
owners don't have access to the host, and host owners can't (and
shouldn't) change guests.

Monitoring guests from the host is useful for kvm developers, but less
so for users.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12
Prev: + tmpfs-fix-oops-on-remounts-with-mpol=default.patch added to -mm tree
Next: [PATCH 5/5] doc: add the documentation for mpol=local