Enhance perf to collect KVM guest os statistics from host side [Kernel]

Prev: [PATCH] perf: x86: fix callgraphs of 32-bit processes on 64-bit kernels V2.
Next: [PATCH 2/3] SCSI: lpfc, fix lock imbalances

From: Sheng Yang on 17 Mar 2010 21:30

On Thursday 18 March 2010 05:14:52 Zachary Amsden wrote:
> On 03/16/2010 11:28 PM, Sheng Yang wrote:
> > On Wednesday 17 March 2010 10:34:33 Zhang, Yanmin wrote:
> >> On Tue, 2010-03-16 at 11:32 +0200, Avi Kivity wrote:
> >>> On 03/16/2010 09:48 AM, Zhang, Yanmin wrote:
> >>>> Right, but there is a scope between kvm_guest_enter and really running
> >>>> in guest os, where a perf event might overflow. Anyway, the scope is
> >>>> very narrow, I will change it to use flag PF_VCPU.
> >>>
> >>> There is also a window between setting the flag and calling 'int $2'
> >>> where an NMI might happen and be accounted incorrectly.
> >>>
> >>> Perhaps separate the 'int $2' into a direct call into perf and another
> >>> call for the rest of NMI handling. I don't see how it would work on
> >>> svm though - AFAICT the NMI is held whereas vmx swallows it.
> >>>
> >>> I guess NMIs
> >>> will be disabled until the next IRET so it isn't racy, just tricky.
> >>
> >> I'm not sure if vmexit does break NMI context or not. Hardware NMI
> >> context isn't reentrant till a IRET. YangSheng would like to double
> >> check it.
> >
> > After more check, I think VMX won't remained NMI block state for host.
> > That's means, if NMI happened and processor is in VMX non-root mode, it
> > would only result in VMExit, with a reason indicate that it's due to NMI
> > happened, but no more state change in the host.
> >
> > So in that meaning, there _is_ a window between VMExit and KVM handle the
> > NMI. Moreover, I think we _can't_ stop the re-entrance of NMI handling
> > code because "int $2" don't have effect to block following NMI.
> >
> > And if the NMI sequence is not important(I think so), then we need to
> > generate a real NMI in current vmexit-after code. Seems let APIC send a
> > NMI IPI to itself is a good idea.
> >
> > I am debugging a patch based on apic->send_IPI_self(NMI_VECTOR) to
> > replace "int $2". Something unexpected is happening...
>
> You can't use the APIC to send vectors 0x00-0x1f, or at least, aren't
> supposed to be able to.

Um? Why?

Especially kernel is already using it to deliver NMI.

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Sheng Yang on 18 Mar 2010 01:30

On Thursday 18 March 2010 12:50:58 Zachary Amsden wrote:
> On 03/17/2010 03:19 PM, Sheng Yang wrote:
> > On Thursday 18 March 2010 05:14:52 Zachary Amsden wrote:
> >> On 03/16/2010 11:28 PM, Sheng Yang wrote:
> >>> On Wednesday 17 March 2010 10:34:33 Zhang, Yanmin wrote:
> >>>> On Tue, 2010-03-16 at 11:32 +0200, Avi Kivity wrote:
> >>>>> On 03/16/2010 09:48 AM, Zhang, Yanmin wrote:
> >>>>>> Right, but there is a scope between kvm_guest_enter and really
> >>>>>> running in guest os, where a perf event might overflow. Anyway, the
> >>>>>> scope is very narrow, I will change it to use flag PF_VCPU.
> >>>>>
> >>>>> There is also a window between setting the flag and calling 'int $2'
> >>>>> where an NMI might happen and be accounted incorrectly.
> >>>>>
> >>>>> Perhaps separate the 'int $2' into a direct call into perf and
> >>>>> another call for the rest of NMI handling. I don't see how it would
> >>>>> work on svm though - AFAICT the NMI is held whereas vmx swallows it.
> >>>>>
> >>>>> I guess NMIs
> >>>>> will be disabled until the next IRET so it isn't racy, just tricky.
> >>>>
> >>>> I'm not sure if vmexit does break NMI context or not. Hardware NMI
> >>>> context isn't reentrant till a IRET. YangSheng would like to double
> >>>> check it.
> >>>
> >>> After more check, I think VMX won't remained NMI block state for host.
> >>> That's means, if NMI happened and processor is in VMX non-root mode, it
> >>> would only result in VMExit, with a reason indicate that it's due to
> >>> NMI happened, but no more state change in the host.
> >>>
> >>> So in that meaning, there _is_ a window between VMExit and KVM handle
> >>> the NMI. Moreover, I think we _can't_ stop the re-entrance of NMI
> >>> handling code because "int $2" don't have effect to block following
> >>> NMI.
> >>>
> >>> And if the NMI sequence is not important(I think so), then we need to
> >>> generate a real NMI in current vmexit-after code. Seems let APIC send a
> >>> NMI IPI to itself is a good idea.
> >>>
> >>> I am debugging a patch based on apic->send_IPI_self(NMI_VECTOR) to
> >>> replace "int $2". Something unexpected is happening...
> >>
> >> You can't use the APIC to send vectors 0x00-0x1f, or at least, aren't
> >> supposed to be able to.
> >
> > Um? Why?
> >
> > Especially kernel is already using it to deliver NMI.
>
> That's the only defined case, and it is defined because the vector field
> is ignore for DM_NMI. Vol 3A (exact section numbers may vary depending
> on your version).
>
> 8.5.1 / 8.6.1
>
> '100 (NMI) Delivers an NMI interrupt to the target processor or
> processors. The vector information is ignored'
>
> 8.5.2 Valid Interrupt Vectors
>
> 'Local and I/O APICs support 240 of these vectors (in the range of 16 to
> 255) as valid interrupts.'
>
> 8.8.4 Interrupt Acceptance for Fixed Interrupts
>
> '...; vectors 0 through 15 are reserved by the APIC (see also: Section
> 8.5.2, "Valid Interrupt Vectors")'
>
> So I misremembered, apparently you can deliver interrupts 0x10-0x1f, but
> vectors 0x00-0x0f are not valid to send via APIC or I/O APIC.

As you pointed out, NMI is not "Fixed interrupt". If we want to send NMI, it
would need a specific delivery mode rather than vector number.

And if you look at code, if we specific NMI_VECTOR, the delivery mode would be
set to NMI.

So what's wrong here?

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Sheng Yang on 18 Mar 2010 01:50

On Thursday 18 March 2010 13:22:28 Sheng Yang wrote:
> On Thursday 18 March 2010 12:50:58 Zachary Amsden wrote:
> > On 03/17/2010 03:19 PM, Sheng Yang wrote:
> > > On Thursday 18 March 2010 05:14:52 Zachary Amsden wrote:
> > >> On 03/16/2010 11:28 PM, Sheng Yang wrote:
> > >>> On Wednesday 17 March 2010 10:34:33 Zhang, Yanmin wrote:
> > >>>> On Tue, 2010-03-16 at 11:32 +0200, Avi Kivity wrote:
> > >>>>> On 03/16/2010 09:48 AM, Zhang, Yanmin wrote:
> > >>>>>> Right, but there is a scope between kvm_guest_enter and really
> > >>>>>> running in guest os, where a perf event might overflow. Anyway,
> > >>>>>> the scope is very narrow, I will change it to use flag PF_VCPU.
> > >>>>>
> > >>>>> There is also a window between setting the flag and calling 'int
> > >>>>> $2' where an NMI might happen and be accounted incorrectly.
> > >>>>>
> > >>>>> Perhaps separate the 'int $2' into a direct call into perf and
> > >>>>> another call for the rest of NMI handling. I don't see how it
> > >>>>> would work on svm though - AFAICT the NMI is held whereas vmx
> > >>>>> swallows it.
> > >>>>>
> > >>>>> I guess NMIs
> > >>>>> will be disabled until the next IRET so it isn't racy, just tricky.
> > >>>>
> > >>>> I'm not sure if vmexit does break NMI context or not. Hardware NMI
> > >>>> context isn't reentrant till a IRET. YangSheng would like to double
> > >>>> check it.
> > >>>
> > >>> After more check, I think VMX won't remained NMI block state for
> > >>> host. That's means, if NMI happened and processor is in VMX non-root
> > >>> mode, it would only result in VMExit, with a reason indicate that
> > >>> it's due to NMI happened, but no more state change in the host.
> > >>>
> > >>> So in that meaning, there _is_ a window between VMExit and KVM handle
> > >>> the NMI. Moreover, I think we _can't_ stop the re-entrance of NMI
> > >>> handling code because "int $2" don't have effect to block following
> > >>> NMI.
> > >>>
> > >>> And if the NMI sequence is not important(I think so), then we need to
> > >>> generate a real NMI in current vmexit-after code. Seems let APIC send
> > >>> a NMI IPI to itself is a good idea.
> > >>>
> > >>> I am debugging a patch based on apic->send_IPI_self(NMI_VECTOR) to
> > >>> replace "int $2". Something unexpected is happening...
> > >>
> > >> You can't use the APIC to send vectors 0x00-0x1f, or at least, aren't
> > >> supposed to be able to.
> > >
> > > Um? Why?
> > >
> > > Especially kernel is already using it to deliver NMI.
> >
> > That's the only defined case, and it is defined because the vector field
> > is ignore for DM_NMI. Vol 3A (exact section numbers may vary depending
> > on your version).
> >
> > 8.5.1 / 8.6.1
> >
> > '100 (NMI) Delivers an NMI interrupt to the target processor or
> > processors. The vector information is ignored'
> >
> > 8.5.2 Valid Interrupt Vectors
> >
> > 'Local and I/O APICs support 240 of these vectors (in the range of 16 to
> > 255) as valid interrupts.'
> >
> > 8.8.4 Interrupt Acceptance for Fixed Interrupts
> >
> > '...; vectors 0 through 15 are reserved by the APIC (see also: Section
> > 8.5.2, "Valid Interrupt Vectors")'
> >
> > So I misremembered, apparently you can deliver interrupts 0x10-0x1f, but
> > vectors 0x00-0x0f are not valid to send via APIC or I/O APIC.
>
> As you pointed out, NMI is not "Fixed interrupt". If we want to send NMI,
> it would need a specific delivery mode rather than vector number.
>
> And if you look at code, if we specific NMI_VECTOR, the delivery mode would
> be set to NMI.
>
> So what's wrong here?

OK, I think I understand your points now. You meant that these vectors can't
be filled in vector field directly, right? But NMI is a exception due to
DM_NMI. Is that your point? I think we agree on this.

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Soeren Sandmann on 23 Mar 2010 09:20

Joerg Roedel <joro(a)8bytes.org> writes:

> On Mon, Mar 22, 2010 at 11:59:27AM +0100, Ingo Molnar wrote:
> > Best would be if you demonstrated any problems of the perf symbol lookup code
> > you are aware of on the host side, as it has that exact design you are
> > criticising here. We are eager to fix any bugs in it.
> >
> > If you claim that it's buggy then that should very much be demonstratable - no
> > need to go into theoretical arguments about it.
>
> I am not claiming anything. I just try to imagine how your proposal
> will look like in practice and forgot that symbol resolution is done at
> a later point.
> But even with defered symbol resolution we need more information from
> the guest than just the rip falling out of KVM. The guest needs to tell
> us about the process where the event happened (information that the host
> has about itself without any hassle) and which executable-files it was
> loaded from.

Slightly tangential, but there is another case that has some of the
same problems: profiling other language runtimes than C and C++, say
Python. At the moment profilers will generally tell you what is going
on inside the python runtime, but not what the python program itself
is doing.

To fix that problem, it seems like we need some way to have python
export what is going on. Maybe the same mechanism could be used to
both access what is going on in qemu and python.

Soren
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andi Kleen on 23 Mar 2010 09:50

Soeren Sandmann <sandmann(a)daimi.au.dk> writes:
>
> To fix that problem, it seems like we need some way to have python
> export what is going on. Maybe the same mechanism could be used to
> both access what is going on in qemu and python.

oprofile already has an interface to let JITs export
information about the JITed code. C Python is not a JIT,
but presumably one of the python JITs could do it.

http://oprofile.sourceforge.net/doc/devel/index.html

I know it's not envogue anymore and you won't be a approved
cool kid if you do, but you could just use oprofile?

Ok presumably one would need to do a python interface for this
first. I believe it's currently only implemented for Java and
Mono. I presume it might work today with IronPython on Mono.

IMHO it doesn't make sense to invent another interface for this,
although I'm sure someone will propose just that.

-Andi
--
ak(a)linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3
Prev: [PATCH] perf: x86: fix callgraphs of 32-bit processes on 64-bit kernels V2.
Next: [PATCH 2/3] SCSI: lpfc, fix lock imbalances