From: Zhang, Yanmin on
On Wed, 2010-06-09 at 11:33 +0300, Avi Kivity wrote:
> On 06/09/2010 06:30 AM, Zhang, Yanmin wrote:
> > From: Zhang, Yanmin<yanmin_zhang(a)linux.intel.com>
> >
> > Based on Ingo's idea, I implement a para virt interface for perf to support
> > statistics collection in guest os. That means we could run tool perf in guest
> > os directly.
> >
> > Great thanks to Peter Zijlstra. He is really the architect and gave me architecture
> > design suggestions. I also want to thank Yangsheng and LinMing for their generous
> > help.
> >
> > The design is:
> >
> > 1) Add a kvm_pmu whose callbacks mostly just calls hypercall to vmexit to host kernel;
> > 2) Create a host perf_event per guest perf_event;
> > 3) Host kernel syncs perf_event count/overflows data changes to guest perf_event
> > when processing perf_event overflows after NMI arrives. Host kernel inject NMI to guest
> > kernel if a guest event overflows.
> > 4) Guest kernel goes through all enabled event on current cpu and output data when they
> > overflows.
> > 5) No change in user space.
> >
>
Thanks for your kind comments.

> One thing that's missing is documentation of the guest/host ABI. It
> will be a requirement for inclusion, but it will also be a great help
> for review, so please provide it ASAP.
I will add such document. It will includes:
1) Data struct perf_event definition. Guest os and host os have to share the same
data structure as host kernel will sync data (counte/overflows and others if needed)
changes back to guest os.
2) A list of all hypercalls
3) Guest need have NMI handler which checks all guest events.


>
> Please also split the guest and host parts into separate patches.
I will do so.

>
> >
> > -#define KVM_MAX_MMU_OP_BATCH 32
> >
>
> Keep that please.
I will do so.

>
> > +/* Operations for KVM_PERF_OP */
> > +#define KVM_PERF_OP_OPEN 1
> > +#define KVM_PERF_OP_CLOSE 2
> > +#define KVM_PERF_OP_ENABLE 3
> > +#define KVM_PERF_OP_DISABLE 4
> > +#define KVM_PERF_OP_START 5
> > +#define KVM_PERF_OP_STOP 6
> > +#define KVM_PERF_OP_READ 7
> >
>
> Where is the hypercall number for the perf hypercall?
I defines it in file kvm_para.h like KVM_HC_MMU_OP.

>
> > +static bool kvm_reserve_pmc_hardware(void)
> > +{
> > + if (nmi_watchdog == NMI_LOCAL_APIC)
> > + disable_lapic_nmi_watchdog();
> > +
> > + return true;
> > +}
> > +
> > +static void kvm_release_pmc_hardware(void)
> > +{
> > + if (nmi_watchdog == NMI_LOCAL_APIC)
> > + enable_lapic_nmi_watchdog();
> > +}
> >
>
> Disabling the watchdog is unfortunate. Why is it necessary?
perf always uses NMI, so we disable the nmi_watchdog when a perf_event is
set up in case they might have impact.

>
> > +
> > +static int kvm_pmu_enable(struct perf_event *event)
> > +{
> > + int ret;
> > + unsigned long addr = __pa(event);
> > +
> > + if (kvm_add_event(event))
> > + return -1;
> > +
> > + ret = kvm_hypercall3(KVM_PERF_OP, KVM_PERF_OP_ENABLE,
> > + addr, (unsigned long) event->shadow);
> >
>
> This is suspicious. Passing virtual addresses to the host is always
> problematic. Or event->shadow only used as an opaque cookie?
I use perf_event->shadow at both host and guest side.
1) At host side, perf_event->shadow points to an area including the page
mapping information about guest perf_event. Host kernel uses it to sync data
changes back to guest;
2) At guest side, perf_event->shadow save the virtual address of host
perf_event at host side. At guest side,
kvm_hypercall3(KVM_PERF_OP, KVM_PERF_OP_OPEN, ...) return the virtual address.
Guest os shouldn't use it but using it to pass it back to host kernel in following
hypercalls. It might be a security issue for host kernel. Originally, I planed guest
os not to use perf_event->shadow. Host kernel maintains a per-vcpu event-related
list whose key is addr of guest perf_event. But considering the vcpu thread migration
on logical cpu, such list needs lock and implementation becomes a little complicated.

I will double-check the list method as the security issue is there.

>
> > +int __init kvm_init_hw_perf_events(void)
> > +{
> > + if (!kvm_para_available())
> > + return -1;
> > +
> > + x86_pmu.handle_irq = kvm_default_x86_handle_irq;
> > +
> > + pr_cont("KVM PARA PMU driver.\n");
> > + register_die_notifier(&kvm_perf_event_nmi_notifier);
> > +
> > + return 0;
> > +}
> >
>
> Need to detect the kvm pmu via its own cpuid bit.
Ok. I will add a feature, KVM_FEATURE_PARA_PERF, something like
bit KVM_FEATURE_CLOCKSOURCE.

>
> > +
> > +static int __kvm_hw_perf_event_init(struct perf_event *event)
> > +{
> > + struct hw_perf_event *hwc =&event->hw;
> > + int err;
> > + unsigned long result;
> > + unsigned long addr;
> > +
> > + err = 0;
> > + if (!atomic_inc_not_zero(&active_events)) {
> > + mutex_lock(&pmc_reserve_mutex);
> > + if (atomic_read(&active_events) == 0) {
> > + if (!kvm_reserve_pmc_hardware())
> > + err = -EBUSY;
> > + }
> > + if (!err)
> > + atomic_inc(&active_events);
> > + mutex_unlock(&pmc_reserve_mutex);
> > + if (err)
> > + return err;
> > + }
> > +
> > + event->destroy = kvm_hw_perf_event_destroy;
> > +
> > + hwc->idx = -1;
> > + hwc->last_cpu = -1;
> > + hwc->last_tag = ~0ULL;
> > +
> > + addr = __pa(event);
> > + result = kvm_hypercall3(KVM_PERF_OP, KVM_PERF_OP_OPEN, addr, 0);
> > +
> > + if (result)
> > + event->shadow = (void *) result;
> > + else
> > + err = -1;
> >
>
> Ok, so event->shadow is never dereferenced. In that case better not
> make it a pointer at all, keep it unsigned long.
Host kernel also uses it.

>
> Note that you can run 32-bit guests on 64-bit hosts, so the cookie
> better not exceed 32 bits.
I forgot 32 bits. I need double-check it. It seems I have to implement a list
at host side and don't use it at guest side any more.

>
> > +
> > +static void kvm_sync_event_to_guest(struct perf_event *event, int overflows)
> > +{
> > + struct hw_perf_event *hwc =&event->hw;
> > + struct kvmperf_event *kvmevent;
> > + int offset, len, data_len, copied, page_offset;
> > + struct page *event_page;
> > + void *shared_kaddr;
> > +
> > + kvmevent = event->shadow;
> > + offset = kvmevent->event_offset;
> > +
> > + /* Copy perf_event->count firstly */
> > + offset += offsetof(struct perf_event, count);
> > + if (offset< PAGE_SIZE) {
> > + event_page = kvmevent->event_page;
> > + page_offset = offset;
> > + } else {
> > + event_page = kvmevent->event_page2;
> > + page_offset = offset - PAGE_SIZE;
> > + }
> > + shared_kaddr = kmap_atomic(event_page, KM_USER0);
> > + *((atomic64_t *)(shared_kaddr + page_offset)) = event->count;
> >
>
> This copy will not be atomic on 32-bit hosts.
Right. But it shouldn't be a problem as vcpu thread stops when vmexit to
host to process events. In addition, only current cpu in guest accesses
perf_events linked to current cpu.

>
> > +
> > + offset = kvmevent->event_offset;
> > + offset += offsetof(struct perf_event, hw);
> > + if (offset< PAGE_SIZE) {
> > + if (event_page == kvmevent->event_page2) {
> > + kunmap_atomic(shared_kaddr, KM_USER0);
> > + event_page = kvmevent->event_page;
> > + shared_kaddr = kmap_atomic(event_page, KM_USER0);
> > + }
> > + page_offset = offset;
> > + } else {
> > + if (event_page == kvmevent->event_page) {
> > + kunmap_atomic(shared_kaddr, KM_USER0);
> > + event_page = kvmevent->event_page2;
> > + shared_kaddr = kmap_atomic(event_page, KM_USER0);
> > + }
> > + page_offset = offset - PAGE_SIZE;
> > + }
> > +
> > + if (overflows)
> > + atomic_add(overflows, (atomic_t *)(shared_kaddr + page_offset));
> > +
> > + kunmap_atomic(shared_kaddr, KM_USER0);
> >
>
> Where is the actual event copied? I'm afraid it's hard to understand
> the code.
Sorry, I should have more statements around the codes.
When an event overflows, host kernel sync perf_event->count to guest os
perf_event->count, and increases the perf_event->hw.overflows.

+struct kvmperf_event {
+ unsigned int event_offset;
+ struct page *event_page;
+ struct page *event_page2;
+};

Actually, at host side, perf_event->shadow points to a kvmperf_event.
kvmperf_event->event_page points the physical page of guest perf_event. event_page2
points to the 2nd page if perf_event data structure layouts across 2 pages.
event_offset marks the offset start in the 1st page.

>
> > +#if 0
> > + offset += offsetof(struct hw_perf_event, prev_count);
> > + data_len = sizeof(struct hw_perf_event) -
> > + offsetof(struct hw_perf_event, prev_count);
> > + if (event_page == kvmevent->event_page2) {
> > + page_offset += offsetof(struct hw_perf_event, prev_count);
> > + memcpy(shared_kaddr + page_offset,
> > + &hwc->prev_count, data_len);
> > + kunmap_atomic(shared_kaddr, KM_USER0);
> > +
> > + return;
> > + }
> > +
> > + copied = 0;
> > + if (offset< PAGE_SIZE) {
> > + len = PAGE_SIZE - offset;
> > + if (len> data_len)
> > + len = data_len;
> > + memcpy(shared_kaddr + offset,
> > + &hwc->prev_count, data_len);
> > + copied = len;
> > + page_offset = 0;
> > + } else
> > + page_offset = offset - PAGE_SIZE;
> > +
> > + kunmap_atomic(shared_kaddr, KM_USER0);
> > + len = data_len - copied;
> > + if (len) {
> > + /* Copy across pages */
> > + shared_kaddr = kmap_atomic(kvmevent->event_page2, KM_USER0);
> > + memcpy(shared_kaddr + page_offset,
> > + ((void *)&hwc->prev_count) + copied, len);
> > + kunmap_atomic(shared_kaddr, KM_USER0);
> > + }
> > +#endif
> >
>
> Maybe here, but the #if 0 doesn't help.
The codes in '#if 0' is trying to copy more data back to guest os
perf_event. I comment them out as they are not really used by guest os.

>
> > +}
> > +
> >
> > +static struct perf_event *
> > +kvm_pv_perf_op_open(struct kvm_vcpu *vcpu, gpa_t addr)
> > +{
> > + int ret;
> > + struct perf_event *event;
> > + struct perf_event *host_event = NULL;
> > + struct kvmperf_event *shadow = NULL;
> > +
> > + event = kzalloc(sizeof(*event), GFP_KERNEL);
> > + if (!event)
> > + goto out;
> > +
> > + shadow = kzalloc(sizeof(*shadow), GFP_KERNEL);
> > + if (!shadow)
> > + goto out;
> > +
> > + shadow->event_page = gfn_to_page(vcpu->kvm, addr>> PAGE_SHIFT);
> > + shadow->event_offset = addr& ~PAGE_MASK;
> >
>
> Might truncate on 32-bit hosts. PAGE_MASK is 32-bit while addr is 64 bit.
Above codes just run at host side. Is it possible that host kernel is 32 bit and
guest kernel is 64bits? I really need separate the patch to host and guest parts
as one patch misleads you guys.

>
> > + if (shadow->event_offset + sizeof(struct perf_event)> PAGE_SIZE) {
> > + shadow->event_page2 = gfn_to_page(vcpu->kvm,
> > + (addr>> PAGE_SHIFT) + 1);
> > + }
> >
>
> My be simpler to make event_pages an array instead of two independent
> variables.
Ok, I will do so.

>
> > +
> > + ret = kvm_read_guest(vcpu->kvm, addr, event, sizeof(*event));
> > + if (ret)
> > + goto out;
> >
>
> I assume this is to check existence?
Here calling kvm_read_guest is to get a copy of guest perf_event as it has
perf_event.attr. Host need the attr to create host perf_event.

> It doesn't work well with memory
> hotplug. In general it's preferred to use
> kvm_read_guest()/kvm_write_guest() instead of gfn_to_page() during
> initialization to prevent pinning and allow for hotplug.
That's an issue. But host kernel couldn't go to sleep when processing event
overflows under NMI context.

>
> > +
> > + /*
> > + * By default, we disable the host event. Later on, guets os
> > + * triggers a perf_event_attach to enable it
> > + */
> > + event->attr.disabled = 1;
> > + event->attr.inherit = 0;
> > + event->attr.enable_on_exec = 0;
> > + /*
> > + * We don't support exclude mode of user and kernel for guest os,
> > + * which mean we always collect both user and kernel for guest os
> > + */
> > + event->attr.exclude_user = 0;
> > + event->attr.exclude_kernel = 0;
> > + /* We always create a cpu context host perf event */
> > +
> > + host_event = perf_event_create_kernel_counter(&event->attr, -1,
> > + current->pid, kvm_perf_event_overflow);
> > +
> > + if (IS_ERR(host_event)) {
> > + host_event = NULL;
> > + goto out;
> > + }
> > + host_event->shadow = shadow;
> > +
> > +out:
> > + if (!host_event)
> > + kfree(shadow);
> > + kfree(event);
> > +
> > + return host_event;
> > +}
> > +
> >
> > +
> > +int kvm_pv_perf_op(struct kvm_vcpu *vcpu, int op_code, unsigned long a1,
> > + unsigned long a2, unsigned long *result)
> > +{
> > + unsigned long ret;
> > + struct perf_event *host_event;
> > + gpa_t addr;
> > +
> > + addr = (gpa_t)(a1);
> >
>
> A 32-bit guest has a 64-bit gpa_t but 32-bit ulongs, so gpas need to be
> passed in two hypervall arguments.
Ok. Originally I pass 2 parameters like hc_gpa. I will redo it.

>
> > +
> > + switch(op_code) {
> > + case KVM_PERF_OP_OPEN:
> > + ret = (unsigned long) kvm_pv_perf_op_open(vcpu, addr);
> > + break;
> > + case KVM_PERF_OP_CLOSE:
> > + host_event = (struct perf_event *) a2;
> >
>
> First, you get truncation in a 32-bit guest. Second, you must validate
> the argument! The guest kernel can easily subvert the host by passing a
> bogus host_event.
You are right. I will implement the list method at host side.

>
> It may be better to have the guest create an id to the host, and the
> host can simply look up the id in a table:
Perhaps the address of guest perf_event is the best id.

>
> if (a2 >= KVM_PMU_MAX_EVENTS)
> bail out;
> host_event = vcpu->pmu.host_events[a2];
>
> > @@ -4052,6 +4054,16 @@ static unsigned long kvm_get_guest_ip(vo
> > return ip;
> > }
> >
> > +int kvm_notify_event_overflow(void)
> > +{
> > + if (percpu_read(current_vcpu)) {
> > + kvm_inject_nmi(percpu_read(current_vcpu));
> > + return 0;
> > + }
> > +
> > + return -1;
> > +}
> >
>
> This should really go through the APIC PERF LVT. No interface
> currently, but we are working on it.
>
> This way the guest can configure the perf interrupt to be NMI, INTR, or
> anything it likes.
>
Thanks for the pointer.

Yanmin


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Zhang, Yanmin on
On Wed, 2010-06-09 at 11:59 +0300, Avi Kivity wrote:
> On 06/09/2010 06:30 AM, Zhang, Yanmin wrote:
> > From: Zhang, Yanmin<yanmin_zhang(a)linux.intel.com>
> >
> > Based on Ingo's idea, I implement a para virt interface for perf to support
> > statistics collection in guest os. That means we could run tool perf in guest
> > os directly.
> >
> > Great thanks to Peter Zijlstra. He is really the architect and gave me architecture
> > design suggestions. I also want to thank Yangsheng and LinMing for their generous
> > help.
> >
> > The design is:
> >
> > 1) Add a kvm_pmu whose callbacks mostly just calls hypercall to vmexit to host kernel;
> > 2) Create a host perf_event per guest perf_event;
> > 3) Host kernel syncs perf_event count/overflows data changes to guest perf_event
> > when processing perf_event overflows after NMI arrives. Host kernel inject NMI to guest
> > kernel if a guest event overflows.
> > 4) Guest kernel goes through all enabled event on current cpu and output data when they
> > overflows.
> > 5) No change in user space.
> >
>
> Other issues:
>
> - save/restore support for live migration
Well, it's a little hard to process perf_event under live migration case.
I will check it.

> - some way to limit the number of open handles (comes automatically with
> the table approach I suggested earlier)
Current perf doesn't restrict perf_event number. Kernel does a rotation to collect
statistics of all perf_events. My patch just follows this style.
The table method might be not good, because below scenario:
guest perf_event might be a per-task event at guest side. When the guest application task is
migrated to another cpu, the perf_event peer at host side should also be migrated to the new vcpu
thread. With table method, we need do some rearrangement on the table when event migration happens.
Here migration I mention is not guest live migration.

I will double-check it.

Thanks,
Yanmin


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Wed, 2010-06-09 at 12:41 +0300, Avi Kivity wrote:
>
> >> Disabling the watchdog is unfortunate. Why is it necessary?
> >>
> > perf always uses NMI, so we disable the nmi_watchdog when a perf_event is
> > set up in case they might have impact.
> >
>
> Ok. Is that the case for the hardware pmus as well? If so it might be
> done in common code.

The x86 hardware pmu implementation disables the lapic watchdog too, but
recent kernels come with a watchdog implementation on top of perf, the
old lapic one will be depricated.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Ingo Molnar on

* Peter Zijlstra <peterz(a)infradead.org> wrote:

> On Wed, 2010-06-09 at 12:41 +0300, Avi Kivity wrote:
> >
> > >> Disabling the watchdog is unfortunate. Why is it necessary?
> > >>
> > > perf always uses NMI, so we disable the nmi_watchdog when a perf_event is
> > > set up in case they might have impact.
> > >
> >
> > Ok. Is that the case for the hardware pmus as well? If so it might be
> > done in common code.
>
> The x86 hardware pmu implementation disables the lapic watchdog too, but
> recent kernels come with a watchdog implementation on top of perf, the old
> lapic one will be depricated.

Note, that code is in -tip, queued for v2.6.36.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Zhang, Yanmin on
On Wed, 2010-06-09 at 12:41 +0300, Avi Kivity wrote:
> On 06/09/2010 12:21 PM, Zhang, Yanmin wrote:
> >
> >> One thing that's missing is documentation of the guest/host ABI. It
> >> will be a requirement for inclusion, but it will also be a great help
> >> for review, so please provide it ASAP.
> >>
> > I will add such document. It will includes:
> > 1) Data struct perf_event definition. Guest os and host os have to share the same
> > data structure as host kernel will sync data (counte/overflows and others if needed)
> > changes back to guest os.
> > 2) A list of all hypercalls
> > 3) Guest need have NMI handler which checks all guest events.
> >
>
> Thanks. It may be easier to have just the documentation for the first
> few iterations so we can converge on a good interface, then update the
> code to match the interface.
I thought over it last night. Your input is important. I need define a clear ABI.
At guest side, I plan to use perf_event->shadow to point to another data area, such like:
struct guest_perf_counter {
u64 count;
atomic_t overflows;
};

So host side just copy data to this area, then guest copy them to its own
perf_event.

The ABI becomes more simple than before. Function kvm_sync_event_to_guest also becomes
clearer. The ABI mostly includes the definition of struct perf_event_attr, guest_perf_counter,
and hypercalls.


> >> Disabling the watchdog is unfortunate. Why is it necessary?
> >>
> > perf always uses NMI, so we disable the nmi_watchdog when a perf_event is
> > set up in case they might have impact.
> >
>
> Ok. Is that the case for the hardware pmus as well? If so it might be
> done in common code.
>
> >>> +
> >>> +static int kvm_pmu_enable(struct perf_event *event)
> >>> +{
> >>> + int ret;
> >>> + unsigned long addr = __pa(event);
> >>> +
> >>> + if (kvm_add_event(event))
> >>> + return -1;
> >>> +
> >>> + ret = kvm_hypercall3(KVM_PERF_OP, KVM_PERF_OP_ENABLE,
> >>> + addr, (unsigned long) event->shadow);
> >>>
> >>>
> >> This is suspicious. Passing virtual addresses to the host is always
> >> problematic. Or event->shadow only used as an opaque cookie?
> >>
> > I use perf_event->shadow at both host and guest side.
> > 1) At host side, perf_event->shadow points to an area including the page
> > mapping information about guest perf_event. Host kernel uses it to sync data
> > changes back to guest;
> > 2) At guest side, perf_event->shadow save the virtual address of host
> > perf_event at host side. At guest side,
> > kvm_hypercall3(KVM_PERF_OP, KVM_PERF_OP_OPEN, ...) return the virtual address.
> > Guest os shouldn't use it but using it to pass it back to host kernel in following
> > hypercalls. It might be a security issue for host kernel. Originally, I planed guest
> > os not to use perf_event->shadow. Host kernel maintains a per-vcpu event-related
> > list whose key is addr of guest perf_event. But considering the vcpu thread migration
> > on logical cpu, such list needs lock and implementation becomes a little complicated.
> >
> > I will double-check the list method as the security issue is there.
> >
>
> Besides the other concern, you cannot live migrate a host virtual
> address, since it changes from host to host. It's better to use a
> guest-chosen bounded small integer.
Ok. Perhaps a single u32 per guest os instance is enough. So I will change the shadow
pointing to a structure like below in guest kernel:

struct guest_perf_counter {
u64 count;
atomic_t overflows;
};

struct guest_perf_shadow {
u32 id;
struct guest_perf_counter sync_data;
};

atomic_t guest_perf_id; /*Global id counter per guest os*/


>
> >> Need to detect the kvm pmu via its own cpuid bit.
> >>
> > Ok. I will add a feature, KVM_FEATURE_PARA_PERF, something like
> > bit KVM_FEATURE_CLOCKSOURCE.
> >
>
> Don't forget Documentation/kvm/cpuid.txt.
Thanks for your kind reminder.

> >> Ok, so event->shadow is never dereferenced. In that case better not
> >> make it a pointer at all, keep it unsigned long.
> >>
> > Host kernel also uses it.
> >
>
> I see. So put it in a union. Or perhaps not even in a union - what if
> a kvm guest is also acting as a kvm host?
My patch has consideration on it. I compiled kernel with host and guest support
at the same time. The accessing to perf_event->shadow is really under specific
scenarios, or they are just in specific functions. These functions are called
just bu host kernel , or just by guest kernel.

>
>
> >
> >>
> >>> +
> >>> +static void kvm_sync_event_to_guest(struct perf_event *event, int overflows)
> >>> +{
> >>> + struct hw_perf_event *hwc =&event->hw;
> >>> + struct kvmperf_event *kvmevent;
> >>> + int offset, len, data_len, copied, page_offset;
> >>> + struct page *event_page;
> >>> + void *shared_kaddr;
> >>> +
> >>> + kvmevent = event->shadow;
> >>> + offset = kvmevent->event_offset;
> >>> +
> >>> + /* Copy perf_event->count firstly */
> >>> + offset += offsetof(struct perf_event, count);
> >>> + if (offset< PAGE_SIZE) {
> >>> + event_page = kvmevent->event_page;
> >>> + page_offset = offset;
> >>> + } else {
> >>> + event_page = kvmevent->event_page2;
> >>> + page_offset = offset - PAGE_SIZE;
> >>> + }
> >>> + shared_kaddr = kmap_atomic(event_page, KM_USER0);
> >>> + *((atomic64_t *)(shared_kaddr + page_offset)) = event->count;
> >>>
> >>>
> >> This copy will not be atomic on 32-bit hosts.
> >>
> > Right. But it shouldn't be a problem as vcpu thread stops when vmexit to
> > host to process events. In addition, only current cpu in guest accesses
> > perf_events linked to current cpu.
> >
>
> Ok. These restrictions should be documented.
Perhaps I need write them down as code comments in the patch.

>
> >>
> >>> +}
> >>> +
> >>>
> >>> +static struct perf_event *
> >>> +kvm_pv_perf_op_open(struct kvm_vcpu *vcpu, gpa_t addr)
> >>> +{
> >>> + int ret;
> >>> + struct perf_event *event;
> >>> + struct perf_event *host_event = NULL;
> >>> + struct kvmperf_event *shadow = NULL;
> >>> +
> >>> + event = kzalloc(sizeof(*event), GFP_KERNEL);
> >>> + if (!event)
> >>> + goto out;
> >>> +
> >>> + shadow = kzalloc(sizeof(*shadow), GFP_KERNEL);
> >>> + if (!shadow)
> >>> + goto out;
> >>> +
> >>> + shadow->event_page = gfn_to_page(vcpu->kvm, addr>> PAGE_SHIFT);
> >>> + shadow->event_offset = addr& ~PAGE_MASK;
> >>>
> >>>
> >> Might truncate on 32-bit hosts. PAGE_MASK is 32-bit while addr is 64 bit.
> >>
> > Above codes just run at host side. Is it possible that host kernel is 32 bit and
> > guest kernel is 64bits?
>
> No, guest bitness always <= host bitness. But gpa_t is 64-bit even on
> 32-bit host/guest, so you can't use PAGE_MASK on that.
I will check it.

>
> >>> +
> >>> + ret = kvm_read_guest(vcpu->kvm, addr, event, sizeof(*event));
> >>> + if (ret)
> >>> + goto out;
> >>>
> >>>
> >> I assume this is to check existence?
> >>
> > Here calling kvm_read_guest is to get a copy of guest perf_event as it has
> > perf_event.attr. Host need the attr to create host perf_event.
> >
> >
> >> It doesn't work well with memory
> >> hotplug. In general it's preferred to use
> >> kvm_read_guest()/kvm_write_guest() instead of gfn_to_page() during
> >> initialization to prevent pinning and allow for hotplug.
> >>
> > That's an issue. But host kernel couldn't go to sleep when processing event
> > overflows under NMI context.
> >
>
> You can set a bit in vcpu->requests and schedule the copying there.
> vcpu->requests is always checked before guest entry.
That becomes a little complicated as we need record overflowed events in vcpu.
Let me check it again.

>
>
> >
> >> It may be better to have the guest create an id to the host, and the
> >> host can simply look up the id in a table:
> >>
> > Perhaps the address of guest perf_event is the best id.
> >
>
> That will need to be looked up in a hash table. A small id is best
> because it can be easily looked up by both guest and host.
Yes. I will use a u32 or atomic_t.

Thanks for your patience!

Yanmin


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/