KVM: MMU: combine guest pte read between walk and pte prefetch [Kernel]

Prev: [PATCH padmux.c] Fix typo in kerneldoc
Next: New ACL format for better NFSv4 acl interoperability

From: Xiao Guangrong on 3 Jul 2010 06:40

Marcelo Tosatti wrote:
> On Thu, Jul 01, 2010 at 09:55:56PM +0800, Xiao Guangrong wrote:
>> Combine guest pte read between guest pte walk and pte prefetch
>>
>> Signed-off-by: Xiao Guangrong <xiaoguangrong(a)cn.fujitsu.com>
>> ---
>> arch/x86/kvm/paging_tmpl.h | 48 ++++++++++++++++++++++++++++++-------------
>> 1 files changed, 33 insertions(+), 15 deletions(-)
>
> Can't do this, it can miss invlpg:
>
> vcpu0 vcpu1
> read guest ptes
> modify guest pte
> invlpg
> instantiate stale
> guest pte

Ah, oops, sorry :-(

>
> See how the pte is reread inside fetch with mmu_lock held.
>

It looks like something is broken in 'fetch' functions, this patch will
fix it.

Subject: [PATCH] KVM: MMU: fix last level broken in FNAME(fetch)

We read the guest level out of 'mmu_lock', sometimes, the host mapping is
confusion. Consider this case:

VCPU0: VCPU1

Read guest mapping, assume the mapping is:
GLV3 -> GLV2 -> GLV1 -> GFNA,
And in the host, the corresponding mapping is
HLV3 -> HLV2 -> HLV1(P=0)

Write GLV1 and cause the
mapping point to GFNB
(May occur in pte_write or
invlpg path)

Mapping GLV1 to GFNA

This issue only occurs in the last indirect mapping, since if the middle
mapping is changed, the mapping will be zapped, then it will be detected
in the FNAME(fetch) path, but when it map the last level, it not checked.

Fixed by also check the last level.

Signed-off-by: Xiao Guangrong <xiaoguangrong(a)cn.fujitsu.com>
---
arch/x86/kvm/paging_tmpl.h | 32 +++++++++++++++++++++++++-------
1 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 3350c02..e617e93 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -291,6 +291,20 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
gpte_to_gfn(gpte), pfn, true, true);
}

+static bool FNAME(check_level_mapping)(struct kvm_vcpu *vcpu,
+ struct guest_walker *gw, int level)
+{
+ pt_element_t curr_pte;
+ int r;
+
+ r = kvm_read_guest_atomic(vcpu->kvm, gw->pte_gpa[level - 1],
+ &curr_pte, sizeof(curr_pte));
+ if (r || curr_pte != gw->ptes[level - 1])
+ return false;
+
+ return true;
+}
+
/*
* Fetch a shadow pte for a specific level in the paging hierarchy.
*/
@@ -304,11 +318,9 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
u64 spte, *sptep = NULL;
int direct;
gfn_t table_gfn;
- int r;
int level;
- bool dirty = is_dirty_gpte(gw->ptes[gw->level - 1]);
+ bool dirty = is_dirty_gpte(gw->ptes[gw->level - 1]), check = true;
unsigned direct_access;
- pt_element_t curr_pte;
struct kvm_shadow_walk_iterator iterator;

if (!is_present_gpte(gw->ptes[gw->level - 1]))
@@ -322,6 +334,12 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
level = iterator.level;
sptep = iterator.sptep;
if (iterator.level == hlevel) {
+ if (check && level == gw->level &&
+ !FNAME(check_level_mapping)(vcpu, gw, hlevel)) {
+ kvm_release_pfn_clean(pfn);
+ break;
+ }
+
mmu_set_spte(vcpu, sptep, access,
gw->pte_access & access,
user_fault, write_fault,
@@ -376,10 +394,10 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
sp = kvm_mmu_get_page(vcpu, table_gfn, addr, level-1,
direct, access, sptep);
if (!direct) {
- r = kvm_read_guest_atomic(vcpu->kvm,
- gw->pte_gpa[level - 2],
- &curr_pte, sizeof(curr_pte));
- if (r || curr_pte != gw->ptes[level - 2]) {
+ if (hlevel == level - 1)
+ check = false;
+
+ if (!FNAME(check_level_mapping)(vcpu, gw, level - 1)) {
kvm_mmu_put_page(sp, sptep);
kvm_release_pfn_clean(pfn);
sptep = NULL;
--
1.6.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Avi Kivity on 3 Jul 2010 07:50

On 07/02/2010 08:03 PM, Marcelo Tosatti wrote:
> On Thu, Jul 01, 2010 at 09:55:56PM +0800, Xiao Guangrong wrote:
>
>> Combine guest pte read between guest pte walk and pte prefetch
>>
>> Signed-off-by: Xiao Guangrong<xiaoguangrong(a)cn.fujitsu.com>
>> ---
>> arch/x86/kvm/paging_tmpl.h | 48 ++++++++++++++++++++++++++++++-------------
>> 1 files changed, 33 insertions(+), 15 deletions(-)
>>
> Can't do this, it can miss invlpg:
>
> vcpu0 vcpu1
> read guest ptes
> modify guest pte
> invlpg
> instantiate stale
> guest pte
>
> See how the pte is reread inside fetch with mmu_lock held.
>

Note, this is fine if the pte is unsync, since vcpu0 will soon invlpg
it. It's only broken for sync ptes.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Avi Kivity on 3 Jul 2010 08:10

On 07/03/2010 01:31 PM, Xiao Guangrong wrote:
>
>> See how the pte is reread inside fetch with mmu_lock held.
>>
>>
> It looks like something is broken in 'fetch' functions, this patch will
> fix it.
>
> Subject: [PATCH] KVM: MMU: fix last level broken in FNAME(fetch)
>
> We read the guest level out of 'mmu_lock', sometimes, the host mapping is
> confusion. Consider this case:
>
> VCPU0: VCPU1
>
> Read guest mapping, assume the mapping is:
> GLV3 -> GLV2 -> GLV1 -> GFNA,
> And in the host, the corresponding mapping is
> HLV3 -> HLV2 -> HLV1(P=0)
>
> Write GLV1 and cause the
> mapping point to GFNB
> (May occur in pte_write or
> invlpg path)
>
> Mapping GLV1 to GFNA
>
> This issue only occurs in the last indirect mapping, since if the middle
> mapping is changed, the mapping will be zapped, then it will be detected
> in the FNAME(fetch) path, but when it map the last level, it not checked.
>
> Fixed by also check the last level.
>
>

I don't really see what is fixed. We already check the gpte. What's
special about the new scenario?

> @@ -322,6 +334,12 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
> level = iterator.level;
> sptep = iterator.sptep;
> if (iterator.level == hlevel) {
> + if (check&& level == gw->level&&
> + !FNAME(check_level_mapping)(vcpu, gw, hlevel)) {
> + kvm_release_pfn_clean(pfn);
> + break;
> + }
> +
>

Now we check here...

> mmu_set_spte(vcpu, sptep, access,
> gw->pte_access& access,
> user_fault, write_fault,
> @@ -376,10 +394,10 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
> sp = kvm_mmu_get_page(vcpu, table_gfn, addr, level-1,
> direct, access, sptep);
> if (!direct) {
> - r = kvm_read_guest_atomic(vcpu->kvm,
> - gw->pte_gpa[level - 2],
> - &curr_pte, sizeof(curr_pte));
> - if (r || curr_pte != gw->ptes[level - 2]) {
> + if (hlevel == level - 1)
> + check = false;
> +
> + if (!FNAME(check_level_mapping)(vcpu, gw, level - 1)) {
>

.... and here? Why?

(looking at the code, we have a call to kvm_host_page_size() on every
page fault, that takes mmap_sem... that's got to impact scaling)

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Xiao Guangrong on 3 Jul 2010 08:30

Avi Kivity wrote:
> On 07/03/2010 01:31 PM, Xiao Guangrong wrote:
>>
>>> See how the pte is reread inside fetch with mmu_lock held.
>>>
>>>
>> It looks like something is broken in 'fetch' functions, this patch will
>> fix it.
>>
>> Subject: [PATCH] KVM: MMU: fix last level broken in FNAME(fetch)
>>
>> We read the guest level out of 'mmu_lock', sometimes, the host mapping is
>> confusion. Consider this case:
>>
>> VCPU0: VCPU1
>>
>> Read guest mapping, assume the mapping is:
>> GLV3 -> GLV2 -> GLV1 -> GFNA,
>> And in the host, the corresponding mapping is
>> HLV3 -> HLV2 -> HLV1(P=0)
>>
>> Write GLV1 and
>> cause the
>> mapping point to GFNB
>> (May occur in
>> pte_write or
>> invlpg path)
>>
>> Mapping GLV1 to GFNA
>>
>> This issue only occurs in the last indirect mapping, since if the middle
>> mapping is changed, the mapping will be zapped, then it will be detected
>> in the FNAME(fetch) path, but when it map the last level, it not checked.
>>
>> Fixed by also check the last level.
>>
>>
>
> I don't really see what is fixed. We already check the gpte. What's
> special about the new scenario?
>

I mean is: while we map the last level, we will directly set to the pfn but
the pfn is got by walk_addr, at this time, the guest mapping may be changed.

What is the 'We already check the gpte' mean? i think i miss something :-(

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Avi Kivity on 3 Jul 2010 08:30

On 07/03/2010 03:16 PM, Xiao Guangrong wrote:
>
> Avi Kivity wrote:
>
>> On 07/03/2010 01:31 PM, Xiao Guangrong wrote:
>>
>>>
>>>> See how the pte is reread inside fetch with mmu_lock held.
>>>>
>>>>
>>>>
>>> It looks like something is broken in 'fetch' functions, this patch will
>>> fix it.
>>>
>>> Subject: [PATCH] KVM: MMU: fix last level broken in FNAME(fetch)
>>>
>>> We read the guest level out of 'mmu_lock', sometimes, the host mapping is
>>> confusion. Consider this case:
>>>
>>> VCPU0: VCPU1
>>>
>>> Read guest mapping, assume the mapping is:
>>> GLV3 -> GLV2 -> GLV1 -> GFNA,
>>> And in the host, the corresponding mapping is
>>> HLV3 -> HLV2 -> HLV1(P=0)
>>>
>>> Write GLV1 and
>>> cause the
>>> mapping point to GFNB
>>> (May occur in
>>> pte_write or
>>> invlpg path)
>>>
>>> Mapping GLV1 to GFNA
>>>
>>> This issue only occurs in the last indirect mapping, since if the middle
>>> mapping is changed, the mapping will be zapped, then it will be detected
>>> in the FNAME(fetch) path, but when it map the last level, it not checked.
>>>
>>> Fixed by also check the last level.
>>>
>>>
>>>
>> I don't really see what is fixed. We already check the gpte. What's
>> special about the new scenario?
>>
>>
> I mean is: while we map the last level, we will directly set to the pfn but
> the pfn is got by walk_addr, at this time, the guest mapping may be changed.
>
> What is the 'We already check the gpte' mean? i think i miss something :-(
>

if (!direct) {
r = kvm_read_guest_atomic(vcpu->kvm,
gw->pte_gpa[level - 2],
&curr_pte, sizeof(curr_pte));
if (r || curr_pte != gw->ptes[level - 2]) {
kvm_mmu_put_page(shadow_page, sptep);
kvm_release_pfn_clean(pfn);
sptep = NULL;
break;
}
}

the code you moved... under what scenario is it not sufficient?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2 3 4
Prev: [PATCH padmux.c] Fix typo in kerneldoc
Next: New ACL format for better NFSv4 acl interoperability