KVM: MMU: combine guest pte read between walk and pte prefetch [Kernel]

Prev: [PATCH padmux.c] Fix typo in kerneldoc
Next: New ACL format for better NFSv4 acl interoperability

From: Xiao Guangrong on 4 Jul 2010 23:00

Avi Kivity wrote:

>> Umm, if we move the check before the judgment, it'll check every level,
>> actually, only the opened mapping and the laster level need checked, so
>> for the performance reason, maybe it's better to keep two check-point.
>>
>
> What exactly are the conditions when you want to check?
>
> Perhaps we do need to check every level. A write to a PDE (or higher
> level) will clear the corresponding spte, but a fault immediately
> afterwards can re-establish the spte to point to the new page.
>

Looks into the code more carefully, maybe this code is wrong:

if (!direct) {
r = kvm_read_guest_atomic(vcpu->kvm,
- gw->pte_gpa[level - 2],
+ gw->pte_gpa[level - 1],
&curr_pte, sizeof(curr_pte));
- if (r || curr_pte != gw->ptes[level - 2]) {
+ if (r || curr_pte != gw->ptes[level - 1]) {
kvm_mmu_put_page(sp, sptep);
kvm_release_pfn_clean(pfn);
sptep = NULL;

It should check the 'level' mapping not 'level - 1', in the later description
i'll explain it.

Since the 'walk_addr' functions is out of mmu_lock protection, we should
handle the guest modify its mapping between 'walk_addr' and 'fetch'.
Now, let's consider every case may be happened while handle guest #PF.

One case is host handle guest written after 'fetch', just like this order:

VCPU 0 VCPU 1
walk_addr
guest modify its mapping
fetch
host handle this written(pte_write or invlpg)

This case is not broken anything, even if 'fetch' setup the wrong mapping, the
later written handler will fix it.

Another case is host handle guest written before 'fetch', like this:

CPU 0 VCPU 1
walk_addr
guest modify its mapping
host handle this written(pte_write or invlpg)
fetch

We should notice this case, since the later 'fetch' will setup the wrong mapping.

For example, the guest mapping which 'walk_addr' got is:

GPML4E -> GPDPE -> GPDE -> GPTE -> GFNA
(Just take small page for example, other mapping way is also applied)

And, for good to describe, we abstract 'fetch''s work:

for_each_shadow_entry(vcpu, addr, iterator) {
if (iterator.level == hlevel)
Mapping the later level

if (is_shadow_present_pte(*sptep) &&
!is_large_pte(*sptep)) <------ Point A
continue;

/* handle the non-present/wrong middle level */

find/create the corresponding sp <----- Point B

if (the guest mapping is modified) <----- Point C
break;
setup this level's mapping <----- Point D

}

[
Note: the later level means PTE, PDE if 2M page size, PDPE if 1G page size
the middle level means PDE, PDPE if not using large page size / PML4E
]

There are two cases:

1: Guest modify the middle level, for example, guest modify the GPDE.
a: the GPDE has corresponding sp entry, after VCPU1 handle this written,
the corresponding host mapping is like this:
HPML4E -> HPDPE -> HPDE
[ HPDE.P = 0 since VCPU1 written handler zapped it in pte_wirte ]

Under this case, it can broke Point A's judgment and Point C can detect
the guest mapping is modified, so it exits this function without setup the
mapping.

And, we should check the guest mapping at GPDE not GPTE since it's GPDE
modified not GPTE, it's the explanation for the front fix.

b: the GPDE not has corresponding sp entry(the area is firstly accessed),
corresponding host mapping is like this:
HPML4E -> HPDPE
[ HPDPE.P = 0]

under this case, VCPU1 happily write GPDE without #PF since the GPDE not has shadow
page, it's not write-protected.

while we handle HPDPE, we will create the shadow page for GPDE
at Point B, then the host mapping is like this:
HPML4E -> HPDPE -> HPDE
[ HPDE.P = 0 ]
it's the same as 1.a, so do the same work as 1.a

Note: there is a trick: we should put 'Point C' behind of 'Point B', since after we
create sp for GPDE, the later modify GPDE will cause #PF, it can ensure later
check is valid

2: Guest modify the later level.
Form 'fetch''s abstract, we can see the late level is not checked by 'Point C', if
guest modified the GPTE's mapping, the wrong-mapping will be setup by 'fetch', this
is just this path does

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Avi Kivity on 5 Jul 2010 04:30

On 07/05/2010 05:52 AM, Xiao Guangrong wrote:
>
> Avi Kivity wrote:
>
>
>>> Umm, if we move the check before the judgment, it'll check every level,
>>> actually, only the opened mapping and the laster level need checked, so
>>> for the performance reason, maybe it's better to keep two check-point.
>>>
>>>
>> What exactly are the conditions when you want to check?
>>
>> Perhaps we do need to check every level. A write to a PDE (or higher
>> level) will clear the corresponding spte, but a fault immediately
>> afterwards can re-establish the spte to point to the new page.
>>
>>
> Looks into the code more carefully, maybe this code is wrong:
>
>
> if (!direct) {
> r = kvm_read_guest_atomic(vcpu->kvm,
> - gw->pte_gpa[level - 2],
> + gw->pte_gpa[level - 1],
> &curr_pte, sizeof(curr_pte));
> - if (r || curr_pte != gw->ptes[level - 2]) {
> + if (r || curr_pte != gw->ptes[level - 1]) {
> kvm_mmu_put_page(sp, sptep);
> kvm_release_pfn_clean(pfn);
> sptep = NULL;
>
> It should check the 'level' mapping not 'level - 1', in the later description
> i'll explain it.
>

Right, this fixes the check for the top level, but it removes a check
from the bottom level.

We need to move this to the top of the loop so we check all levels. I
guess this is why you needed to add a new check point. But since we
loop at least glevels times, we don't need two check points.

> Since the 'walk_addr' functions is out of mmu_lock protection, we should
> handle the guest modify its mapping between 'walk_addr' and 'fetch'.
> Now, let's consider every case may be happened while handle guest #PF.
>
> One case is host handle guest written after 'fetch', just like this order:
>
> VCPU 0 VCPU 1
> walk_addr
> guest modify its mapping
> fetch
> host handle this written(pte_write or invlpg)
>
> This case is not broken anything, even if 'fetch' setup the wrong mapping, the
> later written handler will fix it.
>

Ok.

> Another case is host handle guest written before 'fetch', like this:
>
> CPU 0 VCPU 1
> walk_addr
> guest modify its mapping
> host handle this written(pte_write or invlpg)
> fetch
>
> We should notice this case, since the later 'fetch' will setup the wrong mapping.
>
> For example, the guest mapping which 'walk_addr' got is:
>
> GPML4E -> GPDPE -> GPDE -> GPTE -> GFNA
> (Just take small page for example, other mapping way is also applied)
>
> And, for good to describe, we abstract 'fetch''s work:
>
> for_each_shadow_entry(vcpu, addr, iterator) {
> if (iterator.level == hlevel)
> Mapping the later level
>
> if (is_shadow_present_pte(*sptep)&&
> !is_large_pte(*sptep)) <------ Point A
> continue;
>
> /* handle the non-present/wrong middle level */
>
> find/create the corresponding sp<----- Point B
>
> if (the guest mapping is modified)<----- Point C
> break;
> setup this level's mapping<----- Point D
>
> }
>
> [
> Note: the later level means PTE, PDE if 2M page size, PDPE if 1G page size
> the middle level means PDE, PDPE if not using large page size / PML4E
> ]
>
> There are two cases:
>
> 1: Guest modify the middle level, for example, guest modify the GPDE.
> a: the GPDE has corresponding sp entry, after VCPU1 handle this written,
> the corresponding host mapping is like this:
> HPML4E -> HPDPE -> HPDE
> [ HPDE.P = 0 since VCPU1 written handler zapped it in pte_wirte ]
>
> Under this case, it can broke Point A's judgment and Point C can detect
> the guest mapping is modified, so it exits this function without setup the
> mapping.
>
> And, we should check the guest mapping at GPDE not GPTE since it's GPDE
> modified not GPTE, it's the explanation for the front fix.
>

Agree.

> b: the GPDE not has corresponding sp entry(the area is firstly accessed),
> corresponding host mapping is like this:
> HPML4E -> HPDPE
> [ HPDPE.P = 0]
>
> under this case, VCPU1 happily write GPDE without #PF since the GPDE not has shadow
> page, it's not write-protected.
>
> while we handle HPDPE, we will create the shadow page for GPDE
> at Point B, then the host mapping is like this:
> HPML4E -> HPDPE -> HPDE
> [ HPDE.P = 0 ]
> it's the same as 1.a, so do the same work as 1.a
>
> Note: there is a trick: we should put 'Point C' behind of 'Point B', since after we
> create sp for GPDE, the later modify GPDE will cause #PF, it can ensure later
> check is valid
>
> 2: Guest modify the later level.
> Form 'fetch''s abstract, we can see the late level is not checked by 'Point C', if
> guest modified the GPTE's mapping, the wrong-mapping will be setup by 'fetch', this
> is just this path does
>
>

Ok. So moving the check to before point A, and s/level - 2/level - 1/
should work, yes?

Should be slightly simpler since we don't need to kvm_mmu_put_page(sp,
sptep) any more.

Thanks for the detailed explanation.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Xiao Guangrong on 5 Jul 2010 04:50

Avi Kivity wrote:

>> Looks into the code more carefully, maybe this code is wrong:
>>
>>
>> if (!direct) {
>> r = kvm_read_guest_atomic(vcpu->kvm,
>> - gw->pte_gpa[level - 2],
>> + gw->pte_gpa[level - 1],
>> &curr_pte, sizeof(curr_pte));
>> - if (r || curr_pte != gw->ptes[level - 2]) {
>> + if (r || curr_pte != gw->ptes[level - 1]) {
>> kvm_mmu_put_page(sp, sptep);
>> kvm_release_pfn_clean(pfn);
>> sptep = NULL;
>>
>> It should check the 'level' mapping not 'level - 1', in the later
>> description
>> i'll explain it.
>>
>
> Right, this fixes the check for the top level, but it removes a check
> from the bottom level.
>

We no need check the bottom level if guest not modify the bottom level,
if guest modify it, the bottom level is no-present, it also can broke
Point A's judgment and be checked by 'Point C'

> We need to move this to the top of the loop so we check all levels. I
> guess this is why you needed to add a new check point. But since we
> loop at least glevels times, we don't need two check points.
>

>
> Ok. So moving the check to before point A, and s/level - 2/level - 1/
> should work, yes?
>
> Should be slightly simpler since we don't need to kvm_mmu_put_page(sp,
> sptep) any more.

Yeah, it can work, but check all levels is really unnecessary, if guest not
modify the level, the check can be avoid.

This is why i choose two check-point, one is behind Point A's judgment, this
point checks the level which modified by guest, and another point is at mapping
last level point, this check is alway need.

>
> Thanks for the detailed explanation.
>

It's really my pleasure :-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Avi Kivity on 5 Jul 2010 05:10

On 07/05/2010 11:45 AM, Xiao Guangrong wrote:
>
>
> Avi Kivity wrote:
>
>>> Looks into the code more carefully, maybe this code is wrong:
>>>
>>>
>>> if (!direct) {
>>> r = kvm_read_guest_atomic(vcpu->kvm,
>>> - gw->pte_gpa[level - 2],
>>> + gw->pte_gpa[level - 1],
>>> &curr_pte, sizeof(curr_pte));
>>> - if (r || curr_pte != gw->ptes[level - 2]) {
>>> + if (r || curr_pte != gw->ptes[level - 1]) {
>>> kvm_mmu_put_page(sp, sptep);
>>> kvm_release_pfn_clean(pfn);
>>> sptep = NULL;
>>>
>>> It should check the 'level' mapping not 'level - 1', in the later
>>> description
>>> i'll explain it.
>>>
>>
>> Right, this fixes the check for the top level, but it removes a check
>> from the bottom level.
>>
>
> We no need check the bottom level if guest not modify the bottom level,
> if guest modify it, the bottom level is no-present,

Why? VCPU1 will call kvm_mmu_pte_write (or invlpg) and establishes the
HPTE. Then VCPU0 calls mmu_set_pte() and writes the old GPTE.

> it also can broke
> Point A's judgment and be checked by 'Point C'

Well, the 'continue' in point A means we skip the check. That's not good.

>> We need to move this to the top of the loop so we check all levels. I
>> guess this is why you needed to add a new check point. But since we
>> loop at least glevels times, we don't need two check points.
>>
>
>>
>> Ok. So moving the check to before point A, and s/level - 2/level - 1/
>> should work, yes?
>>
>> Should be slightly simpler since we don't need to kvm_mmu_put_page(sp,
>> sptep) any more.
>
> Yeah, it can work, but check all levels is really unnecessary, if guest not
> modify the level, the check can be avoid.
>
> This is why i choose two check-point, one is behind Point A's judgment, this
> point checks the level which modified by guest, and another point is at mapping
> last level point, this check is alway need.

I'm not convinced we can bypass the checks. Consider:

VCPU0 VCPU1

#PF
walk_addr
-> gpml4e0,gpdpe0,gpde0,gpte0

replace gpdpe0 with gpdpe1
#PF
walk_addr
-> gpml4e0,gpdpe1,gpde1,gpte1
fetch
-> establish hpml4e0,hpdpte1,hpde0,hpte1
fetch
read hpdpe1
if (present(hpdpe1))
continue;
....
write hpte0 using shadow hieratchy for hpte1

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Xiao Guangrong on 5 Jul 2010 05:20

Avi Kivity wrote:

>
> I'm not convinced we can bypass the checks. Consider:
>
>
> VCPU0 VCPU1
>
> #PF
> walk_addr
> -> gpml4e0,gpdpe0,gpde0,gpte0
>
> replace gpdpe0 with gpdpe1
> #PF
> walk_addr
> -> gpml4e0,gpdpe1,gpde1,gpte1
> fetch
> -> establish hpml4e0,hpdpte1,hpde0,hpte1
> fetch
> read hpdpe1
> if (present(hpdpe1))
> continue;
> ...
> write hpte0 using shadow hieratchy for hpte1
>

Ah, i missed this case, thanks for you point it out, i'll fix it in
the next version.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: [PATCH padmux.c] Fix typo in kerneldoc
Next: New ACL format for better NFSv4 acl interoperability