make kvm mmu shrinker more aggressive [Kernel]

Prev: [PATCH] x86-32: remove CONFIG_4KSTACKS
Next: USB: gadget: g_fs: possible invalid pointer reference bug fixed

From: Avi Kivity on 16 Jun 2010 05:30

On 06/15/2010 04:55 PM, Dave Hansen wrote:
> In a previous patch, we removed the 'nr_to_scan' tracking.
> It was not being used to track the number of objects
> scanned, so we stopped using it entirely. Here, we
> strart using it again.
>
> The theory here is simple; if we already have the refcount
> and the kvm->mmu_lock, then we should do as much work as
> possible under the lock. The downside is that we're less
> fair about the KVM instances from which we reclaim. Each
> call to mmu_shrink() will tend to "pick on" one instance,
> after which it gets moved to the end of the list and left
> alone for a while.
>

That also increases the latency hit, as well as a potential fault storm,
on that instance. Spreading out is less efficient, but smoother.

> If mmu_shrink() has already done a significant amount of
> scanning, the use of 'nr_to_scan' inside shrink_kvm_mmu()
> will also ensure that we do not over-reclaim when we have
> already done a lot of work in this call.
>
> In the end, this patch defines a "scan" as:
> 1. An attempt to acquire a refcount on a 'struct kvm'
> 2. freeing a kvm mmu page
>
> This would probably be most ideal if we can expose some
> of the work done by kvm_mmu_remove_some_alloc_mmu_pages()
> as also counting as scanning, but I think we have churned
> enough for the moment.
>

It usually removes one page.

> Signed-off-by: Dave Hansen<dave(a)linux.vnet.ibm.com>
> ---
>
> linux-2.6.git-dave/arch/x86/kvm/mmu.c | 11 ++++++-----
> 1 file changed, 6 insertions(+), 5 deletions(-)
>
> diff -puN arch/x86/kvm/mmu.c~make-shrinker-more-aggressive arch/x86/kvm/mmu.c
> --- linux-2.6.git/arch/x86/kvm/mmu.c~make-shrinker-more-aggressive 2010-06-14 11:30:44.000000000 -0700
> +++ linux-2.6.git-dave/arch/x86/kvm/mmu.c 2010-06-14 11:38:04.000000000 -0700
> @@ -2935,8 +2935,10 @@ static int shrink_kvm_mmu(struct kvm *kv
>
> idx = srcu_read_lock(&kvm->srcu);
> spin_lock(&kvm->mmu_lock);
> - if (kvm->arch.n_used_mmu_pages> 0)
> - freed_pages = kvm_mmu_remove_some_alloc_mmu_pages(kvm);
> + while (nr_to_scan> 0&& kvm->arch.n_used_mmu_pages> 0) {
> + freed_pages += kvm_mmu_remove_some_alloc_mmu_pages(kvm);
> + nr_to_scan--;
> + }
>

What tree are you patching?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Dave Hansen on 16 Jun 2010 11:30

On Wed, 2010-06-16 at 12:24 +0300, Avi Kivity wrote:
> On 06/15/2010 04:55 PM, Dave Hansen wrote:
> > In a previous patch, we removed the 'nr_to_scan' tracking.
> > It was not being used to track the number of objects
> > scanned, so we stopped using it entirely. Here, we
> > strart using it again.
> >
> > The theory here is simple; if we already have the refcount
> > and the kvm->mmu_lock, then we should do as much work as
> > possible under the lock. The downside is that we're less
> > fair about the KVM instances from which we reclaim. Each
> > call to mmu_shrink() will tend to "pick on" one instance,
> > after which it gets moved to the end of the list and left
> > alone for a while.
> >
>
> That also increases the latency hit, as well as a potential fault storm,
> on that instance. Spreading out is less efficient, but smoother.

This is probably something that we need to go back and actually measure.
My suspicion is that, when memory fills up and this shrinker is getting
called a lot, it will be naturally fair. That list gets shuffled around
enough, and mmu_shrink() called often enough that no VMs get picked on
too unfairly.

I'll go back and see if I can quantify this a bit, though.

I do worry about the case where you really have only a single CPU going
into reclaim and a very small number of VMs on the system. You're
basically guaranteeing that you'll throw away nr_to_scan of the poor
victim VM's, with no penalty on the other guy.

> > If mmu_shrink() has already done a significant amount of
> > scanning, the use of 'nr_to_scan' inside shrink_kvm_mmu()
> > will also ensure that we do not over-reclaim when we have
> > already done a lot of work in this call.
> >
> > In the end, this patch defines a "scan" as:
> > 1. An attempt to acquire a refcount on a 'struct kvm'
> > 2. freeing a kvm mmu page
> >
> > This would probably be most ideal if we can expose some
> > of the work done by kvm_mmu_remove_some_alloc_mmu_pages()
> > as also counting as scanning, but I think we have churned
> > enough for the moment.
>
> It usually removes one page.

Does it always just go right now and free it, or is there any real
scanning that has to go on?

> > diff -puN arch/x86/kvm/mmu.c~make-shrinker-more-aggressive arch/x86/kvm/mmu.c
> > --- linux-2.6.git/arch/x86/kvm/mmu.c~make-shrinker-more-aggressive 2010-06-14 11:30:44.000000000 -0700
> > +++ linux-2.6.git-dave/arch/x86/kvm/mmu.c 2010-06-14 11:38:04.000000000 -0700
> > @@ -2935,8 +2935,10 @@ static int shrink_kvm_mmu(struct kvm *kv
> >
> > idx = srcu_read_lock(&kvm->srcu);
> > spin_lock(&kvm->mmu_lock);
> > - if (kvm->arch.n_used_mmu_pages> 0)
> > - freed_pages = kvm_mmu_remove_some_alloc_mmu_pages(kvm);
> > + while (nr_to_scan> 0&& kvm->arch.n_used_mmu_pages> 0) {
> > + freed_pages += kvm_mmu_remove_some_alloc_mmu_pages(kvm);
> > + nr_to_scan--;
> > + }
> >
>
> What tree are you patching?

These applied to Linus's latest as of yesterday.

-- Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Avi Kivity on 17 Jun 2010 04:40

On 06/16/2010 06:25 PM, Dave Hansen wrote:
>
>>> If mmu_shrink() has already done a significant amount of
>>> scanning, the use of 'nr_to_scan' inside shrink_kvm_mmu()
>>> will also ensure that we do not over-reclaim when we have
>>> already done a lot of work in this call.
>>>
>>> In the end, this patch defines a "scan" as:
>>> 1. An attempt to acquire a refcount on a 'struct kvm'
>>> 2. freeing a kvm mmu page
>>>
>>> This would probably be most ideal if we can expose some
>>> of the work done by kvm_mmu_remove_some_alloc_mmu_pages()
>>> as also counting as scanning, but I think we have churned
>>> enough for the moment.
>>>
>> It usually removes one page.
>>
> Does it always just go right now and free it, or is there any real
> scanning that has to go on?
>

It picks a page from the tail of the LRU and frees it. There is very
little attempt to keep the LRU in LRU order, though.

We do need a scanner that looks at spte accessed bits if this isn't
going to result in performance losses.

>>> diff -puN arch/x86/kvm/mmu.c~make-shrinker-more-aggressive arch/x86/kvm/mmu.c
>>> --- linux-2.6.git/arch/x86/kvm/mmu.c~make-shrinker-more-aggressive 2010-06-14 11:30:44.000000000 -0700
>>> +++ linux-2.6.git-dave/arch/x86/kvm/mmu.c 2010-06-14 11:38:04.000000000 -0700
>>> @@ -2935,8 +2935,10 @@ static int shrink_kvm_mmu(struct kvm *kv
>>>
>>> idx = srcu_read_lock(&kvm->srcu);
>>> spin_lock(&kvm->mmu_lock);
>>> - if (kvm->arch.n_used_mmu_pages> 0)
>>> - freed_pages = kvm_mmu_remove_some_alloc_mmu_pages(kvm);
>>> + while (nr_to_scan> 0&& kvm->arch.n_used_mmu_pages> 0) {
>>> + freed_pages += kvm_mmu_remove_some_alloc_mmu_pages(kvm);
>>> + nr_to_scan--;
>>> + }
>>>
>>>
>> What tree are you patching?
>>
> These applied to Linus's latest as of yesterday.
>

Please patch against kvm.git master (or next, which is usually a few
unregression-tested patches ahead). This code has changed.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Dave Hansen on 18 Jun 2010 11:50

On Wed, 2010-06-16 at 08:25 -0700, Dave Hansen wrote:
> On Wed, 2010-06-16 at 12:24 +0300, Avi Kivity wrote:
> > On 06/15/2010 04:55 PM, Dave Hansen wrote:
> > > In a previous patch, we removed the 'nr_to_scan' tracking.
> > > It was not being used to track the number of objects
> > > scanned, so we stopped using it entirely. Here, we
> > > strart using it again.
> > >
> > > The theory here is simple; if we already have the refcount
> > > and the kvm->mmu_lock, then we should do as much work as
> > > possible under the lock. The downside is that we're less
> > > fair about the KVM instances from which we reclaim. Each
> > > call to mmu_shrink() will tend to "pick on" one instance,
> > > after which it gets moved to the end of the list and left
> > > alone for a while.
> > >
> >
> > That also increases the latency hit, as well as a potential fault storm,
> > on that instance. Spreading out is less efficient, but smoother.
>
> This is probably something that we need to go back and actually measure.
> My suspicion is that, when memory fills up and this shrinker is getting
> called a lot, it will be naturally fair. That list gets shuffled around
> enough, and mmu_shrink() called often enough that no VMs get picked on
> too unfairly.
>
> I'll go back and see if I can quantify this a bit, though.

The shrink _query_ (mmu_shrink() with nr_to_scan=0) code is called
really, really often. Like 5,000-10,000 times a second during lots of
VM pressure. But, it's almost never called on to actually shrink
anything.

Over the 20 minutes or so that I tested, I saw about 700k calls to
mmu_shrink(). But, only 6 (yes, six) calls that had a non-zero
nr_to_scan. I'm not sure whether this is because of the .seeks argument
to the shrinker or what, but the slab code stays far, far away from
making mmu_shrink() do much real work.

That changes a few things. I bet all the contention we were seeing was
just from nr_to_scan=0 calls and not from actual shrink operations.
Perhaps we should just stop this set after patch 4.

Any thoughts?

-- Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Avi Kivity on 20 Jun 2010 04:20

On 06/18/2010 06:49 PM, Dave Hansen wrote:
> On Wed, 2010-06-16 at 08:25 -0700, Dave Hansen wrote:
>
>> On Wed, 2010-06-16 at 12:24 +0300, Avi Kivity wrote:
>>
>>> On 06/15/2010 04:55 PM, Dave Hansen wrote:
>>>
>>>> In a previous patch, we removed the 'nr_to_scan' tracking.
>>>> It was not being used to track the number of objects
>>>> scanned, so we stopped using it entirely. Here, we
>>>> strart using it again.
>>>>
>>>> The theory here is simple; if we already have the refcount
>>>> and the kvm->mmu_lock, then we should do as much work as
>>>> possible under the lock. The downside is that we're less
>>>> fair about the KVM instances from which we reclaim. Each
>>>> call to mmu_shrink() will tend to "pick on" one instance,
>>>> after which it gets moved to the end of the list and left
>>>> alone for a while.
>>>>
>>>>
>>> That also increases the latency hit, as well as a potential fault storm,
>>> on that instance. Spreading out is less efficient, but smoother.
>>>
>> This is probably something that we need to go back and actually measure.
>> My suspicion is that, when memory fills up and this shrinker is getting
>> called a lot, it will be naturally fair. That list gets shuffled around
>> enough, and mmu_shrink() called often enough that no VMs get picked on
>> too unfairly.
>>
>> I'll go back and see if I can quantify this a bit, though.
>>
> The shrink _query_ (mmu_shrink() with nr_to_scan=0) code is called
> really, really often. Like 5,000-10,000 times a second during lots of
> VM pressure. But, it's almost never called on to actually shrink
> anything.
>
> Over the 20 minutes or so that I tested, I saw about 700k calls to
> mmu_shrink(). But, only 6 (yes, six) calls that had a non-zero
> nr_to_scan. I'm not sure whether this is because of the .seeks argument
> to the shrinker or what, but the slab code stays far, far away from
> making mmu_shrink() do much real work.
>

Certainly seems so from vmscan.c.

> That changes a few things. I bet all the contention we were seeing was
> just from nr_to_scan=0 calls and not from actual shrink operations.
> Perhaps we should just stop this set after patch 4.
>

At the very least, we should re-measure things.

Even afterwards, we might reduce .seeks in return for making the
shrinker cleverer and eliminating the cap on mmu pages. But I'm afraid
the interface between vmscan and the shrinker is too simplistic;
sometimes we can trim pages without much cost (unreferenced pages), but
some pages are really critical for performance. To see real
improvement, we might need our own scanner.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2
Prev: [PATCH] x86-32: remove CONFIG_4KSTACKS
Next: USB: gadget: g_fs: possible invalid pointer reference bug fixed