Linux/Guest cooperative unmapped page cache control [Kernel]

Prev: Dear Account Owner,
Next: [PATCH] rt3070: Fixed a line over 80 character warning reported by checkpatch.pl tool

From: Balbir Singh on 15 Jun 2010 04:00

* Avi Kivity <avi(a)redhat.com> [2010-06-15 10:12:44]:

> On 06/14/2010 08:16 PM, Balbir Singh wrote:
> >* Dave Hansen<dave(a)linux.vnet.ibm.com> [2010-06-14 10:09:31]:
> >
> >>On Mon, 2010-06-14 at 22:28 +0530, Balbir Singh wrote:
> >>>If you've got duplicate pages and you know
> >>>that they are duplicated and can be retrieved at a lower cost, why
> >>>wouldn't we go after them first?
> >>I agree with this in theory. But, the guest lacks the information about
> >>what is truly duplicated and what the costs are for itself and/or the
> >>host to recreate it. "Unmapped page cache" may be the best proxy that
> >>we have at the moment for "easy to recreate", but I think it's still too
> >>poor a match to make these patches useful.
> >>
> >That is why the policy (in the next set) will come from the host. As
> >to whether the data is truly duplicated, my experiments show up to 60%
> >of the page cache is duplicated.
>
> Isn't that incredibly workload dependent?
>
> We can't expect the host admin to know whether duplication will
> occur or not.
>

I was referring to cache = (policy) we use based on the setup. I don't
think the duplication is too workload specific. Moreover, we could use
aggressive policies and restrict page cache usage or do it selectively
on ballooning. We could also add other options to make the ballooning
option truly optional, so that the system management software decides.

--
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Avi Kivity on 15 Jun 2010 05:50

On 06/15/2010 10:49 AM, Balbir Singh wrote:
>
>> All we need is to select the right page to drop.
>>
>>
> Do we need to drop to the granularity of the page to drop? I think
> figuring out the class of pages and making sure that we don't write
> our own reclaim logic, but work with what we have to identify the
> class of pages is a good start.
>

Well, the class of pages are 'pages that are duplicated on the host'.
Unmapped page cache pages are 'pages that might be duplicated on the
host'. IMO, that's not close enough.

>> How can the host tell if there is duplication? It may know it has
>> some pagecache, but it has no idea whether or to what extent guest
>> pagecache duplicates host pagecache.
>>
>>
> Well it is possible in host user space, I for example use memory
> cgroup and through the stats I have a good idea of how much is duplicated.
> I am ofcourse making an assumption with my setup of the cached mode,
> that the data in the guest page cache and page cache in the cgroup
> will be duplicated to a large extent. I did some trivial experiments
> like drop the data from the guest and look at the cost of bringing it
> in and dropping the data from both guest and host and look at the
> cost. I could see a difference.
>
> Unfortunately, I did not save the data, so I'll need to redo the
> experiment.
>

I'm sure we can detect it experimentally, but how do we do it
programatically at run time (without dropping all the pages).
Situations change, and I don't think we can infer from a few experiments
that we'll have a similar amount of sharing. The cost of an incorrect
decision is too high IMO (not that I think the kernel always chooses the
right pages now, but I'd like to avoid regressions from the
unvirtualized state).

btw, when running with a disk controller that has a very large cache, we
might also see duplication between "guest" and host. So, if this is a
good idea, it shouldn't be enabled just for virtualization, but for any
situation where we have a sizeable cache behind us.

>> It doesn't, really. The host only has aggregate information about
>> itself, and no information about the guest.
>>
>> Dropping duplicate pages would be good if we could identify them.
>> Even then, it's better to drop the page from the host, not the
>> guest, unless we know the same page is cached by multiple guests.
>>
>>
> On the exact pages to drop, please see my comments above on the class
> of pages to drop.
>

Well, we disagree about that. There is some value in dropping
duplicated pages (not always), but that's not what the patch does. It
drops unmapped pagecache pages, which may or may not be duplicated.

> There are reasons for wanting to get the host to cache the data
>

There are also reasons to get the guest to cache the data - it's more
efficient to access it in the guest.

> Unless the guest is using cache = none, the data will still hit the
> host page cache
> The host can do a better job of optimizing the writeouts
>

True, especially for non-raw storage. But even there we have to fsync
all the time to keep the metadata right.

>> But why would the guest voluntarily drop the cache? If there is no
>> memory pressure, dropping caches increases cpu overhead and latency
>> even if the data is still cached on the host.
>>
>>
> So, there are basically two approaches
>
> 1. First patch, proactive - enabled by a boot option
> 2. When ballooned, we try to (please NOTE try to) reclaim cached pages
> first. Failing which, we go after regular pages in the alloc_page()
> call in the balloon driver.
>

Doesn't that mean you may evict a RU mapped page ahead of an LRU
unmapped page, just in the hope that it is double-cached?

Maybe we need the guest and host to talk to each other about which pages
to keep.

>>> 2. Drop the cache on either a special balloon option, again the host
>>> knows it caches that very same information, so it prefers to free that
>>> up first.
>>>
>> Dropping in response to pressure is good. I'm just not convinced
>> the patch helps in selecting the correct page to drop.
>>
>>
> That is why I've presented data on the experiments I've run and
> provided more arguments to backup the approach.
>

I'm still unconvinced, sorry.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Avi Kivity on 15 Jun 2010 06:00

On 06/15/2010 10:52 AM, Balbir Singh wrote:
>>>
>>> That is why the policy (in the next set) will come from the host. As
>>> to whether the data is truly duplicated, my experiments show up to 60%
>>> of the page cache is duplicated.
>>>
>> Isn't that incredibly workload dependent?
>>
>> We can't expect the host admin to know whether duplication will
>> occur or not.
>>
>>
> I was referring to cache = (policy) we use based on the setup. I don't
> think the duplication is too workload specific. Moreover, we could use
> aggressive policies and restrict page cache usage or do it selectively
> on ballooning. We could also add other options to make the ballooning
> option truly optional, so that the system management software decides.
>

Consider a read-only workload that exactly fits in guest cache. Without
trimming, the guest will keep hitting its own cache, and the host will
see no access to the cache at all. So the host (assuming it is under
even low pressure) will evict those pages, and the guest will happily
use its own cache. If we start to trim, the guest will have to go to
disk. That's the best case.

Now for the worst case. A random access workload that misses the cache
on both guest and host. Now every page is duplicated, and trimming
guest pages allows the host to increase its cache, and potentially
reduce misses. In this case trimming duplicated pages works.

Real life will see a mix of this. Often used pages won't be duplicated,
and less often used pages may see some duplication, especially if the
host cache portion dedicated to the guest is bigger than the guest cache.

I can see that trimming duplicate pages helps, but (a) I'd like to be
sure they are duplicates and (b) often trimming them from the host is
better than trimming them from the guest.

Trimming from the guest is worthwhile if the pages are not used very
often (but enough that caching them in the host is worth it) and if the
host cache can serve more than one guest. If we can identify those
pages, we don't risk degrading best-case workloads (as defined above).

(note ksm to some extent identifies those pages, though it is a bit
expensive, and doesn't share with the host pagecache).

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Balbir Singh on 15 Jun 2010 06:20

* Avi Kivity <avi(a)redhat.com> [2010-06-15 12:44:31]:

> On 06/15/2010 10:49 AM, Balbir Singh wrote:
> >
> >>All we need is to select the right page to drop.
> >>
> >Do we need to drop to the granularity of the page to drop? I think
> >figuring out the class of pages and making sure that we don't write
> >our own reclaim logic, but work with what we have to identify the
> >class of pages is a good start.
>
> Well, the class of pages are 'pages that are duplicated on the
> host'. Unmapped page cache pages are 'pages that might be
> duplicated on the host'. IMO, that's not close enough.
>

Agreed, but what happens in reality with the code is that it drops
not-so-frequently-used cache (still reusing the reclaim mechanism),
but prioritizing cached memory.

> >>How can the host tell if there is duplication? It may know it has
> >>some pagecache, but it has no idea whether or to what extent guest
> >>pagecache duplicates host pagecache.
> >>
> >Well it is possible in host user space, I for example use memory
> >cgroup and through the stats I have a good idea of how much is duplicated.
> >I am ofcourse making an assumption with my setup of the cached mode,
> >that the data in the guest page cache and page cache in the cgroup
> >will be duplicated to a large extent. I did some trivial experiments
> >like drop the data from the guest and look at the cost of bringing it
> >in and dropping the data from both guest and host and look at the
> >cost. I could see a difference.
> >
> >Unfortunately, I did not save the data, so I'll need to redo the
> >experiment.
>
> I'm sure we can detect it experimentally, but how do we do it
> programatically at run time (without dropping all the pages).
> Situations change, and I don't think we can infer from a few
> experiments that we'll have a similar amount of sharing. The cost
> of an incorrect decision is too high IMO (not that I think the
> kernel always chooses the right pages now, but I'd like to avoid
> regressions from the unvirtualized state).
>
> btw, when running with a disk controller that has a very large
> cache, we might also see duplication between "guest" and host. So,
> if this is a good idea, it shouldn't be enabled just for
> virtualization, but for any situation where we have a sizeable cache
> behind us.
>

It depends, once the disk controller has the cache and the pages in
the guest are not-so-frequently-used we can drop them. Please remember
we still use the LRU to identify these pages.

> >>It doesn't, really. The host only has aggregate information about
> >>itself, and no information about the guest.
> >>
> >>Dropping duplicate pages would be good if we could identify them.
> >>Even then, it's better to drop the page from the host, not the
> >>guest, unless we know the same page is cached by multiple guests.
> >>
> >On the exact pages to drop, please see my comments above on the class
> >of pages to drop.
>
> Well, we disagree about that. There is some value in dropping
> duplicated pages (not always), but that's not what the patch does.
> It drops unmapped pagecache pages, which may or may not be
> duplicated.
>
> >There are reasons for wanting to get the host to cache the data
>
> There are also reasons to get the guest to cache the data - it's
> more efficient to access it in the guest.
>
> >Unless the guest is using cache = none, the data will still hit the
> >host page cache
> >The host can do a better job of optimizing the writeouts
>
> True, especially for non-raw storage. But even there we have to
> fsync all the time to keep the metadata right.
>
> >>But why would the guest voluntarily drop the cache? If there is no
> >>memory pressure, dropping caches increases cpu overhead and latency
> >>even if the data is still cached on the host.
> >>
> >So, there are basically two approaches
> >
> >1. First patch, proactive - enabled by a boot option
> >2. When ballooned, we try to (please NOTE try to) reclaim cached pages
> >first. Failing which, we go after regular pages in the alloc_page()
> >call in the balloon driver.
>
> Doesn't that mean you may evict a RU mapped page ahead of an LRU
> unmapped page, just in the hope that it is double-cached?
>
> Maybe we need the guest and host to talk to each other about which
> pages to keep.
>

Yeah.. I guess that falls into the domain of CMM.

> >>>2. Drop the cache on either a special balloon option, again the host
> >>>knows it caches that very same information, so it prefers to free that
> >>>up first.
> >>Dropping in response to pressure is good. I'm just not convinced
> >>the patch helps in selecting the correct page to drop.
> >>
> >That is why I've presented data on the experiments I've run and
> >provided more arguments to backup the approach.
>
> I'm still unconvinced, sorry.
>

The reason for making this optional is to let the administrators
decide how they want to use the memory in the system. In some
situations it might be a big no-no to waste memory, in some cases it
might be acceptable.

--
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Balbir Singh on 15 Jun 2010 09:00

* Avi Kivity <avi(a)redhat.com> [2010-06-15 12:54:31]:

> On 06/15/2010 10:52 AM, Balbir Singh wrote:
> >>>
> >>>That is why the policy (in the next set) will come from the host. As
> >>>to whether the data is truly duplicated, my experiments show up to 60%
> >>>of the page cache is duplicated.
> >>Isn't that incredibly workload dependent?
> >>
> >>We can't expect the host admin to know whether duplication will
> >>occur or not.
> >>
> >I was referring to cache = (policy) we use based on the setup. I don't
> >think the duplication is too workload specific. Moreover, we could use
> >aggressive policies and restrict page cache usage or do it selectively
> >on ballooning. We could also add other options to make the ballooning
> >option truly optional, so that the system management software decides.
>
> Consider a read-only workload that exactly fits in guest cache.
> Without trimming, the guest will keep hitting its own cache, and the
> host will see no access to the cache at all. So the host (assuming
> it is under even low pressure) will evict those pages, and the guest
> will happily use its own cache. If we start to trim, the guest will
> have to go to disk. That's the best case.
>
> Now for the worst case. A random access workload that misses the
> cache on both guest and host. Now every page is duplicated, and
> trimming guest pages allows the host to increase its cache, and
> potentially reduce misses. In this case trimming duplicated pages
> works.
>
> Real life will see a mix of this. Often used pages won't be
> duplicated, and less often used pages may see some duplication,
> especially if the host cache portion dedicated to the guest is
> bigger than the guest cache.
>
> I can see that trimming duplicate pages helps, but (a) I'd like to
> be sure they are duplicates and (b) often trimming them from the
> host is better than trimming them from the guest.
>

Lets see the behaviour with these patches

The first patch is a proactive approach to keep more memory around.
Enabling the parameter implies we are OK paying the cost of some
overhead. My data shows that leaves a significant amount of free
memory with a small 5% (in my case) overhead. This brings us back to
what you can do with free memory.

The second patch shows no overhead and selectively tries to use free
cache to return back on memory pressure (as indicated by the balloon
driver). We've discussed the reasons for doing this

1. In the situations where cache is duplicated this should benefit
us. Your contention is that we need to be specific about the
duplication. That falls under the realm of CMM.
2. In the case of slab cache, duplication does not matter, it is a
free page, that should be reclaimed ahead of mapped pages ideally.
If the slab grows, it will get another new page.

What is the cost of (1)

In the worst case, we select a non-duplicated page, but for us to
select it, it should be inactive, in that case we do I/O to bring back
the page.

> Trimming from the guest is worthwhile if the pages are not used very
> often (but enough that caching them in the host is worth it) and if
> the host cache can serve more than one guest. If we can identify
> those pages, we don't risk degrading best-case workloads (as defined
> above).
>
> (note ksm to some extent identifies those pages, though it is a bit
> expensive, and doesn't share with the host pagecache).
>

I see that you are hinting towards finding exact duplicates, I don't
know if the cost and complexity justify it. I hope more users can try
the patches with and without the boot parameter and provide additional
feedback.

--
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6
Prev: Dear Account Owner,
Next: [PATCH] rt3070: Fixed a line over 80 character warning reported by checkpatch.pl tool