From: Avi Kivity on
On 04/30/2010 08:52 PM, Jeremy Fitzhardinge wrote:
> On 04/30/2010 09:16 AM, Avi Kivity wrote:
>
>> Given that whenever frontswap fails you need to swap anyway, it is
>> better for the host to never fail a frontswap request and instead back
>> it with disk storage if needed. This way you avoid a pointless vmexit
>> when you're out of memory. Since it's disk backed it needs to be
>> asynchronous and batched.
>>
> I'd argue the opposite. There's no point in having the host do swapping
> on behalf of guests if guests can do it themselves; it's just a
> duplication of functionality.

The problem with relying on the guest to swap is that it's voluntary.
The guest may not be able to do it. When the hypervisor needs memory
and guests don't cooperate, it has to swap.

But I'm not suggesting that the host swap on behalf on the guest.
Rather, the guest swaps to (what it sees as) a device with a large
write-back cache; the host simply manages that cache.

> You end up having two IO paths for each
> guest, and the resulting problems in trying to account for the IO,
> rate-limit it, etc. If you can simply say "all guest disk IO happens
> via this single interface", its much easier to manage.
>

With tmem you have to account for that memory, make sure it's
distributed fairly, claim it back when you need it (requiring guest
cooperation), live migrate and save/restore it. It's a much larger
change than introducing a write-back device for swapping (which has the
benefit of working with unmodified guests).

> If frontswap has value, it's because its providing a new facility to
> guests that doesn't already exist and can't be easily emulated with
> existing interfaces.
>
> It seems to me the great strengths of the synchronous interface are:
>
> * it matches the needs of an existing implementation (tmem in Xen)
> * it is simple to understand within the context of the kernel code
> it's used in
>
> Simplicity is important, because it allows the mm code to be understood
> and maintained without having to have a deep understanding of
> virtualization.

If we use the existing paths, things are even simpler, and we match more
needs (hypervisors with dma engines, the ability to reclaim memory
without guest cooperation).

> One of the problems with CMM2 was that it puts a lot of
> intricate constraints on the mm code which can be easily broken, which
> would only become apparent in subtle edge cases in a CMM2-using
> environment. An addition async frontswap-like interface - while not as
> complex as CMM2 - still makes things harder for mm maintainers.
>

No doubt CMM2 is hard to swallow.

> The downside is that it may not match some implementation in which the
> get/put operations could take a long time (ie, physical IO to a slow
> mechanical device). But a general Linux principle is not to overdesign
> interfaces for hypothetical users, only for real needs.
>

> Do you think that you would be able to use frontswap in kvm if it were
> an async interface, but not otherwise? Or are you arguing a hypothetical?
>

For kvm (or Xen, with some modifications) all of the benefits of
frontswap/tmem can be achieved with the ordinary swap. It would need
trim/discard support to avoid writing back freed data, but that's good
for flash as well.

The advantages are:
- just works
- old guests
- <1 exit/page (since it's batched)
- no extra overhead if no free memory
- can use dma engine (since it's asynchronous)

>> At this point we're back with the ordinary swap API. Simply have your
>> host expose a device which is write cached by host memory, you'll have
>> all the benefits of frontswap with none of the disadvantages, and with
>> no changes to guest code.
>>
> Yes, that's comfortably within the "guests page themselves" model.
> Setting up a block device for the domain which is backed by pagecache
> (something we usually try hard to avoid) is pretty straightforward. But
> it doesn't work well for Xen unless the blkback domain is sized so that
> it has all of Xen's free memory in its pagecache.
>

Could be easily achieved with ballooning?

> That said, it does concern me that the host/hypervisor is left holding
> the bag on frontswapped pages. A evil/uncooperative/lazy can just pump
> a whole lot of pages into the frontswap pool and leave them there. I
> guess this is mitigated by the fact that the API is designed such that
> they can't update or read the data without also allowing the hypervisor
> to drop the page (updates can fail destructively, and reads are also
> destructive), so the guest can't use it as a clumsy extension of their
> normal dedicated memory.
>

Eventually you'll have to swap frontswap pages, or kill uncooperative
guests. At which point all of the simplicity is gone.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Avi Kivity on
On 04/30/2010 06:59 PM, Dan Magenheimer wrote:
>>
>>> experiencing a load spike, you increase load even more by making the
>>> guests swap. If you can just take some of their memory away, you can
>>> smooth that spike out. CMM2 and frontswap do that. The guests
>>> explicitly give up page contents that the hypervisor does not have to
>>> first consult with the guest before discarding.
>>>
>> Frontswap does not do this. Once a page has been frontswapped, the
>> host
>> is committed to retaining it until the guest releases it.
>>
> Dave or others can correct me if I am wrong, but I think CMM2 also
> handles dirty pages that must be retained by the hypervisor.

But those are the guest's pages in the first place, that's not a new
commitment. CMM2 provides the hypervisor alternatives to swapping a
page out. Frontswap provides the guest alternatives to swapping a page out.

> The
> difference between CMM2 (for dirty pages) and frontswap is that
> CMM2 sets hints that can be handled asynchronously while frontswap
> provides explicit hooks that synchronously succeed/fail.
>

They are not directly comparable. In fact for dirty pages CMM2 is
mostly a no-op - the host is forced to swap them out if it wants them.
CMM2 brings value for demand zero or clean pages which can be restored
by the guest without requiring swapin.

I think for dirty pages what CMM2 brings is the ability to discard them
if the host has swapped them out but the guest doesn't need them,

> In fact, Avi, CMM2 is probably a fairly good approximation of what
> the asynchronous interface you are suggesting might look like.
> In other words,

CMM2 is more directly comparably to ballooning rather than to
frontswap. Frontswap (and cleancache) work with storage that is
external to the guest, and say nothing about the guest's page itself.

> feasible but much much more complex than frontswap.
>

The swap API (e.g. the block layer) itself is an asynchronous batched
version of frontswap. The complexity in CMM2 comes from the fact that
it is communicating information about guest pages to the host, and from
the fact that communication is two-way and asynchronous in both directions.


>
>> [frontswap is] really
>> not very different from a synchronous swap device.
>>
> Not to beat a dead horse, but there is a very key difference:
> The size and availability of frontswap is entirely dynamic;
> any page-to-be-swapped can be rejected at any time even if
> a page was previously successfully swapped to the same index.
> Every other swap device is much more static so the swap code
> assumes a static device. Existing swap code can account for
> "bad blocks" on a static device, but this is far from sufficient
> to handle the dynamicity needed by frontswap.
>

Given that whenever frontswap fails you need to swap anyway, it is
better for the host to never fail a frontswap request and instead back
it with disk storage if needed. This way you avoid a pointless vmexit
when you're out of memory. Since it's disk backed it needs to be
asynchronous and batched.

At this point we're back with the ordinary swap API. Simply have your
host expose a device which is write cached by host memory, you'll have
all the benefits of frontswap with none of the disadvantages, and with
no changes to guest code.


--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jeremy Fitzhardinge on
On 04/30/2010 11:24 AM, Avi Kivity wrote:
>> I'd argue the opposite. There's no point in having the host do swapping
>> on behalf of guests if guests can do it themselves; it's just a
>> duplication of functionality.
>
> The problem with relying on the guest to swap is that it's voluntary.
> The guest may not be able to do it. When the hypervisor needs memory
> and guests don't cooperate, it has to swap.

Or fail whatever operation its trying to do. You can only use
overcommit to fake unlimited resources for so long before you need a
government bailout.

>> You end up having two IO paths for each
>> guest, and the resulting problems in trying to account for the IO,
>> rate-limit it, etc. If you can simply say "all guest disk IO happens
>> via this single interface", its much easier to manage.
>>
>
> With tmem you have to account for that memory, make sure it's
> distributed fairly, claim it back when you need it (requiring guest
> cooperation), live migrate and save/restore it. It's a much larger
> change than introducing a write-back device for swapping (which has
> the benefit of working with unmodified guests).

Well, with caveats. To be useful with migration the backing store needs
to be shared like other storage, so you can't use a specific host-local
fast (ssd) swap device. And because the device is backed by pagecache
with delayed writes, it has much weaker integrity guarantees than a
normal device, so you need to be sure that the guests are only going to
use it for swap. Sure, these are deployment issues rather than code
ones, but they're still issues.

>> If frontswap has value, it's because its providing a new facility to
>> guests that doesn't already exist and can't be easily emulated with
>> existing interfaces.
>>
>> It seems to me the great strengths of the synchronous interface are:
>>
>> * it matches the needs of an existing implementation (tmem in Xen)
>> * it is simple to understand within the context of the kernel code
>> it's used in
>>
>> Simplicity is important, because it allows the mm code to be understood
>> and maintained without having to have a deep understanding of
>> virtualization.
>
> If we use the existing paths, things are even simpler, and we match
> more needs (hypervisors with dma engines, the ability to reclaim
> memory without guest cooperation).

Well, you still can't reclaim memory; you can write it out to storage.
It may be cheaper/byte, but it's still a resource dedicated to the
guest. But that's just a consequence of allowing overcommit, and to
what extent you're happy to allow it.

What kind of DMA engine do you have in mind? Are there practical
memory->memory DMA engines that would be useful in this context?

>>> At this point we're back with the ordinary swap API. Simply have your
>>> host expose a device which is write cached by host memory, you'll have
>>> all the benefits of frontswap with none of the disadvantages, and with
>>> no changes to guest code.
>>>
>> Yes, that's comfortably within the "guests page themselves" model.
>> Setting up a block device for the domain which is backed by pagecache
>> (something we usually try hard to avoid) is pretty straightforward. But
>> it doesn't work well for Xen unless the blkback domain is sized so that
>> it has all of Xen's free memory in its pagecache.
>>
>
> Could be easily achieved with ballooning?

It could be achieved with ballooning, but it isn't completely trivial.
It wouldn't work terribly well with a driver domain setup, unless all
the swap-devices turned out to be backed by the same domain (which in
turn would need to know how to balloon in response to overall system
demand). The partitioning of the pagecache among the guests would be at
the mercy of the mm subsystem rather than subject to any specific QoS or
other per-domain policies you might want to put in place (maybe fiddling
around with [fm]advise could get you some control over that).

>
>> That said, it does concern me that the host/hypervisor is left holding
>> the bag on frontswapped pages. A evil/uncooperative/lazy can just pump
>> a whole lot of pages into the frontswap pool and leave them there. I
>> guess this is mitigated by the fact that the API is designed such that
>> they can't update or read the data without also allowing the hypervisor
>> to drop the page (updates can fail destructively, and reads are also
>> destructive), so the guest can't use it as a clumsy extension of their
>> normal dedicated memory.
>>
>
> Eventually you'll have to swap frontswap pages, or kill uncooperative
> guests. At which point all of the simplicity is gone.

Killing guests is pretty simple. Presumably the oom killer will get kvm
processes like anything else?

J

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Dave Hansen on
On Fri, 2010-04-30 at 08:59 -0700, Dan Magenheimer wrote:
> Dave or others can correct me if I am wrong, but I think CMM2 also
> handles dirty pages that must be retained by the hypervisor. The
> difference between CMM2 (for dirty pages) and frontswap is that
> CMM2 sets hints that can be handled asynchronously while frontswap
> provides explicit hooks that synchronously succeed/fail.

Once pages were dirtied (or I guess just slightly before), they became
volatile, and I don't think the hypervisor could do anything with them.
It could still swap them out like usual, but none of the CMM-specific
optimizations could be performed.

CC'ing Martin since he's the expert. :)

-- Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Dave Hansen on
On Fri, 2010-04-30 at 10:13 +0300, Avi Kivity wrote:
> On 04/30/2010 04:45 AM, Dave Hansen wrote:
> >
> > A large portion of CMM2's gain came from the fact that you could take
> > memory away from guests without _them_ doing any work. If the system is
> > experiencing a load spike, you increase load even more by making the
> > guests swap. If you can just take some of their memory away, you can
> > smooth that spike out. CMM2 and frontswap do that. The guests
> > explicitly give up page contents that the hypervisor does not have to
> > first consult with the guest before discarding.
> >
>
> Frontswap does not do this. Once a page has been frontswapped, the host
> is committed to retaining it until the guest releases it. It's really
> not very different from a synchronous swap device.
>
> I think cleancache allows the hypervisor to drop pages without the
> guest's immediate knowledge, but I'm not sure.

Gah. You're right. I'm reading the two threads and confusing the
concepts. I'm a bit less mystified why the discussion is revolving
around the swap device so much. :)

-- Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/