From: Dan Magenheimer on
> > A large portion of CMM2's gain came from the fact that you could take
> > memory away from guests without _them_ doing any work. If the system
> is
> > experiencing a load spike, you increase load even more by making the
> > guests swap. If you can just take some of their memory away, you can
> > smooth that spike out. CMM2 and frontswap do that. The guests
> > explicitly give up page contents that the hypervisor does not have to
> > first consult with the guest before discarding.
>
> Frontswap does not do this. Once a page has been frontswapped, the
> host
> is committed to retaining it until the guest releases it.

Dave or others can correct me if I am wrong, but I think CMM2 also
handles dirty pages that must be retained by the hypervisor. The
difference between CMM2 (for dirty pages) and frontswap is that
CMM2 sets hints that can be handled asynchronously while frontswap
provides explicit hooks that synchronously succeed/fail.

In fact, Avi, CMM2 is probably a fairly good approximation of what
the asynchronous interface you are suggesting might look like.
In other words, feasible but much much more complex than frontswap.

> [frontswap is] really
> not very different from a synchronous swap device.

Not to beat a dead horse, but there is a very key difference:
The size and availability of frontswap is entirely dynamic;
any page-to-be-swapped can be rejected at any time even if
a page was previously successfully swapped to the same index.
Every other swap device is much more static so the swap code
assumes a static device. Existing swap code can account for
"bad blocks" on a static device, but this is far from sufficient
to handle the dynamicity needed by frontswap.

> I think cleancache allows the hypervisor to drop pages without the
> guest's immediate knowledge, but I'm not sure.

Yes, cleancache can drop pages at any time because (as the
name implies) only clean pages can be put into cleancache.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Avi Kivity on
On 04/30/2010 09:59 PM, Jeremy Fitzhardinge wrote:
> On 04/30/2010 11:24 AM, Avi Kivity wrote:
>
>>> I'd argue the opposite. There's no point in having the host do swapping
>>> on behalf of guests if guests can do it themselves; it's just a
>>> duplication of functionality.
>>>
>>
>> The problem with relying on the guest to swap is that it's voluntary.
>> The guest may not be able to do it. When the hypervisor needs memory
>> and guests don't cooperate, it has to swap.
>>
> Or fail whatever operation its trying to do. You can only use
> overcommit to fake unlimited resources for so long before you need a
> government bailout.
>

Keep your commitment below RAM+swap and you'll be fine. We want to
overcommit RAM, not total storage.

>>> You end up having two IO paths for each
>>> guest, and the resulting problems in trying to account for the IO,
>>> rate-limit it, etc. If you can simply say "all guest disk IO happens
>>> via this single interface", its much easier to manage.
>>>
>>>
>> With tmem you have to account for that memory, make sure it's
>> distributed fairly, claim it back when you need it (requiring guest
>> cooperation), live migrate and save/restore it. It's a much larger
>> change than introducing a write-back device for swapping (which has
>> the benefit of working with unmodified guests).
>>
> Well, with caveats. To be useful with migration the backing store needs
> to be shared like other storage, so you can't use a specific host-local
> fast (ssd) swap device.

Live migration of local storage is possible (qemu does it).

> And because the device is backed by pagecache
> with delayed writes, it has much weaker integrity guarantees than a
> normal device, so you need to be sure that the guests are only going to
> use it for swap. Sure, these are deployment issues rather than code
> ones, but they're still issues.
>

You advertise it as a disk with write cache, so the guest is obliged to
flush the cache if it wants a guarantee. When it does, you flush your
cache as well. For swap, the guest will not issue any flushes. This is
already supported by qemu with cache=writeback.

I agree care is needed here. You don't want to use the device for
anything else.

>>> If frontswap has value, it's because its providing a new facility to
>>> guests that doesn't already exist and can't be easily emulated with
>>> existing interfaces.
>>>
>>> It seems to me the great strengths of the synchronous interface are:
>>>
>>> * it matches the needs of an existing implementation (tmem in Xen)
>>> * it is simple to understand within the context of the kernel code
>>> it's used in
>>>
>>> Simplicity is important, because it allows the mm code to be understood
>>> and maintained without having to have a deep understanding of
>>> virtualization.
>>>
>> If we use the existing paths, things are even simpler, and we match
>> more needs (hypervisors with dma engines, the ability to reclaim
>> memory without guest cooperation).
>>
> Well, you still can't reclaim memory; you can write it out to storage.
> It may be cheaper/byte, but it's still a resource dedicated to the
> guest. But that's just a consequence of allowing overcommit, and to
> what extent you're happy to allow it.
>

In general you want to run on RAM. To maximise your RAM, you do things
like page sharing and ballooning. Both can fail, increasing the demand
for RAM. At that time you either kill a guest or swap to disk.

Consider a frontswap/tmem on bare-metal hypervisor cluster. Presumably
you give most of your free memory to guests. A node dies. Now you need
to start its guests on the surviving nodes, but you're at the mercy of
your guests to give up their tmem.

With an ordinary swap approach, you first flush cache to disk, and if
that's not sufficient you start paging out guest memory. You take a
performance hit but you keep your guests running.

> What kind of DMA engine do you have in mind? Are there practical
> memory->memory DMA engines that would be useful in this context?
>

I/OAT (driver ioatdma).

When you don't have a lot of memory free, you can also switch from
write cache to O_DIRECT, so you use the storage controller's dma engine
to transfer pages to disk.

>>> Yes, that's comfortably within the "guests page themselves" model.
>>> Setting up a block device for the domain which is backed by pagecache
>>> (something we usually try hard to avoid) is pretty straightforward. But
>>> it doesn't work well for Xen unless the blkback domain is sized so that
>>> it has all of Xen's free memory in its pagecache.
>>>
>>>
>> Could be easily achieved with ballooning?
>>
> It could be achieved with ballooning, but it isn't completely trivial.
> It wouldn't work terribly well with a driver domain setup, unless all
> the swap-devices turned out to be backed by the same domain (which in
> turn would need to know how to balloon in response to overall system
> demand). The partitioning of the pagecache among the guests would be at
> the mercy of the mm subsystem rather than subject to any specific QoS or
> other per-domain policies you might want to put in place (maybe fiddling
> around with [fm]advise could get you some control over that).
>

See Documentation/cgroups/memory.txt.

>>> That said, it does concern me that the host/hypervisor is left holding
>>> the bag on frontswapped pages. A evil/uncooperative/lazy can just pump
>>> a whole lot of pages into the frontswap pool and leave them there. I
>>> guess this is mitigated by the fact that the API is designed such that
>>> they can't update or read the data without also allowing the hypervisor
>>> to drop the page (updates can fail destructively, and reads are also
>>> destructive), so the guest can't use it as a clumsy extension of their
>>> normal dedicated memory.
>>>
>>>
>> Eventually you'll have to swap frontswap pages, or kill uncooperative
>> guests. At which point all of the simplicity is gone.
>>
> Killing guests is pretty simple.

Migrating to a hypervisor that doesn't kill guests isn't.

> Presumably the oom killer will get kvm
> processes like anything else?
>

Yes. Of course, you want your management code never to allow this to
happen.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Dan Magenheimer on
> Eventually you'll have to swap frontswap pages, or kill uncooperative
> guests. At which point all of the simplicity is gone.

OK, now I think I see the crux of the disagreement.

NO! Frontswap on Xen+tmem never *never* _never_ NEVER results
in host swapping. Host swapping is evil. Host swapping is
the root of most of the bad reputation that memory overcommit
has gotten from VMware customers. Host swapping can't be
avoided with some memory overcommit technologies (such as page
sharing), but frontswap on Xen+tmem CAN and DOES avoid it.

So, to summarize:

1) You agreed that a synchronous interface for frontswap makes
sense for swap-to-in-kernel-compressed-RAM because it is
truly swapping to RAM.
2) You have pointed out that an asynchronous interface for
frontswap makes more sense for KVM than a synchronous
interface, because KVM does host swapping. Then you said
if you have an asynchronous interface anyway, the existing
swap code works just fine with no changes so frontswap
is not needed at all... for KVM.
3) You have suggested that if Xen were more like KVM and required
host-swapping, then Xen doesn't need frontswap either.

BUT frontswap on Xen+tmem always truly swaps to RAM.

So there are two users of frontswap for which the synchronous
interface makes sense. I believe there may be more in the
future and you disagree but, as Jeremy said, "a general Linux
principle is not to overdesign interfaces for hypothetical users,
only for real needs." We have demonstrated there is a need
with at least two users so the debate is only whether the
number of users is two or more than two.

Frontswap is a very non-invasive patch and is very cleanly
layered so that if it is not in the presence of either of
the intended "users", it can be turned off in many different
ways with zero overhead (CONFIG'ed off) or extremely small overhead
(frontswap_ops is never set; or frontswap_ops is set but the
underlying hypervisor doesn't support it so frontswap_poolid
never gets set).

So... KVM doesn't need it and won't use it. Do you, Avi, have
any other objections as to why the frontswap patch shouldn't be
accepted as is for the users that DO need it and WILL use it?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Pavel Machek on

> So there are two users of frontswap for which the synchronous
> interface makes sense. I believe there may be more in the
> future and you disagree but, as Jeremy said, "a general Linux
> principle is not to overdesign interfaces for hypothetical users,
> only for real needs." We have demonstrated there is a need
> with at least two users so the debate is only whether the
> number of users is two or more than two.
>
> Frontswap is a very non-invasive patch and is very cleanly
> layered so that if it is not in the presence of either of
> the intended "users", it can be turned off in many different
> ways with zero overhead (CONFIG'ed off) or extremely small overhead
> (frontswap_ops is never set; or frontswap_ops is set but the
> underlying hypervisor doesn't support it so frontswap_poolid
> never gets set).

Yet there are less invasive solutions available, like 'add trim
operation to swap_ops'.

So what needs to be said here is 'frontswap is XX times faster than
swap_ops based solution on workload YY'.
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Nitin Gupta on
On 05/01/2010 10:40 PM, Dan Magenheimer wrote:
>> Eventually you'll have to swap frontswap pages, or kill uncooperative
>> guests. At which point all of the simplicity is gone.
>
> OK, now I think I see the crux of the disagreement.
>
> NO! Frontswap on Xen+tmem never *never* _never_ NEVER results
> in host swapping. Host swapping is evil. Host swapping is
> the root of most of the bad reputation that memory overcommit
> has gotten from VMware customers. Host swapping can't be
> avoided with some memory overcommit technologies (such as page
> sharing), but frontswap on Xen+tmem CAN and DOES avoid it.
>

Why host-level swapping is evil? In KVM case, VM is just another
process and host will just swap out pages using the same LRU like
scheme as with any other process, AFAIK.

Also, with frontswap, host cannot discard pages at any time as is
the case will cleancache. So, while cleancache is obviously very
useful, the usefulness of frontswap remains doubtful.

IMHO, along with cleancache, we should just have in in-memory
compressed swapping at *host* level i.e. no frontswap. I agree
that using frontswap hooks, it is easy to implement ramzswap
functionality but I think its not worth replacing this driver
with frontswap hooks. This driver already has all the goodness:
asynchronous interface, ability to dynamically add/remove ramzswap
devices etc. All that is lacking in this driver is a more efficient
'discard' functionality so we can free a page as soon as it becomes
unused.

It should also be easy to extend this driver to allow sending pages
to host using virtio (for KVM) or Xen hypercalls, if frontswap is
needed at all.

So, IMHO we can focus on cleancache development and add missing
parts to ramzswap driver.

Thanks,
Nitin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/