Frontswap [PATCH 0/4] (was Transcendent Memory): overview [Kernel]

Prev: Whitespace Coding style fixes.
Next: Frontswap [PATCH 1/4] (was Transcendent Memory): swap data structure changes

From: Dan Magenheimer on 2 May 2010 11:10

> > So there are two users of frontswap for which the synchronous
> > interface makes sense. I believe there may be more in the
> > future and you disagree but, as Jeremy said, "a general Linux
> > principle is not to overdesign interfaces for hypothetical users,
> > only for real needs." We have demonstrated there is a need
> > with at least two users so the debate is only whether the
> > number of users is two or more than two.
> >
> > Frontswap is a very non-invasive patch and is very cleanly
> > layered so that if it is not in the presence of either of
> > the intended "users", it can be turned off in many different
> > ways with zero overhead (CONFIG'ed off) or extremely small overhead
> > (frontswap_ops is never set; or frontswap_ops is set but the
> > underlying hypervisor doesn't support it so frontswap_poolid
> > never gets set).
>
> Yet there are less invasive solutions available, like 'add trim
> operation to swap_ops'.

As Nitin pointed out much earlier in this thread:

"No: trim or discard is not useful"

I also think that trim does not do anything for the widely
varying dynamically changing size that frontswap provides.

> So what needs to be said here is 'frontswap is XX times faster than
> swap_ops based solution on workload YY'.

Are you asking me to demonstrate that swap-to-hypervisor-RAM is
faster than swap-to-disk?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Avi Kivity on 2 May 2010 11:40

On 05/01/2010 08:10 PM, Dan Magenheimer wrote:
>> Eventually you'll have to swap frontswap pages, or kill uncooperative
>> guests. At which point all of the simplicity is gone.
>>
> OK, now I think I see the crux of the disagreement.
>

Alas, I think we're pretty far from that.

> NO! Frontswap on Xen+tmem never *never* _never_ NEVER results
> in host swapping.

That's a bug. You're giving the guest memory without the means to take
it back. The result is that you have to _undercommit_ your memory
resources.

Consider a machine running a guest, with most of its memory free. You
give the memory via frontswap to the guest. The guest happily swaps to
frontswap, and uses the freed memory for something unswappable, like
mlock()ed memory or hugetlbfs.

Now the second node dies and you need memory to migrate your guests
into. But you can't, and the hypervisor is at the mercy of the guest
for getting its memory back; and the guest can't do it (at least not
quickly).

> Host swapping is evil. Host swapping is
> the root of most of the bad reputation that memory overcommit
> has gotten from VMware customers. Host swapping can't be
> avoided with some memory overcommit technologies (such as page
> sharing), but frontswap on Xen+tmem CAN and DOES avoid it.
>

In this case the guest expects that swapped out memory will be slow
(since was freed via the swap API; it will be slow if the host happened
to run out of tmem). So by storing this memory on disk you aren't
reducing performance beyond what you promised to the guest.

Swapping guest RAM will indeed cause a performance hit, but sometimes
you need to do it.

> So, to summarize:
>
> 1) You agreed that a synchronous interface for frontswap makes
> sense for swap-to-in-kernel-compressed-RAM because it is
> truly swapping to RAM.
>

Because the interface is internal to the kernel.

> 2) You have pointed out that an asynchronous interface for
> frontswap makes more sense for KVM than a synchronous
> interface, because KVM does host swapping.

kvm's host swapping is unrelated. Host swapping swaps guest-owned
memory; that's not what we want here. We want to cache guest swap in
RAM, and that's easily done by having a virtual disk cached in main
memory. We're simply presenting a disk with a large write-back cache to
the guest.

You could just as easily cache a block device in free RAM with Xen.
Have a tmem domain behave as the backend for your swap device. Use
ballooning to force tmem to disk, or to allow more cache when memory is
free.

Voila: you no longer depend on guests (you depend on the tmem domain,
but that's part of the host code), you don't need guest modifications,
so it works across a wider range of guests.

> Then you said
> if you have an asynchronous interface anyway, the existing
> swap code works just fine with no changes so frontswap
> is not needed at all... for KVM.
>

For any hypervisor which implements virtual disks with write-back cache
in host memory.

> 3) You have suggested that if Xen were more like KVM and required
> host-swapping, then Xen doesn't need frontswap either.
>

Host swapping is not a requirement.

> BUT frontswap on Xen+tmem always truly swaps to RAM.
>

AND that's a problem because it puts the hypervisor at the mercy of the
guest.

> So there are two users of frontswap for which the synchronous
> interface makes sense.

I believe there is only one. See below.

> I believe there may be more in the
> future and you disagree but, as Jeremy said, "a general Linux
> principle is not to overdesign interfaces for hypothetical users,
> only for real needs." We have demonstrated there is a need
> with at least two users so the debate is only whether the
> number of users is two or more than two.
>
> Frontswap is a very non-invasive patch and is very cleanly
> layered so that if it is not in the presence of either of
> the intended "users", it can be turned off in many different
> ways with zero overhead (CONFIG'ed off) or extremely small overhead
> (frontswap_ops is never set; or frontswap_ops is set but the
> underlying hypervisor doesn't support it so frontswap_poolid
> never gets set).
>

The problem is not the complexity of the patch itself. It's the fact
that it introduces a new external API. If we refactor swapping, that
stands in the way.

How much, that's up to the mm maintainers to say. If it isn't a problem
for them, fine (but I still think
swap-to-RAM-without-hypervisor-decommit is a bad idea).

> So... KVM doesn't need it and won't use it. Do you, Avi, have
> any other objections as to why the frontswap patch shouldn't be
> accepted as is for the users that DO need it and WILL use it?
>

Even ignoring the problems above (which are really hypervisor problems
and the guest, which is what we're discussing here, shouldn't care if
the hypervisor paints itself into an oom), a synchronous single-page DMA
API is a bad idea. Look at the Xen network and block code, while they
eventually do a memory copy for every page they see, they try to batch
multiple pages into an exit, and make the response asynchronous.

As an example, with a batched API you could save/restore the fpu context
and use sse for copying the memory, while with a single page API you'd
probably lost out. Synchronous DMA, even for emulated hardware, is out
of place in 2010.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Dan Magenheimer on 2 May 2010 12:10

> > NO! Frontswap on Xen+tmem never *never* _never_ NEVER results
> > in host swapping. Host swapping is evil. Host swapping is
> > the root of most of the bad reputation that memory overcommit
> > has gotten from VMware customers. Host swapping can't be
> > avoided with some memory overcommit technologies (such as page
> > sharing), but frontswap on Xen+tmem CAN and DOES avoid it.
>
> Why host-level swapping is evil? In KVM case, VM is just another
> process and host will just swap out pages using the same LRU like
> scheme as with any other process, AFAIK.

The first problem is that you are simulating a fast resource
(RAM) with a resource that is orders of magnitude slower with
NO visibility to the user that suffers the consequences. A good
analogy (and no analogy is perfect) is if Linux discovers a 16MHz
80286 on a serial card in addition to the 32 3GHz cores on a
Nehalem box and, whenever the 32 cores are all busy, randomly
schedules a process on the 80286, while recording all CPU usage
data as if the 80286 is a "real" processor.... "Hmmm... why
did my compile suddenly run 100 times slower?"

The second problem is "double swapping": A guest may choose
a page to swap to "guest swap", but invisibly to the guest,
the host first must fetch it from "host swap". (This may
seem like it is easy to avoid... it is not and happens more
frequently than you might think.)

Third, host swapping makes live migration much more difficult.
Either the host swap disk must be accessible to all machines
or data sitting on a local disk MUST be migrated along with
RAM (which is not impossible but complicates live migration
substantially). Last I checked, VMware does not allow
page-sharing and live migration to both be enabled for the
same host.

If you talk to VMware customers (especially web-hosting services)
that have attempted to use overcommit technologies that require
host-swapping, you will find that they quickly become allergic
to memory overcommit and turn it off. The end users (users of
the VMs that inexplicably grind to a halt) complain loudly.
As a result, RAM has become a bottleneck in many many systems,
which ultimately reduces the utility of servers and the value
of virtualization.

> Also, with frontswap, host cannot discard pages at any time as is
> the case will cleancache

True. But in the Xen+tmem implementation there are disincentives
for a guest to unnecessarily retain pages put into frontswap,
so the host doesn't need to care that it can't discard the pages
as the guest is "billed" for them anyway.

So far we've been avoiding hypervisor policy implementation
questions and focused on mechanism (because, after all, this
is a *Linux kernel* mailing list), but we can go there if
needed.

> IMHO, along with cleancache, we should just have in in-memory
> compressed swapping at *host* level i.e. no frontswap. I agree
> that using frontswap hooks, it is easy to implement ramzswap
> functionality but I think its not worth replacing this driver
> with frontswap hooks. This driver already has all the goodness:
> asynchronous interface, ability to dynamically add/remove ramzswap
> devices etc. All that is lacking in this driver is a more efficient
> 'discard' functionality so we can free a page as soon as it becomes
> unused.

The key missing element with ramzswap is that, with frontswap, EVERY
attempt to swap a page to RAM is evaluated and potentially rejected
by the "backend" (hypervisor). Further, no additional per-guest
system administration is required to configure ramzswap. (How big
should it be anyway?) This level of dynamicity is important to
optimally managing physical memory in a rapidly changing virtual
environment.

> It should also be easy to extend this driver to allow sending pages
> to host using virtio (for KVM) or Xen hypercalls, if frontswap is
> needed at all.
>
> So, IMHO we can focus on cleancache development and add missing
> parts to ramzswap driver.

I'm certainly open to someone exploring this approach to see if
it works for swap-to-hypervisor-RAM. It has been my understanding
that Linus rejected the proposed discard hooks, without which
ramzswap doesn't even really work for swap-to-in-kernel-compressed-
RAM. However, I suspect that ramzswap, even with the discard hooks,
will not have the "dynamic range" useful for swap-to-hypervisor-RAM,
but frontswap will work fine for swap-to-in-kernel-compressed-RAM.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Avi Kivity on 2 May 2010 12:50

On 05/02/2010 07:06 PM, Dan Magenheimer wrote:
>>> NO! Frontswap on Xen+tmem never *never* _never_ NEVER results
>>> in host swapping. Host swapping is evil. Host swapping is
>>> the root of most of the bad reputation that memory overcommit
>>> has gotten from VMware customers. Host swapping can't be
>>> avoided with some memory overcommit technologies (such as page
>>> sharing), but frontswap on Xen+tmem CAN and DOES avoid it.
>>>
>> Why host-level swapping is evil? In KVM case, VM is just another
>> process and host will just swap out pages using the same LRU like
>> scheme as with any other process, AFAIK.
>>
> The first problem is that you are simulating a fast resource
> (RAM) with a resource that is orders of magnitude slower with
> NO visibility to the user that suffers the consequences. A good
> analogy (and no analogy is perfect) is if Linux discovers a 16MHz
> 80286 on a serial card in addition to the 32 3GHz cores on a
> Nehalem box and, whenever the 32 cores are all busy, randomly
> schedules a process on the 80286, while recording all CPU usage
> data as if the 80286 is a "real" processor.... "Hmmm... why
> did my compile suddenly run 100 times slower?"
>

It's bad, but it's better than ooming.

The same thing happens with vcpus: you run 10 guests on one core, if
they all wake up, your cpu is suddenly 10x slower and has 30000x
interrupt latency (30ms vs 1us, assuming 3ms timeslices). Your disks
become slower as well.

It's worse with memory, so you try to swap as a last resort. However,
swap is still faster than a crashed guest.

> The second problem is "double swapping": A guest may choose
> a page to swap to "guest swap", but invisibly to the guest,
> the host first must fetch it from "host swap". (This may
> seem like it is easy to avoid... it is not and happens more
> frequently than you might think.)
>

True. In fact when the guest and host use the same LRU algorithm, it
becomes even likelier. That's one of the things CMM2 addresses.

> Third, host swapping makes live migration much more difficult.
> Either the host swap disk must be accessible to all machines
> or data sitting on a local disk MUST be migrated along with
> RAM (which is not impossible but complicates live migration
> substantially).

kvm does live migration with swapping, and has no special code to
integrate them.

> Last I checked, VMware does not allow
> page-sharing and live migration to both be enabled for the
> same host.
>

Don't know about vmware, but kvm supports page sharing, swapping, and
live migration simultaneously.

> If you talk to VMware customers (especially web-hosting services)
> that have attempted to use overcommit technologies that require
> host-swapping, you will find that they quickly become allergic
> to memory overcommit and turn it off. The end users (users of
> the VMs that inexplicably grind to a halt) complain loudly.
> As a result, RAM has become a bottleneck in many many systems,
> which ultimately reduces the utility of servers and the value
> of virtualization.
>

Choosing the correct overcommit ratio is certainly not an easy task.
However, just hoping that memory will be available when you need it is
not a good solution.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Dan Magenheimer on 2 May 2010 13:10

> > OK, now I think I see the crux of the disagreement.
>
> Alas, I think we're pretty far from that.

Well, to be fair, I meant the disagreement of synchronous vs
asynchronous.

> > NO! Frontswap on Xen+tmem never *never* _never_ NEVER results
> > in host swapping.
>
> That's a bug. You're giving the guest memory without the means to take
> it back. The result is that you have to _undercommit_ your memory
> resources.
>
> Consider a machine running a guest, with most of its memory free. You
> give the memory via frontswap to the guest. The guest happily swaps to
> frontswap, and uses the freed memory for something unswappable, like
> mlock()ed memory or hugetlbfs.
>
> Now the second node dies and you need memory to migrate your guests
> into. But you can't, and the hypervisor is at the mercy of the guest
> for getting its memory back; and the guest can't do it (at least not
> quickly).

Simple policies must exist and must be enforced by the hypervisor to ensure
this doesn't happen. Xen+tmem provides these policies and enforces them.
And it enforces them very _dynamically_ to constantly optimize
RAM utilization across multiple guests each with dynamically varying RAM
usage. Frontswap fits nicely into this framework.

> > Host swapping is evil. Host swapping is
> > the root of most of the bad reputation that memory overcommit
> > has gotten from VMware customers. Host swapping can't be
> > avoided with some memory overcommit technologies (such as page
> > sharing), but frontswap on Xen+tmem CAN and DOES avoid it.
>
> In this case the guest expects that swapped out memory will be slow
> (since was freed via the swap API; it will be slow if the host happened
> to run out of tmem). So by storing this memory on disk you aren't
> reducing performance beyond what you promised to the guest.
>
> Swapping guest RAM will indeed cause a performance hit, but sometimes
> you need to do it.

Huge performance hits that are completely inexplicable to a user
give virtualization a bad reputation. If the user (i.e. guest,
not host, administrator) can at least see "Hmmm... I'm doing a lot
of swapping, guess I'd better pay for more (virtual) RAM", then
the user objections are greatly reduced.

> > So, to summarize:
> >
> > 1) You agreed that a synchronous interface for frontswap makes
> > sense for swap-to-in-kernel-compressed-RAM because it is
> > truly swapping to RAM.
>
> Because the interface is internal to the kernel.

Xen+tmem uses the SAME internal kernel interface. The Xen-specific
code which performs the Xen-specific stuff (hypercalls) is only in
the Xen-specific directory.

> > 2) You have pointed out that an asynchronous interface for
> > frontswap makes more sense for KVM than a synchronous
> > interface, because KVM does host swapping.
>
> kvm's host swapping is unrelated. Host swapping swaps guest-owned
> memory; that's not what we want here. We want to cache guest swap in
> RAM, and that's easily done by having a virtual disk cached in main
> memory. We're simply presenting a disk with a large write-back cache
> to the guest.

The missing part again is dynamicity. How large is the virtual
disk? Or are you proposing that disks can dramatically vary
in size across time? I suspect that would be a very big patch.
And you're talking about a disk that doesn't have all the
overhead of blockio, right?

> You could just as easily cache a block device in free RAM with Xen.
> Have a tmem domain behave as the backend for your swap device. Use
> ballooning to force tmem to disk, or to allow more cache when memory is
> free.

A block device of what size? Again, I don't think this will be
dynamic enough.

> Voila: you no longer depend on guests (you depend on the tmem domain,
> but that's part of the host code), you don't need guest modifications,
> so it works across a wider range of guests.

Ummm... no guest modifications, yet this special disk does everything
you've described above (and, to meet my dynamicity requirements,
varies in size as well)?

> > BUT frontswap on Xen+tmem always truly swaps to RAM.
>
> AND that's a problem because it puts the hypervisor at the mercy of the
> guest.

As I described in a separate reply, this is simply not true.

> > So there are two users of frontswap for which the synchronous
> > interface makes sense.
>
> I believe there is only one. See below.
>
> The problem is not the complexity of the patch itself. It's the fact
> that it introduces a new external API. If we refactor swapping, that
> stands in the way.

Could you please explicitly identify what you are referring
to as a new external API? The part this is different from
the "only one" internal user?

> Even ignoring the problems above (which are really hypervisor problems
> and the guest, which is what we're discussing here, shouldn't care if
> the hypervisor paints itself into an oom)

which it doesn't.

> a synchronous single-page DMA
> API is a bad idea. Look at the Xen network and block code, while they
> eventually do a memory copy for every page they see, they try to batch
> multiple pages into an exit, and make the response asynchronous.

As noted VERY early in this thread, if/when it makes sense, frontswap
can do exactly the same thing by adding a buffering layer invisible
to the internal kernel interfaces.

> As an example, with a batched API you could save/restore the fpu
> context
> and use sse for copying the memory, while with a single page API you'd
> probably lost out. Synchronous DMA, even for emulated hardware, is out
> of place in 2010.

I think we agree that DMA makes sense when there is a lot of data to
copy and makes little sense when there is only a little (e.g. a
single page) to copy. So I guess we need to understand what the
tradeoff is. So, do you have any idea what the breakeven point is
for your favorite DMA engine for amount of data copied vs
1) locking the memory pages
2) programming the DMA engine
3) responding to the interrupt from the DMA engine

And the simple act of waiting to collect enough pages to "batch"
means none of those pages can be used until the last page is collected
and the DMA engine is programmed and the DMA is complete.
A page-at-a-time interface synchronously releases the pages
for other (presumably more important) needs and thus, when
memory is under extreme pressure, also reduces the probability
of a (guest) OOM.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12
Prev: Whitespace Coding style fixes.
Next: Frontswap [PATCH 1/4] (was Transcendent Memory): swap data structure changes