Frontswap [PATCH 0/4] (was Transcendent Memory): overview [Kernel]

Prev: Whitespace Coding style fixes.
Next: Frontswap [PATCH 1/4] (was Transcendent Memory): swap data structure changes

From: Dan Magenheimer on 2 May 2010 13:30

> It's bad, but it's better than ooming.
>
> The same thing happens with vcpus: you run 10 guests on one core, if
> they all wake up, your cpu is suddenly 10x slower and has 30000x
> interrupt latency (30ms vs 1us, assuming 3ms timeslices). Your disks
> become slower as well.
>
> It's worse with memory, so you try to swap as a last resort. However,
> swap is still faster than a crashed guest.

Your analogy only holds when the host administrator is either
extremely greedy or stupid. My analogy only requires some
statistical bad luck: Multiple guests with peaks and valleys
of memory requirements happen to have their peaks align.

> > Third, host swapping makes live migration much more difficult.
> > Either the host swap disk must be accessible to all machines
> > or data sitting on a local disk MUST be migrated along with
> > RAM (which is not impossible but complicates live migration
> > substantially).
>
> kvm does live migration with swapping, and has no special code to
> integrate them.
> :
> Don't know about vmware, but kvm supports page sharing, swapping, and
> live migration simultaneously.

Hmmm... I'll bet I can break it pretty easily. I think the
case you raised that you thought would cause host OOM'ing
will cause kvm live migration to fail.

Or maybe not... when a guest is in the middle of a live migration,
I believe (in Xen), the entire guest memory allocation (possibly
excluding ballooned-out pages) must be simultaneously in RAM briefly
in BOTH the host and target machine. That is, live migration is
not "pipelined". Is this also true of KVM? If so, your
statement above is just waiting a corner case to break it.
And if not, I expect you've got fault tolerance issues.

> > If you talk to VMware customers (especially web-hosting services)
> > that have attempted to use overcommit technologies that require
> > host-swapping, you will find that they quickly become allergic
> > to memory overcommit and turn it off. The end users (users of
> > the VMs that inexplicably grind to a halt) complain loudly.
> > As a result, RAM has become a bottleneck in many many systems,
> > which ultimately reduces the utility of servers and the value
> > of virtualization.
>
> Choosing the correct overcommit ratio is certainly not an easy task.
> However, just hoping that memory will be available when you need it is
> not a good solution.

Choosing the _optimal_ overcommit ratio is impossible without a
prescient knowledge of the workload in each guest. Hoping memory
will be available is certainly not a good solution, but if memory
is not available guest swapping is much better than host swapping.
And making RAM usage as dynamic as possible and live migration
as easy as possible are keys to maximizing the benefits (and
limiting the problems) of virtualization.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Pavel Machek on 2 May 2010 16:10

> > So what needs to be said here is 'frontswap is XX times faster than
> > swap_ops based solution on workload YY'.
>
> Are you asking me to demonstrate that swap-to-hypervisor-RAM is
> faster than swap-to-disk?

I would like comparison of swap-to-frontswap vs. swap-to-RAMdisk.
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Dan Magenheimer on 2 May 2010 17:10

> From: Pavel Machek [mailto:pavel(a)ucw.cz]
>
> > > So what needs to be said here is 'frontswap is XX times faster than
> > > swap_ops based solution on workload YY'.
> >
> > Are you asking me to demonstrate that swap-to-hypervisor-RAM is
> > faster than swap-to-disk?
>
> I would like comparison of swap-to-frontswap vs. swap-to-RAMdisk.
> Pavel

Well, it's not really apples-to-apples because swap-to-RAMdisk
is copying to a chunk of RAM with a known permanently-fixed size
so it SHOULD be faster than swap-to-hypervisor, and should
*definitely* be faster than swap-to-in-kernel-compressed-RAM
but I suppose it is still an interesting comparison. I'll
see what I can do, but it will probably be a couple days to
figure out how to measure it (e.g. without accidentally measuring
any swap-to-disk).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Avi Kivity on 3 May 2010 04:50

On 05/02/2010 08:06 PM, Dan Magenheimer wrote:
>
>>> NO! Frontswap on Xen+tmem never *never* _never_ NEVER results
>>> in host swapping.
>>>
>> That's a bug. You're giving the guest memory without the means to take
>> it back. The result is that you have to _undercommit_ your memory
>> resources.
>>
>> Consider a machine running a guest, with most of its memory free. You
>> give the memory via frontswap to the guest. The guest happily swaps to
>> frontswap, and uses the freed memory for something unswappable, like
>> mlock()ed memory or hugetlbfs.
>>
>> Now the second node dies and you need memory to migrate your guests
>> into. But you can't, and the hypervisor is at the mercy of the guest
>> for getting its memory back; and the guest can't do it (at least not
>> quickly).
>>
> Simple policies must exist and must be enforced by the hypervisor to ensure
> this doesn't happen. Xen+tmem provides these policies and enforces them.
> And it enforces them very _dynamically_ to constantly optimize
> RAM utilization across multiple guests each with dynamically varying RAM
> usage. Frontswap fits nicely into this framework.
>

Can you explain what "enforcing" means in this context? You loaned the
guest some pages, can you enforce their return?

>>> Host swapping is evil. Host swapping is
>>> the root of most of the bad reputation that memory overcommit
>>> has gotten from VMware customers. Host swapping can't be
>>> avoided with some memory overcommit technologies (such as page
>>> sharing), but frontswap on Xen+tmem CAN and DOES avoid it.
>>>
>> In this case the guest expects that swapped out memory will be slow
>> (since was freed via the swap API; it will be slow if the host happened
>> to run out of tmem). So by storing this memory on disk you aren't
>> reducing performance beyond what you promised to the guest.
>>
>> Swapping guest RAM will indeed cause a performance hit, but sometimes
>> you need to do it.
>>
> Huge performance hits that are completely inexplicable to a user
> give virtualization a bad reputation. If the user (i.e. guest,
> not host, administrator) can at least see "Hmmm... I'm doing a lot
> of swapping, guess I'd better pay for more (virtual) RAM", then
> the user objections are greatly reduced.
>

What you're saying is "don't overcommit". That's a good policy for some
scenarios but not for others. Note it applies equally well for cpu as
well as memory.

frontswap+tmem is not overcommit, it's undercommit. You have spare
memory, and you give it away. It isn't a replacement. However, without
the means to reclaim this spare memory, it can result in overcommit.

>>> So, to summarize:
>>>
>>> 1) You agreed that a synchronous interface for frontswap makes
>>> sense for swap-to-in-kernel-compressed-RAM because it is
>>> truly swapping to RAM.
>>>
>> Because the interface is internal to the kernel.
>>
> Xen+tmem uses the SAME internal kernel interface. The Xen-specific
> code which performs the Xen-specific stuff (hypercalls) is only in
> the Xen-specific directory.
>

This makes it an external interface.

>>> 2) You have pointed out that an asynchronous interface for
>>> frontswap makes more sense for KVM than a synchronous
>>> interface, because KVM does host swapping.
>>>
>> kvm's host swapping is unrelated. Host swapping swaps guest-owned
>> memory; that's not what we want here. We want to cache guest swap in
>> RAM, and that's easily done by having a virtual disk cached in main
>> memory. We're simply presenting a disk with a large write-back cache
>> to the guest.
>>
> The missing part again is dynamicity. How large is the virtual
> disk?

Exactly as large as the swap space which the guest would have in the
frontswap+tmem case.

> Or are you proposing that disks can dramatically vary
> in size across time?

Not needed, though I expect it is already supported (SAN volumes do grow).

> I suspect that would be a very big patch.
> And you're talking about a disk that doesn't have all the
> overhead of blockio, right?
>

If block layer overhead is a problem, go ahead and optimize it instead
of adding new interfaces to bypass it. Though I expect it wouldn't be
needed, and if any optimization needs to be done it is in the swap layer.

Optimizing swap has the additional benefit of improving performance on
flash-backed swap.

>> You could just as easily cache a block device in free RAM with Xen.
>> Have a tmem domain behave as the backend for your swap device. Use
>> ballooning to force tmem to disk, or to allow more cache when memory is
>> free.
>>
> A block device of what size? Again, I don't think this will be
> dynamic enough.
>

What happens when no tmem is available? you swap to a volume. That's
the disk size needed.

>> Voila: you no longer depend on guests (you depend on the tmem domain,
>> but that's part of the host code), you don't need guest modifications,
>> so it works across a wider range of guests.
>>
> Ummm... no guest modifications, yet this special disk does everything
> you've described above (and, to meet my dynamicity requirements,
> varies in size as well)?
>

You're dynamic swap is limited too. And no, no guest modifications.

>>> BUT frontswap on Xen+tmem always truly swaps to RAM.
>>>
>> AND that's a problem because it puts the hypervisor at the mercy of the
>> guest.
>>
> As I described in a separate reply, this is simply not true.
>

I still don't understand why.

>>> So there are two users of frontswap for which the synchronous
>>> interface makes sense.
>>>
>> I believe there is only one. See below.
>>
>> The problem is not the complexity of the patch itself. It's the fact
>> that it introduces a new external API. If we refactor swapping, that
>> stands in the way.
>>
> Could you please explicitly identify what you are referring
> to as a new external API? The part this is different from
> the "only one" internal user?
>

Something completely internal to the guest can be replaced by something
completely different. Something that talks to a hypervisor will need
those hooks forever to avoid regressions.

>> a synchronous single-page DMA
>> API is a bad idea. Look at the Xen network and block code, while they
>> eventually do a memory copy for every page they see, they try to batch
>> multiple pages into an exit, and make the response asynchronous.
>>
> As noted VERY early in this thread, if/when it makes sense, frontswap
> can do exactly the same thing by adding a buffering layer invisible
> to the internal kernel interfaces.
>

So, you take a synchronous copyful interface, add another copy to make
it into an asynchronous interface, instead of using the original
asynchronous copyless interface.

>> As an example, with a batched API you could save/restore the fpu
>> context
>> and use sse for copying the memory, while with a single page API you'd
>> probably lost out. Synchronous DMA, even for emulated hardware, is out
>> of place in 2010.
>>
> I think we agree that DMA makes sense when there is a lot of data to
> copy and makes little sense when there is only a little (e.g. a
> single page) to copy. So I guess we need to understand what the
> tradeoff is. So, do you have any idea what the breakeven point is
> for your favorite DMA engine for amount of data copied vs
> 1) locking the memory pages
> 2) programming the DMA engine
> 3) responding to the interrupt from the DMA engine
>
> And the simple act of waiting to collect enough pages to "batch"
> means none of those pages can be used until the last page is collected
> and the DMA engine is programmed and the DMA is complete.
> A page-at-a-time interface synchronously releases the pages
> for other (presumably more important) needs and thus, when
> memory is under extreme pressure, also reduces the probability
> of a (guest) OOM.
>

When swapping out, Linux already batches pages in the block device's
request queue. Swapping out is inherently asynchronous and batched,
you're swapping out those pages _because_ you don't need them, and
you're never interested in swapping out a single page. Linux already
reserves memory for use during swapout. There's no need to re-solve
solved problems.

Swapping in is less simple, it is mostly synchronous (in some cases it
isn't: with many threads, or with the preswap patches (IIRC unmerged)).
You can always choose to copy if you don't have enough to justify dma.

The networking stack seems to think 4096 bytes is a good size for dma
(see net/core/user_dma.c, NET_DMA_DEFAULT_COPYBREAK).

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Avi Kivity on 3 May 2010 05:50

On 05/02/2010 08:22 PM, Dan Magenheimer wrote:
>> It's bad, but it's better than ooming.
>>
>> The same thing happens with vcpus: you run 10 guests on one core, if
>> they all wake up, your cpu is suddenly 10x slower and has 30000x
>> interrupt latency (30ms vs 1us, assuming 3ms timeslices). Your disks
>> become slower as well.
>>
>> It's worse with memory, so you try to swap as a last resort. However,
>> swap is still faster than a crashed guest.
>>
> Your analogy only holds when the host administrator is either
> extremely greedy or stupid.

10x vcpu is reasonable in some situations (VDI, powersave at night).
Even a 2x vcpu overcommit will cause a 10000x interrupt latency degradation.

> My analogy only requires some
> statistical bad luck: Multiple guests with peaks and valleys
> of memory requirements happen to have their peaks align.
>

Not sure I understand.

>>> Third, host swapping makes live migration much more difficult.
>>> Either the host swap disk must be accessible to all machines
>>> or data sitting on a local disk MUST be migrated along with
>>> RAM (which is not impossible but complicates live migration
>>> substantially).
>>>
>> kvm does live migration with swapping, and has no special code to
>> integrate them.
>> :
>> Don't know about vmware, but kvm supports page sharing, swapping, and
>> live migration simultaneously.
>>
> Hmmm... I'll bet I can break it pretty easily. I think the
> case you raised that you thought would cause host OOM'ing
> will cause kvm live migration to fail.
>
> Or maybe not... when a guest is in the middle of a live migration,
> I believe (in Xen), the entire guest memory allocation (possibly
> excluding ballooned-out pages) must be simultaneously in RAM briefly
> in BOTH the host and target machine. That is, live migration is
> not "pipelined". Is this also true of KVM?

No. The entire guest address space can be swapped out on the source and
target, less the pages being copied to or from the wire, and pages
actively accessed by the guest. Of course performance will suck if all
memory is swapped out.

> If so, your
> statement above is just waiting a corner case to break it.
> And if not, I expect you've got fault tolerance issues.
>

Not that I'm aware of.

>>> If you talk to VMware customers (especially web-hosting services)
>>> that have attempted to use overcommit technologies that require
>>> host-swapping, you will find that they quickly become allergic
>>> to memory overcommit and turn it off. The end users (users of
>>> the VMs that inexplicably grind to a halt) complain loudly.
>>> As a result, RAM has become a bottleneck in many many systems,
>>> which ultimately reduces the utility of servers and the value
>>> of virtualization.
>>>
>> Choosing the correct overcommit ratio is certainly not an easy task.
>> However, just hoping that memory will be available when you need it is
>> not a good solution.
>>
> Choosing the _optimal_ overcommit ratio is impossible without a
> prescient knowledge of the workload in each guest. Hoping memory
> will be available is certainly not a good solution, but if memory
> is not available guest swapping is much better than host swapping.
>

You cannot rely on guest swapping.

> And making RAM usage as dynamic as possible and live migration
> as easy as possible are keys to maximizing the benefits (and
> limiting the problems) of virtualization.
>

That is why you need overcommit. You make things dynamic with page
sharing and ballooning and live migration, but at some point you need a
failsafe fallback. The only failsafe fallback I can see (where the host
doesn't rely on guests) is swapping.

As far as I can tell, frontswap+tmem increases the problem. You loan
the guest some memory without the means to take it back, this increases
memory pressure on the host. The result is that if you want to avoid
swapping (or are unable to) you need to undercommit host resources.
Instead of sum(guest mem) + reserve < (host mem), you need sum(guest mem
+ committed tmem) + reserve < (host mem). You need more host memory, or
less guests, or to be prepared to swap if the worst happens.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12
Prev: Whitespace Coding style fixes.
Next: Frontswap [PATCH 1/4] (was Transcendent Memory): swap data structure changes