From: Dan Magenheimer on
> > My analogy only requires some
> > statistical bad luck: Multiple guests with peaks and valleys
> > of memory requirements happen to have their peaks align.
>
> Not sure I understand.

Virtualization is all about statistical multiplexing of fixed
resources. If all guests demand a resource simultaneously,
that is peak alignment == "bad luck".

(But, honestly, I don't even remember the point either of us
was trying to make here :-)

> > Or maybe not... when a guest is in the middle of a live migration,
> > I believe (in Xen), the entire guest memory allocation (possibly
> > excluding ballooned-out pages) must be simultaneously in RAM briefly
> > in BOTH the host and target machine. That is, live migration is
> > not "pipelined". Is this also true of KVM?
>
> No. The entire guest address space can be swapped out on the source
> and
> target, less the pages being copied to or from the wire, and pages
> actively accessed by the guest. Of course performance will suck if all
> memory is swapped out.

Will it suck to the point of eventually causing the live migration
to fail? Or will swap-storms effectively cause denial-of-service
for other guests?

Anyway, if live migration works fine with mostly-swapped-out guests
on KVM, that's great.

> > Choosing the _optimal_ overcommit ratio is impossible without a
> > prescient knowledge of the workload in each guest. Hoping memory
> > will be available is certainly not a good solution, but if memory
> > is not available guest swapping is much better than host swapping.
>
> You cannot rely on guest swapping.

Frontswap only relies on the guest having an existing swap device,
defined in /etc/fstab like any normal Linux swap device. If this
is "relying on guest swapping", yes frontswap relies on guest swapping.

Or if you are referring to your "host can't force guest to
reclaim pages" argument, see the other thread.

> > And making RAM usage as dynamic as possible and live migration
> > as easy as possible are keys to maximizing the benefits (and
> > limiting the problems) of virtualization.
>
> That is why you need overcommit. You make things dynamic with page
> sharing and ballooning and live migration, but at some point you need a
> failsafe fallback. The only failsafe fallback I can see (where the
> host doesn't rely on guests) is swapping.

No fallback is required if the overcommitment is done intelligently.

> As far as I can tell, frontswap+tmem increases the problem. You loan
> the guest some memory without the means to take it back, this increases
> memory pressure on the host. The result is that if you want to avoid
> swapping (or are unable to) you need to undercommit host resources.
> Instead of sum(guest mem) + reserve < (host mem), you need sum(guest
> mem
> + committed tmem) + reserve < (host mem). You need more host memory,
> or less guests, or to be prepared to swap if the worst happens.

Your argument might make sense from a KVM perspective but is
not true of frontswap with Xen+tmem. With KVM, the host's
swap disk(s) can all be used as "slow RAM". With Xen, there is
no host swap disk. So, yes, the degree of potential memory
overcommitment is smaller with Xen+tmem than with KVM. In
order to avoid all the host problems with host-swapping,
frontswap+Xen+tmem intentionally limits the degree of memory
overcommitment... but this is just memory overcommitment done
intelligently.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Dan Magenheimer on
> > Simple policies must exist and must be enforced by the hypervisor to
> ensure
> > this doesn't happen. Xen+tmem provides these policies and enforces
> them.
> > And it enforces them very _dynamically_ to constantly optimize
> > RAM utilization across multiple guests each with dynamically varying
> RAM
> > usage. Frontswap fits nicely into this framework.
>
> Can you explain what "enforcing" means in this context? You loaned the
> guest some pages, can you enforce their return?

We're getting into hypervisor policy issues, but given that probably
nobody else is listening by now, I guess that's OK. ;-)

The enforcement is on the "put" side. The page is not loaned,
it is freely given, but only if the guest is within its
contractual limitations (e.g. within its predefined "maxmem").
If the guest chooses to never remove the pages from frontswap,
that's the guest's option, but that part of the guests
memory allocation can never be used for anything else so
it is in the guest's self-interest to "get" or "flush" the
pages from frontswap.

> > Huge performance hits that are completely inexplicable to a user
> > give virtualization a bad reputation. If the user (i.e. guest,
> > not host, administrator) can at least see "Hmmm... I'm doing a lot
> > of swapping, guess I'd better pay for more (virtual) RAM", then
> > the user objections are greatly reduced.
>
> What you're saying is "don't overcommit".

Not at all. I am saying "overcommit, but do it intelligently".

> That's a good policy for some
> scenarios but not for others. Note it applies equally well for cpu as
> well as memory.

Perhaps, but CPU overcommit has been a well-understood
part of computing for a very long time and users, admins,
and hosting providers all know how to recognize it and
deal with it. Not so with overcommitment of memory;
the only exposure to memory limitations is "my disk light
is flashing a lot, I'd better buy more RAM". Obviously,
this doesn't translate to virtualization very well.

And, as for your interrupt latency analogy, let's
revisit that if/when Xen or KVM support CPU overcommitment
for real-time-sensitive guests. Until then, your analogy
is misleading.

> frontswap+tmem is not overcommit, it's undercommit. You have spare
> memory, and you give it away. It isn't a replacement. However,
> without
> the means to reclaim this spare memory, it can result in overcommit.

But you are missing part of the magic: Once the memory
page is no longer directly addressable (AND this implies not
directly writable) by the guest, the hypervisor can do interesting
things with it, such as compression and deduplication.

As a result, the sum of pages used by all the guests exceeds
the total pages of RAM in the system. Thus overcommitment.
I agree that the degree of overcommitment is less than possible
with host-swapping, but none of the evil issues of host-swapping
happen. Again, this is "intelligent overcommitment". Other
existing forms are "overcommit and cross your fingers that bad
things don't happen."

> > Xen+tmem uses the SAME internal kernel interface. The Xen-specific
> > code which performs the Xen-specific stuff (hypercalls) is only in
> > the Xen-specific directory.
>
> This makes it an external interface.
> :
> Something completely internal to the guest can be replaced by something
> completely different. Something that talks to a hypervisor will need
> those hooks forever to avoid regressions.

Uh, no. As I've said, everything about frontswap is entirely
optional, both at compile-time and run-time. A frontswap-enabled
guest is fully compatible with a hypervisor with no frontswap;
a frontswap-enabled hypervisor is fully compatible with a guest
with no frontswap. The only thing that is reserved forever is
a hypervisor-specific "hypercall number" which is not exposed in
the Linux kernel except in Xen-specific code. And, for Xen,
frontswap shares the same hypercall number with cleancache.

So, IMHO, you are being alarmist. This is not an "API
maintenance" problem for Linux.

> Exactly as large as the swap space which the guest would have in the
> frontswap+tmem case.
> :
> Not needed, though I expect it is already supported (SAN volumes do
> grow).
> :
> If block layer overhead is a problem, go ahead and optimize it instead
> of adding new interfaces to bypass it. Though I expect it wouldn't be
> needed, and if any optimization needs to be done it is in the swap
> layer.
> Optimizing swap has the additional benefit of improving performance on
> flash-backed swap.
> :
> What happens when no tmem is available? you swap to a volume. That's
> the disk size needed.
> :
> You're dynamic swap is limited too. And no, no guest modifications.

You keep saying you are going to implement all of the dynamic features
of frontswap with no changes to the guest and no copying and no
host-swapping. You are being disingenuous. VMware has had a lot
of people working on virtualization a lot longer than you or I have.
Don't you think they would have done this by now?

Frontswap exists today and is even shipping in real released products.
If you can work your magic (in Xen... I am not trying to claim
frontswap should work with KVM), please show us the code.

> So, you take a synchronous copyful interface, add another copy to make
> it into an asynchronous interface, instead of using the original
> asynchronous copyless interface.

"Add another copy" is not required any more than it is with the
other examples you cited.

The "original asynchronous copyless interface" works because DMA
for devices has been around for >40 years and has been greatly
refined. We're not talking about DMA to a device here, we're
talking about DMA from one place in RAM to another (i.e. from
guest RAM to hypervisor RAM). Do you have examples of DMA engines
that do page-size-ish RAM-to-RAM more efficiently than copying?

> The networking stack seems to think 4096 bytes is a good size for dma
> (see net/core/user_dma.c, NET_DMA_DEFAULT_COPYBREAK).

Networking is a device-to-RAM, not RAM-to-RAM.

> When swapping out, Linux already batches pages in the block device's
> request queue. Swapping out is inherently asynchronous and batched,
> you're swapping out those pages _because_ you don't need them, and
> you're never interested in swapping out a single page. Linux already
> reserves memory for use during swapout. There's no need to re-solve
> solved problems.

Swapping out is inherently asynchronous and batches because it was
designed for swapping to a device, while you are claiming that the
same _unchanged_ interface is suitable for swap-to-hypervisor-RAM
and at the same time saying that the block layer might need
to be "optimized" (apparently without code changes).

I'm not trying to re-solve a solved problem; frontswap solves a NEW
problem, with very little impact to existing code.

> Swapping in is less simple, it is mostly synchronous (in some cases it
> isn't: with many threads, or with the preswap patches (IIRC unmerged)).
> You can always choose to copy if you don't have enough to justify dma.

Do you have a pointer to these preswap patches?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Pavel Machek on

> > If block layer overhead is a problem, go ahead and optimize it instead
> > of adding new interfaces to bypass it. Though I expect it wouldn't be
> > needed, and if any optimization needs to be done it is in the swap
> > layer.
> > Optimizing swap has the additional benefit of improving performance on
> > flash-backed swap.
> > :
> > What happens when no tmem is available? you swap to a volume. That's
> > the disk size needed.
> > :
> > You're dynamic swap is limited too. And no, no guest modifications.
>
> You keep saying you are going to implement all of the dynamic features
> of frontswap with no changes to the guest and no copying and no
> host-swapping. You are being disingenuous. VMware has had a lot

I don't see why no copying is a requirement. I believe requirement
should be "it is fast enough".
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Martin Schwidefsky on
On Fri, 30 Apr 2010 09:08:00 -0700
Dave Hansen <dave(a)linux.vnet.ibm.com> wrote:

> On Fri, 2010-04-30 at 08:59 -0700, Dan Magenheimer wrote:
> > Dave or others can correct me if I am wrong, but I think CMM2 also
> > handles dirty pages that must be retained by the hypervisor. The
> > difference between CMM2 (for dirty pages) and frontswap is that
> > CMM2 sets hints that can be handled asynchronously while frontswap
> > provides explicit hooks that synchronously succeed/fail.
>
> Once pages were dirtied (or I guess just slightly before), they became
> volatile, and I don't think the hypervisor could do anything with them.
> It could still swap them out like usual, but none of the CMM-specific
> optimizations could be performed.

Well, almost correct :-)
A dirty page (or one that is about to become dirty) can be in one of two
CMMA states:
1) stable
This is the case for pages where the kernel is doing some operation on
the page that will make it dirty, e.g. I/O. Before the kernel can
allow the operation the page has to be made stable. If the state
conversion to stable fails because the hypervisor removed the page the
page needs to get deleted from page cache and recreated from scratch.
2) potentially-volatile
This state is used for page cache pages for which a writable mapping
exists. The page can be removed by the hypervisor as long as the
physical per-page dirty bit is not set. As soon as the bit is set the
page is considered stable although the CMMA state still is potentially-
volatile.

In both cases the only thing the hypervisor can do with a dirty page is
to swap it as usual.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/