From: Dan Magenheimer on
> Stop right here. Instead of improving existing swap api, you just
> create one because it is less work.
>
> We do not want apis to cummulate; please just fix the existing one.

> If we added all the apis that worked when proposed, we'd have
> unmaintanable mess by about 1996.
>
> Why can't frontswap just use existing swap api?

Hi Pavel!

The existing swap API as it stands is inadequate for an efficient
synchronous interface (e.g. for swapping to RAM). Both Nitin
and I independently have found this to be true. But swap-to-RAM
is very useful in some cases (swap-to-kernel-compressed-RAM
and swap-to-hypervisor-RAM and maybe others) that were not even
conceived many years ago at the time the existing swap API was
designed for swap-to-disk. Swap-to-RAM can relieve memory
pressure faster and more resource-efficient than swap-to-device
but must assume that RAM available for swap-to-RAM is dynamic
(not fixed in size). (And swap-to-SSD, when the SSD is an
I/O device on an I/O bus is NOT the same as swap-to-RAM.)

In my opinion, frontswap is NOT a new API, but the simplest
possible extension of the existing swap API to allow for
efficient swap-to-RAM. Avi's comments about a new API
(as he explained later in the thread) refer to a new API
between kernel and hypervisor, what is essentially the
Transcendent Memory interface. Frontswap was separated from
the tmem dependency to enable Nitin's swap-to-kernel-compressed-RAM
and the possibility that there may be other interesting
swap-to-RAM uses.

Does this help?

Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Nitin Gupta on
On 04/27/2010 06:25 PM, Pavel Machek wrote:
>
>>> Can we extend it? Adding new APIs is easy, but harder to maintain in
>>> the long term.
>>
>> Umm... I think the difference between a "new" API and extending
>> an existing one here is a choice of semantics. As designed, frontswap
>> is an extremely simple, only-very-slightly-intrusive set of hooks that
>> allows swap pages to, under some conditions, go to pseudo-RAM instead
> ...
>> "Extending" the existing swap API, which has largely been untouched for
>> many years, seems like a significantly more complex and error-prone
>> undertaking that will affect nearly all Linux users with a likely long
>> bug tail. And, by the way, there is no existence proof that it
>> will be useful.
>
>> Seems like a no-brainer to me.
>
> Stop right here. Instead of improving existing swap api, you just
> create one because it is less work.
>
> We do not want apis to cummulate; please just fix the existing one.


I'm a bit confused: What do you mean by 'existing swap API'?
Frontswap simply hooks in swap_readpage() and swap_writepage() to
call frontswap_{get,put}_page() respectively. Now to avoid a hardcoded
implementation of these function, it introduces struct frontswap_ops
so that custom implementations fronswap get/put/etc. functions can be
provided. This allows easy implementation of swap-to-hypervisor,
in-memory-compressed-swapping etc. with common set of hooks.

So, how frontswap approach can be seen as introducing a new API?

Thanks,
Nitin






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Pavel Machek on
On Tue 2010-04-27 20:13:39, Nitin Gupta wrote:
> On 04/27/2010 06:25 PM, Pavel Machek wrote:
> >
> >>> Can we extend it? Adding new APIs is easy, but harder to maintain in
> >>> the long term.
> >>
> >> Umm... I think the difference between a "new" API and extending
> >> an existing one here is a choice of semantics. As designed, frontswap
> >> is an extremely simple, only-very-slightly-intrusive set of hooks that
> >> allows swap pages to, under some conditions, go to pseudo-RAM instead
> > ...
> >> "Extending" the existing swap API, which has largely been untouched for
> >> many years, seems like a significantly more complex and error-prone
> >> undertaking that will affect nearly all Linux users with a likely long
> >> bug tail. And, by the way, there is no existence proof that it
> >> will be useful.
> >
> >> Seems like a no-brainer to me.
> >
> > Stop right here. Instead of improving existing swap api, you just
> > create one because it is less work.
> >
> > We do not want apis to cummulate; please just fix the existing one.
>
>
> I'm a bit confused: What do you mean by 'existing swap API'?
> Frontswap simply hooks in swap_readpage() and swap_writepage() to
> call frontswap_{get,put}_page() respectively. Now to avoid a hardcoded
> implementation of these function, it introduces struct frontswap_ops
> so that custom implementations fronswap get/put/etc. functions can be
> provided. This allows easy implementation of swap-to-hypervisor,
> in-memory-compressed-swapping etc. with common set of hooks.

Yes, and that set of hooks is new API, right?

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Dave Hansen on
On Fri, 2010-04-30 at 09:43 -0700, Dan Magenheimer wrote:
> And, importantly, "have your host expose a device which is write
> cached by host memory"... you are implying that all guest swapping
> should be done to a device managed/controlled by the host? That
> eliminates guest swapping to directIO/SRIOV devices doesn't it?

If you have a single swap device, sure. But, I can also see a case
where you have a "fast" swap and "slow" swap.

The part of the argument about frontswap is that I like is the lack
sizing exposed to the guest. When you're dealing with swap-only, you
are stuck adding or removing swap devices if you want to "grow/shrink"
the memory footprint. If the host (or whatever is backing the
frontswap) wants to change the sizes, they're fairly free to.

The part that bothers me it is that it just pushes the problem
elsewhere. For KVM, we still have to figure out _somewhere_ what to do
with all those pages. It's nice that the host would have the freedom to
either swap or keep them around, but it doesn't really fix the problem.

I do see the lack of sizing exposed to the guest as being a bad thing,
too. Let's say we saved 25% of system RAM to back a frontswap-type
device on a KVM host. The first time a user boots up their set of VMs
and 25% of their RAM is gone, they're going to start complaining,
despite the fact that their 25% smaller systems may end up being faster.

I think I'd be more convinced if we saw this thing actually get used
somehow. How is a ram-backed frontswap better than a /dev/ramX-backed
swap file in practice?

-- Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jeremy Fitzhardinge on
On 04/30/2010 09:16 AM, Avi Kivity wrote:
> Given that whenever frontswap fails you need to swap anyway, it is
> better for the host to never fail a frontswap request and instead back
> it with disk storage if needed. This way you avoid a pointless vmexit
> when you're out of memory. Since it's disk backed it needs to be
> asynchronous and batched.

I'd argue the opposite. There's no point in having the host do swapping
on behalf of guests if guests can do it themselves; it's just a
duplication of functionality. You end up having two IO paths for each
guest, and the resulting problems in trying to account for the IO,
rate-limit it, etc. If you can simply say "all guest disk IO happens
via this single interface", its much easier to manage.

If frontswap has value, it's because its providing a new facility to
guests that doesn't already exist and can't be easily emulated with
existing interfaces.

It seems to me the great strengths of the synchronous interface are:

* it matches the needs of an existing implementation (tmem in Xen)
* it is simple to understand within the context of the kernel code
it's used in

Simplicity is important, because it allows the mm code to be understood
and maintained without having to have a deep understanding of
virtualization. One of the problems with CMM2 was that it puts a lot of
intricate constraints on the mm code which can be easily broken, which
would only become apparent in subtle edge cases in a CMM2-using
environment. An addition async frontswap-like interface - while not as
complex as CMM2 - still makes things harder for mm maintainers.

The downside is that it may not match some implementation in which the
get/put operations could take a long time (ie, physical IO to a slow
mechanical device). But a general Linux principle is not to overdesign
interfaces for hypothetical users, only for real needs.

Do you think that you would be able to use frontswap in kvm if it were
an async interface, but not otherwise? Or are you arguing a hypothetical?

> At this point we're back with the ordinary swap API. Simply have your
> host expose a device which is write cached by host memory, you'll have
> all the benefits of frontswap with none of the disadvantages, and with
> no changes to guest code.

Yes, that's comfortably within the "guests page themselves" model.
Setting up a block device for the domain which is backed by pagecache
(something we usually try hard to avoid) is pretty straightforward. But
it doesn't work well for Xen unless the blkback domain is sized so that
it has all of Xen's free memory in its pagecache.

That said, it does concern me that the host/hypervisor is left holding
the bag on frontswapped pages. A evil/uncooperative/lazy can just pump
a whole lot of pages into the frontswap pool and leave them there. I
guess this is mitigated by the fact that the API is designed such that
they can't update or read the data without also allowing the hypervisor
to drop the page (updates can fail destructively, and reads are also
destructive), so the guest can't use it as a clumsy extension of their
normal dedicated memory.

J

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/