From: Dan Magenheimer on
Frontswap [PATCH 0/4] (was Transcendent Memory): overview

Patch applies to 2.6.34-rc5

In previous patch postings, frontswap was part of the Transcendent
Memory ("tmem") patchset. This patchset refocuses not on the underlying
technology (tmem) but instead on the useful functionality provided for Linux,
and provides a clean API so that frontswap can provide this very useful
functionality via a Xen tmem driver OR completely independent of tmem.
For example: Nitin Gupta (of compcache and ramzswap fame) is implementing
an in-kernel compression "backend" for frontswap; some believe
frontswap will be a very nice interface for building RAM-like functionality
for pseudo-RAM devices such as SSD or phase-change memory; and a Pune
University team is looking at a backend for virtio (see OLS'2010).

A more complete description of frontswap can be found in the introductory
comment in mm/frontswap.c (in PATCH 2/4) which is included below
for convenience.

Note that an earlier version of this patch is now shipping in OpenSuSE 11.2
and will soon ship in a release of Oracle Enterprise Linux. Underlying
tmem technology is now shipping in Oracle VM 2.2 and was just released
in Xen 4.0 on April 15, 2010. (Search news.google.com for Transcedent
Memory)

Signed-off-by: Dan Magenheimer <dan.magenheimer(a)oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy(a)goop.org>

include/linux/frontswap.h | 98 ++++++++++++++
include/linux/swap.h | 2
include/linux/swapfile.h | 13 +
mm/Kconfig | 16 ++
mm/Makefile | 1
mm/frontswap.c | 301 ++++++++++++++++++++++++++++++++++++++++++++++
mm/page_io.c | 12 +
mm/swap.c | 4
mm/swapfile.c | 58 +++++++-
9 files changed, 496 insertions(+), 9 deletions(-)

Frontswap is so named because it can be thought of as the opposite of
a "backing" store for a swap device. The storage is assumed to be
a synchronous concurrency-safe page-oriented pseudo-RAM device (such as
Xen's Transcendent Memory, aka "tmem", or in-kernel compressed memory,
aka "zmem", or other RAM-like devices) which is not directly accessible
or addressable by the kernel and is of unknown and possibly time-varying
size. This pseudo-RAM device links itself to frontswap by setting the
frontswap_ops pointer appropriately and the functions it provides must
conform to certain policies as follows:

An "init" prepares the pseudo-RAM to receive frontswap pages and returns
a non-negative pool id, used for all swap device numbers (aka "type").
A "put_page" will copy the page to pseudo-RAM and associate it with
the type and offset associated with the page. A "get_page" will copy the
page, if found, from pseudo-RAM into kernel memory, but will NOT remove
the page from pseudo-RAM. A "flush_page" will remove the page from
pseudo-RAM and a "flush_area" will remove ALL pages associated with the
swap type (e.g., like swapoff) and notify the pseudo-RAM device to refuse
further puts with that swap type.

Once a page is successfully put, a matching get on the page will always
succeed. So when the kernel finds itself in a situation where it needs
to swap out a page, it first attempts to use frontswap. If the put returns
non-zero, the data has been successfully saved to pseudo-RAM and
a disk write and, if the data is later read back, a disk read are avoided.
If a put returns zero, pseudo-RAM has rejected the data, and the page can
be written to swap as usual.

Note that if a page is put and the page already exists in pseudo-RAM
(a "duplicate" put), either the put succeeds and the data is overwritten,
or the put fails AND the page is flushed. This ensures stale data may
never be obtained from pseudo-RAM.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Nitin Gupta on
On 04/23/2010 08:22 PM, Avi Kivity wrote:
> On 04/23/2010 05:43 PM, Dan Magenheimer wrote:
>>>
>>> Perhaps I misunderstood. Isn't frontswap in front of the normal swap
>>> device? So we do have double swapping, first to frontswap (which is in
>>> memory, yes, but still a nonzero cost), then the normal swap device.
>>> The io subsystem is loaded with writes; you only save the reads.
>>> Better to swap to the hypervisor, and make it responsible for
>>> committing
>>> to disk on overcommit or keeping in RAM when memory is available. This
>>> way we avoid the write to disk if memory is in fact available (or at
>>> least defer it until later). This way you avoid both reads and writes
>>> if memory is available.
>>>
>> Each page is either in frontswap OR on the normal swap device,
>> never both. So, yes, both reads and writes are avoided if memory
>> is available and there is no write issued to the io subsystem if
>> memory is available. The is_memory_available decision is determined
>> by the hypervisor dynamically for each page when the guest attempts
>> a "frontswap_put". So, yes, you are indeed "swapping to the
>> hypervisor" but, at least in the case of Xen, the hypervisor
>> never swaps any memory to disk so there is never double swapping.
>>
>
> I see. So why not implement this as an ordinary swap device, with a
> higher priority than the disk device? this way we reuse an API and keep
> things asynchronous, instead of introducing a special purpose API.
>

ramzswap is exactly this: an ordinary swap device which stores every page
in (compressed) memory and its enabled as highest priority swap. Currently,
it stores these compressed chunks in guest memory itself but it is not very
difficult to send these chunks out to host/hypervisor using virtio.

However, it suffers from unnecessary block I/O layer overhead and requires
weird hooks in swap code, say to get notification when a swap slot is freed.
OTOH frontswap approach gets rid of any such artifacts and overheads.
(ramzswap: http://code.google.com/p/compcache/)

Thanks,
Nitin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Avi Kivity on
On 04/24/2010 04:49 AM, Nitin Gupta wrote:
>
>> I see. So why not implement this as an ordinary swap device, with a
>> higher priority than the disk device? this way we reuse an API and keep
>> things asynchronous, instead of introducing a special purpose API.
>>
>>
> ramzswap is exactly this: an ordinary swap device which stores every page
> in (compressed) memory and its enabled as highest priority swap. Currently,
> it stores these compressed chunks in guest memory itself but it is not very
> difficult to send these chunks out to host/hypervisor using virtio.
>
> However, it suffers from unnecessary block I/O layer overhead and requires
> weird hooks in swap code, say to get notification when a swap slot is freed.
>

Isn't that TRIM?

> OTOH frontswap approach gets rid of any such artifacts and overheads.
> (ramzswap: http://code.google.com/p/compcache/)
>

Maybe we should optimize these overheads instead. Swap used to always
be to slow devices, but swap-to-flash has the potential to make swap act
like an extension of RAM.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Dan Magenheimer on
> >> I see. So why not implement this as an ordinary swap device, with a
> >> higher priority than the disk device? this way we reuse an API and
> >> keep
> >> things asynchronous, instead of introducing a special purpose API.
> >>
> > Because the swapping API doesn't adapt well to dynamic changes in
> > the size and availability of the underlying "swap" device, which
> > is very useful for swap to (bare-metal) hypervisor.
>
> Can we extend it? Adding new APIs is easy, but harder to maintain in
> the long term.

Umm... I think the difference between a "new" API and extending
an existing one here is a choice of semantics. As designed, frontswap
is an extremely simple, only-very-slightly-intrusive set of hooks that
allows swap pages to, under some conditions, go to pseudo-RAM instead
of an asynchronous disk-like device. It works today with at least
one "backend" (Xen tmem), is shipping today in real distros, and is
extremely easy to enable/disable via CONFIG or module... meaning
no impact on anyone other than those who choose to benefit from it.

"Extending" the existing swap API, which has largely been untouched for
many years, seems like a significantly more complex and error-prone
undertaking that will affect nearly all Linux users with a likely long
bug tail. And, by the way, there is no existence proof that it
will be useful.

Seems like a no-brainer to me.

> Ok. For non traditional RAM uses I really think an async API is
> needed. If the API is backed by a cpu synchronous operation is fine,
> but once it isn't RAM, it can be all kinds of interesting things.

Well, we shall see. It may also be the case that the existing
asynchronous swap API will work fine for some non traditional RAM;
and it may also be the case that frontswap works fine for some
non traditional RAM. I agree there is fertile ground for exploration
here. But let's not allow our speculation on what may or may
not work in the future halt forward progress of something that works
today.

> Note that even if you do give the page to the guest, you still control
> how it can access it, through the page tables. So for example you can
> easily compress a guest's pages without telling it about it; whenever
> it
> touches them you decompress them on the fly.

Yes, at a much larger more invasive cost to the kernel. Frontswap
and cleancache and tmem are all well-layered for a good reason.

> >> I think it will be true in an overwhelming number of cases. Flash
> is
> >> new enough that most devices support scatter/gather.
> >>
> > I wasn't referring to hardware capability but to the availability
> > and timing constraints of the pages that need to be swapped.
> >
>
> I have a feeling we're talking past each other here.

Could be.

> Swap has no timing
> constraints, it is asynchronous and usually to slow devices.

What I was referring to is that the existing swap code DOES NOT
always have the ability to collect N scattered pages before
initiating an I/O write suitable for a device (such as an SSD)
that is optimized for writing N pages at a time. That is what
I meant by a timing constraint. See references to page_cluster
in the swap code (and this is for contiguous pages, not scattered).

Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Dan Magenheimer on
> > No, ANY put_page can fail, and this is a critical part of the API
> > that provides all of the flexibility for the hypervisor and all
> > the guests. (See previous reply.)
>
> The guest isn't required to do any put_page()s. It can issue lots of
> them when memory is available, and keep them in the hypervisor forever.
> Failing new put_page()s isn't enough for a dynamic system, you need to
> be able to force the guest to give up some of its tmem.

Yes, indeed, this is true. That is why it is important for any
policy implemented behind frontswap to "bill" the guest if it
is attempting to keep frontswap pages in the hypervisor forever
and to prod the guest to reclaim them when it no longer needs
super-fast emergency swap space. The frontswap patch already includes
the kernel mechanism to enable this and the prodding can be implemented
by a guest daemon (of which there already exists an existence proof).

(While devil's advocacy is always welcome, frontswap is NOT a
cool academic science project where these issues have not been
considered or tested.)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/