| 	
Prev: Whitespace Coding style fixes. Next: Frontswap [PATCH 1/4] (was Transcendent Memory): swap data structure changes 	
		 From: Dan Magenheimer on 22 Apr 2010 09:50 Frontswap [PATCH 0/4] (was Transcendent Memory): overview Patch applies to 2.6.34-rc5 In previous patch postings, frontswap was part of the Transcendent Memory ("tmem") patchset. This patchset refocuses not on the underlying technology (tmem) but instead on the useful functionality provided for Linux, and provides a clean API so that frontswap can provide this very useful functionality via a Xen tmem driver OR completely independent of tmem. For example: Nitin Gupta (of compcache and ramzswap fame) is implementing an in-kernel compression "backend" for frontswap; some believe frontswap will be a very nice interface for building RAM-like functionality for pseudo-RAM devices such as SSD or phase-change memory; and a Pune University team is looking at a backend for virtio (see OLS'2010). A more complete description of frontswap can be found in the introductory comment in mm/frontswap.c (in PATCH 2/4) which is included below for convenience. Note that an earlier version of this patch is now shipping in OpenSuSE 11.2 and will soon ship in a release of Oracle Enterprise Linux. Underlying tmem technology is now shipping in Oracle VM 2.2 and was just released in Xen 4.0 on April 15, 2010. (Search news.google.com for Transcedent Memory) Signed-off-by: Dan Magenheimer <dan.magenheimer(a)oracle.com> Reviewed-by: Jeremy Fitzhardinge <jeremy(a)goop.org> include/linux/frontswap.h | 98 ++++++++++++++ include/linux/swap.h | 2 include/linux/swapfile.h | 13 + mm/Kconfig | 16 ++ mm/Makefile | 1 mm/frontswap.c | 301 ++++++++++++++++++++++++++++++++++++++++++++++ mm/page_io.c | 12 + mm/swap.c | 4 mm/swapfile.c | 58 +++++++- 9 files changed, 496 insertions(+), 9 deletions(-) Frontswap is so named because it can be thought of as the opposite of a "backing" store for a swap device. The storage is assumed to be a synchronous concurrency-safe page-oriented pseudo-RAM device (such as Xen's Transcendent Memory, aka "tmem", or in-kernel compressed memory, aka "zmem", or other RAM-like devices) which is not directly accessible or addressable by the kernel and is of unknown and possibly time-varying size. This pseudo-RAM device links itself to frontswap by setting the frontswap_ops pointer appropriately and the functions it provides must conform to certain policies as follows: An "init" prepares the pseudo-RAM to receive frontswap pages and returns a non-negative pool id, used for all swap device numbers (aka "type"). A "put_page" will copy the page to pseudo-RAM and associate it with the type and offset associated with the page. A "get_page" will copy the page, if found, from pseudo-RAM into kernel memory, but will NOT remove the page from pseudo-RAM. A "flush_page" will remove the page from pseudo-RAM and a "flush_area" will remove ALL pages associated with the swap type (e.g., like swapoff) and notify the pseudo-RAM device to refuse further puts with that swap type. Once a page is successfully put, a matching get on the page will always succeed. So when the kernel finds itself in a situation where it needs to swap out a page, it first attempts to use frontswap. If the put returns non-zero, the data has been successfully saved to pseudo-RAM and a disk write and, if the data is later read back, a disk read are avoided. If a put returns zero, pseudo-RAM has rejected the data, and the page can be written to swap as usual. Note that if a page is put and the page already exists in pseudo-RAM (a "duplicate" put), either the put succeeds and the data is overwritten, or the put fails AND the page is flushed. This ensures stale data may never be obtained from pseudo-RAM. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ 	
		 From: Nitin Gupta on 23 Apr 2010 22:00 On 04/23/2010 08:22 PM, Avi Kivity wrote: > On 04/23/2010 05:43 PM, Dan Magenheimer wrote: >>> >>> Perhaps I misunderstood. Isn't frontswap in front of the normal swap >>> device? So we do have double swapping, first to frontswap (which is in >>> memory, yes, but still a nonzero cost), then the normal swap device. >>> The io subsystem is loaded with writes; you only save the reads. >>> Better to swap to the hypervisor, and make it responsible for >>> committing >>> to disk on overcommit or keeping in RAM when memory is available. This >>> way we avoid the write to disk if memory is in fact available (or at >>> least defer it until later). This way you avoid both reads and writes >>> if memory is available. >>> >> Each page is either in frontswap OR on the normal swap device, >> never both. So, yes, both reads and writes are avoided if memory >> is available and there is no write issued to the io subsystem if >> memory is available. The is_memory_available decision is determined >> by the hypervisor dynamically for each page when the guest attempts >> a "frontswap_put". So, yes, you are indeed "swapping to the >> hypervisor" but, at least in the case of Xen, the hypervisor >> never swaps any memory to disk so there is never double swapping. >> > > I see. So why not implement this as an ordinary swap device, with a > higher priority than the disk device? this way we reuse an API and keep > things asynchronous, instead of introducing a special purpose API. > ramzswap is exactly this: an ordinary swap device which stores every page in (compressed) memory and its enabled as highest priority swap. Currently, it stores these compressed chunks in guest memory itself but it is not very difficult to send these chunks out to host/hypervisor using virtio. However, it suffers from unnecessary block I/O layer overhead and requires weird hooks in swap code, say to get notification when a swap slot is freed. OTOH frontswap approach gets rid of any such artifacts and overheads. (ramzswap: http://code.google.com/p/compcache/) Thanks, Nitin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ 	
		 From: Avi Kivity on 24 Apr 2010 14:30 On 04/24/2010 04:49 AM, Nitin Gupta wrote: > >> I see. So why not implement this as an ordinary swap device, with a >> higher priority than the disk device? this way we reuse an API and keep >> things asynchronous, instead of introducing a special purpose API. >> >> > ramzswap is exactly this: an ordinary swap device which stores every page > in (compressed) memory and its enabled as highest priority swap. Currently, > it stores these compressed chunks in guest memory itself but it is not very > difficult to send these chunks out to host/hypervisor using virtio. > > However, it suffers from unnecessary block I/O layer overhead and requires > weird hooks in swap code, say to get notification when a swap slot is freed. > Isn't that TRIM? > OTOH frontswap approach gets rid of any such artifacts and overheads. > (ramzswap: http://code.google.com/p/compcache/) > Maybe we should optimize these overheads instead. Swap used to always be to slow devices, but swap-to-flash has the potential to make swap act like an extension of RAM. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ 	
		 From: Dan Magenheimer on 24 Apr 2010 20:40 > >> I see. So why not implement this as an ordinary swap device, with a > >> higher priority than the disk device? this way we reuse an API and > >> keep > >> things asynchronous, instead of introducing a special purpose API. > >> > > Because the swapping API doesn't adapt well to dynamic changes in > > the size and availability of the underlying "swap" device, which > > is very useful for swap to (bare-metal) hypervisor. > > Can we extend it? Adding new APIs is easy, but harder to maintain in > the long term. Umm... I think the difference between a "new" API and extending an existing one here is a choice of semantics. As designed, frontswap is an extremely simple, only-very-slightly-intrusive set of hooks that allows swap pages to, under some conditions, go to pseudo-RAM instead of an asynchronous disk-like device. It works today with at least one "backend" (Xen tmem), is shipping today in real distros, and is extremely easy to enable/disable via CONFIG or module... meaning no impact on anyone other than those who choose to benefit from it. "Extending" the existing swap API, which has largely been untouched for many years, seems like a significantly more complex and error-prone undertaking that will affect nearly all Linux users with a likely long bug tail. And, by the way, there is no existence proof that it will be useful. Seems like a no-brainer to me. > Ok. For non traditional RAM uses I really think an async API is > needed. If the API is backed by a cpu synchronous operation is fine, > but once it isn't RAM, it can be all kinds of interesting things. Well, we shall see. It may also be the case that the existing asynchronous swap API will work fine for some non traditional RAM; and it may also be the case that frontswap works fine for some non traditional RAM. I agree there is fertile ground for exploration here. But let's not allow our speculation on what may or may not work in the future halt forward progress of something that works today. > Note that even if you do give the page to the guest, you still control > how it can access it, through the page tables. So for example you can > easily compress a guest's pages without telling it about it; whenever > it > touches them you decompress them on the fly. Yes, at a much larger more invasive cost to the kernel. Frontswap and cleancache and tmem are all well-layered for a good reason. > >> I think it will be true in an overwhelming number of cases. Flash > is > >> new enough that most devices support scatter/gather. > >> > > I wasn't referring to hardware capability but to the availability > > and timing constraints of the pages that need to be swapped. > > > > I have a feeling we're talking past each other here. Could be. > Swap has no timing > constraints, it is asynchronous and usually to slow devices. What I was referring to is that the existing swap code DOES NOT always have the ability to collect N scattered pages before initiating an I/O write suitable for a device (such as an SSD) that is optimized for writing N pages at a time. That is what I meant by a timing constraint. See references to page_cluster in the swap code (and this is for contiguous pages, not scattered). Dan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ 	
		 From: Dan Magenheimer on 24 Apr 2010 20:50 > > No, ANY put_page can fail, and this is a critical part of the API > > that provides all of the flexibility for the hypervisor and all > > the guests. (See previous reply.) > > The guest isn't required to do any put_page()s. It can issue lots of > them when memory is available, and keep them in the hypervisor forever. > Failing new put_page()s isn't enough for a dynamic system, you need to > be able to force the guest to give up some of its tmem. Yes, indeed, this is true. That is why it is important for any policy implemented behind frontswap to "bill" the guest if it is attempting to keep frontswap pages in the hypervisor forever and to prod the guest to reclaim them when it no longer needs super-fast emergency swap space. The frontswap patch already includes the kernel mechanism to enable this and the prodding can be implemented by a guest daemon (of which there already exists an existence proof). (While devil's advocacy is always welcome, frontswap is NOT a cool academic science project where these issues have not been considered or tested.) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |