From: Nitin Gupta on
On 04/24/2010 11:57 PM, Avi Kivity wrote:
> On 04/24/2010 04:49 AM, Nitin Gupta wrote:
>>
>>> I see. So why not implement this as an ordinary swap device, with a
>>> higher priority than the disk device? this way we reuse an API and keep
>>> things asynchronous, instead of introducing a special purpose API.
>>>
>>>
>> ramzswap is exactly this: an ordinary swap device which stores every page
>> in (compressed) memory and its enabled as highest priority swap.
>> Currently,
>> it stores these compressed chunks in guest memory itself but it is not
>> very
>> difficult to send these chunks out to host/hypervisor using virtio.
>>
>> However, it suffers from unnecessary block I/O layer overhead and
>> requires
>> weird hooks in swap code, say to get notification when a swap slot is
>> freed.
>>
>
> Isn't that TRIM?

No: trim or discard is not useful. The problem is that we require a callback
_as soon as_ a page (swap slot) is freed. Otherwise, stale data quickly accumulates
in memory defeating the whole purpose of in-memory compressed swap devices (like ramzswap).

Increasing the frequency of discards is also not an option:
- Creating discard bio requests themselves need memory and these swap devices
come into picture only under low memory conditions.
- We need to regularly scan swap_map to issue these discards. Increasing discard
frequency also means more frequent scanning (which will still not be fast enough
for ramzswap needs).

>
>> OTOH frontswap approach gets rid of any such artifacts and overheads.
>> (ramzswap: http://code.google.com/p/compcache/)
>>
>
> Maybe we should optimize these overheads instead. Swap used to always
> be to slow devices, but swap-to-flash has the potential to make swap act
> like an extension of RAM.
>

Spending lot of effort optimizing an overhead which can be completely avoided
is probably not worth it.

Also, I think the choice of a synchronous style API for frontswap and cleancache
is justified as they want to send pages to host *RAM*. If you want to use other
devices like SSDs, then these should be just added as another swap device as
we do currently -- these should not be used as frontswap storage directly.

Thanks,
Nitin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Avi Kivity on
On 04/25/2010 03:41 AM, Dan Magenheimer wrote:
>>> No, ANY put_page can fail, and this is a critical part of the API
>>> that provides all of the flexibility for the hypervisor and all
>>> the guests. (See previous reply.)
>>>
>> The guest isn't required to do any put_page()s. It can issue lots of
>> them when memory is available, and keep them in the hypervisor forever.
>> Failing new put_page()s isn't enough for a dynamic system, you need to
>> be able to force the guest to give up some of its tmem.
>>
> Yes, indeed, this is true. That is why it is important for any
> policy implemented behind frontswap to "bill" the guest if it
> is attempting to keep frontswap pages in the hypervisor forever
> and to prod the guest to reclaim them when it no longer needs
> super-fast emergency swap space. The frontswap patch already includes
> the kernel mechanism to enable this and the prodding can be implemented
> by a guest daemon (of which there already exists an existence proof).
>

In this case you could use the same mechanism to stop new put_page()s?

Seems frontswap is like a reverse balloon, where the balloon is in
hypervisor space instead of the guest space.

> (While devil's advocacy is always welcome, frontswap is NOT a
> cool academic science project where these issues have not been
> considered or tested.)
>


Good to know.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Avi Kivity on
On 04/25/2010 03:30 AM, Dan Magenheimer wrote:
>>>> I see. So why not implement this as an ordinary swap device, with a
>>>> higher priority than the disk device? this way we reuse an API and
>>>> keep
>>>> things asynchronous, instead of introducing a special purpose API.
>>>>
>>>>
>>> Because the swapping API doesn't adapt well to dynamic changes in
>>> the size and availability of the underlying "swap" device, which
>>> is very useful for swap to (bare-metal) hypervisor.
>>>
>> Can we extend it? Adding new APIs is easy, but harder to maintain in
>> the long term.
>>
> Umm... I think the difference between a "new" API and extending
> an existing one here is a choice of semantics. As designed, frontswap
> is an extremely simple, only-very-slightly-intrusive set of hooks that
> allows swap pages to, under some conditions, go to pseudo-RAM instead
> of an asynchronous disk-like device. It works today with at least
> one "backend" (Xen tmem), is shipping today in real distros, and is
> extremely easy to enable/disable via CONFIG or module... meaning
> no impact on anyone other than those who choose to benefit from it.
>
> "Extending" the existing swap API, which has largely been untouched for
> many years, seems like a significantly more complex and error-prone
> undertaking that will affect nearly all Linux users with a likely long
> bug tail. And, by the way, there is no existence proof that it
> will be useful.
>
> Seems like a no-brainer to me.
>

My issue is with the API's synchronous nature. Both RAM and more exotic
memories can be used with DMA instead of copying. A synchronous
interface gives this up.

>> Ok. For non traditional RAM uses I really think an async API is
>> needed. If the API is backed by a cpu synchronous operation is fine,
>> but once it isn't RAM, it can be all kinds of interesting things.
>>
> Well, we shall see. It may also be the case that the existing
> asynchronous swap API will work fine for some non traditional RAM;
> and it may also be the case that frontswap works fine for some
> non traditional RAM. I agree there is fertile ground for exploration
> here. But let's not allow our speculation on what may or may
> not work in the future halt forward progress of something that works
> today.
>

Let's not allow the urge to merge prevent us from doing the right thing.

>
>
>> Note that even if you do give the page to the guest, you still control
>> how it can access it, through the page tables. So for example you can
>> easily compress a guest's pages without telling it about it; whenever
>> it
>> touches them you decompress them on the fly.
>>
> Yes, at a much larger more invasive cost to the kernel. Frontswap
> and cleancache and tmem are all well-layered for a good reason.
>

No need to change the kernel at all; the hypervisor controls the page
tables.

>> Swap has no timing
>> constraints, it is asynchronous and usually to slow devices.
>>
> What I was referring to is that the existing swap code DOES NOT
> always have the ability to collect N scattered pages before
> initiating an I/O write suitable for a device (such as an SSD)
> that is optimized for writing N pages at a time. That is what
> I meant by a timing constraint. See references to page_cluster
> in the swap code (and this is for contiguous pages, not scattered).
>

I see. Given that swap-to-flash will soon be way more common than
frontswap, it needs to be solved (either in flash or in the swap code).

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Avi Kivity on
On 04/25/2010 04:12 PM, Dan Magenheimer wrote:
>>
>> In this case you could use the same mechanism to stop new put_page()s?
>>
> You are suggesting the hypervisor communicate dynamically-rapidly-changing
> physical memory availability information to a userland daemon in each guest,
> and each daemon communicate this information to each respective kernel
> to notify the kernel that hypervisor memory is not available?
>
> Seems very convoluted to me, and anyway it doesn't eliminate the need
> for a hook placed exactly where the frontswap_put hook is placed.
>

Yeah, it's pretty ugly. Balloons typically communicate without a daemon
too.

>> Seems frontswap is like a reverse balloon, where the balloon is in
>> hypervisor space instead of the guest space.
>>
> That's a reasonable analogy. Frontswap serves nicely as an
> emergency safety valve when a guest has given up (too) much of
> its memory via ballooning but unexpectedly has an urgent need
> that can't be serviced quickly enough by the balloon driver.
>

(or ordinary swap)

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Dan Magenheimer on
> On 04/25/2010 03:41 AM, Dan Magenheimer wrote:
> >>> No, ANY put_page can fail, and this is a critical part of the API
> >>> that provides all of the flexibility for the hypervisor and all
> >>> the guests. (See previous reply.)
> >>>
> >> The guest isn't required to do any put_page()s. It can issue lots
> of
> >> them when memory is available, and keep them in the hypervisor
> forever.
> >> Failing new put_page()s isn't enough for a dynamic system, you need
> to
> >> be able to force the guest to give up some of its tmem.
> >>
> > Yes, indeed, this is true. That is why it is important for any
> > policy implemented behind frontswap to "bill" the guest if it
> > is attempting to keep frontswap pages in the hypervisor forever
> > and to prod the guest to reclaim them when it no longer needs
> > super-fast emergency swap space. The frontswap patch already
> includes
> > the kernel mechanism to enable this and the prodding can be
> implemented
> > by a guest daemon (of which there already exists an existence proof).
>
> In this case you could use the same mechanism to stop new put_page()s?

You are suggesting the hypervisor communicate dynamically-rapidly-changing
physical memory availability information to a userland daemon in each guest,
and each daemon communicate this information to each respective kernel
to notify the kernel that hypervisor memory is not available?

Seems very convoluted to me, and anyway it doesn't eliminate the need
for a hook placed exactly where the frontswap_put hook is placed.

> Seems frontswap is like a reverse balloon, where the balloon is in
> hypervisor space instead of the guest space.

That's a reasonable analogy. Frontswap serves nicely as an
emergency safety valve when a guest has given up (too) much of
its memory via ballooning but unexpectedly has an urgent need
that can't be serviced quickly enough by the balloon driver.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/