From: Dan Magenheimer on
(I'll back down on the CMM2 comparisons until I can go
back and read the paper :-)

> >> [frontswap is] really
> >> not very different from a synchronous swap device.
> >>
> > Not to beat a dead horse, but there is a very key difference:
> > The size and availability of frontswap is entirely dynamic;
> > any page-to-be-swapped can be rejected at any time even if
> > a page was previously successfully swapped to the same index.
> > Every other swap device is much more static so the swap code
> > assumes a static device. Existing swap code can account for
> > "bad blocks" on a static device, but this is far from sufficient
> > to handle the dynamicity needed by frontswap.
>
> Given that whenever frontswap fails you need to swap anyway, it is
> better for the host to never fail a frontswap request and instead back
> it with disk storage if needed. This way you avoid a pointless vmexit
> when you're out of memory. Since it's disk backed it needs to be
> asynchronous and batched.
>
> At this point we're back with the ordinary swap API. Simply have your
> host expose a device which is write cached by host memory, you'll have
> all the benefits of frontswap with none of the disadvantages, and with
> no changes to guest .

I think you are making a number of possibly false assumptions here:
1) The host [the frontswap backend may not even be a hypervisor]
2) can back it with disk storage [not if it is a bare-metal hypervisor]
3) avoid a pointless vmexit [no vmexit for a non-VMX (e.g. PV) guest]
4) when you're out of memory [how can this be determined outside of
the hypervisor?]

And, importantly, "have your host expose a device which is write
cached by host memory"... you are implying that all guest swapping
should be done to a device managed/controlled by the host? That
eliminates guest swapping to directIO/SRIOV devices doesn't it?

Anyway, I think we can see now why frontswap might not be a good
match for a hosted hypervisor (KVM), but that doesn't make it
any less useful for a bare-metal hypervisor (or TBD for in-kernel
compressed swap and TBD for possible future pseudo-RAM technologies).

Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Pavel Machek on
Hi!

> > Stop right here. Instead of improving existing swap api, you just
> > create one because it is less work.
> >
> > We do not want apis to cummulate; please just fix the existing one.
>
> > If we added all the apis that worked when proposed, we'd have
> > unmaintanable mess by about 1996.
> >
> > Why can't frontswap just use existing swap api?
>
> Hi Pavel!
>
> The existing swap API as it stands is inadequate for an efficient
> synchronous interface (e.g. for swapping to RAM). Both Nitin
> and I independently have found this to be true. But swap-to-RAM

So... how much slower is swapping to RAM over current interface when
compared to proposed interface, and how much is that slower than just
using the memory directly?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Avi Kivity on
On 04/30/2010 07:43 PM, Dan Magenheimer wrote:
>> Given that whenever frontswap fails you need to swap anyway, it is
>> better for the host to never fail a frontswap request and instead back
>> it with disk storage if needed. This way you avoid a pointless vmexit
>> when you're out of memory. Since it's disk backed it needs to be
>> asynchronous and batched.
>>
>> At this point we're back with the ordinary swap API. Simply have your
>> host expose a device which is write cached by host memory, you'll have
>> all the benefits of frontswap with none of the disadvantages, and with
>> no changes to guest .
>>
> I think you are making a number of possibly false assumptions here:
> 1) The host [the frontswap backend may not even be a hypervisor]
>

True. My remarks only apply to frontswap-to-hypervisor, for internally
consumed frontswap the situation is different.

> 2) can back it with disk storage [not if it is a bare-metal hypervisor]
>

So it seems a bare-metal hypervisor has less access to the bare metal
than a non-bare-metal hypervisor?

Seriously, leave the bare-metal FUD to Simon. People on this list know
that kvm and Xen have exactly the same access to the hardware (well
actually Xen needs to use privileged guests to access some of its hardware).

> 3) avoid a pointless vmexit [no vmexit for a non-VMX (e.g. PV) guest]
>

There's still an exit. It's much faster than a vmx/svm vmexit but still
nontrivial.

But why are we optimizing for 5 year old hardware?

> 4) when you're out of memory [how can this be determined outside of
> the hypervisor?]
>

It's determined by the hypervisor, same as with tmem. The guest swaps
to a virtual disk, the hypervisor places the data in RAM if it's
available, or on disk if it isn't. Write-back caching in all its glory.

> And, importantly, "have your host expose a device which is write
> cached by host memory"... you are implying that all guest swapping
> should be done to a device managed/controlled by the host? That
> eliminates guest swapping to directIO/SRIOV devices doesn't it?
>

You can have multiple swap devices.

wrt SR/IOV, you'll see synchronous frontswap reduce throughput. SR/IOV
will swap with <1 exit/page and DMA guest pages, while frontswap/tmem
will carry a 1 exit/page hit (even if no swap actually happens) and the
copy cost (if it does).

The API really, really wants to be asynchronous.

> Anyway, I think we can see now why frontswap might not be a good
> match for a hosted hypervisor (KVM), but that doesn't make it
> any less useful for a bare-metal hypervisor (or TBD for in-kernel
> compressed swap and TBD for possible future pseudo-RAM technologies).
>

In-kernel compressed swap does seem to be a good match for a synchronous
API. For future memory devices, or even bare-metal buzzword-compliant
hypervisors, I disagree. An asynchronous API is required for
efficiency, and they'll all have swap capability sooner or later (kvm,
vmware, and I believe xen 4 already do).

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Avi Kivity on
On 04/29/2010 05:42 PM, Dan Magenheimer wrote:
>>
>> Yes, and that set of hooks is new API, right?
>>
> Well, no, if you define API as "application programming interface"
> this is NOT exposed to userland. If you define API as a new
> in-kernel function call, yes, these hooks are a new API, but that
> is true of virtually any new code in the kernel. If you define
> API as some new interface between the kernel and a hypervisor,
> yes, this is a new API, but it is "optional" at several levels
> so that any hypervisor (e.g. KVM) can completely ignore it.
>

The concern is not with the hypervisor, but with Linux. More external
APIs reduce our flexibility to change things.

> So please let's not argue about whether the code is a "new API"
> or not, but instead consider whether the concept is useful or not
> and if useful, if there is or is not a cleaner way to implement it.
>

I'm convinced it's useful. The API is so close to a block device
(read/write with key/value vs read/write with sector/value) that we
should make the effort not to introduce a new API.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Avi Kivity on
On 04/29/2010 09:59 PM, Avi Kivity wrote:
>
> I'm convinced it's useful. The API is so close to a block device
> (read/write with key/value vs read/write with sector/value) that we
> should make the effort not to introduce a new API.
>

Plus of course the asynchronity and batching of the block layer. Even
if you don't use a dma engine, you improve performance by exiting one
per several dozen pages instead of for every page, perhaps enough to
allow the hypervisor to justify copying the memory with non-temporal moves.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/