Frontswap [PATCH 0/4] (was Transcendent Memory): overview [Kernel]

Prev: Whitespace Coding style fixes.
Next: Frontswap [PATCH 1/4] (was Transcendent Memory): swap data structure changes

From: Dan Magenheimer on 25 Apr 2010 09:50

> My issue is with the API's synchronous nature. Both RAM and more
> exotic
> memories can be used with DMA instead of copying. A synchronous
> interface gives this up.
> :
> Let's not allow the urge to merge prevent us from doing the right
> thing.
> :
> I see. Given that swap-to-flash will soon be way more common than
> frontswap, it needs to be solved (either in flash or in the swap code).

While I admit that I started this whole discussion by implying
that frontswap (and cleancache) might be useful for SSDs, I think
we are going far astray here. Frontswap is synchronous for a
reason: It uses real RAM, but RAM that is not directly addressable
by a (guest) kernel. SSD's (at least today) are still I/O devices;
even though they may be very fast, they still live on a PCI (or
slower) bus and use DMA. Frontswap is not intended for use with
I/O devices.

Today's memory technologies are either RAM that can be addressed
by the kernel, or I/O devices that sit on an I/O bus. The
exotic memories that I am referring to may be a hybrid:
memory that is fast enough to live on a QPI/hypertransport,
but slow enough that you wouldn't want to randomly mix and
hand out to userland apps some pages from "exotic RAM" and some
pages from "normal RAM". Such memory makes no sense today
because OS's wouldn't know what to do with it. But it MAY
make sense with frontswap (and cleancache).

Nevertheless, frontswap works great today with a bare-metal
hypervisor. I think it stands on its own merits, regardless
of one's vision of future SSD/memory technologies.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Avi Kivity on 25 Apr 2010 10:20

On 04/25/2010 04:37 PM, Dan Magenheimer wrote:
>> My issue is with the API's synchronous nature. Both RAM and more
>> exotic
>> memories can be used with DMA instead of copying. A synchronous
>> interface gives this up.
>> :
>> Let's not allow the urge to merge prevent us from doing the right
>> thing.
>> :
>> I see. Given that swap-to-flash will soon be way more common than
>> frontswap, it needs to be solved (either in flash or in the swap code).
>>
> While I admit that I started this whole discussion by implying
> that frontswap (and cleancache) might be useful for SSDs, I think
> we are going far astray here. Frontswap is synchronous for a
> reason: It uses real RAM, but RAM that is not directly addressable
> by a (guest) kernel. SSD's (at least today) are still I/O devices;
> even though they may be very fast, they still live on a PCI (or
> slower) bus and use DMA. Frontswap is not intended for use with
> I/O devices.
>
> Today's memory technologies are either RAM that can be addressed
> by the kernel, or I/O devices that sit on an I/O bus. The
> exotic memories that I am referring to may be a hybrid:
> memory that is fast enough to live on a QPI/hypertransport,
> but slow enough that you wouldn't want to randomly mix and
> hand out to userland apps some pages from "exotic RAM" and some
> pages from "normal RAM". Such memory makes no sense today
> because OS's wouldn't know what to do with it. But it MAY
> make sense with frontswap (and cleancache).
>
> Nevertheless, frontswap works great today with a bare-metal
> hypervisor. I think it stands on its own merits, regardless
> of one's vision of future SSD/memory technologies.
>

Even when frontswapping to RAM on a bare metal hypervisor it makes sense
to use an async API, in case you have a DMA engine on board.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Dan Magenheimer on 25 Apr 2010 11:40

> > While I admit that I started this whole discussion by implying
> > that frontswap (and cleancache) might be useful for SSDs, I think
> > we are going far astray here. Frontswap is synchronous for a
> > reason: It uses real RAM, but RAM that is not directly addressable
> > by a (guest) kernel. SSD's (at least today) are still I/O devices;
> > even though they may be very fast, they still live on a PCI (or
> > slower) bus and use DMA. Frontswap is not intended for use with
> > I/O devices.
> >
> > Today's memory technologies are either RAM that can be addressed
> > by the kernel, or I/O devices that sit on an I/O bus. The
> > exotic memories that I am referring to may be a hybrid:
> > memory that is fast enough to live on a QPI/hypertransport,
> > but slow enough that you wouldn't want to randomly mix and
> > hand out to userland apps some pages from "exotic RAM" and some
> > pages from "normal RAM". Such memory makes no sense today
> > because OS's wouldn't know what to do with it. But it MAY
> > make sense with frontswap (and cleancache).
> >
> > Nevertheless, frontswap works great today with a bare-metal
> > hypervisor. I think it stands on its own merits, regardless
> > of one's vision of future SSD/memory technologies.
>
> Even when frontswapping to RAM on a bare metal hypervisor it makes
> sense
> to use an async API, in case you have a DMA engine on board.

When pages are 2MB, this may be true. When pages are 4KB and
copied individually, it may take longer to program a DMA engine
than to just copy 4KB.

But in any case, frontswap works fine on all existing machines
today. If/when most commodity CPUs have an asynchronous RAM DMA
engine, an asynchronous API may be appropriate. Or the existing
swap API might be appropriate. Or the synchronous frontswap API
may work fine too. Speculating further about non-existent
hardware that might exist in the (possibly far) future is irrelevant
to the proposed patch, which works today on all existing x86 hardware
and on shipping software.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nitin Gupta on 25 Apr 2010 12:10

On 04/25/2010 05:46 PM, Avi Kivity wrote:
> On 04/25/2010 06:11 AM, Nitin Gupta wrote:
>> On 04/24/2010 11:57 PM, Avi Kivity wrote:
>>
>>> On 04/24/2010 04:49 AM, Nitin Gupta wrote:
>>>
>>>>
>>>>> I see. So why not implement this as an ordinary swap device, with a
>>>>> higher priority than the disk device? this way we reuse an API and
>>>>> keep
>>>>> things asynchronous, instead of introducing a special purpose API.
>>>>>
>>>>>
>>>>>
>>>> ramzswap is exactly this: an ordinary swap device which stores every
>>>> page
>>>> in (compressed) memory and its enabled as highest priority swap.
>>>> Currently,
>>>> it stores these compressed chunks in guest memory itself but it is not
>>>> very
>>>> difficult to send these chunks out to host/hypervisor using virtio.
>>>>
>>>> However, it suffers from unnecessary block I/O layer overhead and
>>>> requires
>>>> weird hooks in swap code, say to get notification when a swap slot is
>>>> freed.
>>>>
>>>>
>>> Isn't that TRIM?
>>>
>> No: trim or discard is not useful. The problem is that we require a
>> callback
>> _as soon as_ a page (swap slot) is freed. Otherwise, stale data
>> quickly accumulates
>> in memory defeating the whole purpose of in-memory compressed swap
>> devices (like ramzswap).
>>
>
> Doesn't flash have similar requirements? The earlier you discard, the
> likelier you are to reuse an erase block (or reduce the amount of copying).
>

No. We do not want to issue discard for every page as soon as it is freed.
I'm not flash expert but I guess issuing erase is just too expensive to be
issued so frequently. OTOH, ramzswap needs a callback for every page and as
soon as it is freed.

>> Increasing the frequency of discards is also not an option:
>> - Creating discard bio requests themselves need memory and these
>> swap devices
>> come into picture only under low memory conditions.
>>
>
> That's fine, swap works under low memory conditions by using reserves.
>

Ok, but still all this bio allocation and block layer overhead seems
unnecessary and is easily avoidable. I think frontswap code needs
clean up but at least it avoids all this bio overhead.

>> - We need to regularly scan swap_map to issue these discards.
>> Increasing discard
>> frequency also means more frequent scanning (which will still not be
>> fast enough
>> for ramzswap needs).
>>
>
> How does frontswap do this? Does it maintain its own data structures?
>

frontswap simply calls frontswap_flush_page() in swap_entry_free() i.e. as
soon as a swap slot is freed. No bio allocation etc.

>>> Maybe we should optimize these overheads instead. Swap used to always
>>> be to slow devices, but swap-to-flash has the potential to make swap act
>>> like an extension of RAM.
>>>
>>>
>> Spending lot of effort optimizing an overhead which can be completely
>> avoided
>> is probably not worth it.
>>
>
> I'm not sure. Swap-to-flash will soon be everywhere. If it's slow,
> people will feel it a lot more than ramzswap slowness.
>

Optimizing swap-to-flash is surely desirable but this problem is separate
from ramzswap or frontswap optimization. For the latter, I think dealing
with bio's, going through block layer is plain overhead.

>> Also, I think the choice of a synchronous style API for frontswap and
>> cleancache
>> is justified as they want to send pages to host *RAM*. If you want to
>> use other
>> devices like SSDs, then these should be just added as another swap
>> device as
>> we do currently -- these should not be used as frontswap storage
>> directly.
>>
>
> Even for copying to RAM an async API is wanted, so you can dma it
> instead of copying.
>

Maybe incremental development is better? Stabilize and refine existing
code and gradually move to async API, if required in future?

Thanks,
Nitin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Avi Kivity on 26 Apr 2010 02:10

On 04/25/2010 06:29 PM, Dan Magenheimer wrote:
>>> While I admit that I started this whole discussion by implying
>>> that frontswap (and cleancache) might be useful for SSDs, I think
>>> we are going far astray here. Frontswap is synchronous for a
>>> reason: It uses real RAM, but RAM that is not directly addressable
>>> by a (guest) kernel. SSD's (at least today) are still I/O devices;
>>> even though they may be very fast, they still live on a PCI (or
>>> slower) bus and use DMA. Frontswap is not intended for use with
>>> I/O devices.
>>>
>>> Today's memory technologies are either RAM that can be addressed
>>> by the kernel, or I/O devices that sit on an I/O bus. The
>>> exotic memories that I am referring to may be a hybrid:
>>> memory that is fast enough to live on a QPI/hypertransport,
>>> but slow enough that you wouldn't want to randomly mix and
>>> hand out to userland apps some pages from "exotic RAM" and some
>>> pages from "normal RAM". Such memory makes no sense today
>>> because OS's wouldn't know what to do with it. But it MAY
>>> make sense with frontswap (and cleancache).
>>>
>>> Nevertheless, frontswap works great today with a bare-metal
>>> hypervisor. I think it stands on its own merits, regardless
>>> of one's vision of future SSD/memory technologies.
>>>
>> Even when frontswapping to RAM on a bare metal hypervisor it makes
>> sense
>> to use an async API, in case you have a DMA engine on board.
>>
> When pages are 2MB, this may be true. When pages are 4KB and
> copied individually, it may take longer to program a DMA engine
> than to just copy 4KB.
>

Of course, you have to use a batching API, like virtio or Xen's rings,
to avoid the overhead.

> But in any case, frontswap works fine on all existing machines
> today. If/when most commodity CPUs have an asynchronous RAM DMA
> engine, an asynchronous API may be appropriate. Or the existing
> swap API might be appropriate. Or the synchronous frontswap API
> may work fine too. Speculating further about non-existent
> hardware that might exist in the (possibly far) future is irrelevant
> to the proposed patch, which works today on all existing x86 hardware
> and on shipping software.
>

dma engines are present on commodity hardware now:

http://en.wikipedia.org/wiki/I/O_Acceleration_Technology

I don't know if consumer machines have them, but servers certainly do.
modprobe ioatdma.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12
Prev: Whitespace Coding style fixes.
Next: Frontswap [PATCH 1/4] (was Transcendent Memory): swap data structure changes