VMware Balloon driver [Kernel]

Prev: [PATCH] /dev/mem: Allow rewinding
Next: drivers/uwb: Rename dev_info to wdi

From: Dmitry Torokhov on 5 Apr 2010 19:00

On Mon, Apr 05, 2010 at 02:24:19PM -0700, Andrew Morton wrote:
> On Sun, 4 Apr 2010 14:52:02 -0700
> Dmitry Torokhov <dtor(a)vmware.com> wrote:
>
> > This is standalone version of VMware Balloon driver. Unlike previous
> > version, that tried to integrate VMware ballooning transport into virtio
> > subsystem, and use stock virtio_ballon driver, this one implements both
> > controlling thread/algorithm and hypervisor transport.
> >
> > We are submitting standalone driver because KVM maintainer (Avi Kivity)
> > expressed opinion (rightly) that our transport does not fit well into
> > virtqueue paradigm and thus it does not make much sense to integrate
> > with virtio.
> >
>
> I think I've forgotten what balloon drivers do. Are they as nasty a
> hack as I remember believing them to be?
>
> A summary of what this code sets out to do, and how it does it would be
> useful.
>

Jeremy provided a very good writeup; I will aldo expand changelog in the
next version.

> Also please explain the applicability of this driver. Will xen use it?
> kvm? Out-of-tree code?

The driver is expected to be used on VMware platform - mainly ESX.
Originally we tried to converge with KVM and use virtio and
stock virtio_balloon driver but Avi mentioned that our code emulating
virtqueue was more than balloon code itself and thus using virtio did
not make nuch sense.

>
> The code implements a user-visible API (in /proc, at least). Please
> fully describe the proposed interface(s) in the changelog so we can
> review and understand that proposal.

OK.

>
> >
> > ...
> >
> > +static bool vmballoon_send_start(struct vmballoon *b)
> > +{
> > + unsigned long status, dummy;
> > +
> > + STATS_INC(b->stats.start);
> > +
> > + status = VMWARE_BALLOON_CMD(START, VMW_BALLOON_PROTOCOL_VERSION, dummy);
> > + if (status == VMW_BALLOON_SUCCESS)
> > + return true;
> > +
> > + pr_debug("%s - failed, hv returns %ld\n", __func__, status);
>
> The code refers to something called "hv". I suspect that's stale?
>
> > + STATS_INC(b->stats.start_fail);
> > + return false;
> > +}
> > +
> > +static bool vmballoon_check_status(struct vmballoon *b, unsigned long status)
> > +{
> > + switch (status) {
> > + case VMW_BALLOON_SUCCESS:
> > + return true;
> > +
> > + case VMW_BALLOON_ERROR_RESET:
> > + b->reset_required = true;
> > + /* fall through */
> > +
> > + default:
> > + return false;
> > + }
> > +}
> > +
> > +static bool vmballoon_send_guest_id(struct vmballoon *b)
> > +{
> > + unsigned long status, dummy;
> > +
> > + status = VMWARE_BALLOON_CMD(GUEST_ID, VMW_BALLOON_GUEST_ID, dummy);
> > +
> > + STATS_INC(b->stats.guest_type);
> > +
> > + if (vmballoon_check_status(b, status))
> > + return true;
> > +
> > + pr_debug("%s - failed, hv returns %ld\n", __func__, status);
> > + STATS_INC(b->stats.guest_type_fail);
> > + return false;
> > +}
>
> The lack of comments makes it all a bit hard to take in.

OK, I will address lack of comments.

>
> >
> > ...
> >
> > +static int __init vmballoon_init(void)
> > +{
> > + int error;
> > +
> > + /*
> > + * Check if we are running on VMware's hypervisor and bail out
> > + * if we are not.
> > + */
> > + if (!vmware_platform())
> > + return -ENODEV;
> > +
> > + vmballoon_wq = create_freezeable_workqueue("vmmemctl");
> > + if (!vmballoon_wq) {
> > + pr_err("failed to create workqueue\n");
> > + return -ENOMEM;
> > + }
> > +
> > + /* initialize global state */
> > + memset(&balloon, 0, sizeof(balloon));
>
> The memset seems to be unneeded.

OK.

>
> > + INIT_LIST_HEAD(&balloon.pages);
> > + INIT_LIST_HEAD(&balloon.refused_pages);
> > +
> > + /* initialize rates */
> > + balloon.rate_alloc = VMW_BALLOON_RATE_ALLOC_MAX;
> > + balloon.rate_free = VMW_BALLOON_RATE_FREE_MAX;
> > +
> > + INIT_DELAYED_WORK(&balloon.dwork, vmballoon_work);
> > +
> > + /*
> > + * Start balloon.
> > + */
> > + if (!vmballoon_send_start(&balloon)) {
> > + pr_err("failed to send start command to the host\n");
> > + error = -EIO;
> > + goto fail;
> > + }
> > +
> > + if (!vmballoon_send_guest_id(&balloon)) {
> > + pr_err("failed to send guest ID to the host\n");
> > + error = -EIO;
> > + goto fail;
> > + }
> > +
> > + error = vmballoon_procfs_init(&balloon);
> > + if (error)
> > + goto fail;
> > +
> > + queue_delayed_work(vmballoon_wq, &balloon.dwork, 0);
> > +
> > + return 0;
> > +
> > +fail:
> > + destroy_workqueue(vmballoon_wq);
> > + return error;
> > +}
> >
> > ...
> >
>
> Oh well, ho hum. Help is needed on working out what to do about this,
> please.
>
> Congrats on the new job, btw ;)

Thanks ;). BTW, please send input stuff to my gmail addresss till.

--
Dmitry

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Dan Magenheimer on 5 Apr 2010 19:10

> > On 04/06/2010 01:17 AM, Andrew Morton wrote:
> > >> The basic idea of the driver is to allow a guest system to give up
> > >> memory it isn't using so it can be reused by other virtual
> machines (or
> > >> the host itself).
> > >>
> > > So... does this differ in any fundamental way from what
> hibernation
> > > does, via shrink_all_memory()?
> > >
> >
> > Just the _all_ bit, and the fact that we need to report the freed
> page
> > numbers to the hypervisor.
> >
>
> So... why not tweak that, rather than implementing some parallel
> thing?

I think Avi was being facetious ("_all_"). Hibernation assumes
everything in the machine is going to stop for awhile. Ballooning
assumes that the machine has lower memory need for awhile, but
is otherwise fully operational. Think of it as hot-plug memory
at a page granularity.

Historically, all OS's had a (relatively) fixed amount of memory
and, since it was fixed in size, there was no sense wasting any of it.
In a virtualized world, OS's should be trained to be much more
flexible as one virtual machine's "waste" could/should be another
virtual machine's "want". Ballooning is currently the mechanism
for this; it places memory pressure on the OS to encourage it
to get by with less memory. Unfortunately, it is difficult even
within an OS to determine what memory is wasted and what memory
might be used imminently... because LRU is only an approximation of
the future. Hypervisors have an even more difficult problem not
only because they must infer this information from external events,
but they can double the problem if they infer the opposite of what
the OS actually does.

As Jeremy mentioned, Transcendent Memory (and its Linux implementations
"cleancache" and "frontswap") allows a guest kernel to give up memory
for the broader good while still retaining a probability that it
can get the same data back quickly. This results in more memory
fluidity. Transcendent Memory ("tmem") still uses ballooning as
the mechanism to create memory pressure... it just provides an
insurance policy for that memory pressure.

Avi will point out that it is not clear that tmem can make use of
or benefit from tmem, but we need not repeat that discussion here.

Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Dmitry Torokhov on 5 Apr 2010 19:10

On Mon, Apr 05, 2010 at 03:40:23PM -0700, Andrew Morton wrote:
> On Tue, 06 Apr 2010 01:26:11 +0300
> Avi Kivity <avi(a)redhat.com> wrote:
>
> > On 04/06/2010 01:17 AM, Andrew Morton wrote:
> > >> The basic idea of the driver is to allow a guest system to give up
> > >> memory it isn't using so it can be reused by other virtual machines (or
> > >> the host itself).
> > >>
> > > So... does this differ in any fundamental way from what hibernation
> > > does, via shrink_all_memory()?
> > >
> >
> > Just the _all_ bit, and the fact that we need to report the freed page
> > numbers to the hypervisor.
> >
>
> So... why not tweak that, rather than implementing some parallel thing?

I guess the main difference is that freeing memory is not the primary
goal; we want to make sure that guest does not use some of its memory
without notifying hypervisor first.

--
Dmitry

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andrew Morton on 5 Apr 2010 19:20

On Mon, 5 Apr 2010 16:03:48 -0700 (PDT)
Dan Magenheimer <dan.magenheimer(a)oracle.com> wrote:

> > > On 04/06/2010 01:17 AM, Andrew Morton wrote:
> > > >> The basic idea of the driver is to allow a guest system to give up
> > > >> memory it isn't using so it can be reused by other virtual
> > machines (or
> > > >> the host itself).
> > > >>
> > > > So... does this differ in any fundamental way from what
> > hibernation
> > > > does, via shrink_all_memory()?
> > > >
> > >
> > > Just the _all_ bit, and the fact that we need to report the freed
> > page
> > > numbers to the hypervisor.
> > >
> >
> > So... why not tweak that, rather than implementing some parallel
> > thing?
>
> I think Avi was being facetious ("_all_"). Hibernation assumes
> everything in the machine is going to stop for awhile. Ballooning
> assumes that the machine has lower memory need for awhile, but
> is otherwise fully operational.

shrink_all_memory() doesn't require that processes be stopped.

If the existing code doesn't exactly match virtualisation's
requirements, it can be changed.

> Think of it as hot-plug memory
> at a page granularity.

hotplug is different because it targets particular physical pages. For
this requirement any old page will do. Preferably one which won't be
needed soon, yes?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Jeremy Fitzhardinge on 5 Apr 2010 19:30

On 04/05/2010 03:17 PM, Andrew Morton wrote:
> On Mon, 05 Apr 2010 15:03:08 -0700
> Jeremy Fitzhardinge<jeremy(a)goop.org> wrote:
>
>
>> On 04/05/2010 02:24 PM, Andrew Morton wrote:
>>
>>> I think I've forgotten what balloon drivers do. Are they as nasty a
>>> hack as I remember believing them to be?
>>>
>>>
>> (I haven't looked at Dmitry's patch yet, so this is from the Xen
>> perspective.)
>>
>> In the simplest form, they just look like a driver which allocates a
>> pile of pages, and the underlying memory gets returned to the
>> hypervisor. When you want the memory back, it reattaches memory to the
>> pageframes and releases the memory back to the kernel. This allows a
>> virtual machine to shrink with respect to its original size.
>>
>> Going the other way - expanding beyond the memory allocation - is a bit
>> trickier because you need to get some new page structures from
>> somewhere. We don't do this in Xen yet, but I've done some experiments
>> with hotplug memory to implement this. Or a simpler approach is to fake
>> up some reserved E820 ranges to grow into.
>>
>>
> Lots of stuff for Dmitry to add to his changelog ;)
>
>
>>> A summary of what this code sets out to do, and how it does it would be
>>> useful.
>>>
>>> Also please explain the applicability of this driver. Will xen use it?
>>> kvm? Out-of-tree code?
>>>
>>>
>> The basic idea of the driver is to allow a guest system to give up
>> memory it isn't using so it can be reused by other virtual machines (or
>> the host itself).
>>
> So... does this differ in any fundamental way from what hibernation
> does, via shrink_all_memory()?
>

Note that we're using shrink and grow in opposite senses.
shrink_all_memory() is trying to free as much kernel memory as possible,
which to the virtual machine's host looks like the guest is growing
(since it has claimed more memory for its own use). A balloon "shrink"
appears to Linux as allocated memory (ie, locking down memory within
Linux to make it available to the rest of system).

The fact that shrink_all_memory() has much deeper insight into the
current state of the vm subsystem is interesting; it has much more to
work with than a simple alloc/free page. Does it actively try to
reclaim cold, unlikely to be used stuff, first? It appears it does to
my mm/ naive eye.

I guess a way to use it in the short term is to have a loop of the form:

while (guest_size> target) {
shrink_all_memory(guest_size - target); /* force pages to be free */
while (p = alloc_page(GFP_NORETRY)) /* vacuum up pages */
release_page_to_hypervisor(p);
/* twiddle thumbs */
}

....assuming the allocation would tend to pick up the pages that
shrink_all_memory just freed.

Or ideally, have a form of shrink_all_memory() which causes pages to
become unused, but rather than freeing them returns them to the caller.

And is there some way to get the vm subsystem to provide backpressure:
"I'm getting desperately short of memory!"? Experience has shown that
administrators often accidentally over-shrink their domains and
effectively kill them. Sometimes due to bad UI - entering the wrong
units - but also because they just don't know what the actual memory
demands are. Or they change over time.

Thanks,
J
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: [PATCH] /dev/mem: Allow rewinding
Next: drivers/uwb: Rename dev_info to wdi