From: Daniel P. Berrange on
On Fri, Jul 16, 2010 at 11:12:34AM -0400, Vivek Goyal wrote:
> On Fri, Jul 16, 2010 at 03:53:09PM +0100, Daniel P. Berrange wrote:
> > On Fri, Jul 16, 2010 at 10:35:36AM -0400, Vivek Goyal wrote:
> > > On Fri, Jul 16, 2010 at 03:15:49PM +0100, Daniel P. Berrange wrote:
> > > Secondly, just because some controller allows creation of hierarchy does
> > > not mean that hierarchy is being enforced. For example, memory controller.
> > > IIUC, one needs to explicitly set "use_hierarchy" to enforce hierarchy
> > > otherwise effectively it is flat. So if libvirt is creating groups and
> > > putting machines in child groups thinking that we are not interfering
> > > with admin's policy, is not entirely correct.
> >
> > That is true, but that 'use_hierarchy' at least provides admins
> > the mechanism required to implement the neccessary policy
> >
> > > So how do we make progress here. I really want to see blkio controller
> > > integrated with libvirt.
> > >
> > > About the issue of hierarchy, I can probably travel down the path of allowing
> > > creation of hierarchy but CFQ will treat it as flat. Though I don't like it
> > > because it will force me to introduce variables like "use_hierarchy" once
> > > real hierarchical support comes in but I guess I can live with that.
> > > (Anyway memory controller is already doing it.).
> > >
> > > There is another issue though and that is by default every virtual
> > > machine going into a group of its own. As of today, it can have
> > > severe performance penalties (depending on workload) if group is not
> > > driving doing enough IO. (Especially with group_isolation=1).
> > >
> > > I was thinking of a model where an admin moves out the bad virtual
> > > machines in separate group and limit their IO.
> >
> > In the simple / normal case I imagine all guests VMs will be running
> > unrestricted I/O initially. Thus instead of creating the cgroup at time
> > of VM startup, we could create the cgroup only when the admin actually
> > sets an I/O limit.
>
> That makes sense. Run all the virtual machines by default in root group
> and move out a virtual machine to a separate group of either low weight
> (if virtual machine is a bad one and driving lot of IO) or of higher weight
> (if we want to give more IO bw to this machine).
>
> > IIUC, this should maintain the one cgroup per guest
> > model, while avoiding the performance penalty in normal use. The caveat
> > of course is that this would require blkio controller to have a dedicated
> > mount point, not shared with other controller.
>
> Yes. Because for other controllers we seem to be putting virtual machines
> in separate cgroups by default at startup time. So it seems we will
> require a separate mount point here for blkio controller.
>
> > I think we might also
> > want this kind of model for net I/O, since we probably don't want to
> > creating TC classes + net_cls groups for every VM the moment it starts
> > unless the admin has actually set a net I/O limit.
>
> Looks like. So good, then network controller and blkio controller can
> share the this new mount point.

After thinking about this some more there are a couple of problems with
this plan. For QEMU the 'vhostnet' (the in kernel virtio network backend)
requires that QEMU be in the cgroup at time of startup, otherwise the
vhost kernel thread won't end up in the right cgroup. For libvirt's LXC
container driver, moving the container in & out of the cgroups at runtime
is pretty difficult because there are an arbitrary number of processes
running in the container. It would require moving all the container
processes between two cgroups in a race free manner. So on second thoughts
I'm more inclined to stick with our current approach of putting all guests
into the appropriate cgroups at guest/container startup, even for blkio
and netcls.

Daniel
--
|: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Vivek Goyal on
On Tue, Jul 27, 2010 at 11:40:37AM +0100, Daniel P. Berrange wrote:
> On Fri, Jul 16, 2010 at 11:12:34AM -0400, Vivek Goyal wrote:
> > On Fri, Jul 16, 2010 at 03:53:09PM +0100, Daniel P. Berrange wrote:
> > > On Fri, Jul 16, 2010 at 10:35:36AM -0400, Vivek Goyal wrote:
> > > > On Fri, Jul 16, 2010 at 03:15:49PM +0100, Daniel P. Berrange wrote:
> > > > Secondly, just because some controller allows creation of hierarchy does
> > > > not mean that hierarchy is being enforced. For example, memory controller.
> > > > IIUC, one needs to explicitly set "use_hierarchy" to enforce hierarchy
> > > > otherwise effectively it is flat. So if libvirt is creating groups and
> > > > putting machines in child groups thinking that we are not interfering
> > > > with admin's policy, is not entirely correct.
> > >
> > > That is true, but that 'use_hierarchy' at least provides admins
> > > the mechanism required to implement the neccessary policy
> > >
> > > > So how do we make progress here. I really want to see blkio controller
> > > > integrated with libvirt.
> > > >
> > > > About the issue of hierarchy, I can probably travel down the path of allowing
> > > > creation of hierarchy but CFQ will treat it as flat. Though I don't like it
> > > > because it will force me to introduce variables like "use_hierarchy" once
> > > > real hierarchical support comes in but I guess I can live with that.
> > > > (Anyway memory controller is already doing it.).
> > > >
> > > > There is another issue though and that is by default every virtual
> > > > machine going into a group of its own. As of today, it can have
> > > > severe performance penalties (depending on workload) if group is not
> > > > driving doing enough IO. (Especially with group_isolation=1).
> > > >
> > > > I was thinking of a model where an admin moves out the bad virtual
> > > > machines in separate group and limit their IO.
> > >
> > > In the simple / normal case I imagine all guests VMs will be running
> > > unrestricted I/O initially. Thus instead of creating the cgroup at time
> > > of VM startup, we could create the cgroup only when the admin actually
> > > sets an I/O limit.
> >
> > That makes sense. Run all the virtual machines by default in root group
> > and move out a virtual machine to a separate group of either low weight
> > (if virtual machine is a bad one and driving lot of IO) or of higher weight
> > (if we want to give more IO bw to this machine).
> >
> > > IIUC, this should maintain the one cgroup per guest
> > > model, while avoiding the performance penalty in normal use. The caveat
> > > of course is that this would require blkio controller to have a dedicated
> > > mount point, not shared with other controller.
> >
> > Yes. Because for other controllers we seem to be putting virtual machines
> > in separate cgroups by default at startup time. So it seems we will
> > require a separate mount point here for blkio controller.
> >
> > > I think we might also
> > > want this kind of model for net I/O, since we probably don't want to
> > > creating TC classes + net_cls groups for every VM the moment it starts
> > > unless the admin has actually set a net I/O limit.
> >
> > Looks like. So good, then network controller and blkio controller can
> > share the this new mount point.
>
> After thinking about this some more there are a couple of problems with
> this plan. For QEMU the 'vhostnet' (the in kernel virtio network backend)
> requires that QEMU be in the cgroup at time of startup, otherwise the
> vhost kernel thread won't end up in the right cgroup.

Not sure why this limitation is there in vhostnet.

> For libvirt's LXC
> container driver, moving the container in & out of the cgroups at runtime
> is pretty difficult because there are an arbitrary number of processes
> running in the container.

So once a container is created, we don't have the capability to move
it around cgroups? One needs to shutdown the container and relaunch it
in desired container.

> It would require moving all the container
> processes between two cgroups in a race free manner. So on second thoughts
> I'm more inclined to stick with our current approach of putting all guests
> into the appropriate cgroups at guest/container startup, even for blkio
> and netcls.

In the current code form, it is a bad idea from "blkio" perspective. Very
often, a virtual machine might not be driving enough IO and we will see
overall decreased throughput. That's why I was preferring to move out
the virtual machine out of cgroup only if required.

I was also thinking of implementing a new tunable in CFQ, something like
"min_queue_depth". It would mean that don't idle on groups if we are not
driving a min_queue_depth. Higher the "min_queue_depth", lower the isolation
between groups. But this will take effect only if slice_idle=0 and that
would be done only on higher end storage.

IOW, I am experimenting with above bits, but I certainly would not
recommend putting virtual machines/containers in their own blkio cgroup
by default.

How about not comounting blkio and net_cls. So for network, you can
continue to put virtual machine in a cgroup of its own and that should
take care of vhostnet issue. For blkio, we will continue to put virtual
machines in common root group.

For container driver issue, we need to figure out how to move around
containers in cgroups. Not sure how hard that is though.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Vivek Goyal on
On Thu, Jul 08, 2010 at 10:57:13PM -0400, Munehiro Ikeda wrote:
> These RFC patches are trial to add async (cached) write support on blkio
> controller.
>
> Only test which has been done is to compile, boot, and that write bandwidth
> seems prioritized when pages which were dirtied by two different processes in
> different cgroups are written back to a device simultaneously. I know this
> is the minimum (or less) test but I posted this as RFC because I would like
> to hear your opinions about the design direction in the early stage.
>
> Patches are for 2.6.35-rc4.
>
> This patch series consists of two chunks.
>
> (1) iotrack (patch 01/11 -- 06/11)
>
> This is a functionality to track who dirtied a page, in exact which cgroup a
> process which dirtied a page belongs to. Blkio controller will read the info
> later and prioritize when the page is actually written to a block device.
> This work is originated from Ryo Tsuruta and Hirokazu Takahashi and includes
> Andrea Righi's idea. It was posted as a part of dm-ioband which was one of
> proposals for IO controller.
>
>
> (2) blkio controller modification (07/11 -- 11/11)
>
> The main part of blkio controller async write support.
> Currently async queues are device-wide and async write IOs are always treated
> as root group.
> These patches make async queues per a cfq_group per a device to control them.
> Async write is handled by flush kernel thread. Because queue pointers are
> stored in cfq_io_context, io_context of the thread has to have multiple
> cfq_io_contexts per a device. So these patches make cfq_io_context per an
> io_context per a cfq_group, which means per an io_context per a cgroup per a
> device.
>
>

Muuh,

You will require one more piece and that is support for per cgroup request
descriptors on request queue. With writes, it is so easy to consume those
128 request descriptors.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Munehiro Ikeda on
Vivek Goyal wrote, on 08/02/2010 04:58 PM:
> On Thu, Jul 08, 2010 at 10:57:13PM -0400, Munehiro Ikeda wrote:
>> These RFC patches are trial to add async (cached) write support on blkio
>> controller.
>>
>> Only test which has been done is to compile, boot, and that write bandwidth
>> seems prioritized when pages which were dirtied by two different processes in
>> different cgroups are written back to a device simultaneously. I know this
>> is the minimum (or less) test but I posted this as RFC because I would like
>> to hear your opinions about the design direction in the early stage.
>>
>> Patches are for 2.6.35-rc4.
>>
>> This patch series consists of two chunks.
>>
>> (1) iotrack (patch 01/11 -- 06/11)
>>
>> This is a functionality to track who dirtied a page, in exact which cgroup a
>> process which dirtied a page belongs to. Blkio controller will read the info
>> later and prioritize when the page is actually written to a block device.
>> This work is originated from Ryo Tsuruta and Hirokazu Takahashi and includes
>> Andrea Righi's idea. It was posted as a part of dm-ioband which was one of
>> proposals for IO controller.
>>
>>
>> (2) blkio controller modification (07/11 -- 11/11)
>>
>> The main part of blkio controller async write support.
>> Currently async queues are device-wide and async write IOs are always treated
>> as root group.
>> These patches make async queues per a cfq_group per a device to control them.
>> Async write is handled by flush kernel thread. Because queue pointers are
>> stored in cfq_io_context, io_context of the thread has to have multiple
>> cfq_io_contexts per a device. So these patches make cfq_io_context per an
>> io_context per a cfq_group, which means per an io_context per a cgroup per a
>> device.
>>
>>
>
> Muuh,
>
> You will require one more piece and that is support for per cgroup request
> descriptors on request queue. With writes, it is so easy to consume those
> 128 request descriptors.

Hi Vivek,

Yes. Thank you for the comment.
I have two concerns to do that.

(1) technical concern
If there is fixed device-wide limitation and there are so many groups,
the number of request descriptors distributed to each group can be too
few. My only idea for this is to make device-wide limitation flexible,
but I'm not sure if it is the best or even can be allowed.

(2) implementation concern
Now the limitation is done by generic block layer which doesn't know
about grouping. The idea in my head to solve this is to add a new
interface on elevator_ops to ask IO scheduler if a new request can
be allocated.

Anyway, simple RFC patch first and testing it would be preferable,
I think.


Thanks,
Muuhh


--
IKEDA, Munehiro
NEC Corporation of America
m-ikeda(a)ds.jp.nec.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Nauman Rafique on
On Tue, Aug 3, 2010 at 7:31 AM, Munehiro Ikeda <m-ikeda(a)ds.jp.nec.com> wrote:
> Vivek Goyal wrote, on 08/02/2010 04:58 PM:
>>
>> On Thu, Jul 08, 2010 at 10:57:13PM -0400, Munehiro Ikeda wrote:
>>>
>>> These RFC patches are trial to add async (cached) write support on blkio
>>> controller.
>>>
>>> Only test which has been done is to compile, boot, and that write
>>> bandwidth
>>> seems prioritized when pages which were dirtied by two different
>>> processes in
>>> different cgroups are written back to a device simultaneously. �I know
>>> this
>>> is the minimum (or less) test but I posted this as RFC because I would
>>> like
>>> to hear your opinions about the design direction in the early stage.
>>>
>>> Patches are for 2.6.35-rc4.
>>>
>>> This patch series consists of two chunks.
>>>
>>> (1) iotrack (patch 01/11 -- 06/11)
>>>
>>> This is a functionality to track who dirtied a page, in exact which
>>> cgroup a
>>> process which dirtied a page belongs to. �Blkio controller will read the
>>> info
>>> later and prioritize when the page is actually written to a block device.
>>> This work is originated from Ryo Tsuruta and Hirokazu Takahashi and
>>> includes
>>> Andrea Righi's idea. �It was posted as a part of dm-ioband which was one
>>> of
>>> proposals for IO controller.
>>>
>>>
>>> (2) blkio controller modification (07/11 -- 11/11)
>>>
>>> The main part of blkio controller async write support.
>>> Currently async queues are device-wide and async write IOs are always
>>> treated
>>> as root group.
>>> These patches make async queues per a cfq_group per a device to control
>>> them.
>>> Async write is handled by flush kernel thread. �Because queue pointers
>>> are
>>> stored in cfq_io_context, io_context of the thread has to have multiple
>>> cfq_io_contexts per a device. �So these patches make cfq_io_context per
>>> an
>>> io_context per a cfq_group, which means per an io_context per a cgroup
>>> per a
>>> device.
>>>
>>>
>>
>> Muuh,
>>
>> You will require one more piece and that is support for per cgroup request
>> descriptors on request queue. With writes, it is so easy to consume those
>> 128 request descriptors.
>
> Hi Vivek,
>
> Yes. �Thank you for the comment.
> I have two concerns to do that.
>
> (1) technical concern
> If there is fixed device-wide limitation and there are so many groups,
> the number of request descriptors distributed to each group can be too
> few. �My only idea for this is to make device-wide limitation flexible,
> but I'm not sure if it is the best or even can be allowed.
>
> (2) implementation concern
> Now the limitation is done by generic block layer which doesn't know
> about grouping. �The idea in my head to solve this is to add a new
> interface on elevator_ops to ask IO scheduler if a new request can
> be allocated.

Muuhh,
We have already done the work of forward porting the request
descriptor patch that Vivek had in his earlier patch sets. We also
taken care of the two concerns you have mentioned above. We have been
testing it, and getting good numbers. So if you want, I can send the
patch your way so it can be included in this same patch series.

Thanks.

>
> Anyway, simple RFC patch first and testing it would be preferable,
> I think.
>
>
> Thanks,
> Muuhh
>
>
> --
> IKEDA, Munehiro
> �NEC Corporation of America
> � �m-ikeda(a)ds.jp.nec.com
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo(a)vger.kernel.org
> More majordomo info at �http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at �http://www.tux.org/lkml/
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/