Have sane default values for cpusets [Kernel]

Prev: [PATCH] remove bogus #ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT for debug_kmap_atomic()
Next: [pm] resume from s3 oops? (last checked on rc6)

From: Balbir Singh on 12 May 2010 14:30

* Chris Friesen <cfriesen(a)nortel.com> [2010-05-12 12:04:08]:

> Something I haven't tried is whether or not it's possible to set
> ownership/permissions on subtrees within the cgroup hierarchy to allow
> unprivileged users/groups to control their own tasks (basically letting
> them prioritize their own jobs without being able to affect anything
> else). Does anyone know if this sort of thing is possible?

Yep, it is.

--
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Paul Menage on 12 May 2010 15:10

What about the case where some subset of the parent's mems/cpus are
given to a child with the exclusive flag set?

Paul

On Wed, May 12, 2010 at 6:05 AM, Dhaval Giani <dhaval.giani(a)gmail.com> wrote:
> Hi folks,
>
> This is a patch (against a somewhat older kernel) which proposes to set
> a default value for a cpuset cgroup that is created. At this point in
> time, this is just half done since I would prefer some comments, and see
> if it is acceptable, and how.
>
> First the description of the patch.
>
> This patch basically sets up default values for the a cpuset that is
> created. By default right now, cpuset.cpus_allowed and
> cpuset.mems_allowed is empty. This does not allow a task to be attached
> to the cpuset. This patch sets the default value of the cpus_allowed and
> mems_allowed as the same as that of the parent.
>
> TODO:
> 1. Set the value depending on the exclusive flags set in other cpusets.
>
> This does not break ABI since applications which were explicitly setting
> up the cpusets will still be setting them up anyway. And if someone was
> checking if a cpuset was setup or not by checking the state of
> cpuset.cpus_allowed, then it was broken and should be fixed.
>
> Now the motivation.
>
> Looking from an application programmer's point of view, when using
> cgroups, he does not want to care about unrelated subsystem and would
> only manipulate the subsystem which he is concerned with. But this is a
> decision that is not just limited to the application programmer. It is a
> decision that is very strongly dependent on the underlying system as
> well. Cgroups allows multiple subsystems to be mounted together, which
> then implies they have a common hierarchy.
>
> Now to take an example, consider a system where cpu and memory are
> mounted together, since the user wants to have the same hierarchy for
> both cpu and memory. Since the application cares only about memory, it
> manipulates all those values. But since they are mounted together, every
> time it creates a cgroup for a task, that task will also be moved to the
> corresponding cpu cgroup. The solution to this is (and the one we
> recommend is) to mount all cgroups separately, but this is not always
> going to happen, because it is quite painful to do this. If you use
> libcgroup, you need to add additional parameters to your configuration
> file. If you mount it manually, you have to specify multiple mount
> commands.
>
> Anyway, coming back to the original issue. Consider that the usecase
> that the user has is a valid use case, and just mix in cpuset into this
> case. Now, if the application creates a cgroup, for memory, but not
> knowing that the user has mounted cpusets together, it is unable to
> attach a task to its newly created cgroup because cpusets is not setup.
> Now the programmer is forced to know about cpusets as well.
>
> In order to handle this situation, libcgroup has an API which takes the
> parameters from the parent cgroup. But that is also broken. Consider
> this same example. If there is a cgroup, that has its cpu.rt_runtime_us
> parameter setup in the another child, then the create from parent API
> will fail since we tried to assign too much rt bandwidth to that cgroup.
> So you can neither create a cgroup nor can you assign parameters from
> its parents.
>
> Now rt-cgroups handles this situation quite well. Since real-time is
> obviously a special case, the default is to have no rt bandwidth for
> that cgroup. Where cpusets goes wrong is to have a *no* default values.
> So the question now is, do we expect to have this non uniform policy in
> implementing subsystems, or do we enforce a policy to have sane defaults
> for subsystems if they prevent attaching "regular" tasks by default.
>
> Solving it in userspace is just adding another layer, and asking either
> libcgroup to have a lot of code for just one subsystem, or expecting the
> programmer to know about every subsystem, just in order to handle every
> corner case.
>
> Comments?
>
> Thanks!
> Dhaval
>
> ---
> �kernel/cpuset.c | � 13 +++++++++++++
> �1 file changed, 13 insertions(+)
>
> Index: linux-2.6/kernel/cpuset.c
> ===================================================================
> --- linux-2.6.orig/kernel/cpuset.c
> +++ linux-2.6/kernel/cpuset.c
> @@ -1824,6 +1824,17 @@ static void cpuset_post_clone(struct cgr
> �}
>
> �/*
> + * Inherit the parent's cpus/mems values. Do not inhert the
> + * exclusivity flag
> + *
> + */
> +static void cpuset_inherit_parent_values(struct cpuset *child)
> +{
> + � � � cpumask_copy(child->cpus_allowed, child->parent->cpus_allowed);
> + � � � child->mems_allowed = child->parent->mems_allowed;
> +}
> +
> +/*
> �* � � cpuset_create - create a cpuset
> �* � � ss: � � cpuset cgroup subsystem
> �* � � cont: � control group that the new cpuset will be part of
> @@ -1860,6 +1871,8 @@ static struct cgroup_subsys_state *cpuse
> � � � �cs->relax_domain_level = -1;
>
> � � � �cs->parent = parent;
> + � � � cpuset_inherit_parent_values(cs);
> +
> � � � �number_of_cpusets++;
> � � � �return &cs->css ;
> �}
>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Dhaval Giani on 12 May 2010 15:20

On Wed, May 12, 2010 at 9:01 PM, Paul Menage <menage(a)google.com> wrote:
> What about the case where some subset of the parent's mems/cpus are
> given to a child with the exclusive flag set?
>

As I mentioned in the TODO, it is still to be handled. But it should
simply exclude those mems/cpus which are exclusive. It was a bit more
involved than the effort I wanted to put in before gauging the
reactions.

Thanks!
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Lennart Poettering on 12 May 2010 15:20

On Wed, 12.05.10 16:20, Peter Zijlstra (peterz(a)infradead.org) wrote:

>
> On Wed, 2010-05-12 at 16:13 +0200, Dhaval Giani wrote:
> > What you are saying is that an application
> > programmer who wants to just use memory cgroups should also care about
> > cpusets and just about countless other cgroup subsystems that can
> > exist.
>
> That's exactly what he says if he mounts them together.

Well, this is not realistic.

See Dhaval's patch on the background of systemd
(http://0pointer.de/blog/projects/systemd.html). When a service is
started in systemd, we create a cgroup for it, when it ends, we remove
it. While systemd does that to keep track of processes this has the nice
side effect that all services are properly (and without races) sorted
into different groups: if you start apache, then you get it into its own
group, if you started cups, you get your own group for that -- without
further configuration. Now, while the main reason to do that is for
keeping track of processes this is also useful to actually enforce
limits and suchlike on those groups and hence services. An admin can
choose to enforce limits on the groups systemd creates
for him, because most likely the grouping systemd does along service
lines is the one that matters the most.

I am not interested to make systemd aware of each and every controller
that exists and will exist in the future and encode specific inheritance
rules for them. That is simply not possible, we'd have to add a lot of
logic to systemd I simply don't want to maintain there, and I'd have to
constantly play catch-up with every controller that is added to the
kernel. However, if I don't have that in systemd, as it stands now and
an admin tells systemd to duplicate its groups tree in the cpuset
hierarchy, then systemd would fail to work. And that is not acceptable.

So, just for once, see this from the perspective of the people using
your code: if admins want to piggybick resource limiting onto the normal
systemd cgroup tree, then you make that impossible by having weird
inheritance rules that systemd would first have to learn. (And I am
sorry, but I refuse to teach those rules to systemd, anyway)

What I am arguing here is basically that it is really important to allow
userspace code to create groups in hierarchies where controllers are
active that the userspace code does not know.

Also, it's completely stupid anyway to ask userspace code to implement
inheritance rules for each cgroup controller, if that algorithm could
just as well with minimal work be implemented on the kernel side for free,
and then allows userspace to simply rely on "mkdir" to result in a
working subgroup.

Or let me say this with other words: if an "mkdir" is not enough to
create a working sub-cgroup then libcgroup would have to learn the
necessary inheritance rules and how to copy group params from the parent
to the child -- and that for each and every controller that exists and
will exist. If a new controller is added you'd have to patch libcgroup
and the kernel and make sure they always stay in sync. And that's just
crappy design, if you ask me, and doesn't scale.

Lennart
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Dhaval Giani on 12 May 2010 15:30

On Wed, May 12, 2010 at 9:20 PM, Paul Menage <menage(a)google.com> wrote:
> On Wed, May 12, 2010 at 12:10 PM, Dhaval Giani <dhaval.giani(a)gmail.com> wrote:
>> On Wed, May 12, 2010 at 9:01 PM, Paul Menage <menage(a)google.com> wrote:
>>> What about the case where some subset of the parent's mems/cpus are
>>> given to a child with the exclusive flag set?
>>>
>>
>> As I mentioned in the TODO, it is still to be handled.
>
> Oops, sorry, just read the patch :-)
>
>> But it should
>> simply exclude those mems/cpus which are exclusive. It was a bit more
>> involved than the effort I wanted to put in before gauging the
>> reactions.
>
> I think the idea is reasonable - the only way that I could see it
> breaking someone would be code that currently does something like:
>
> mkdir A
> mkdir B
> echo 1 > A/mem_exclusive
> echo 1 > B/mem_exclusive
> echo $mems_for_a > A/mems
> echo $mems_for_b > B/mems
>
> The attempts to set the mem_exclusive flags would fail, since A and B
> would both have all of the parent's mems.
>

But would this not fail otherwise?

Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8
Prev: [PATCH] remove bogus #ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT for debug_kmap_atomic()
Next: [pm] resume from s3 oops? (last checked on rc6)