From: Nauman Rafique on
On Fri, Mar 5, 2010 at 6:13 AM, Vivek Goyal <vgoyal(a)redhat.com> wrote:
> On Fri, Mar 05, 2010 at 10:25:58AM +0800, Gui Jianfeng wrote:
>> Currently, IO Controller makes use of blkio.weight to assign weight for
>> all devices. Here a new user interface "blkio.weight_device" is introduced to
>> assign different weights for different devices. blkio.weight becomes the
>> default value for devices which are not configured by "blkio.weight_device"
>>
>> You can use the following format to assigned specific weight for a given
>> device:
>>
>> major:minor represents device number.
>>
>> And you can remove a specific weight as following:
>>
>> V1->V2 changes:
>> - use user interface "weight_device" instead of "policy" suggested by Vivek
>> - rename some struct suggested by Vivek
>> - rebase to 2.6-block "for-linus" branch
>> - remove an useless list_empty check pointed out by Li Zefan
>> - some trivial typo fix
>>
>> V2->V3 changes:
>> - Move policy_*_node() functions up to get rid of forward declarations
>> - rename related functions by adding prefix "blkio_"
>>
>
> Thanks for the changes Gui. Looks good to me.
>
> Acked-by: Vivek Goyal <vgoyal(a)redhat.com>
>
> Vivek
>
>> Signed-off-by: Gui Jianfeng <guijianfeng(a)cn.fujitsu.com>
>> ---
>> �block/blk-cgroup.c �| �236 ++++++++++++++++++++++++++++++++++++++++++++++++++-
>> �block/blk-cgroup.h �| � 10 ++
>> �block/cfq-iosched.c | � �2 +-
>> �3 files changed, 246 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>> index c85d74c..8825e49 100644
>> --- a/block/blk-cgroup.c
>> +++ b/block/blk-cgroup.c
>> @@ -16,6 +16,7 @@
>> �#include <linux/module.h>
>> �#include <linux/err.h>
>> �#include "blk-cgroup.h"
>> +#include <linux/genhd.h>
>>
>> �static DEFINE_SPINLOCK(blkio_list_lock);
>> �static LIST_HEAD(blkio_list);
>> @@ -23,6 +24,32 @@ static LIST_HEAD(blkio_list);
>> �struct blkio_cgroup blkio_root_cgroup = { .weight = 2*BLKIO_WEIGHT_DEFAULT };
>> �EXPORT_SYMBOL_GPL(blkio_root_cgroup);
>>
>> +static inline void blkio_policy_insert_node(struct blkio_cgroup *blkcg,
>> + � � � � � � � � � � � � � � � � � � � � struct blkio_policy_node *pn)
>> +{
>> + � � list_add(&pn->node, &blkcg->policy_list);
>> +}
>> +
>> +/* Must be called with blkcg->lock held */
>> +static inline void blkio_policy_delete_node(struct blkio_policy_node *pn)
>> +{
>> + � � list_del(&pn->node);
>> +}
>> +
>> +/* Must be called with blkcg->lock held */
>> +static struct blkio_policy_node *
>> +blkio_policy_search_node(const struct blkio_cgroup *blkcg, dev_t dev)
>> +{
>> + � � struct blkio_policy_node *pn;
>> +
>> + � � list_for_each_entry(pn, &blkcg->policy_list, node) {
>> + � � � � � � if (pn->dev == dev)
>> + � � � � � � � � � � return pn;
>> + � � }
>> +
>> + � � return NULL;
>> +}
>> +
>> �struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
>> �{
>> � � � return container_of(cgroup_subsys_state(cgroup, blkio_subsys_id),
>> @@ -128,6 +155,7 @@ blkiocg_weight_write(struct cgroup *cgroup, struct cftype *cftype, u64 val)
>> � � � struct blkio_group *blkg;
>> � � � struct hlist_node *n;
>> � � � struct blkio_policy_type *blkiop;
>> + � � struct blkio_policy_node *pn;
>>
>> � � � if (val < BLKIO_WEIGHT_MIN || val > BLKIO_WEIGHT_MAX)
>> � � � � � � � return -EINVAL;
>> @@ -136,7 +164,13 @@ blkiocg_weight_write(struct cgroup *cgroup, struct cftype *cftype, u64 val)
>> � � � spin_lock(&blkio_list_lock);
>> � � � spin_lock_irq(&blkcg->lock);
>> � � � blkcg->weight = (unsigned int)val;
>> +
>> � � � hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
>> + � � � � � � pn = blkio_policy_search_node(blkcg, blkg->dev);
>> +
>> + � � � � � � if (pn)
>> + � � � � � � � � � � continue;
>> +
>> � � � � � � � list_for_each_entry(blkiop, &blkio_list, list)
>> � � � � � � � � � � � blkiop->ops.blkio_update_group_weight_fn(blkg,
>> � � � � � � � � � � � � � � � � � � � blkcg->weight);
>> @@ -178,15 +212,208 @@ SHOW_FUNCTION_PER_GROUP(dequeue);
>>
>> �#ifdef CONFIG_DEBUG_BLK_CGROUP
>> �void blkiocg_update_blkio_group_dequeue_stats(struct blkio_group *blkg,
>> - � � � � � � � � � � unsigned long dequeue)
>> + � � � � � � � � � � � � � � � � � � � � � unsigned long dequeue)
>> �{
>> � � � blkg->dequeue += dequeue;
>> �}
>> �EXPORT_SYMBOL_GPL(blkiocg_update_blkio_group_dequeue_stats);
>> �#endif
>>
>> +static int blkio_check_dev_num(dev_t dev)
>> +{
>> + � � int part = 0;
>> + � � struct gendisk *disk;
>> +
>> + � � disk = get_gendisk(dev, &part);
>> + � � if (!disk || part)
>> + � � � � � � return -ENODEV;
>> +
>> + � � return 0;
>> +}
>> +
>> +static int blkio_policy_parse_and_set(char *buf,
>> + � � � � � � � � � � � � � � � � � struct blkio_policy_node *newpn)
>> +{
>> + � � char *s[4], *p, *major_s = NULL, *minor_s = NULL;
>> + � � int ret;
>> + � � unsigned long major, minor, temp;
>> + � � int i = 0;
>> + � � dev_t dev;
>> +
>> + � � memset(s, 0, sizeof(s));
>> +
>> + � � while ((p = strsep(&buf, " ")) != NULL) {
>> + � � � � � � if (!*p)
>> + � � � � � � � � � � continue;
>> +
>> + � � � � � � s[i++] = p;
>> +
>> + � � � � � � /* Prevent from inputing too many things */
>> + � � � � � � if (i == 3)
>> + � � � � � � � � � � break;
>> + � � }
>> +
>> + � � if (i != 2)
>> + � � � � � � return -EINVAL;
>> +
>> + � � p = strsep(&s[0], ":");
>> + � � if (p != NULL)
>> + � � � � � � major_s = p;
>> + � � else
>> + � � � � � � return -EINVAL;
>> +
>> + � � minor_s = s[0];
>> + � � if (!minor_s)
>> + � � � � � � return -EINVAL;
>> +
>> + � � ret = strict_strtoul(major_s, 10, &major);
>> + � � if (ret)
>> + � � � � � � return -EINVAL;
>> +
>> + � � ret = strict_strtoul(minor_s, 10, &minor);
>> + � � if (ret)
>> + � � � � � � return -EINVAL;
>> +
>> + � � dev = MKDEV(major, minor);

I am not quite sure if exposing a mojor,minor number is the best
interface that can be exposed to user space. How about actual disk
names like sda, sdb, .. etc? The only problem I see there is that it
seems tricky to get to these disk names from within the block layer.
"struct request_queue" has a pointer to backing_dev which has a device
from which we can get major,minor. But in order to get to disk name,
we would have to call get_gendisk which can hold a semaphore. Is this
the reason for us going with major,minor as a user interface to
specify a disk? I bet there are good reasons for us not keeping a
pointer to "struct gendisk" from "struct request_queue". If we could
keep that pointer, our user interface could be very easily modified to
be the disk name like sda, sdb, etc.

>> +
>> + � � ret = blkio_check_dev_num(dev);
>> + � � if (ret)
>> + � � � � � � return ret;
>> +
>> + � � newpn->dev = dev;
>> +
>> + � � if (s[1] == NULL)
>> + � � � � � � return -EINVAL;
>> +
>> + � � ret = strict_strtoul(s[1], 10, &temp);
>> + � � if (ret || (temp < BLKIO_WEIGHT_MIN && temp > 0) ||
>> + � � � � temp > BLKIO_WEIGHT_MAX)
>> + � � � � � � return -EINVAL;
>> +
>> + � � newpn->weight = �temp;
>> +
>> + � � return 0;
>> +}
>> +
>> +unsigned int blkcg_get_weight(struct blkio_cgroup *blkcg,
>> + � � � � � � � � � � � � � dev_t dev)
>> +{
>> + � � struct blkio_policy_node *pn;
>> +
>> + � � pn = blkio_policy_search_node(blkcg, dev);
>> + � � if (pn)
>> + � � � � � � return pn->weight;
>> + � � else
>> + � � � � � � return blkcg->weight;
>> +}
>> +EXPORT_SYMBOL_GPL(blkcg_get_weight);
>> +
>> +static int blkiocg_weight_device_write(struct cgroup *cgrp, struct cftype *cft,
>> + � � � � � � � � � � � � � � � � � �const char *buffer)
>> +{
>> + � � int ret = 0;
>> + � � char *buf;
>> + � � struct blkio_policy_node *newpn, *pn;
>> + � � struct blkio_cgroup *blkcg;
>> + � � struct blkio_group *blkg;
>> + � � int keep_newpn = 0;
>> + � � struct hlist_node *n;
>> + � � struct blkio_policy_type *blkiop;
>> +
>> + � � buf = kstrdup(buffer, GFP_KERNEL);
>> + � � if (!buf)
>> + � � � � � � return -ENOMEM;
>> +
>> + � � newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
>> + � � if (!newpn) {
>> + � � � � � � ret = -ENOMEM;
>> + � � � � � � goto free_buf;
>> + � � }
>> +
>> + � � ret = blkio_policy_parse_and_set(buf, newpn);
>> + � � if (ret)
>> + � � � � � � goto free_newpn;
>> +
>> + � � blkcg = cgroup_to_blkio_cgroup(cgrp);
>> +
>> + � � spin_lock_irq(&blkcg->lock);
>> +
>> + � � pn = blkio_policy_search_node(blkcg, newpn->dev);
>> + � � if (!pn) {
>> + � � � � � � if (newpn->weight != 0) {
>> + � � � � � � � � � � blkio_policy_insert_node(blkcg, newpn);
>> + � � � � � � � � � � keep_newpn = 1;
>> + � � � � � � }
>> + � � � � � � spin_unlock_irq(&blkcg->lock);
>> + � � � � � � goto update_io_group;
>> + � � }
>> +
>> + � � if (newpn->weight == 0) {
>> + � � � � � � /* weight == 0 means deleteing a specific weight */
>> + � � � � � � blkio_policy_delete_node(pn);
>> + � � � � � � spin_unlock_irq(&blkcg->lock);
>> + � � � � � � goto update_io_group;
>> + � � }
>> + � � spin_unlock_irq(&blkcg->lock);
>> +
>> + � � pn->weight = newpn->weight;
>> +
>> +update_io_group:
>> + � � /* update weight for each cfqg */
>> + � � spin_lock(&blkio_list_lock);
>> + � � spin_lock_irq(&blkcg->lock);
>> +
>> + � � hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
>> + � � � � � � if (newpn->dev == blkg->dev) {
>> + � � � � � � � � � � list_for_each_entry(blkiop, &blkio_list, list)
>> + � � � � � � � � � � � � � � blkiop->ops.blkio_update_group_weight_fn(blkg,
>> + � � � � � � � � � � � � � � � � � � � � � � newpn->weight ?
>> + � � � � � � � � � � � � � � � � � � � � � � newpn->weight :
>> + � � � � � � � � � � � � � � � � � � � � � � blkcg->weight);
>> + � � � � � � }
>> + � � }
>> +
>> + � � spin_unlock_irq(&blkcg->lock);
>> + � � spin_unlock(&blkio_list_lock);
>> +
>> +free_newpn:
>> + � � if (!keep_newpn)
>> + � � � � � � kfree(newpn);
>> +free_buf:
>> + � � kfree(buf);
>> + � � return ret;
>> +}
>> +
>> +static int blkiocg_weight_device_read(struct cgroup *cgrp, struct cftype *cft,
>> + � � � � � � � � � � � � � � � � � struct seq_file *m)
>> +{
>> + � � struct blkio_cgroup *blkcg;
>> + � � struct blkio_policy_node *pn;
>> +
>> + � � seq_printf(m, "dev\tweight\n");
>> +
>> + � � blkcg = cgroup_to_blkio_cgroup(cgrp);
>> + � � if (list_empty(&blkcg->policy_list))
>> + � � � � � � goto out;
>> +
>> + � � spin_lock_irq(&blkcg->lock);
>> + � � list_for_each_entry(pn, &blkcg->policy_list, node) {
>> + � � � � � � seq_printf(m, "%u:%u\t%u\n", MAJOR(pn->dev),
>> + � � � � � � � � � � � �MINOR(pn->dev), pn->weight);
>> + � � }
>> + � � spin_unlock_irq(&blkcg->lock);
>> +out:
>> + � � return 0;
>> +}
>> +
>> �struct cftype blkio_files[] = {
>> � � � {
>> + � � � � � � .name = "weight_device",
>> + � � � � � � .read_seq_string = blkiocg_weight_device_read,
>> + � � � � � � .write_string = blkiocg_weight_device_write,
>> + � � � � � � .max_write_len = 256,
>> + � � },
>> + � � {
>> � � � � � � � .name = "weight",
>> � � � � � � � .read_u64 = blkiocg_weight_read,
>> � � � � � � � .write_u64 = blkiocg_weight_write,
>> @@ -220,6 +447,7 @@ static void blkiocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
>> � � � struct blkio_group *blkg;
>> � � � void *key;
>> � � � struct blkio_policy_type *blkiop;
>> + � � struct blkio_policy_node *pn, *pntmp;
>>
>> � � � rcu_read_lock();
>> �remove_entry:
>> @@ -250,7 +478,12 @@ remove_entry:
>> � � � � � � � blkiop->ops.blkio_unlink_group_fn(key, blkg);
>> � � � spin_unlock(&blkio_list_lock);
>> � � � goto remove_entry;
>> +
>> �done:
>> + � � list_for_each_entry_safe(pn, pntmp, &blkcg->policy_list, node) {
>> + � � � � � � blkio_policy_delete_node(pn);
>> + � � � � � � kfree(pn);
>> + � � }
>> � � � free_css_id(&blkio_subsys, &blkcg->css);
>> � � � rcu_read_unlock();
>> � � � kfree(blkcg);
>> @@ -280,6 +513,7 @@ done:
>> � � � spin_lock_init(&blkcg->lock);
>> � � � INIT_HLIST_HEAD(&blkcg->blkg_list);
>>
>> + � � INIT_LIST_HEAD(&blkcg->policy_list);
>> � � � return &blkcg->css;
>> �}
>>
>> diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
>> index 84bf745..c8144a2 100644
>> --- a/block/blk-cgroup.h
>> +++ b/block/blk-cgroup.h
>> @@ -22,6 +22,7 @@ struct blkio_cgroup {
>> � � � unsigned int weight;
>> � � � spinlock_t lock;
>> � � � struct hlist_head blkg_list;
>> + � � struct list_head policy_list; /* list of blkio_policy_node */
>> �};
>>
>> �struct blkio_group {
>> @@ -43,6 +44,15 @@ struct blkio_group {
>> � � � unsigned long sectors;
>> �};
>>
>> +struct blkio_policy_node {
>> + � � struct list_head node;
>> + � � dev_t dev;
>> + � � unsigned int weight;
>> +};
>> +
>> +extern unsigned int blkcg_get_weight(struct blkio_cgroup *blkcg,
>> + � � � � � � � � � � � � � � � � �dev_t dev);
>> +
>> �typedef void (blkio_unlink_group_fn) (void *key, struct blkio_group *blkg);
>> �typedef void (blkio_update_group_weight_fn) (struct blkio_group *blkg,
>> � � � � � � � � � � � � � � � � � � � � � � � unsigned int weight);
>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> index dee9d93..6945e9b 100644
>> --- a/block/cfq-iosched.c
>> +++ b/block/cfq-iosched.c
>> @@ -954,7 +954,6 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
>> � � � if (!cfqg)
>> � � � � � � � goto done;
>>
>> - � � cfqg->weight = blkcg->weight;
>> � � � for_each_cfqg_st(cfqg, i, j, st)
>> � � � � � � � *st = CFQ_RB_ROOT;
>> � � � RB_CLEAR_NODE(&cfqg->rb_node);
>> @@ -971,6 +970,7 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
>> � � � sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
>> � � � blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
>> � � � � � � � � � � � � � � � � � � � MKDEV(major, minor));
>> + � � cfqg->weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
>>
>> � � � /* Add group on cfqd list */
>> � � � hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
>> --
>> 1.5.4.rc3
>>
>>
>>
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Nauman Rafique on
On Tue, Mar 9, 2010 at 12:16 PM, Vivek Goyal <vgoyal(a)redhat.com> wrote:
> On Mon, Mar 08, 2010 at 11:39:54AM -0800, Nauman Rafique wrote:
> [..]
>> >> +static int blkio_policy_parse_and_set(char *buf,
>> >> + � � � � � � � � � � � � � � � � � struct blkio_policy_node *newpn)
>> >> +{
>> >> + � � char *s[4], *p, *major_s = NULL, *minor_s = NULL;
>> >> + � � int ret;
>> >> + � � unsigned long major, minor, temp;
>> >> + � � int i = 0;
>> >> + � � dev_t dev;
>> >> +
>> >> + � � memset(s, 0, sizeof(s));
>> >> +
>> >> + � � while ((p = strsep(&buf, " ")) != NULL) {
>> >> + � � � � � � if (!*p)
>> >> + � � � � � � � � � � continue;
>> >> +
>> >> + � � � � � � s[i++] = p;
>> >> +
>> >> + � � � � � � /* Prevent from inputing too many things */
>> >> + � � � � � � if (i == 3)
>> >> + � � � � � � � � � � break;
>> >> + � � }
>> >> +
>> >> + � � if (i != 2)
>> >> + � � � � � � return -EINVAL;
>> >> +
>> >> + � � p = strsep(&s[0], ":");
>> >> + � � if (p != NULL)
>> >> + � � � � � � major_s = p;
>> >> + � � else
>> >> + � � � � � � return -EINVAL;
>> >> +
>> >> + � � minor_s = s[0];
>> >> + � � if (!minor_s)
>> >> + � � � � � � return -EINVAL;
>> >> +
>> >> + � � ret = strict_strtoul(major_s, 10, &major);
>> >> + � � if (ret)
>> >> + � � � � � � return -EINVAL;
>> >> +
>> >> + � � ret = strict_strtoul(minor_s, 10, &minor);
>> >> + � � if (ret)
>> >> + � � � � � � return -EINVAL;
>> >> +
>> >> + � � dev = MKDEV(major, minor);
>>
>> I am not quite sure if exposing a mojor,minor number is the best
>> interface that can be exposed to user space. How about actual disk
>> names like sda, sdb, .. etc? The only problem I see there is that it
>> seems tricky to get to these disk names from within the block layer.
>> "struct request_queue" has a pointer to backing_dev which has a device
>> from which we can get major,minor. But in order to get to disk name,
>> we would have to call get_gendisk which can hold a semaphore. Is this
>> the reason for us going with major,minor as a user interface to
>> specify a disk? I bet there are good reasons for us not keeping a
>> pointer to "struct gendisk" from "struct request_queue". If we could
>> keep that pointer, our user interface could be very easily modified to
>> be the disk name like sda, sdb, etc.
>
> Hi Nauman,
>
> Do we really store a device name in "struct gendisk"? IIUC, a disk is
> identified by its major and minor number and then there can be multiple
> device files pointing to same disk.
>
> So I have a disk /dev/sdc in my system and I created another alias to
> same disk using mknod and mounted the disk using the alias.
>
> mknod /dev/sdc-alias b 8 32
> mount /dev/sdc-alias /mnt
>
> If that's the case, there is no way gendisk can store the pathname.
> Instead, device file has inode associated with it, and there we store
> major, minor number of disk, and using that we operate on disk/partition.

You are right, you can create more aliases to point the same device.
But there is one name stored in disk_name field of struct gendisk. And
I guess this is the same name that you will see if you do "ls
/sys/block/". block layer exposes all its sysfs variables going
through the disk names. For example, if you have to switch a scheduler
on a block device, you would use the name like sdb and do "echo cfq >
/sys/block/sdb/queue/scheduler". The same holds true if you want to
change a io scheduler specific tunable, e.g.
/sys/block/sdb/queue/iosched/back_seek_max.

My point is that all per device interfaces in the io cgroup category
should use the same device names; this includes all the stats
reporting, interfaces to set weights, and so on. And not come up with
a different way of identifying devices.

>
> If that's the case, then major/minor number based interface for blkio
> makes sense. Because we also need to export stats regarding the disk
> time consumed by cgroup on a particular device, the only unique identifier
> of the disk seems to be {major,minor} pair and multiple block device
> files can be pointing to same disk. Because it is many to one mapping, it
> will not be possible to reverse map it.
>
> So I guess we need to continue to handle rules and stats using major/minor
> numbers. One improvement probably we can do and that is allow setting
> rules both by major/minor number and device file path. But internally
> cgroup code will map device file path to major minor numbers and rules
> will be displayed against major/minor number and not original device path.
>
> Thanks
> Vivek
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Chad Talbott on
On Tue, Mar 9, 2010 at 4:33 PM, Gui Jianfeng <guijianfeng(a)cn.fujitsu.com> wrote:
> Vivek Goyal wrote:
>> On Tue, Mar 09, 2010 at 09:52:06AM +0800, Gui Jianfeng wrote:
>> I am not very sure but device name/path interface might turn out to be
>> more intutive.
>
> I don't think using an inode path name as interface is a good idea. Because, one
> can create new file to point to the same device. Also, pathname could be removed
> or renamed by user.
> So, i think device number is a better choice.

Sorry to butt in, but I don't think anyone is suggesting trying to
remember names from device nodes. I think the idea is to use the
names derived from the subsystem - such as 'sda' which is reflected in
/sys and dmesg.

Chad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Chad Talbott on
On Wed, Mar 10, 2010 at 7:30 AM, Vivek Goyal <vgoyal(a)redhat.com> wrote:
> This still leaves the issue of reaching a gendisk object from request
> queue. Looking into it.

It looks like we have that pairing way back in blk_register_queue()
which takes a gendisk. Is there any reason we don't hold onto the
gendisk there? Eyeballing add_disk() and unlink_gendisk() seems to
confirm that gendisk lifetime spans request_queue.

Nauman and I were also wondering why blkio_group and blkio_policy_node
store a dev_t, rather than a direct pointer to gendisk. dev_t seems
more like a userspace<->kernel interface than an inside-the-kernel
interface.

Chad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Manuel Benitez on
On a closely related topic, I've just recently made a change to one of
my branches that exposes the blkio.time and blkio.sectors information
for the root cgroup. These stats would not show because the major and
minor information for the root blkio_croup structures is zero. This
information is not available at the when the root blkio_cgroup
structures are instantiated, so they are left without major and minor
information.

I have a simple fix that updates the major and minor information for
the root structures at a later time. It looks something like this:

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index cd79be0..b34c952 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -956,6 +956,11 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *c
return NULL;;

cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
+ if (cfqg && !cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
+ sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
+ cfqg->blkg.dev = MKDEV(major, minor);
+ goto done;
+ }
if (cfqg || !create)
goto done;

If folks think that this would be of interest, I can submit a formal
patch. If someone can suggest a better way to do it that doesn't
require extensive changes elsewhere, I'm open to working that up as
well.

-Ricky

On Wed, Mar 10, 2010 at 12:31 PM, Vivek Goyal <vgoyal(a)redhat.com> wrote:
> On Wed, Mar 10, 2010 at 01:03:36PM -0500, Vivek Goyal wrote:
>> On Wed, Mar 10, 2010 at 09:38:35AM -0800, Chad Talbott wrote:
>> > On Wed, Mar 10, 2010 at 7:30 AM, Vivek Goyal <vgoyal(a)redhat.com> wrote:
>> > > This still leaves the issue of reaching a gendisk object from request
>> > > queue. Looking into it.
>> >
>> > It looks like we have that pairing way back in blk_register_queue()
>> > which takes a gendisk. �Is there any reason we don't hold onto the
>> > gendisk there? �Eyeballing add_disk() and unlink_gendisk() seems to
>> > confirm that gendisk lifetime spans request_queue.
>> >
>>
>> Yes, looking at the code, it looks like gendisk and request_queue object's
>> lifetime is same and probably we can store a pointer to gendisk in
>> request_queue at blk_register_queue() time. And then use this pointer to
>> retrieve gendisk->disk_name to report stats.
>>
>
> Well, gendisk and request_queue have little different life span. Following
> seems to be the sequence a block driver follows.
>
> � � � �blk_init_queue()
> � � � �alloc_disk() and add_disk()
> � � � �device_removed
> � � � �del_gendisk()
> � � � �blk_cleanup_queue()
>
> So first we cleaup the gendisk structure and later driver calls to cleanup
> the request queue.
>
>> > Nauman and I were also wondering why blkio_group and blkio_policy_node
>> > store a dev_t, rather than a direct pointer to gendisk. �dev_t seems
>> > more like a userspace<->kernel interface than an inside-the-kernel
>> > interface.
>> >
>>
>> blkio_policy_node currently can't store a pointer to gendisk because there
>> is no mechanism to call back into blkio if device is removed. So if we
>> implement something so that once device is removed, blkio layer gets a
>> callback and we cleanup any state/rules associated with that device, then
>> I think we should be able to store the pointer to gendisk.
>>
>> I am still trying to figure out how elevator/ioscheduler state is cleaned
>> up if a device is removed while some IO is happening to it.
>>
>
> So blk_cleanup_queue() will do this. That means few things.
>
> - We can't store pointers to gendisk in blkio_policy_node or blkio_group
> �because gendisk might have gone away but request queue is still there.
> �May be one can try saving a pointer and taking a reference, but I guess
> �that becomes littles complicated.
>
> - If we are using disk name for rules and reporting stats, then we also
> �need to make sure that these rules are cleared from cgroups once device
> �has disappeared. Otherwise, following might happen.
>
> � � � �- Create a rule for sda (x,y) for cgroup test1. x,y are major and
> � � � � �minor numbers.
> � � � �- sda goes away. Rules still remains in blkio cgroup.
> � � � �- Another device gets plugged in and i guess following can happen.
> � � � � � � � �- device name is different but dev_t is same as sda.
> � � � � � � � �- device name is same (sda) but device number is
> � � � � � � � � �different.
>
> � � � � � � � �In both the cases a user will be confused with stale rules
> � � � � � � � �in cgroups.
>
> �Cleaning up cgroups rules can get little complicated. I guess we need to
> �create a function in blkio-cgroup.c to traverse through all the cgroups
> �and cleanup any blkio_policy_nodes belonging to device going away.
>
> In a nutshell, it probably is doable. You are welcome to write a patch. At
> the same time I am not against deivce major/minor number based interface,
> because it keeps things little simple.
>
> Thanks
> Vivek
>
> -
>> OTOH, Gui, may be one can use blk_lookup_devt() to lookup the dev_t of a
>> device using the disk name (sda). I just noticed it while reading the
>> code.
>>
>> Thanks
>> Vivek
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo(a)vger.kernel.org
> More majordomo info at �http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at �http://www.tux.org/lkml/
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/