cfq-iosced: Implement IOPS mode and group

Prev: dynamic debug: move ddebug_remove_module() down into free_module()
Next: perf annotate segfaults when source code has goto label that looks like hex number

From: Gui Jianfeng on 22 Jul 2010 03:20

From: Gui Jianfeng on 22 Jul 2010 20:00

Vivek Goyal wrote:
> On Thu, Jul 22, 2010 at 03:08:00PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> Hi,
>>>
>>> This is V3 of the group_idle and CFQ IOPS mode implementation patchset. Since V2
>>> I have cleaned up the code a bit to clarify the confusion lingering around in
>>> what cases do we charge time slice and in what cases do we charge number of
>>> requests.
>>>
>>> What's the problem
>>> ------------------
>>> On high end storage (I got on HP EVA storage array with 12 SATA disks in
>>> RAID 5), CFQ's model of dispatching requests from a single queue at a
>>> time (sequential readers/write sync writers etc), becomes a bottleneck.
>>> Often we don't drive enough request queue depth to keep all the disks busy
>>> and suffer a lot in terms of overall throughput.
>>>
>>> All these problems primarily originate from two things. Idling on per
>>> cfq queue and quantum (dispatching limited number of requests from a
>>> single queue) and till then not allowing dispatch from other queues. Once
>>> you set the slice_idle=0 and quantum to higher value, most of the CFQ's
>>> problem on higher end storage disappear.
>>>
>>> This problem also becomes visible in IO controller where one creates
>>> multiple groups and gets the fairness but overall throughput is less. In
>>> the following table, I am running increasing number of sequential readers
>>> (1,2,4,8) in 8 groups of weight 100 to 800.
>>>
>>> Kernel=2.6.35-rc5-iops+
>>> GROUPMODE=1 NRGRP=8
>>> DIR=/mnt/iostestmnt/fio DEV=/dev/dm-4
>>> Workload=bsr iosched=cfq Filesz=512M bs=4K
>>> group_isolation=1 slice_idle=8 group_idle=8 quantum=8
>>> =========================================================================
>>> AVERAGE[bsr] [bw in KB/s]
>>> -------
>>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 cgrp5 cgrp6 cgrp7 cgrp8 total
>>> --- --- -- ---------------------------------------------------------------
>>> bsr 3 1 6186 12752 16568 23068 28608 35785 42322 48409 213701
>>> bsr 3 2 5396 10902 16959 23471 25099 30643 37168 42820 192461
>>> bsr 3 4 4655 9463 14042 20537 24074 28499 34679 37895 173847
>>> bsr 3 8 4418 8783 12625 19015 21933 26354 29830 36290 159249
>>>
>>> Notice that overall throughput is just around 160MB/s with 8 sequential reader
>>> in each group.
>>>
>>> With this patch set, I have set slice_idle=0 and re-ran same test.
>>>
>>> Kernel=2.6.35-rc5-iops+
>>> GROUPMODE=1 NRGRP=8
>>> DIR=/mnt/iostestmnt/fio DEV=/dev/dm-4
>>> Workload=bsr iosched=cfq Filesz=512M bs=4K
>>> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
>>> =========================================================================
>>> AVERAGE[bsr] [bw in KB/s]
>>> -------
>>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 cgrp5 cgrp6 cgrp7 cgrp8 total
>>> --- --- -- ---------------------------------------------------------------
>>> bsr 3 1 6523 12399 18116 24752 30481 36144 42185 48894 219496
>>> bsr 3 2 10072 20078 29614 38378 46354 52513 58315 64833 320159
>>> bsr 3 4 11045 22340 33013 44330 52663 58254 63883 70990 356520
>>> bsr 3 8 12362 25860 37920 47486 61415 47292 45581 70828 348747
>>>
>>> Notice how overall throughput has shot upto 348MB/s while retaining the ability
>>> to do the IO control.
>>>
>>> So this is not the default mode. This new tunable group_idle, allows one to
>>> set slice_idle=0 to disable some of the CFQ features and and use primarily
>>> group service differentation feature.
>>>
>>> If you have thoughts on other ways of solving the problem, I am all ears
>>> to it.
>> Hi Vivek
>>
>> Would you attach your fio job config file?
>>
>
> Hi Gui,
>
> I have written a fio based test script, "iostest", to be able to
> do cgroup and other IO scheduler testing more smoothly and I am using
> that. I am attaching the compressed script with the mail. Try using it
> and if it works for you and you find it useful, I can think of hosting a
> git tree somewhere.
>
> I used following following command lines to test above.
>
> # iostest <block-device> -G -w bsr -m 8 -c --nrgrp 8 --total
>
> With slice idle disabled.
>
> # iostest <block-device> -G -w bsr -m 8 -c --nrgrp 8 --total -I 0

That's cool! Very helpful, I'll try it.

Thanks,
Gui

>
> Thanks
> Vivek

--
Regards
Gui Jianfeng
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Gui Jianfeng on 26 Jul 2010 03:10

Vivek Goyal wrote:
> Hi,
>
> This is V3 of the group_idle and CFQ IOPS mode implementation patchset. Since V2
> I have cleaned up the code a bit to clarify the confusion lingering around in
> what cases do we charge time slice and in what cases do we charge number of
> requests.
>
> What's the problem
> ------------------
> On high end storage (I got on HP EVA storage array with 12 SATA disks in
> RAID 5), CFQ's model of dispatching requests from a single queue at a
> time (sequential readers/write sync writers etc), becomes a bottleneck.
> Often we don't drive enough request queue depth to keep all the disks busy
> and suffer a lot in terms of overall throughput.
>
> All these problems primarily originate from two things. Idling on per
> cfq queue and quantum (dispatching limited number of requests from a
> single queue) and till then not allowing dispatch from other queues. Once
> you set the slice_idle=0 and quantum to higher value, most of the CFQ's
> problem on higher end storage disappear.
>
> This problem also becomes visible in IO controller where one creates
> multiple groups and gets the fairness but overall throughput is less. In
> the following table, I am running increasing number of sequential readers
> (1,2,4,8) in 8 groups of weight 100 to 800.
>
> Kernel=2.6.35-rc5-iops+
> GROUPMODE=1 NRGRP=8
> DIR=/mnt/iostestmnt/fio DEV=/dev/dm-4
> Workload=bsr iosched=cfq Filesz=512M bs=4K
> group_isolation=1 slice_idle=8 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[bsr] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 cgrp5 cgrp6 cgrp7 cgrp8 total
> --- --- -- ---------------------------------------------------------------
> bsr 3 1 6186 12752 16568 23068 28608 35785 42322 48409 213701
> bsr 3 2 5396 10902 16959 23471 25099 30643 37168 42820 192461
> bsr 3 4 4655 9463 14042 20537 24074 28499 34679 37895 173847
> bsr 3 8 4418 8783 12625 19015 21933 26354 29830 36290 159249
>
> Notice that overall throughput is just around 160MB/s with 8 sequential reader
> in each group.
>
> With this patch set, I have set slice_idle=0 and re-ran same test.
>
> Kernel=2.6.35-rc5-iops+
> GROUPMODE=1 NRGRP=8
> DIR=/mnt/iostestmnt/fio DEV=/dev/dm-4
> Workload=bsr iosched=cfq Filesz=512M bs=4K
> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[bsr] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 cgrp5 cgrp6 cgrp7 cgrp8 total
> --- --- -- ---------------------------------------------------------------
> bsr 3 1 6523 12399 18116 24752 30481 36144 42185 48894 219496
> bsr 3 2 10072 20078 29614 38378 46354 52513 58315 64833 320159
> bsr 3 4 11045 22340 33013 44330 52663 58254 63883 70990 356520
> bsr 3 8 12362 25860 37920 47486 61415 47292 45581 70828 348747
>
> Notice how overall throughput has shot upto 348MB/s while retaining the ability
> to do the IO control.
>
> So this is not the default mode. This new tunable group_idle, allows one to
> set slice_idle=0 to disable some of the CFQ features and and use primarily
> group service differentation feature.
>
> If you have thoughts on other ways of solving the problem, I am all ears
> to it.

Hi Vivek,

I did some tests on single SATA disk on my desktop. With patches applied, seems no
regression occurs till now, and have some performance improvement in case of
"Direct Random Reader" mode. Here're some numbers on my box.

Vallina kernel:

Blkio is already mounted at /cgroup/blkio. Unmounting it
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
GROUPMODE=1 NRGRP=4
Will run workloads for increasing number of threads upto a max of 4
Starting test for [drr] with set=1 numjobs=1 filesz=512M bs=32k runtime=30
Starting test for [drr] with set=1 numjobs=2 filesz=512M bs=32k runtime=30
Starting test for [drr] with set=1 numjobs=4 filesz=512M bs=32k runtime=30
Finished test for workload [drr]
Host=localhost.localdomain Kernel=2.6.35-rc4-Vivek-+
GROUPMODE=1 NRGRP=4
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
Workload=drr iosched=cfq Filesz=512M bs=32k
group_isolation=1 slice_idle=0 group_idle=8 quantum=8
=========================================================================
AVERAGE[drr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
--- --- -- -----------------------------------
drr 1 1 761 761 762 760 3044
drr 1 2 185 420 727 1256 2588
drr 1 4 180 371 588 863 2002

Patched kernel:

Blkio is already mounted at /cgroup/blkio. Unmounting it
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
GROUPMODE=1 NRGRP=4
Will run workloads for increasing number of threads upto a max of 4
Starting test for [drr] with set=1 numjobs=1 filesz=512M bs=32k runtime=30
Starting test for [drr] with set=1 numjobs=2 filesz=512M bs=32k runtime=30
Starting test for [drr] with set=1 numjobs=4 filesz=512M bs=32k runtime=30
Finished test for workload [drr]
Host=localhost.localdomain Kernel=2.6.35-rc4-Vivek-+
GROUPMODE=1 NRGRP=4
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
Workload=drr iosched=cfq Filesz=512M bs=32k
group_isolation=1 slice_idle=0 group_idle=8 quantum=8
=========================================================================
AVERAGE[drr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
--- --- -- -----------------------------------
drr 1 1 323 671 1030 1378 3402
drr 1 2 165 391 686 1144 2386
drr 1 4 185 373 612 873 2043

Thanks
Gui

>
> Thanks
> Vivek
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Gui Jianfeng on 27 Jul 2010 04:40

Vivek Goyal wrote:
> On Mon, Jul 26, 2010 at 02:58:16PM +0800, Gui Jianfeng wrote:
>
> [..]
>> Hi Vivek,
>>
>> I did some tests on single SATA disk on my desktop. With patches applied, seems no
>> regression occurs till now, and have some performance improvement in case of
>> "Direct Random Reader" mode. Here're some numbers on my box.
>>
>
> Thanks for testing Gui. "iostest" seems to be working for you. If you had
> to some fixes to make it work on my boxes, do send those to me, and I can
> commit those in my internal git tree.

Hi Vivek,

I didn't modify iostest at all but just upgraded fio to 1.42

Gui

>
> After running the script, you can also run "iostest -R <result-dir>" and
> that will generate a report. It will not have all this "Starting test..."
> lines and looks nicer.
>
> Good to know that you don't see any regressions on SATA disk in your
> cgroup testing with this patchset. Little improvement in "drr" might
> be due to the fact that with existing slice_idle=0, we can still do
> some extra idling on service tree and first patch in the series (V4)
> gets rid of that.
>
> Thanks
> Vivek
>
>> Vallina kernel:
>>
>> Blkio is already mounted at /cgroup/blkio. Unmounting it
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> GROUPMODE=1 NRGRP=4
>> Will run workloads for increasing number of threads upto a max of 4
>> Starting test for [drr] with set=1 numjobs=1 filesz=512M bs=32k runtime=30
>> Starting test for [drr] with set=1 numjobs=2 filesz=512M bs=32k runtime=30
>> Starting test for [drr] with set=1 numjobs=4 filesz=512M bs=32k runtime=30
>> Finished test for workload [drr]
>> Host=localhost.localdomain Kernel=2.6.35-rc4-Vivek-+
>> GROUPMODE=1 NRGRP=4
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> Workload=drr iosched=cfq Filesz=512M bs=32k
>> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
>> =========================================================================
>> AVERAGE[drr] [bw in KB/s]
>> -------
>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
>> --- --- -- -----------------------------------
>> drr 1 1 761 761 762 760 3044
>> drr 1 2 185 420 727 1256 2588
>> drr 1 4 180 371 588 863 2002
>>
>>
>> Patched kernel:
>>
>> Blkio is already mounted at /cgroup/blkio. Unmounting it
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> GROUPMODE=1 NRGRP=4
>> Will run workloads for increasing number of threads upto a max of 4
>> Starting test for [drr] with set=1 numjobs=1 filesz=512M bs=32k runtime=30
>> Starting test for [drr] with set=1 numjobs=2 filesz=512M bs=32k runtime=30
>> Starting test for [drr] with set=1 numjobs=4 filesz=512M bs=32k runtime=30
>> Finished test for workload [drr]
>> Host=localhost.localdomain Kernel=2.6.35-rc4-Vivek-+
>> GROUPMODE=1 NRGRP=4
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> Workload=drr iosched=cfq Filesz=512M bs=32k
>> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
>> =========================================================================
>> AVERAGE[drr] [bw in KB/s]
>> -------
>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
>> --- --- -- -----------------------------------
>> drr 1 1 323 671 1030 1378 3402
>> drr 1 2 165 391 686 1144 2386
>> drr 1 4 185 373 612 873 2043
>>
>> Thanks
>> Gui
>>
>>> Thanks
>>> Vivek
>>>
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

|
Pages: 1
Prev: dynamic debug: move ddebug_remove_module() down into free_module()
Next: perf annotate segfaults when source code has goto label that looks like hex number

cfq-iosced: Implement IOPS mode and group_idle tunable V3