From: Vivek Goyal on

Hi,

This is V4 of the patches for group_idle and CFQ group charge accounting in
terms of IOPS implementation. Since V3 not much has changed. Just more testing
and rebase on top of for-2.6.36 branch of block tree.

What's the problem
------------------
On high end storage (I got on HP EVA storage array with 12 SATA disks in
RAID 5), CFQ's model of dispatching requests from a single queue at a
time (sequential readers/write sync writers etc), becomes a bottleneck.
Often we don't drive enough request queue depth to keep all the disks busy
and suffer a lot in terms of overall throughput.

All these problems primarily originate from two things. Idling on per
cfq queue and quantum (dispatching limited number of requests from a
single queue) and till then not allowing dispatch from other queues. Once
you set the slice_idle=0 and quantum to higher value, most of the CFQ's
problem on higher end storage disappear.

This problem also becomes visible in IO controller where one creates
multiple groups and gets the fairness but overall throughput is less. In
the following table, I am running increasing number of sequential readers
(1,2,4,8) in 8 groups of weight 100 to 800.

Kernel=2.6.35-blktree-group_idle+
GROUPMODE=1 NRGRP=8 DEV=/dev/dm-3
Workload=bsr iosched=cfq Filesz=512M bs=4K
gi=1 slice_idle=8 group_idle=8 quantum=8
=========================================================================
AVERAGE[bsr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 cgrp5 cgrp6 cgrp7 cgrp8 total
--- --- -- ---------------------------------------------------------------
bsr 1 1 6519 12742 16801 23109 28694 35988 43175 49272 216300
bsr 1 2 5522 10922 17174 22554 24151 30488 36572 42021 189404
bsr 1 4 4593 9620 13120 21405 25827 28097 33029 37335 173026
bsr 1 8 3622 8277 12557 18296 21775 26022 30760 35713 157022


Notice that overall throughput is just around 160MB/s with 8 sequential reader
in each group.

With this patch set, I have set slice_idle=0 and re-ran same test.

Kernel=2.6.35-blktree-group_idle+
GROUPMODE=1 NRGRP=8 DEV=/dev/dm-3
Workload=bsr iosched=cfq Filesz=512M bs=4K
gi=1 slice_idle=0 group_idle=8 quantum=8
=========================================================================
AVERAGE[bsr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 cgrp5 cgrp6 cgrp7 cgrp8 total
--- --- -- ---------------------------------------------------------------
bsr 1 1 6652 12341 17335 23856 28740 36059 42833 48487 216303
bsr 1 2 10168 20292 29827 38363 45746 52842 60071 63957 321266
bsr 1 4 11176 21763 32713 42970 53222 58613 63598 69296 353351
bsr 1 8 11750 23718 34102 47144 56975 63613 69000 69666 375968

Notice how overall throughput has shot upto 350-370MB/s while retaining the
ability to do the IO control.

So this is not the default mode. This new tunable group_idle, allows one to
set slice_idle=0 to disable some of the CFQ features and and use primarily
group service differentation feature.

By default nothing should change for CFQ and this change should be fairly
low risk.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/