From: Christoph Hellwig on
On Wed, Jul 21, 2010 at 03:06:18PM -0400, Vivek Goyal wrote:
> On high end storage (I got on HP EVA storage array with 12 SATA disks in
> RAID 5),

That's actually quite low end storage for a server these days :)

> So this is not the default mode. This new tunable group_idle, allows one to
> set slice_idle=0 to disable some of the CFQ features and and use primarily
> group service differentation feature.

While this is better than before needing a sysfs tweak to get any
performance out of any kind of server class hardware still is pretty
horrible. And slice_idle=0 is not exactly the most obvious paramter
I would look for either. So having some way to automatically disable
this mode based on hardware characteristics would be really useful,
and if that's not possible at least make sure it's very obviously
document and easily found using web searches.

Btw, what effect does slice_idle=0 with your patches have to single SATA
disk and single SSD setups?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Vivek Goyal on
On Thu, Jul 22, 2010 at 01:56:02AM -0400, Christoph Hellwig wrote:
> On Wed, Jul 21, 2010 at 03:06:18PM -0400, Vivek Goyal wrote:
> > On high end storage (I got on HP EVA storage array with 12 SATA disks in
> > RAID 5),
>
> That's actually quite low end storage for a server these days :)
>

Yes it is. Just that this is the best I got access to. :-)

> > So this is not the default mode. This new tunable group_idle, allows one to
> > set slice_idle=0 to disable some of the CFQ features and and use primarily
> > group service differentation feature.
>
> While this is better than before needing a sysfs tweak to get any
> performance out of any kind of server class hardware still is pretty
> horrible. And slice_idle=0 is not exactly the most obvious paramter
> I would look for either. So having some way to automatically disable
> this mode based on hardware characteristics would be really useful,

An IO scheduler able to change its behavior based on unerlying storage
property is the ideal and most convenient thing. For that we will need
some kind of auto tuning features in CFQ where we monitor for the ongoing
IO (for sequentiality, for block size) and then try to make some
predictions about the storage property.

Auto tuning is little hard to implement. So I thought that in first step we
can make sure things work reasonably well with the help of tunables and
then look into auto tuning the stuff.

I was actually thinking of writting a user space utility which can do
some specific IO patterns to the disk/lun and setup some IO scheduler
tunables automatically.

> and if that's not possible at least make sure it's very obviously
> document and easily found using web searches.

Sure. I think I will create a new file Documentation/block/cfq-iosched.txt
and document this new mode there. Becuase this mode primarily is useful
for group scheduling, I will also add some info in
Documentation/cgroups/blkio-controller.txt.

>
> Btw, what effect does slice_idle=0 with your patches have to single SATA
> disk and single SSD setups?

I am not expecting any major effect of IOPS mode on a non-group setup on
any kind of storage.

IOW, currently if one sets slice_idle=0 in CFQ, then we kind of become almost
like deadline (with some differences here and there). Notion of ioprio
almost disappears except that in some cases you can still see some
service differentation among queues of different prio level.

With this patchset, one would swtich to IOPS mode with slice_idle=0. We
will still show a deadlinish behavior. The only difference will be that
there will be no service differentation among ioprio levels.

I am not bothering about fixing it currently because in slice_idle=0 mode,
notion of ioprio is so weak and unpredictable that I think it is not worth
fixing it at this point of time. If somebody is looking for service
differentation with slice_idle=0, using cgroups might turn out to be a
better bet.

In summary, in non cgroup setup, wth slice_idle=0, one should not see
significant change with this patchset on any kind of storage. With
slice_idle=0, CFQ stops idling and achieves much better throughput and
even in IOPS mode it will continue doing that.

The difference is primarily visible for cgroup users where we get better
accounting done in IOPS mode and are able to provide service differentation
among groups in a more predictable manner.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Vivek Goyal on
On Thu, Jul 22, 2010 at 03:08:00PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > Hi,
> >
> > This is V3 of the group_idle and CFQ IOPS mode implementation patchset. Since V2
> > I have cleaned up the code a bit to clarify the confusion lingering around in
> > what cases do we charge time slice and in what cases do we charge number of
> > requests.
> >
> > What's the problem
> > ------------------
> > On high end storage (I got on HP EVA storage array with 12 SATA disks in
> > RAID 5), CFQ's model of dispatching requests from a single queue at a
> > time (sequential readers/write sync writers etc), becomes a bottleneck.
> > Often we don't drive enough request queue depth to keep all the disks busy
> > and suffer a lot in terms of overall throughput.
> >
> > All these problems primarily originate from two things. Idling on per
> > cfq queue and quantum (dispatching limited number of requests from a
> > single queue) and till then not allowing dispatch from other queues. Once
> > you set the slice_idle=0 and quantum to higher value, most of the CFQ's
> > problem on higher end storage disappear.
> >
> > This problem also becomes visible in IO controller where one creates
> > multiple groups and gets the fairness but overall throughput is less. In
> > the following table, I am running increasing number of sequential readers
> > (1,2,4,8) in 8 groups of weight 100 to 800.
> >
> > Kernel=2.6.35-rc5-iops+
> > GROUPMODE=1 NRGRP=8
> > DIR=/mnt/iostestmnt/fio DEV=/dev/dm-4
> > Workload=bsr iosched=cfq Filesz=512M bs=4K
> > group_isolation=1 slice_idle=8 group_idle=8 quantum=8
> > =========================================================================
> > AVERAGE[bsr] [bw in KB/s]
> > -------
> > job Set NR cgrp1 cgrp2 cgrp3 cgrp4 cgrp5 cgrp6 cgrp7 cgrp8 total
> > --- --- -- ---------------------------------------------------------------
> > bsr 3 1 6186 12752 16568 23068 28608 35785 42322 48409 213701
> > bsr 3 2 5396 10902 16959 23471 25099 30643 37168 42820 192461
> > bsr 3 4 4655 9463 14042 20537 24074 28499 34679 37895 173847
> > bsr 3 8 4418 8783 12625 19015 21933 26354 29830 36290 159249
> >
> > Notice that overall throughput is just around 160MB/s with 8 sequential reader
> > in each group.
> >
> > With this patch set, I have set slice_idle=0 and re-ran same test.
> >
> > Kernel=2.6.35-rc5-iops+
> > GROUPMODE=1 NRGRP=8
> > DIR=/mnt/iostestmnt/fio DEV=/dev/dm-4
> > Workload=bsr iosched=cfq Filesz=512M bs=4K
> > group_isolation=1 slice_idle=0 group_idle=8 quantum=8
> > =========================================================================
> > AVERAGE[bsr] [bw in KB/s]
> > -------
> > job Set NR cgrp1 cgrp2 cgrp3 cgrp4 cgrp5 cgrp6 cgrp7 cgrp8 total
> > --- --- -- ---------------------------------------------------------------
> > bsr 3 1 6523 12399 18116 24752 30481 36144 42185 48894 219496
> > bsr 3 2 10072 20078 29614 38378 46354 52513 58315 64833 320159
> > bsr 3 4 11045 22340 33013 44330 52663 58254 63883 70990 356520
> > bsr 3 8 12362 25860 37920 47486 61415 47292 45581 70828 348747
> >
> > Notice how overall throughput has shot upto 348MB/s while retaining the ability
> > to do the IO control.
> >
> > So this is not the default mode. This new tunable group_idle, allows one to
> > set slice_idle=0 to disable some of the CFQ features and and use primarily
> > group service differentation feature.
> >
> > If you have thoughts on other ways of solving the problem, I am all ears
> > to it.
>
> Hi Vivek
>
> Would you attach your fio job config file?
>

Hi Gui,

I have written a fio based test script, "iostest", to be able to
do cgroup and other IO scheduler testing more smoothly and I am using
that. I am attaching the compressed script with the mail. Try using it
and if it works for you and you find it useful, I can think of hosting a
git tree somewhere.

I used following following command lines to test above.

# iostest <block-device> -G -w bsr -m 8 -c --nrgrp 8 --total

With slice idle disabled.

# iostest <block-device> -G -w bsr -m 8 -c --nrgrp 8 --total -I 0

Thanks
Vivek
From: Vivek Goyal on
On Thu, Jul 22, 2010 at 01:56:02AM -0400, Christoph Hellwig wrote:
> On Wed, Jul 21, 2010 at 03:06:18PM -0400, Vivek Goyal wrote:
> > On high end storage (I got on HP EVA storage array with 12 SATA disks in
> > RAID 5),
>
> That's actually quite low end storage for a server these days :)
>
> > So this is not the default mode. This new tunable group_idle, allows one to
> > set slice_idle=0 to disable some of the CFQ features and and use primarily
> > group service differentation feature.
>
> While this is better than before needing a sysfs tweak to get any
> performance out of any kind of server class hardware still is pretty
> horrible. And slice_idle=0 is not exactly the most obvious paramter
> I would look for either. So having some way to automatically disable
> this mode based on hardware characteristics would be really useful,
> and if that's not possible at least make sure it's very obviously
> document and easily found using web searches.
>
> Btw, what effect does slice_idle=0 with your patches have to single SATA
> disk and single SSD setups?

Well after responding to your mail in the morning, I realized that it was
a twisted answer and not very clear.

That forced me to change the patch a bit. With new patches (yet to be
posted), answer to your question is that nothing will change for SATA
or SSD setup with slice_idle=0 with my patches..

Why? CFQ is using two different algorithms for cfq queue and cfq group
scheduling. This IOPS mode will only affect group scheduling and not
the cfqq scheduling.

So switching to IOPS mode should not change anything for non cgroup users on
all kind of storage. It will impact only group scheduling users who will start
seeing fairness among groups in terms of IOPS and not time. Of course
slice_idle needs to be set to 0 only on high end storage so that we get
fairness among groups in IOPS at the same time achieve full potential of
storage box.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Christoph Hellwig on
To me this sounds like slice_idle=0 is the right default then, as it
gives useful behaviour for all systems linux runs on. Setups with
more than a few spindles are for sure more common than setups making
use of cgroups. Especially given that cgroups are more of a high end
feature you'd rarely use on a single SATA spindle anyway. So setting
a paramter to make this useful sounds like the much better option.

Especially given that the block cgroup code doesn't work particularly
well in presence of barriers, which are on for any kind of real life
production setup anyway.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/