From: Vivek Goyal on
On Sat, Oct 03, 2009 at 07:56:18AM +0200, Mike Galbraith wrote:
> On Sat, 2009-10-03 at 07:49 +0200, Mike Galbraith wrote:
> > On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote:
> >
> > > If you could do a cleaned up version of your overload patch based on
> > > this:
> > >
> > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768
> > >
> > > then lets take it from there.
>

> Note to self: build the darn thing after last minute changes.
>
> Block: Delay overloading of CFQ queues to improve read latency.
>
> Introduce a delay maximum dispatch timestamp, and stamp it when:
> 1. we encounter a known seeky or possibly new sync IO queue.
> 2. the current queue may go idle and we're draining async IO.
> 3. we have sync IO in flight and are servicing an async queue.
> 4 we are not the sole user of disk.
> Disallow exceeding quantum if any of these events have occurred recently.
>

So it looks like primarily the issue seems to be that we done lot of
dispatch from async queue and if some sync queue comes in now, it will
experience latencies.

For a ongoing seeky sync queue issue will be solved up to some extent
because previously we did not choose to idle for that queue now we will
idle, hence async queue will not get a chance to overload the dispatch
queue.

For the sync queues where we choose not to enable idle, we still will see
the latencies. Instead of time stamping on all the above events, can we
just keep track of last sync request completed in the system and don't
allow async queue to flood/overload the dispatch queue with-in certain
time limit of that last sync request completion. This just gives a buffer
period to that sync queue to come back and submit more requests and
still not suffer large latencies?

Thanks
Vivek


> Protect this behavioral change with a "desktop_dispatch" knob and default
> it to "on".. providing an easy means of regression verification prior to
> hate-mail dispatch :) to CC list.
>
> Signed-off-by: Mike Galbraith <efault(a)gmx.de>
> Cc: Jens Axboe <jens.axboe(a)oracle.com>
> Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
> Cc: Andrew Morton <akpm(a)linux-foundation.org>
> ... others who let somewhat hacky tweak slip by
>
> ---
> block/cfq-iosched.c | 45 +++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 41 insertions(+), 4 deletions(-)
>
> Index: linux-2.6/block/cfq-iosched.c
> ===================================================================
> --- linux-2.6.orig/block/cfq-iosched.c
> +++ linux-2.6/block/cfq-iosched.c
> @@ -174,6 +174,9 @@ struct cfq_data {
> unsigned int cfq_slice_async_rq;
> unsigned int cfq_slice_idle;
> unsigned int cfq_desktop;
> + unsigned int cfq_desktop_dispatch;
> +
> + unsigned long desktop_dispatch_ts;
>
> struct list_head cic_list;
>
> @@ -1283,6 +1286,7 @@ static int cfq_dispatch_requests(struct
> struct cfq_data *cfqd = q->elevator->elevator_data;
> struct cfq_queue *cfqq;
> unsigned int max_dispatch;
> + unsigned long delay;
>
> if (!cfqd->busy_queues)
> return 0;
> @@ -1297,19 +1301,26 @@ static int cfq_dispatch_requests(struct
> /*
> * Drain async requests before we start sync IO
> */
> - if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
> + if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) {
> + cfqd->desktop_dispatch_ts = jiffies;
> return 0;
> + }
>
> /*
> * If this is an async queue and we have sync IO in flight, let it wait
> */
> - if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> + if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) {
> + cfqd->desktop_dispatch_ts = jiffies;
> return 0;
> + }
>
> max_dispatch = cfqd->cfq_quantum;
> if (cfq_class_idle(cfqq))
> max_dispatch = 1;
>
> + if (cfqd->busy_queues > 1)
> + cfqd->desktop_dispatch_ts = jiffies;
> +
> /*
> * Does this cfqq already have too much IO in flight?
> */
> @@ -1327,6 +1338,16 @@ static int cfq_dispatch_requests(struct
> return 0;
>
> /*
> + * Don't start overloading until we've been alone for a bit.
> + */
> + if (cfqd->cfq_desktop_dispatch) {
> + delay = cfqd->desktop_dispatch_ts + cfq_slice_sync;
> +
> + if (time_before(jiffies, max_delay))
> + return 0;
> + }
> +
> + /*
> * we are the only queue, allow up to 4 times of 'quantum'
> */
> if (cfqq->dispatched >= 4 * max_dispatch)
> @@ -1942,7 +1963,7 @@ static void
> cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> struct cfq_io_context *cic)
> {
> - int old_idle, enable_idle;
> + int old_idle, enable_idle, seeky = 0;
>
> /*
> * Don't idle for async or idle io prio class
> @@ -1950,10 +1971,20 @@ cfq_update_idle_window(struct cfq_data *
> if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
> return;
>
> + if (cfqd->hw_tag) {
> + if (CIC_SEEKY(cic))
> + seeky = 1;
> + /*
> + * If seeky or incalculable seekiness, delay overloading.
> + */
> + if (seeky || !sample_valid(cic->seek_samples))
> + cfqd->desktop_dispatch_ts = jiffies;
> + }
> +
> enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
>
> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
> - (!cfqd->cfq_desktop && cfqd->hw_tag && CIC_SEEKY(cic)))
> + (!cfqd->cfq_desktop && seeky))
> enable_idle = 0;
> else if (sample_valid(cic->ttime_samples)) {
> if (cic->ttime_mean > cfqd->cfq_slice_idle)
> @@ -2483,6 +2514,9 @@ static void *cfq_init_queue(struct reque
> cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
> cfqd->cfq_slice_idle = cfq_slice_idle;
> cfqd->cfq_desktop = 1;
> + cfqd->cfq_desktop_dispatch = 1;
> +
> + cfqd->desktop_dispatch_ts = INITIAL_JIFFIES;
> cfqd->hw_tag = 1;
>
> return cfqd;
> @@ -2553,6 +2587,7 @@ SHOW_FUNCTION(cfq_slice_sync_show, cfqd-
> SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
> SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
> SHOW_FUNCTION(cfq_desktop_show, cfqd->cfq_desktop, 0);
> +SHOW_FUNCTION(cfq_desktop_dispatch_show, cfqd->cfq_desktop_dispatch, 0);
> #undef SHOW_FUNCTION
>
> #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
> @@ -2585,6 +2620,7 @@ STORE_FUNCTION(cfq_slice_async_store, &c
> STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
> UINT_MAX, 0);
> STORE_FUNCTION(cfq_desktop_store, &cfqd->cfq_desktop, 0, 1, 0);
> +STORE_FUNCTION(cfq_desktop_dispatch_store, &cfqd->cfq_desktop_dispatch, 0, 1, 0);
> #undef STORE_FUNCTION
>
> #define CFQ_ATTR(name) \
> @@ -2601,6 +2637,7 @@ static struct elv_fs_entry cfq_attrs[] =
> CFQ_ATTR(slice_async_rq),
> CFQ_ATTR(slice_idle),
> CFQ_ATTR(desktop),
> + CFQ_ATTR(desktop_dispatch),
> __ATTR_NULL
> };
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Corrado Zoccolo on
On Sat, Oct 3, 2009 at 12:27 AM, Vivek Goyal <vgoyal(a)redhat.com> wrote:
> On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote:
>> In fact I think that the 'rotating' flag name is misleading.
>> All the checks we are doing are actually checking if the device truly
>> supports multiple parallel operations, and this feature is shared by
>> hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single
>> NCQ-enabled SATA disk.
>>
>
> While we are at it, what happens to notion of priority of tasks on SSDs?
This is not changed by proposed patch w.r.t. current CFQ.
> Without idling there is not continuous time slice and there is no
> fairness. So ioprio is out of the window for SSDs?
I haven't NCQ enabled SSDs here, so I can't test it, but it seems to
me that the way in which queues are sorted in the rr tree may still
provide some sort of fairness and service differentiation for
priorities, in terms of number of IOs.
Non-NCQ SSDs, instead, will still have the idle window enabled, so it
is not an issue for them.
>
> On SSDs, will it make more sense to provide fairness in terms of number or
> IO or size of IO and not in terms of time slices.
Not on all SSDs. There are still ones that have a non-negligible
penalty on non-sequential access pattern (hopefully the ones without
NCQ, but if we find otherwise, then we will have to benchmark access
time in I/O scheduler to select the best policy). For those, time
based may still be needed.

Thanks,
Corrado

>
> Thanks
> Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jens Axboe on
On Sat, Oct 03 2009, Corrado Zoccolo wrote:
> Hi,
> On Sat, Oct 3, 2009 at 11:00 AM, Mike Galbraith <efault(a)gmx.de> wrote:
> > On Sat, 2009-10-03 at 09:24 +0200, Jens Axboe wrote:
> >
> >> After shutting down the computer yesterday, I was thinking a bit about
> >> this issue and how to solve it without incurring too much delay. If we
> >> add a stricter control of the depth, that may help. So instead of
> >> allowing up to max_quantum (or larger) depths, only allow gradual build
> >> up of that the farther we get away from a dispatch from the sync IO
> >> queues. For example, when switching to an async or seeky sync queue,
> >> initially allow just 1 in flight. For the next round, if there still
> >> hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue
> >> again, immediately drop to 1.
> >>
>
> I would limit just async I/O. Seeky sync queues are automatically
> throttled by being sync, and have already high latency, so we
> shouldn't increase it artificially. I think, instead, that we should
> send multiple seeky requests (possibly coming from different queues)
> at once. They will help especially with raid devices, where the seeks
> for requests going to different disks will happen in parallel.
>
Async is the prime offendor, definitely.

> >> It could tie in with (or partly replace) the overload feature. The key
> >> to good latency and decent throughput is knowing when to allow queue
> >> build up and when not to.
> >
> > Hm. �Starting at 1 sounds a bit thin (like IDLE), multiple iterations to
> > build/unleash any sizable IO, but that's just my gut talking.
> >
> On the other hand, sending 1 write first and then waiting it to
> complete before submitting new ones, will help performing more merges,
> so the subsequent requests will be bigger and thus more efficient.

Usually async writes stack up very quickly, so as long as you don't
drain completely, the merging will happen automagically anyway.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Ryo Tsuruta on
Hi,

Munehiro Ikeda <m-ikeda(a)ds.jp.nec.com> wrote:
> Vivek Goyal wrote, on 10/01/2009 10:57 PM:
> > Before finishing this mail, will throw a whacky idea in the ring. I was
> > going through the request based dm-multipath paper. Will it make sense
> > to implement request based dm-ioband? So basically we implement all the
> > group scheduling in CFQ and let dm-ioband implement a request function
> > to take the request and break it back into bios. This way we can keep
> > all the group control at one place and also meet most of the requirements.
> >
> > So request based dm-ioband will have a request in hand once that request
> > has passed group control and prio control. Because dm-ioband is a device
> > mapper target, one can put it on higher level devices (practically taking
> > CFQ at higher level device), and provide fairness there. One can also
> > put it on those SSDs which don't use IO scheduler (this is kind of forcing
> > them to use the IO scheduler.)
> >
> > I am sure that will be many issues but one big issue I could think of that
> > CFQ thinks that there is one device beneath it and dipsatches requests
> > from one queue (in case of idling) and that would kill parallelism at
> > higher layer and throughput will suffer on many of the dm/md configurations.
> >
> > Thanks
> > Vivek
>
> As long as using CFQ, your idea is reasonable for me. But how about for
> other IO schedulers? In my understanding, one of the keys to guarantee
> group isolation in your patch is to have per-group IO scheduler internal
> queue even with as, deadline, and noop scheduler. I think this is
> great idea, and to implement generic code for all IO schedulers was
> concluded when we had so many IO scheduler specific proposals.
> If we will still need per-group IO scheduler internal queues with
> request-based dm-ioband, we have to modify elevator layer. It seems
> out of scope of dm.
> I might miss something...

IIUC, the request based device-mapper could not break back a request
into bio, so it could not work with block devices which don't use the
IO scheduler.

How about adding a callback function to the higher level controller?
CFQ calls it when the active queue runs out of time, then the higer
level controller use it as a trigger or a hint to move IO group, so
I think a time-based controller could be implemented at higher level.

My requirements for IO controller are:
- Implement s a higher level controller, which is located at block
layer and bio is grabbed in generic_make_request().
- Can work with any type of IO scheduler.
- Can work with any type of block devices.
- Support multiple policies, proportional wegiht, max rate, time
based, ans so on.

The IO controller mini-summit will be held in next week, and I'm
looking forard to meet you all and discuss about IO controller.
https://sourceforge.net/apps/trac/ioband/wiki/iosummit

Thanks,
Ryo Tsuruta
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Vivek Goyal on
On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote:
> Hi,
>
> Munehiro Ikeda <m-ikeda(a)ds.jp.nec.com> wrote:
> > Vivek Goyal wrote, on 10/01/2009 10:57 PM:
> > > Before finishing this mail, will throw a whacky idea in the ring. I was
> > > going through the request based dm-multipath paper. Will it make sense
> > > to implement request based dm-ioband? So basically we implement all the
> > > group scheduling in CFQ and let dm-ioband implement a request function
> > > to take the request and break it back into bios. This way we can keep
> > > all the group control at one place and also meet most of the requirements.
> > >
> > > So request based dm-ioband will have a request in hand once that request
> > > has passed group control and prio control. Because dm-ioband is a device
> > > mapper target, one can put it on higher level devices (practically taking
> > > CFQ at higher level device), and provide fairness there. One can also
> > > put it on those SSDs which don't use IO scheduler (this is kind of forcing
> > > them to use the IO scheduler.)
> > >
> > > I am sure that will be many issues but one big issue I could think of that
> > > CFQ thinks that there is one device beneath it and dipsatches requests
> > > from one queue (in case of idling) and that would kill parallelism at
> > > higher layer and throughput will suffer on many of the dm/md configurations.
> > >
> > > Thanks
> > > Vivek
> >
> > As long as using CFQ, your idea is reasonable for me. But how about for
> > other IO schedulers? In my understanding, one of the keys to guarantee
> > group isolation in your patch is to have per-group IO scheduler internal
> > queue even with as, deadline, and noop scheduler. I think this is
> > great idea, and to implement generic code for all IO schedulers was
> > concluded when we had so many IO scheduler specific proposals.
> > If we will still need per-group IO scheduler internal queues with
> > request-based dm-ioband, we have to modify elevator layer. It seems
> > out of scope of dm.
> > I might miss something...
>
> IIUC, the request based device-mapper could not break back a request
> into bio, so it could not work with block devices which don't use the
> IO scheduler.
>

I think current request based multipath drvier does not do it but can't it
be implemented that requests are broken back into bio?

Anyway, I don't feel too strongly about this approach as it might
introduce more serialization at higher layer.

> How about adding a callback function to the higher level controller?
> CFQ calls it when the active queue runs out of time, then the higer
> level controller use it as a trigger or a hint to move IO group, so
> I think a time-based controller could be implemented at higher level.
>

Adding a call back should not be a big issue. But that means you are
planning to run only one group at higher layer at one time and I think
that's the problem because than we are introducing serialization at higher
layer. So any higher level device mapper target which has multiple
physical disks under it, we might be underutilizing these even more and
take a big hit on overall throughput.

The whole design of doing proportional weight at lower layer is optimial
usage of system.

> My requirements for IO controller are:
> - Implement s a higher level controller, which is located at block
> layer and bio is grabbed in generic_make_request().

How are you planning to handle the issue of buffered writes Andrew raised?

> - Can work with any type of IO scheduler.
> - Can work with any type of block devices.
> - Support multiple policies, proportional wegiht, max rate, time
> based, ans so on.
>
> The IO controller mini-summit will be held in next week, and I'm
> looking forard to meet you all and discuss about IO controller.
> https://sourceforge.net/apps/trac/ioband/wiki/iosummit

Is there a new version of dm-ioband now where you have solved the issue of
sync/async dispatch with-in group? Before meeting at mini-summit, I am
trying to run some tests and come up with numbers so that we have more
clear picture of pros/cons.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/