cfq-iosched: Improve time slice charging logic [Kernel]

Prev: [PATCH 12/26] cciss: factor out cciss_enable_scsi_prefetch()
Next: [PATCH 24/26] cciss: separate cmd_alloc() and cmd_special_alloc()

From: Jeff Moyer on 19 Jul 2010 14:50

Vivek Goyal <vgoyal(a)redhat.com> writes:

> - Currently in CFQ there are many situations where don't know how
> much time slice has been consumed by a queue. For example, all
> the random reader/writer queues where we don't idle on
> individual queues and we expire the queue either immediately
> after the request dispatch.
>
> - In this case time consumed by a queue is just a memory copy
> operation. Actually time measurement is possible only if we
> idle on a queue and allow dispatch from a queue for significant
> amount of time.
>
> - As of today, in such cases we calculate the time since the
> dispatch from the queue started and charge all that time.
> Generally this rounds to 1 jiffy but in some cases it can
> be more. For example, if we are driving high request queue
> depth and driver is too busy and does not ask for new
> requests for 8-10 jiffies. In such cases, the active queue
> gets charged very unfairly.
>
> - So fundamentally, whole notion of charging for time slice
> is valid only if we have been idling on the queue. Otherwise
> in an NCQ queue, there might be other requests on the queue
> and we can not do the time slice calculation.
>
> - This patch tweaks the slice charging logic a bit so that
> in the cases where we can't know the amount of time, we
> start charging in terms of number of requests dispatched
> (IOPS). This practically switching CFQ fairness model to
> fairness in terms of IOPS with slice_idle=0.
>
> - As of today this will primarily be useful only with
> group_idle patches so that we get fairness in terms of
> IOPS across groups. The idea is that on fast storage
> one can run CFQ with slice_idle=0 and still get IO
> controller working without losing too much of
> throughput.

I'm not fluent in the cgroup code, my apologies for that. However, just
trying to make sense of this is giving me a headache. Now, in some
cases you are using IOPS *in place of* jiffies. How are we to know
which is which and in what cases?

It sounds like this is addressing an important problem, but I'm having a
hard time picking out what that problem is. Is this problem noticable
for competing sync-noidle workloads (competing between groups, that is)?
If not, then what?

Thanks,
Jeff

> Signed-off-by: Vivek Goyal <vgoyal(a)redhat.com>
> ---
> block/cfq-iosched.c | 24 +++++++++++++++++++++---
> 1 files changed, 21 insertions(+), 3 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 7982b83..f44064c 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -896,16 +896,34 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
> * if there are mutiple queues in the group, each can dispatch
> * a single request on seeky media and cause lots of seek time
> * and group will never know it.
> + *
> + * If drive is NCQ and we are driving deep queue depths, then
> + * it is not reasonable to charge the slice since dispatch
> + * started because this time will include time taken by all
> + * the other requests in the queue.
> + *
> + * Actually there is no reasonable way to know the disk time
> + * here and we need to come up with some approximation. If
> + * disk is non NCQ, we should be driving request queue depth
> + * 1, then charge for time since dispatch start and this will
> + * account for seek time properly on seeky media. If request
> + * queue depth is high, then charge for number of requests
> + * dispatched from the queue. This will sort of becoming
> + * charging in terms of IOPS.
> */
> - slice_used = max_t(unsigned, (jiffies - cfqq->dispatch_start),
> - 1);
> + if (cfqq->cfqd->hw_tag == 0)
> + slice_used = max_t(unsigned,
> + (jiffies - cfqq->dispatch_start), 1);
> + else
> + slice_used = cfqq->slice_dispatch;
> } else {
> slice_used = jiffies - cfqq->slice_start;
> if (slice_used > cfqq->allocated_slice)
> slice_used = cfqq->allocated_slice;
> }
>
> - cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u", slice_used);
> + cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u, sl_disp=%u", slice_used,
> + cfqq->slice_dispatch);
> return slice_used;
> }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Vivek Goyal on 19 Jul 2010 15:00

On Mon, Jul 19, 2010 at 02:47:20PM -0400, Jeff Moyer wrote:
> Vivek Goyal <vgoyal(a)redhat.com> writes:
>
> > - Currently in CFQ there are many situations where don't know how
> > much time slice has been consumed by a queue. For example, all
> > the random reader/writer queues where we don't idle on
> > individual queues and we expire the queue either immediately
> > after the request dispatch.
> >
> > - In this case time consumed by a queue is just a memory copy
> > operation. Actually time measurement is possible only if we
> > idle on a queue and allow dispatch from a queue for significant
> > amount of time.
> >
> > - As of today, in such cases we calculate the time since the
> > dispatch from the queue started and charge all that time.
> > Generally this rounds to 1 jiffy but in some cases it can
> > be more. For example, if we are driving high request queue
> > depth and driver is too busy and does not ask for new
> > requests for 8-10 jiffies. In such cases, the active queue
> > gets charged very unfairly.
> >
> > - So fundamentally, whole notion of charging for time slice
> > is valid only if we have been idling on the queue. Otherwise
> > in an NCQ queue, there might be other requests on the queue
> > and we can not do the time slice calculation.
> >
> > - This patch tweaks the slice charging logic a bit so that
> > in the cases where we can't know the amount of time, we
> > start charging in terms of number of requests dispatched
> > (IOPS). This practically switching CFQ fairness model to
> > fairness in terms of IOPS with slice_idle=0.
> >
> > - As of today this will primarily be useful only with
> > group_idle patches so that we get fairness in terms of
> > IOPS across groups. The idea is that on fast storage
> > one can run CFQ with slice_idle=0 and still get IO
> > controller working without losing too much of
> > throughput.
>
> I'm not fluent in the cgroup code, my apologies for that. However, just
> trying to make sense of this is giving me a headache. Now, in some
> cases you are using IOPS *in place of* jiffies. How are we to know
> which is which and in what cases?

Yes it is mixed now for default CFQ case. Whereever we don't have the
capability to determine the slice_used, we charge IOPS.

For slice_idle=0 case, we should charge IOPS almost all the time. Though
if there is a workload where single cfqq can keep the request queue
saturated, then current code will charge in terms of time.

I agree that this is little confusing. May be in case of slice_idle=0
we can always charge in terms of IOPS.

>
> It sounds like this is addressing an important problem, but I'm having a
> hard time picking out what that problem is. Is this problem noticable
> for competing sync-noidle workloads (competing between groups, that is)?
> If not, then what?

I noticed problem during competing workloads in different groups. With
slice_idle 0, we will drive full queue depth of 32. Sometimes when we
hit high queue depth, say 32, for few jiffies, driver did not ask for
new requests. So say for 10-12 ms, requests only completed and new
requests did not get issued. In that case, all this 10-12 ms gets charged
to active queue and the fact is that this active queue did not even get
to dispatch more than 1 request. This queue was just unfortunate to be
there at that time. The higher weight queue ofen run into this situation
because CFQ tries to keep them as active queue more often.

So if you are driving full queue depth where in NCQ request queue there
are requests pending from multiple queues and groups, you have no way to
measure the time. My impression is that on fast devices, we can no longer
stick to the model of measuring the time. If we switch to IOPS model,
then we can drive deeper requests queue depths and keep the device
saturated, at the same time achieve group IO control.

Thanks
Vivek

>
> Thanks,
> Jeff
>
> > Signed-off-by: Vivek Goyal <vgoyal(a)redhat.com>
> > ---
> > block/cfq-iosched.c | 24 +++++++++++++++++++++---
> > 1 files changed, 21 insertions(+), 3 deletions(-)
> >
> > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> > index 7982b83..f44064c 100644
> > --- a/block/cfq-iosched.c
> > +++ b/block/cfq-iosched.c
> > @@ -896,16 +896,34 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
> > * if there are mutiple queues in the group, each can dispatch
> > * a single request on seeky media and cause lots of seek time
> > * and group will never know it.
> > + *
> > + * If drive is NCQ and we are driving deep queue depths, then
> > + * it is not reasonable to charge the slice since dispatch
> > + * started because this time will include time taken by all
> > + * the other requests in the queue.
> > + *
> > + * Actually there is no reasonable way to know the disk time
> > + * queue depth is high, then charge for number of requests
> > + * dispatched from the queue. This will sort of becoming
> > + * charging in terms of IOPS.
> > */
> > - slice_used = max_t(unsigned, (jiffies - cfqq->dispatch_start),
> > - 1);
> > + if (cfqq->cfqd->hw_tag == 0)
> > + slice_used = max_t(unsigned,
> > + (jiffies - cfqq->dispatch_start), 1);
> > + else
> > + slice_used = cfqq->slice_dispatch;
> > } else {
> > slice_used = jiffies - cfqq->slice_start;
> > if (slice_used > cfqq->allocated_slice)
> > slice_used = cfqq->allocated_slice;
> > }
> >
> > - cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u", slice_used);
> > + cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u, sl_disp=%u", slice_used,
> > + cfqq->slice_dispatch);
> > return slice_used;
> > }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Divyesh Shah on 19 Jul 2010 16:40

On Mon, Jul 19, 2010 at 11:58 AM, Vivek Goyal <vgoyal(a)redhat.com> wrote:
> Yes it is mixed now for default CFQ case. Whereever we don't have the
> capability to determine the slice_used, we charge IOPS.
>
> For slice_idle=0 case, we should charge IOPS almost all the time. Though
> if there is a workload where single cfqq can keep the request queue
> saturated, then current code will charge in terms of time.
>
> I agree that this is little confusing. May be in case of slice_idle=0
> we can always charge in terms of IOPS.

I agree with Jeff that this is very confusing. Also there are
absolutely no bets that one job may end up getting charged in IOPs for
this behavior while other jobs continue getting charged in timefor
their IOs. Depending on the speed of the disk, this could be a huge
advantage or disadvantage for the cgroup being charged in IOPs.

It should be black or white, time or IOPs and also very clearly called
out not just in code comments but in the Documentation too.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Vivek Goyal on 19 Jul 2010 16:50

On Mon, Jul 19, 2010 at 01:32:24PM -0700, Divyesh Shah wrote:
> On Mon, Jul 19, 2010 at 11:58 AM, Vivek Goyal <vgoyal(a)redhat.com> wrote:
> > Yes it is mixed now for default CFQ case. Whereever we don't have the
> > capability to determine the slice_used, we charge IOPS.
> >
> > For slice_idle=0 case, we should charge IOPS almost all the time. Though
> > if there is a workload where single cfqq can keep the request queue
> > saturated, then current code will charge in terms of time.
> >
> > I agree that this is little confusing. May be in case of slice_idle=0
> > we can always charge in terms of IOPS.
>
> I agree with Jeff that this is very confusing. Also there are
> absolutely no bets that one job may end up getting charged in IOPs for
> this behavior while other jobs continue getting charged in timefor
> their IOs. Depending on the speed of the disk, this could be a huge
> advantage or disadvantage for the cgroup being charged in IOPs.
>
> It should be black or white, time or IOPs and also very clearly called
> out not just in code comments but in the Documentation too.

Ok, how about always charging in IOPS when slice_idle=0?

So on fast devices, admin/user space tool, can set slice_idle=0, and CFQ
starts doing accounting in IOPS instead of time. On slow devices we
continue to run with slice_idle=8 and nothing changes.

Personally I feel that it is hard to sustain time based logic on high end
devices and still get good throughput. We could make CFQ a dual mode kind
of scheduler which is capable of doing accouting both in terms of time as
well as IOPS. When slice_idle !=0, we do accounting in terms of time and
it will be same CFQ as of today. When slice_idle=0, CFQ starts accounting
in terms of IOPS.

I think this change should bring us one step closer to our goal of one
IO sheduler for all devices.

Jens, what do you think?

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Corrado Zoccolo on 19 Jul 2010 17:20

On Mon, Jul 19, 2010 at 10:44 PM, Vivek Goyal <vgoyal(a)redhat.com> wrote:
> On Mon, Jul 19, 2010 at 01:32:24PM -0700, Divyesh Shah wrote:
>> On Mon, Jul 19, 2010 at 11:58 AM, Vivek Goyal <vgoyal(a)redhat.com> wrote:
>> > Yes it is mixed now for default CFQ case. Whereever we don't have the
>> > capability to determine the slice_used, we charge IOPS.
>> >
>> > For slice_idle=0 case, we should charge IOPS almost all the time. Though
>> > if there is a workload where single cfqq can keep the request queue
>> > saturated, then current code will charge in terms of time.
>> >
>> > I agree that this is little confusing. May be in case of slice_idle=0
>> > we can always charge in terms of IOPS.
>>
>> I agree with Jeff that this is very confusing. Also there are
>> absolutely no bets that one job may end up getting charged in IOPs for
>> this behavior while other jobs continue getting charged in timefor
>> their IOs. Depending on the speed of the disk, this could be a huge
>> advantage or disadvantage for the cgroup being charged in IOPs.
>>
>> It should be black or white, time or IOPs and also very clearly called
>> out not just in code comments but in the Documentation too.
>
> Ok, how about always charging in IOPS when slice_idle=0?
>
> So on fast devices, admin/user space tool, can set slice_idle=0, and CFQ
> starts doing accounting in IOPS instead of time. On slow devices we
> continue to run with slice_idle=8 and nothing changes.
>
> Personally I feel that it is hard to sustain time based logic on high end
> devices and still get good throughput. We could make CFQ a dual mode kind
> of scheduler which is capable of doing accouting both in terms of time as
> well as IOPS. When slice_idle !=0, we do accounting in terms of time and
> it will be same CFQ as of today. When slice_idle=0, CFQ starts accounting
> in terms of IOPS.
There is an other mode in which cfq can operate: for ncq ssds, it
basically ignores slice_idle, and operates as if it was 0.
This mode should also be handled as an IOPS counting mode.
SSD mode, though, differs from rotational mode for the definition of
"seekyness", and we should think if this mode is appropriate also for
the other hardware where slice_idle=0 is beneficial.
>
> I think this change should bring us one step closer to our goal of one
> IO sheduler for all devices.

I think this is an interesting instance of a more general problem: cfq
needs a cost function applicable to all requests on any hardware. The
current function is a concrete one (measured time), but unfortunately
it is not always applicable, because:
- for fast hardware the resolution is too coarse (this can be fixed
using higher resolution timers)
- for hardware that allows parallel dispatching, we can't measure the
cost of a single request (can we try something like average cost of
the requests executed in parallel?).
IOPS, instead, is a synthetic cost measure. It is a simplified model,
that will approximate some devices (SSDs) better than others
(multi-spindle rotational disks). But if we want to go for the
synthetic path, we can have more complex measures, that also take into
account other parameters, as sequentiality of the requests, their size
and so on, all parameters that may have still some impact on high-end
devices.

Thanks,
Corrado
>
> Jens, what do you think?
>
> Thanks
> Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2
Prev: [PATCH 12/26] cciss: factor out cciss_enable_scsi_prefetch()
Next: [PATCH 24/26] cciss: separate cmd_alloc() and cmd_special_alloc()