block: Implement a blk_yield function to voluntarily give up the I/O scheduler. [Kernel]

Prev: [2.6.35-rc1 regression] tg3 vpd r/w failed
Next: perf: add perfconfig option to specify buildid cache dir

From: Jeff Moyer on 1 Jun 2010 16:10

Vivek Goyal <vgoyal(a)redhat.com> writes:

> On Tue, May 18, 2010 at 02:20:18PM -0400, Jeff Moyer wrote:
>> This patch implements a blk_yield to allow a process to voluntarily
>> give up its I/O scheduler time slice. This is desirable for those processes
>> which know that they will be blocked on I/O from another process, such as
>> the file system journal thread. Following patches will put calls to blk_yield
>> into jbd and jbd2.
>>
>> Signed-off-by: Jeff Moyer <jmoyer(a)redhat.com>
>> ---
>> block/blk-core.c | 13 +++++
>> block/blk-settings.c | 6 +++
>> block/cfq-iosched.c | 112 +++++++++++++++++++++++++++++++++++++++++++++-
>> block/elevator.c | 8 +++
>> include/linux/blkdev.h | 4 ++
>> include/linux/elevator.h | 3 +
>> 6 files changed, 144 insertions(+), 2 deletions(-)
>>

[...]

> @@ -2138,7 +2142,8 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
> * ok to wait for this request to complete.
> */
> if (cfqq->cfqg->nr_cfqq == 1 && RB_EMPTY_ROOT(&cfqq->sort_list)
> - && cfqq->dispatched && cfq_should_idle(cfqd, cfqq)) {
> + && cfqq->dispatched && !cfq_cfqq_yield(cfqq) &&
> + cfq_should_idle(cfqd, cfqq)) {
> cfqq = NULL;

> I think if we just place cfq_cfqq_yield(cfqq), check above it, we don't
> need above code modification?

Yep.

> This might result in some group losing fairness in certain scenarios, but
> I guess we will tackle it once we face it. For the time being if fsync
> thread is giving up cpu, journald commits will come in root group and
> there might not be a point in wasting time idling on this group.

ok.

[...]

>> @@ -2191,6 +2199,98 @@ keep_queue:
>> return cfqq;
>> }
>>
>> +static void cfq_yield(struct request_queue *q, struct task_struct *tsk)
>> +{
>> + struct cfq_data *cfqd = q->elevator->elevator_data;
>> + struct cfq_io_context *cic, *new_cic;
>> + struct cfq_queue *cfqq, *sync_cfqq, *async_cfqq, *new_cfqq;
>> +
>> + cic = cfq_cic_lookup(cfqd, current->io_context);
>> + if (!cic)
>> + return;
>> +
>> + task_lock(tsk);
>> + new_cic = cfq_cic_lookup(cfqd, tsk->io_context);
>> + atomic_long_inc(&tsk->io_context->refcount);
>> + task_unlock(tsk);
>> + if (!new_cic)
>> + goto out_dec;
>> +
>> + spin_lock_irq(q->queue_lock);
>> +
>> + /*
>> + * This is primarily called to ensure that the long synchronous
>> + * time slice does not prevent other I/O happenning (like journal
>> + * commits) while we idle waiting for it. Thus, check to see if the
>> + * current cfqq is the sync cfqq for this process.
>> + */
>> + cfqq = cic_to_cfqq(cic, 1);
>> + if (!cfqq)
>> + goto out_unlock;
>> +
>> + if (cfqd->active_queue != cfqq)
>> + goto out_unlock;
>
> Why does yielding queue has to be active queue? Could it be possible that
> a queue submitted a bunch of IO and did yield. It is possible that
> yielding queue is not active queue. Even in that case we will like to mark
> cfqq yielding and expire queue after dispatch of pending requests?

You're absolutely right, this should not be a limitation in the code.

>> +
>> + sync_cfqq = cic_to_cfqq(new_cic, 1);
>> + async_cfqq = cic_to_cfqq(new_cic, 0);
>> + if (!sync_cfqq && !async_cfqq)
>> + goto out_unlock;
>> + if (!RB_EMPTY_ROOT(&sync_cfqq->sort_list))
>> + new_cfqq = sync_cfqq;
>> + else
>> + new_cfqq = async_cfqq;
>> +
>
> Why is it mandatory that target context has already the queue setup. If
> there is no queue setup, can't we just yield the slice and let somebody
> else run?

The only callers of the yield method are yielding to a kernel thread
that is always going to exist. I didn't see a need to add any
complexity here for a case that doesn't yet exist.

> Oh, I guess you want to keep the slice and idle till target queue
> dispatches some requests and sets up context and a queue? But even that
> will not help as you have anyway missed the yield event?

No, see above.

>> + /*
>> + * If we are currently servicing the SYNC_NOIDLE_WORKLOAD, and we
>> + * are idling on the last queue in that workload, *and* the average
>> + * think time is larger thank the remaining slice time, go ahead
>> + * and yield the queue. Otherwise, don't yield so that fsync-heavy
>> + * workloads don't starve out the sync-noidle workload.
>> + */
>> + if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD &&
>> + (!sample_valid(cfqq->service_tree->ttime_samples) ||
>> + cfqq->slice_end - jiffies > cfqq->service_tree->ttime_mean)) {
>> + /*
>
> This is confusing. The yielding queue's stats are already part of service
> tree's think time. So if yielding queue has done bunch of IO, then service
> tree's mean think time will be low and we will not yield? I guess that's
> the reason you are trying to check average number of queues in following
> code.

Sorry for confusing you. Your deduction is correct. Is there a way I
can write this that would be less confusing to you?

>> + * If there's only a single sync-noidle queue on average,
>> + * there's no point in idling in order to not starve the
>> + * non-existent other queues.
>> + */
>> + if (cfq_group_busy_queues_wl(SYNC_NOIDLE_WORKLOAD, cfqd,
>> + cfqq->cfqg) > 1)
>
> Does cfq_group_busy_queues_wl() give average number of queues. Were you
> looking for cfqg->busy_queues_avg?

Yes, thanks again.

>
>> + goto out_unlock;
>> + }
>> +
>> + cfq_log_cfqq(cfqd, cfqq, "yielding queue");
>> +
>> + /*
>> + * If there are other requests pending, just mark the queue as
>> + * yielding and give up our slice after the last request is
>> + * dispatched. Or, if there are no requests currently pending
>> + * on the selected queue, keep our time slice until such a time
>> + * as the new queue has something to do. It will then preempt
>> + * this queue (see cfq_should_preempt).
>> + */
>> + if (!RB_EMPTY_ROOT(&cfqq->sort_list) ||
>> + RB_EMPTY_ROOT(&new_cfqq->sort_list)) {
>> + cfqq->yield_to = new_cic;
>
> Where are you clearing yield_to flag? Can't seem to find it.

yield_to is not a flag. ;-) It is not cleared as it is only valid if
the cfq_cfqq_yield flag is set. If you would find it more sensible, I
can certainly add code to clear it.

>> + cfq_mark_cfqq_yield(cfqq);
>> + goto out_unlock;
>> + }
>> +
>> + /*
>> + * At this point, we know that the target queue has requests pending.
>> + * Yield the io scheduler.
>> + */
>> + __cfq_slice_expired(cfqd, cfqq, 1);
>> + __cfq_set_active_queue(cfqd, new_cfqq);
>
> What happens to cfq_group considerations? So we can yield slices across
> groups. That does not seem right. If existing group has requests on other
> service trees, these should be processed first.
>
> Can we handle all this logic in select_queue(). I think it might turn out
> to be simpler. I mean in this function if we just mark the existing queue
> as yielding and also setup who to yield the queue to and let
> select_queue() take the decision whether to idle or choose new queue etc.
> I guess it should be possible and code might turn out to be simpler to
> read.

I'll give that a shot, thanks for the suggestion. My concern is that,
when employing cgroups, you may never actually yield the queue depending
on where the file system's journal thread ends up.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Jeff Moyer on 23 Jun 2010 11:00

Andrew Morton <akpm(a)linux-foundation.org> writes:

> On Tue, 22 Jun 2010 17:35:00 -0400 Jeff Moyer <jmoyer(a)redhat.com> wrote:
>
>> This patch implements a blk_yield to allow a process to voluntarily
>> give up its I/O scheduler time slice. This is desirable for those processes
>> which know that they will be blocked on I/O from another process, such as
>> the file system journal thread. Following patches will put calls to blk_yield
>> into jbd and jbd2.
>>
>
> I'm looking through this patch series trying to find the
> analysis/description of the cause for this (bad) performance problem.
> But all I'm seeing is implementation stuff :( It's hard to review code
> with your eyes shut.

Sorry about that, Andrew. The problem case is (for example) iozone when
run with small file sizes (up to 8MB) configured to fsync after each
file is written. Because the iozone process is issuing synchronous
writes, it is put onto CFQ's SYNC service tree. The significance of
this is that CFQ will idle for up to 8ms waiting for requests on such
queues. So, what happens is that the iozone process will issue, say,
64KB worth of write I/O. That I/O will just land in the page cache.
Then, the iozone process does an fsync which forces those I/Os to disk
as synchronous writes. Then, the file system's fsync method is invoked,
and for ext3/4, it calls log_start_commit followed by log_wait_commit.
Because those synchronous writes were forced out in the context of the
iozone process, CFQ will now idle on iozone's cfqq waiting for more I/O.
However, iozone's progress is gated by the journal thread, now.

So, I tried two approaches to solving the problem. The first, which
Christoph brought up again in this thread, was to simply mark all
journal I/O as BIO_RW_META, which would cause the iozone process' cfqq
to be preempted when the journal issued its I/O. However, Vivek pointed
out that this was bad for interactive performance.

The second approach, of which this is the fourth iteration, was to allow
the file system to explicitly tell the I/O scheduler that it is waiting
on I/O from another process.

Does that help? Let me know if you have any more questions, and thanks
a ton for looking at this, Andrew. I appreciate it.

The comments I've elided from my response make perfect sense, so I'll
address them in the next posting.

>> };
>> #define CFQ_RB_ROOT (struct cfq_rb_root) { .rb = RB_ROOT, .left = NULL, \
>> - .count = 0, .min_vdisktime = 0, }
>> + .count = 0, .min_vdisktime = 0, .last_expiry = 0UL, \
>> + .last_pid = (pid_t)-1, }
>
> May as well leave the 0 and NULL fields unmentioned (ie: don't do
> crappy stuff because the old code did crappy stuff!)

I don't actually understand why you take issue with this.

>> +{
>> + struct cfq_data *cfqd = q->elevator->elevator_data;
>> + struct cfq_io_context *cic, *new_cic;
>> + struct cfq_queue *cfqq;
>> +
>> + cic = cfq_cic_lookup(cfqd, current->io_context);
>> + if (!cic)
>> + return;
>> +
>> + task_lock(tsk);
>> + new_cic = cfq_cic_lookup(cfqd, tsk->io_context);
>> + atomic_long_inc(&tsk->io_context->refcount);
>
> How do we know tsk has an io_context? Use get_io_context() and check
> its result?

I'll fix that up. It works now only by luck (and the fact that there's
a good chance the journal thread has an i/o context).

>> + task_unlock(tsk);
>> + if (!new_cic)
>> + goto out_dec;
>> +
>> + spin_lock_irq(q->queue_lock);
>> +
>> + cfqq = cic_to_cfqq(cic, 1);
>> + if (!cfqq)
>> + goto out_unlock;
>> +
>> + /*
>> + * If we are currently servicing the SYNC_NOIDLE_WORKLOAD, and we
>> + * are idling on the last queue in that workload, *and* there are no
>> + * potential dependent readers running currently, then go ahead and
>> + * yield the queue.
>> + */
>
> Comment explains the code, but doesn't explain the *reason* for the code.

Actually, it explains more than just what the code does. It would be
difficult for one to divine that the code actually only really cares
about breaking up a currently running dependent reader. I'll see if I
can make that more clear.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Jeff Moyer on 25 Jun 2010 13:00

Vivek Goyal <vgoyal(a)redhat.com> writes:

> On Tue, Jun 22, 2010 at 05:35:00PM -0400, Jeff Moyer wrote:
>
> [..]
>> @@ -1614,6 +1620,15 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>> cfq_clear_cfqq_wait_request(cfqq);
>> cfq_clear_cfqq_wait_busy(cfqq);
>>
>> + if (!cfq_cfqq_yield(cfqq)) {
>> + struct cfq_rb_root *st;
>> + st = service_tree_for(cfqq->cfqg,
>> + cfqq_prio(cfqq), cfqq_type(cfqq));
>> + st->last_expiry = jiffies;
>> + st->last_pid = cfqq->pid;
>> + }
>> + cfq_clear_cfqq_yield(cfqq);
>
> Jeff, I think cfqq is still on service tree at this point of time. If yes,
> then we can simply use cfqq->service_tree, instead of calling
> service_tree_for().

Yup.

> No clearing of cfqq->yield_to field?

Nope. Again, it's not required, but if you really want me to, I'll add
it.

> [..]
>> /*
>> * Select a queue for service. If we have a current active queue,
>> * check whether to continue servicing it, or retrieve and set a new one.
>> @@ -2187,6 +2232,10 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
>> * have been idling all along on this queue and it should be
>> * ok to wait for this request to complete.
>> */
>> + if (cfq_cfqq_yield(cfqq) &&
>> + cfq_should_yield_now(cfqq, &new_cfqq))
>> + goto expire;
>> +
>
> I think we can get rid of this condition here and move the yield check
> above outside above if condition. This if condition waits for request to
> complete from this queue and waits for queue to get busy before slice
> expiry. If we have decided to yield the queue, there is no point in
> waiting for next request for queue to get busy.

Yeah, this is a vestige of the older code layout. Thanks, this cleans
things up nicely.

>> + cfq_log_cfqq(cfqd, cfqq, "yielding queue to %d", tsk->pid);
>> + cfqq->yield_to = new_cic;
>
> We are stashing away a pointer to cic without taking reference?

There is no reference counting on the cic.

>> @@ -3123,6 +3234,13 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
>> if (!cfqq)
>> return false;
>>
>> + /*
>> + * If the active queue yielded its timeslice to this queue, let
>> + * it preempt.
>> + */
>> + if (cfq_cfqq_yield(cfqq) && RQ_CIC(rq) == cfqq->yield_to)
>> + return true;
>> +
>
> I think we need to again if if we are sync-noidle workload then allow
> preemption only if no dependent read is currently on, otherwise
> sync-noidle service tree loses share.

I think you mean don't yield if there is a dependent reader. Yeah,
makes sense.

> This version looks much simpler than previous one and is much easier
> to understand. I will do some testing on friday and provide you feedback.

Great, thanks again for the review!

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Jeff Moyer on 25 Jun 2010 16:00

Jens Axboe <axboe(a)kernel.dk> writes:

> On 25/06/10 18.51, Jeff Moyer wrote:
>>>> + cfq_log_cfqq(cfqd, cfqq, "yielding queue to %d", tsk->pid);
>>>> + cfqq->yield_to = new_cic;
>>>
>>> We are stashing away a pointer to cic without taking reference?
>>
>> There is no reference counting on the cic.
>
> Not on the cic itself, but on the io context it belongs to. So you
> need to grab a reference to that, if you are stowing a reference
> to the cic.

OK, easy enough. Thanks!

Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

|
Pages: 1
Prev: [2.6.35-rc1 regression] tg3 vpd r/w failed
Next: perf: add perfconfig option to specify buildid cache dir