trying to understand READ_META, READ_SYNC, WRITE

Prev: [PATCH V2 5/5] ara virt interface of perf to support kvm guest os statistics collection in guest os
Next: [PATCH fs/ramfs] inode.c: Fix incorrect variable freeing.

From: Christoph Hellwig on 26 Jun 2010 06:20

On Sat, Jun 26, 2010 at 08:10:45PM +1000, Nick Piggin wrote:
> But I'm sure apps can submit fsyncs much faster than once per
> few ms, like small database transactions.

fsync / O_SYNC should be irrelevant for the idling logic. One those
retourn to the user data must have made it to the disk, and with our
barrier implementation that implies fully draining any outstanding I/O
on the device.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 26 Jun 2010 06:20

On Sat, Jun 26, 2010 at 11:27:59AM +0200, Christoph Hellwig wrote:
> On Sat, Jun 26, 2010 at 07:25:56PM +1000, Nick Piggin wrote:
> > Biggest thing is multiple small files operations like on the same
> > directory. Best case I measured back when doing AS io scheduler
> > versus deadline was about 100x improvement on a uncached kernel
> > grep workload when competing with a streaming writeout (the writeout
> > probably ended up going somewhat slower naturally, but it is fairer).
>
> As I mentioned below I absolutely see the case for reads. A normal
> grep basically is a dependent read kind of workload. For for writes
> it should either be O_SYNC-style workloads or batched I/O.

Sorry I missed that. OK well that may be true. One would hope
it was benchmarked before being merged.

But I'm sure apps can submit fsyncs much faster than once per
few ms, like small database transactions.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Jens Axboe on 26 Jun 2010 07:30

On Sat, Jun 26 2010, Christoph Hellwig wrote:
> > - Stop idling on all the WRITE_SYNC IO. There is no reasonable way to
> > tell whether there will be more IO or not from applicatoin. This will
> > impact direct writes, O_SYNC writes and fsync().
> >
> > If direct IO application is submitting writes with a delay in between
> > it can be starved out in presnce of competing workloads.
>
> So what application does this?

It isn't about apps having a small delay between submissions,
not sure where Vivek gets that from. Even if you submit back
to back, there's still very small time where the io scheduler
will switch to something else if there's other io pending. This
happens instantly when the sync request finishes - if we don't
idle for any given request, then we of course go to service
someone else with pending io.

The whole idling/anticipation is all about knowing when to
stall the queue very briefly for sync io, allowing that single
sync stream to make good progress for a while before switching
to something else. This switching back and forth potentially
destroyes throughput for the O_DIRECT writer, especially for
disks with write through caching.

Christoph, you seem not to agree on the concept of idling. As
Nick writes in another reply, the difference in performance
when you get it right is staggering. We don't idle the disk
for kicks and laughs. I wont rule out bugs, both in the
handling and the signalling of idling. But as a concept, it
is definitely sound and proven.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Christoph Hellwig on 26 Jun 2010 08:00

On Sat, Jun 26, 2010 at 01:20:55PM +0200, Jens Axboe wrote:
> The whole idling/anticipation is all about knowing when to
> stall the queue very briefly for sync io, allowing that single
> sync stream to make good progress for a while before switching
> to something else. This switching back and forth potentially
> destroyes throughput for the O_DIRECT writer, especially for
> disks with write through caching.
>
> Christoph, you seem not to agree on the concept of idling.

I'm still trying to understand the use case. Basically CFQ gets
worse results for any workload I'm looking at, be that from the
filesystem developer point of view, or virtualization point of
view. And it tends to get worse the more intelligence is added
to CFQ.

If we could actually narrow down what it's supposed to help
with into useful benchmarks that can be trivially reproduced a
lot of things would be easier.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Jeff Moyer on 27 Jun 2010 11:50

Vivek Goyal <vgoyal(a)redhat.com> writes:

> On Fri, Jun 25, 2010 at 01:03:20PM +0200, Christoph Hellwig wrote:
>> On Wed, Jun 23, 2010 at 09:44:20PM -0400, Vivek Goyal wrote:
>> I see the point of this logic for reads where various workloads have
>> dependent reads that might be close to each other, but I don't really
>> see any point for writes.
>>
>> > So looks like fsync path will do bunch of IO and then will wait for jbd thread
>> > to finish the work. In this case idling is waste of time.
>>
>> Given that ->writepage already does WRITE_SYNC_PLUG I/O which includes
>> REQ_NODILE I'm still confused why we still have that issue.
>
> In current form, cfq honors REQ_NOIDLE conditionally and that's why we
> still have the issue. If you look at cfq_completed_request(), we continue
> to idle in following two cases.
>
> - If we classifed the queue as SYNC_WORKLOAD.
> - If there is another random read/write happening on sync-noidle service
> tree.
>
> SYNC_WORKLOAD means that cfq thinks this particular queue is doing sequential
> IO. For random IO queues, we don't idle on each individual queue but a
> group of queue.
>
> In jeff's testing, fsync thread/queue sometimes is viewed as sequential
> workload and goes on SYNC_WORKLOAD tree. In that case even if request is
> REQ_NOIDLE, we will continue to idle hence fsync issue.

I'm now testing OCFS2, and I'm seeing performance that is not great
(even with the blk_yield patches applied). What happens is that we
successfully yield the queue to the journal thread, but then idle on the
journal thread (even though RQ_NOIDLE was set).

So, can we just get rid of idling when RQ_NOIDLE is set?

Vivek sent me this patch to test, and it got rid of the performance
issue for the fsync workload. Can we discuss its merits?

Thanks,
Jeff

Index: linux-2.6/block/cfq-iosched.c
===================================================================
--- linux-2.6.orig/block/cfq-iosched.c 2010-06-25 15:57:33.832125786 -0400
+++ linux-2.6/block/cfq-iosched.c 2010-06-25 15:59:19.788876361 -0400
@@ -318,6 +318,7 @@
CFQ_CFQQ_FLAG_split_coop, /* shared cfqq will be splitted */
CFQ_CFQQ_FLAG_deep, /* sync cfqq experienced large depth */
CFQ_CFQQ_FLAG_wait_busy, /* Waiting for next request */
+ CFQ_CFQQ_FLAG_group_idle, /* This queue is doing group idle */
};

#define CFQ_CFQQ_FNS(name) \
@@ -347,6 +348,7 @@
CFQ_CFQQ_FNS(split_coop);
CFQ_CFQQ_FNS(deep);
CFQ_CFQQ_FNS(wait_busy);
+CFQ_CFQQ_FNS(group_idle);
#undef CFQ_CFQQ_FNS

#ifdef CONFIG_CFQ_GROUP_IOSCHED
@@ -1613,6 +1615,7 @@

cfq_clear_cfqq_wait_request(cfqq);
cfq_clear_cfqq_wait_busy(cfqq);
+ cfq_clear_cfqq_group_idle(cfqq);

/*
* If this cfqq is shared between multiple processes, check to
@@ -3176,6 +3179,13 @@
if (cfq_class_rt(new_cfqq) && !cfq_class_rt(cfqq))
return true;

+ /*
+ * If were doing group_idle and we got new request in same group,
+ * preempt the queue
+ */
+ if (cfq_cfqq_group_idle(cfqq))
+ return true;
+
if (!cfqd->active_cic || !cfq_cfqq_wait_request(cfqq))
return false;

@@ -3271,6 +3281,7 @@
struct cfq_queue *cfqq = RQ_CFQQ(rq);

cfq_log_cfqq(cfqd, cfqq, "insert_request");
+ cfq_clear_cfqq_group_idle(cfqq);
cfq_init_prio_data(cfqq, RQ_CIC(rq)->ioc);

rq_set_fifo_time(rq, jiffies + cfqd->cfq_fifo_expire[rq_is_sync(rq)]);
@@ -3416,10 +3427,12 @@
* SYNC_NOIDLE_WORKLOAD idles at the end of the tree
* only if we processed at least one !rq_noidle request
*/
- if (cfqd->serving_type == SYNC_WORKLOAD
- || cfqd->noidle_tree_requires_idle
- || cfqq->cfqg->nr_cfqq == 1)
+ if (cfqd->noidle_tree_requires_idle)
+ cfq_arm_slice_timer(cfqd);
+ else if (cfqq->cfqg->nr_cfqq == 1) {
+ cfq_mark_cfqq_group_idle(cfqq);
cfq_arm_slice_timer(cfqd);
+ }
}
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6
Prev: [PATCH V2 5/5] ara virt interface of perf to support kvm guest os statistics collection in guest os
Next: [PATCH fs/ramfs] inode.c: Fix incorrect variable freeing.

trying to understand READ_META, READ_SYNC, WRITE_SYNC & co