ext3/4: enhance fsync performance when using cfq [Kernel]

Prev: [Bug #15518] CONFIG_NO_BOOTMEM=y breaks boot on 32bit
Next: [Bug #15655] corrupt ext3 fs and partial freeze

From: Vivek Goyal on 7 Apr 2010 17:50

On Wed, Apr 07, 2010 at 05:18:12PM -0400, Jeff Moyer wrote:
> Hi again,
>
> So, here's another stab at fixing this. This patch is very much an RFC,
> so do not pull it into anything bound for Linus. ;-) For those new to
> this topic, here is the original posting: http://lkml.org/lkml/2010/4/1/344
>
> The basic problem is that, when running iozone on smallish files (up to
> 8MB in size) and including fsync in the timings, deadline outperforms
> CFQ by a factor of about 5 for 64KB files, and by about 10% for 8MB
> files. From examining the blktrace data, it appears that iozone will
> issue an fsync() call, and will have to wait until it's CFQ timeslice
> has expired before the journal thread can run to actually commit data to
> disk.
>
> The approach below puts an explicit call into the filesystem-specific
> fsync code to yield the disk so that the jbd[2] process has a chance to
> issue I/O. This bring performance of CFQ in line with deadline.
>
> There is one outstanding issue with the patch that Vivek pointed out.
> Basically, this could starve out the sync-noidle workload if there is a
> lot of fsync-ing going on. I'll address that in a follow-on patch. For
> now, I wanted to get the idea out there for others to comment on.
>
> Thanks a ton to Vivek for spotting the problem with the initial
> approach, and for his continued review.
>

Thanks Jeff. Conceptually this appraoch makes lot of sense to me. Higher
layers explicitly telling CFQ not to idle/yield the slice.

My firefox timing test is perfoming much better now.

real 0m15.957s
user 0m0.608s
sys 0m0.165s

real 0m12.984s
user 0m0.602s
sys 0m0.148s

real 0m13.057s
user 0m0.624s
sys 0m0.145s

So we got to take care of two issues now.

- Make it work with dm/md devices also. Somehow shall have to propogate
this yield semantic down the stack.

- The issue of making sure we don't yield if we are servicing sync-noidle
tree and there is other IO going on which relies on sync-noidle tree
idling (as you have already mentioned).

[..]
> +static void cfq_yield(struct request_queue *q)
> +{
> + struct cfq_data *cfqd = q->elevator->elevator_data;
> + struct cfq_io_context *cic;
> + struct cfq_queue *cfqq;
> + unsigned long flags;
> +
> + cic = cfq_cic_lookup(cfqd, current->io_context);
> + if (!cic)
> + return;
> +
> + spin_lock_irqsave(q->queue_lock, flags);
> +
> + /*
> + * This is primarily called to ensure that the long synchronous
> + * time slice does not prevent other I/O happenning (like journal
> + * commits) while we idle waiting for it. Thus, check to see if the
> + * current cfqq is the sync cfqq for this process.
> + */
> + cfqq = cic_to_cfqq(cic, 1);
> + if (!cfqq)
> + goto out_unlock;
> +
> + if (cfqd->active_queue != cfqq)
> + goto out_unlock;
> +
> + cfq_log_cfqq(cfqd, cfqq, "yielding queue");
> +
> + __cfq_slice_expired(cfqd, cfqq, 1);
> + __blk_run_queue(cfqd->queue);

I think it would be good if we also check that cfqq is empty or not. If cfqq
is not empty we don't want to give up slice? But this is a minor point. At
least in case of fsync, we seem to be waiting for all the iozone request
to finish and then calling yield. So cfqq should be empty at this point of
time.

Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Jens Axboe on 8 Apr 2010 07:10

On Wed, Apr 07 2010, Jeff Moyer wrote:
> Hi again,
>
> So, here's another stab at fixing this. This patch is very much an RFC,
> so do not pull it into anything bound for Linus. ;-) For those new to
> this topic, here is the original posting: http://lkml.org/lkml/2010/4/1/344
>
> The basic problem is that, when running iozone on smallish files (up to
> 8MB in size) and including fsync in the timings, deadline outperforms
> CFQ by a factor of about 5 for 64KB files, and by about 10% for 8MB
> files. From examining the blktrace data, it appears that iozone will
> issue an fsync() call, and will have to wait until it's CFQ timeslice
> has expired before the journal thread can run to actually commit data to
> disk.
>
> The approach below puts an explicit call into the filesystem-specific
> fsync code to yield the disk so that the jbd[2] process has a chance to
> issue I/O. This bring performance of CFQ in line with deadline.
>
> There is one outstanding issue with the patch that Vivek pointed out.
> Basically, this could starve out the sync-noidle workload if there is a
> lot of fsync-ing going on. I'll address that in a follow-on patch. For
> now, I wanted to get the idea out there for others to comment on.
>
> Thanks a ton to Vivek for spotting the problem with the initial
> approach, and for his continued review.

I like the concept, it's definitely useful (and your results amply
demonstrate that). I was thinking if there was a way in through the ioc
itself, rather than bdi -> queue and like you are doing. But I can't
think of a nice way to do it, so this is probably as good as it gets.

> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index dee9d93..a76ccd1 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -2065,7 +2065,7 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)
> cfqd->serving_group = cfqg;
>
> /* Restore the workload type data */
> - if (cfqg->saved_workload_slice) {
> + if (cfqg && cfqg->saved_workload_slice) {
> cfqd->workload_expires = jiffies + cfqg->saved_workload_slice;
> cfqd->serving_type = cfqg->saved_workload;
> cfqd->serving_prio = cfqg->saved_serving_prio;

Unrelated change?

> +static void cfq_yield(struct request_queue *q)
> +{
> + struct cfq_data *cfqd = q->elevator->elevator_data;
> + struct cfq_io_context *cic;
> + struct cfq_queue *cfqq;
> + unsigned long flags;
> +
> + cic = cfq_cic_lookup(cfqd, current->io_context);
> + if (!cic)
> + return;
> +
> + spin_lock_irqsave(q->queue_lock, flags);

spin_lock_irq() is sufficient here.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Jens Axboe on 8 Apr 2010 07:10

On Wed, Apr 07 2010, Vivek Goyal wrote:
> On Wed, Apr 07, 2010 at 05:18:12PM -0400, Jeff Moyer wrote:
> > Hi again,
> >
> > So, here's another stab at fixing this. This patch is very much an RFC,
> > so do not pull it into anything bound for Linus. ;-) For those new to
> > this topic, here is the original posting: http://lkml.org/lkml/2010/4/1/344
> >
> > The basic problem is that, when running iozone on smallish files (up to
> > 8MB in size) and including fsync in the timings, deadline outperforms
> > CFQ by a factor of about 5 for 64KB files, and by about 10% for 8MB
> > files. From examining the blktrace data, it appears that iozone will
> > issue an fsync() call, and will have to wait until it's CFQ timeslice
> > has expired before the journal thread can run to actually commit data to
> > disk.
> >
> > The approach below puts an explicit call into the filesystem-specific
> > fsync code to yield the disk so that the jbd[2] process has a chance to
> > issue I/O. This bring performance of CFQ in line with deadline.
> >
> > There is one outstanding issue with the patch that Vivek pointed out.
> > Basically, this could starve out the sync-noidle workload if there is a
> > lot of fsync-ing going on. I'll address that in a follow-on patch. For
> > now, I wanted to get the idea out there for others to comment on.
> >
> > Thanks a ton to Vivek for spotting the problem with the initial
> > approach, and for his continued review.
> >
>
> Thanks Jeff. Conceptually this appraoch makes lot of sense to me. Higher
> layers explicitly telling CFQ not to idle/yield the slice.
>
> My firefox timing test is perfoming much better now.
>
> real 0m15.957s
> user 0m0.608s
> sys 0m0.165s
>
> real 0m12.984s
> user 0m0.602s
> sys 0m0.148s
>
> real 0m13.057s
> user 0m0.624s
> sys 0m0.145s
>
> So we got to take care of two issues now.
>
> - Make it work with dm/md devices also. Somehow shall have to propogate
> this yield semantic down the stack.

The way that Jeff set it up, it's completely parallel to eg congestion
or unplugging. So that should be easily doable.

> - The issue of making sure we don't yield if we are servicing sync-noidle
> tree and there is other IO going on which relies on sync-noidle tree
> idling (as you have already mentioned).

Agree.

> > + /*
> > + * This is primarily called to ensure that the long synchronous
> > + * time slice does not prevent other I/O happenning (like journal
> > + * commits) while we idle waiting for it. Thus, check to see if the
> > + * current cfqq is the sync cfqq for this process.
> > + */
> > + cfqq = cic_to_cfqq(cic, 1);
> > + if (!cfqq)
> > + goto out_unlock;
> > +
> > + if (cfqd->active_queue != cfqq)
> > + goto out_unlock;
> > +
> > + cfq_log_cfqq(cfqd, cfqq, "yielding queue");
> > +
> > + __cfq_slice_expired(cfqd, cfqq, 1);
> > + __blk_run_queue(cfqd->queue);
>
> I think it would be good if we also check that cfqq is empty or not. If cfqq
> is not empty we don't want to give up slice? But this is a minor point. At
> least in case of fsync, we seem to be waiting for all the iozone request
> to finish and then calling yield. So cfqq should be empty at this point of
> time.

That's a bit of a problem. Say the queue has one request left, it'll
service that and the start idling. But we already got this hint that we
should yield, but as it happened before we reached the idling state,
we'll not yield. Perhaps it would be better to mark the cfqq as yielding
if it isn't ready to yield just yet.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Vivek Goyal on 8 Apr 2010 10:00

On Thu, Apr 08, 2010 at 01:00:45PM +0200, Jens Axboe wrote:
> On Wed, Apr 07 2010, Jeff Moyer wrote:
> > Hi again,
> >
> > So, here's another stab at fixing this. This patch is very much an RFC,
> > so do not pull it into anything bound for Linus. ;-) For those new to
> > this topic, here is the original posting: http://lkml.org/lkml/2010/4/1/344
> >
> > The basic problem is that, when running iozone on smallish files (up to
> > 8MB in size) and including fsync in the timings, deadline outperforms
> > CFQ by a factor of about 5 for 64KB files, and by about 10% for 8MB
> > files. From examining the blktrace data, it appears that iozone will
> > issue an fsync() call, and will have to wait until it's CFQ timeslice
> > has expired before the journal thread can run to actually commit data to
> > disk.
> >
> > The approach below puts an explicit call into the filesystem-specific
> > fsync code to yield the disk so that the jbd[2] process has a chance to
> > issue I/O. This bring performance of CFQ in line with deadline.
> >
> > There is one outstanding issue with the patch that Vivek pointed out.
> > Basically, this could starve out the sync-noidle workload if there is a
> > lot of fsync-ing going on. I'll address that in a follow-on patch. For
> > now, I wanted to get the idea out there for others to comment on.
> >
> > Thanks a ton to Vivek for spotting the problem with the initial
> > approach, and for his continued review.
>
> I like the concept, it's definitely useful (and your results amply
> demonstrate that). I was thinking if there was a way in through the ioc
> itself, rather than bdi -> queue and like you are doing. But I can't
> think of a nice way to do it, so this is probably as good as it gets.
>

I think, one issue with ioc based approach will be that it will then call
yield operation on all the devices in the system where this context has ever
done any IO. With bdi based approach this call will remain limited to
a smaller set of devices.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Jens Axboe on 8 Apr 2010 10:10

On Thu, Apr 08 2010, Vivek Goyal wrote:
> On Thu, Apr 08, 2010 at 01:00:45PM +0200, Jens Axboe wrote:
> > On Wed, Apr 07 2010, Jeff Moyer wrote:
> > > Hi again,
> > >
> > > So, here's another stab at fixing this. This patch is very much an RFC,
> > > so do not pull it into anything bound for Linus. ;-) For those new to
> > > this topic, here is the original posting: http://lkml.org/lkml/2010/4/1/344
> > >
> > > The basic problem is that, when running iozone on smallish files (up to
> > > 8MB in size) and including fsync in the timings, deadline outperforms
> > > CFQ by a factor of about 5 for 64KB files, and by about 10% for 8MB
> > > files. From examining the blktrace data, it appears that iozone will
> > > issue an fsync() call, and will have to wait until it's CFQ timeslice
> > > has expired before the journal thread can run to actually commit data to
> > > disk.
> > >
> > > The approach below puts an explicit call into the filesystem-specific
> > > fsync code to yield the disk so that the jbd[2] process has a chance to
> > > issue I/O. This bring performance of CFQ in line with deadline.
> > >
> > > There is one outstanding issue with the patch that Vivek pointed out.
> > > Basically, this could starve out the sync-noidle workload if there is a
> > > lot of fsync-ing going on. I'll address that in a follow-on patch. For
> > > now, I wanted to get the idea out there for others to comment on.
> > >
> > > Thanks a ton to Vivek for spotting the problem with the initial
> > > approach, and for his continued review.
> >
> > I like the concept, it's definitely useful (and your results amply
> > demonstrate that). I was thinking if there was a way in through the ioc
> > itself, rather than bdi -> queue and like you are doing. But I can't
> > think of a nice way to do it, so this is probably as good as it gets.
> >
>
> I think, one issue with ioc based approach will be that it will then call
> yield operation on all the devices in the system where this context has ever
> done any IO. With bdi based approach this call will remain limited to
> a smaller set of devices.

Oh, you'd want the bdi as well. And as I said, I don't think it was
workable, just trying to think it over and consider potentially other
ways to accomplish this.

At one point I had a patch that did the equivalant of this yield on
being scheduled out on the CPU side, which is probably why I was in the
ioc mindset.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2 3 4 5 6
Prev: [Bug #15518] CONFIG_NO_BOOTMEM=y breaks boot on 32bit
Next: [Bug #15655] corrupt ext3 fs and partial freeze