From: Jeff Moyer on
Jens Axboe <axboe(a)kernel.dk> writes:

> On 21/06/10 20.52, Jeff Moyer wrote:
>> Jens Axboe <axboe(a)kernel.dk> writes:
>>
>>> On 2010-06-21 11:48, Christoph Hellwig wrote:
>>>> Now how do we use these flags in the block layer?
>>>>
>>>> - REQ_META
>>>>
>>>> The only place where we ever use this flag is inside the
>>>> cfq scheduler. In cfq_choose_req we use it to give a meta
>>>> request priority over one that doesn't have it. But before
>>>> that we already do the same preference check with rw_is_sync,
>>>> which evaluates to true for requests with that are either
>>>> reads or have REQ_SYNC set. So for reads the REQ_META flag
>>>> here effectively is a no-op, and for writes it gives less
>>>> priority than REQ_SYNC.
>>>> In addition to that we use it to account for pending metadata
>>>> requests in cfq_rq_enqueued/cfq_remove_request which gets
>>>> checked in cfq_should_preempt to give priority to a meta
>>>> request if the other queue doesn't have any pending meta
>>>> requests. But again this priority comes after a similar
>>>> check for sync requests that checks if the other queue has
>>>> been marked to have sync requests pending.
>>>
>>> It's also annotation for blktrace, so you can tell which parts of the IO
>>> is meta data etc. The scheduler impact is questionable, I doubt it makes
>>> a whole lot of difference.
>>
>> Really? Even after I showed the performance impact of setting that bit
>> for journal I/O?
>>
>> http://lkml.org/lkml/2010/4/1/344
>
> It's definitely a win in some cases, as you showed there as well.
> My initial testing a long time ago had some nice benefits too. So
> perhaps the above wasn't worded very well, I always worry that we
> have regressions doing boosts for things like that. But given that
> meta data is something that needs to be done before we get to the
> real data, bumping priority generally seems like a good thing to do.

Oh, I'm not arguing for that approach. I just wanted to make it clear
that it can and does have a noticible impact.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Christoph Hellwig on
On Mon, Jun 21, 2010 at 08:56:55PM +0200, Jens Axboe wrote:
> FWIW, Windows marks meta data writes and they go out with FUA set
> on SATA disks. And SATA firmware prioritizes FUA writes, it's essentially
> a priority bit as well as a platter access bit. So at least we have some
> one else using a meta data boost. I agree that it would be a lot more
> trivial to add the annotations if they didn't have scheduler impact
> as well, but I still think it's a sane thing.

And we still disable the FUA bit in libata unless people set a
non-standard module option..

> >> Reads are sync by nature in the block layer, so they don't get that
> >> special annotation.
> >
> > Well, we do give them this special annotation in a few places, but we
> > don't actually use it.
>
> For unplugging?

We use the explicit unplugging, yes - but READ_META also includes
REQ_SYNC which is not used anywhere.

> > But that leaves the question why disabling the idling logical for
> > data integrity ->writepage is fine? This gets called from ->fsync
> > or O_SYNC writes and will have the same impact as O_DIRECT writes.
>
> We have never enabled idling for those. O_SYNC should get a nice
> boost too, it just needs to be benchmarked and tested and then
> there would be no reason not to add it.

We've only started using any kind of sync tag last year in ->writepage in
commit a64c8610bd3b753c6aff58f51c04cdf0ae478c18 "block_write_full_page:
Use synchronous writes for WBC_SYNC_ALL writebacks", switching from
WRITE_SYNC to WRITE_SYNC_PLUG a bit later in commit
6e34eeddf7deec1444bbddab533f03f520d8458c "block_write_full_page: switch
synchronous writes to use WRITE_SYNC_PLUG"

Before that we used plain WRITE, which had the normal idling logic.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jens Axboe on
On 21/06/10 21.14, Christoph Hellwig wrote:
> On Mon, Jun 21, 2010 at 08:56:55PM +0200, Jens Axboe wrote:
>> FWIW, Windows marks meta data writes and they go out with FUA set
>> on SATA disks. And SATA firmware prioritizes FUA writes, it's essentially
>> a priority bit as well as a platter access bit. So at least we have some
>> one else using a meta data boost. I agree that it would be a lot more
>> trivial to add the annotations if they didn't have scheduler impact
>> as well, but I still think it's a sane thing.
>
> And we still disable the FUA bit in libata unless people set a
> non-standard module option..

Yes, but that's a separate thing. The point is that boosting meta data
IO is done by others as well. That we don't fully do the same on the
hw side is a different story. That doesn't mean that the io scheduler
boost isn't useful.

>>>> Reads are sync by nature in the block layer, so they don't get that
>>>> special annotation.
>>>
>>> Well, we do give them this special annotation in a few places, but we
>>> don't actually use it.
>>
>> For unplugging?
>
> We use the explicit unplugging, yes - but READ_META also includes
> REQ_SYNC which is not used anywhere.

That does look superfluous.

>>> But that leaves the question why disabling the idling logical for
>>> data integrity ->writepage is fine? This gets called from ->fsync
>>> or O_SYNC writes and will have the same impact as O_DIRECT writes.
>>
>> We have never enabled idling for those. O_SYNC should get a nice
>> boost too, it just needs to be benchmarked and tested and then
>> there would be no reason not to add it.
>
> We've only started using any kind of sync tag last year in ->writepage in
> commit a64c8610bd3b753c6aff58f51c04cdf0ae478c18 "block_write_full_page:
> Use synchronous writes for WBC_SYNC_ALL writebacks", switching from
> WRITE_SYNC to WRITE_SYNC_PLUG a bit later in commit
> 6e34eeddf7deec1444bbddab533f03f520d8458c "block_write_full_page: switch
> synchronous writes to use WRITE_SYNC_PLUG"
>
> Before that we used plain WRITE, which had the normal idling logic.

Plain write does not idle.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Christoph Hellwig on
On Mon, Jun 21, 2010 at 09:16:33PM +0200, Jens Axboe wrote:
> Yes, but that's a separate thing. The point is that boosting meta data
> IO is done by others as well. That we don't fully do the same on the
> hw side is a different story. That doesn't mean that the io scheduler
> boost isn't useful.

Well, the FUA bit is useful for writes. And I agree that we need to
prioritize log I/O metadata. On the other hand at least for XFS most
other metadata writes actually are asynchronous - there's no need to
prioritize those.

> > We use the explicit unplugging, yes - but READ_META also includes
> > REQ_SYNC which is not used anywhere.
>
> That does look superfluous.

The above should READ_SYNC, but the same still applies.

> > We've only started using any kind of sync tag last year in ->writepage in
> > commit a64c8610bd3b753c6aff58f51c04cdf0ae478c18 "block_write_full_page:
> > Use synchronous writes for WBC_SYNC_ALL writebacks", switching from
> > WRITE_SYNC to WRITE_SYNC_PLUG a bit later in commit
> > 6e34eeddf7deec1444bbddab533f03f520d8458c "block_write_full_page: switch
> > synchronous writes to use WRITE_SYNC_PLUG"
> >
> > Before that we used plain WRITE, which had the normal idling logic.
>
> Plain write does not idle.

Well, whatever logic we use for normal files. Note that all the commits
above were introduced in 2.6.29. That's where the horrible fsync
latency regressions due to the idling logic that Jeff is fighting were
introduced. So there's defintively something wrong going on here.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Vivek Goyal on
On Mon, Jun 21, 2010 at 12:04:06PM +0200, Jens Axboe wrote:

[..]
> > Now how do we use these flags in the block layer?
> >
> > - REQ_META
> >
> > The only place where we ever use this flag is inside the
> > cfq scheduler. In cfq_choose_req we use it to give a meta
> > request priority over one that doesn't have it. But before
> > that we already do the same preference check with rw_is_sync,
> > which evaluates to true for requests with that are either
> > reads or have REQ_SYNC set. So for reads the REQ_META flag
> > here effectively is a no-op, and for writes it gives less
> > priority than REQ_SYNC.
> > In addition to that we use it to account for pending metadata
> > requests in cfq_rq_enqueued/cfq_remove_request which gets
> > checked in cfq_should_preempt to give priority to a meta
> > request if the other queue doesn't have any pending meta
> > requests. But again this priority comes after a similar
> > check for sync requests that checks if the other queue has
> > been marked to have sync requests pending.
>
> It's also annotation for blktrace, so you can tell which parts of the IO
> is meta data etc. The scheduler impact is questionable, I doubt it makes
> a whole lot of difference.

In my testing in the past, this was helping if lots of sequential readers
are running in a system (with 100ms slice each) and if there is another
reader doing small file reads. Without meta data based preemption check,
latency of opening small file used to be very high (Because all the sync
sequntial reader gets to consume their 100ms slices first).

Situation should be somewhat better now after corrado's changes of
reducing slice length dynamically. But I suspect that we will still
experience significantly reduced latency for small file reads in presece
of multiple sequntial reads going on.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/