From: Christoph Hellwig on
Hi Jens,

I'm trying to understand the "sync" and "meta" flags we add to bios /
requests and the more I look into the more I get confused. Let's start
with the simple bits:

- REQ_META

Only really used for READ_META. There's a WRITE_META that I
added a while ago, but the actual users of it are gone again
now.

The READ_META is used by ext3 and ext4 for reading inodes,
and for all reads of directory blocks. XFS has a place that
sets READ_META, but it's unreachable now. GFS2 on the other
hand uses REQ_META directly, and ORs it into just about every
other kind of READ/WRITE request: WRITE_BARRIER,
WRITE_SYNC_PLUG, WRITE, READ_SYNC.

- REQ_SYNC

Used for READ_SYNC / WRITE_SYNC_PLUG / WRITE_SYNC /
WRITE_ODIRECT_PLUG / WRITE_BARRIER and the SWRITE variants.

READ_SYNC is used by GFS2 in a few spots, but always together
with REQ_META, and by jfs and xfs for the code reading the log
during recovery.
(S)WRITE_SYNC_PLUG is used for ->writepage instances with
the WB_SYNC_ALL sync_mode, and random pieces of journal /
metadata I/O in gfs2 and jbd(2) as well as fsync_buffers_list().
Note that the SWRITE_SYNC_PLUG uses actually degenerate to WRITE
in ll_rw_block.c due to what seems to be a bug in that code.
WRITE_SYNC is used by xfs for the log write if barriers are not
supported, by jfs for all log writes, by gfs2 for at least some
log writes, by btrfs for data writeout in some cases.
SWRITE_SYNC is entirely unused.
WRITE_ODIRECT_PLUG is used for O_DIRECT writes, and
WRITE_BARRIER is used for anything using barriers.

- REQ_NOIDLE

Used for WRITE_SYNC_PLUG / WRITE_SYNC / WRITE_BARRIER and
SWRITE variants. See the description above for details.

- REQ_UNPLUG

Used for READ_SYNC, WRITE_SYNC and WRITE_BARRIER.

Now how do we use these flags in the block layer?

- REQ_META

The only place where we ever use this flag is inside the
cfq scheduler. In cfq_choose_req we use it to give a meta
request priority over one that doesn't have it. But before
that we already do the same preference check with rw_is_sync,
which evaluates to true for requests with that are either
reads or have REQ_SYNC set. So for reads the REQ_META flag
here effectively is a no-op, and for writes it gives less
priority than REQ_SYNC.
In addition to that we use it to account for pending metadata
requests in cfq_rq_enqueued/cfq_remove_request which gets
checked in cfq_should_preempt to give priority to a meta
request if the other queue doesn't have any pending meta
requests. But again this priority comes after a similar
check for sync requests that checks if the other queue has
been marked to have sync requests pending.

- REQ_SYNC

This one is used in quite a few places, with many of them
obsfucated by macros like rw_is_sync, rq_is_sync and
cfq_bio_sync. In general all uses seem to imply giving
a write request the same priority as a read request and
treat it as synchronous. I could not spot a place where
it actually has any effect on reads.

- REQ_NOIDLE

Only used inside the cfq I/O scheduler in one place:
cfq_completed_request. Only used for READ or synchronous
request. I can't figure out what's actually going on in
details here.

- REQ_UNPLUG

Used to call __generic_unplug_device at the end of
__make_reques, and one probably wrong usage in drbd (but given
what a big piece of junk drbd is I refuse to step into that).

That's how far I got to understand this.

Now the big questions:

What eactly is the REQ_META flag for? Even after going
through all the things I mentioned above I don't quite
understand it. It's only used for reads, and doesn't seem
to give any major priority. Should it be used for all metadata
reads or just some? Currently it's not actually very widely
used.

Should we allow REQ_META on writes, but ignore it except for
blktrace? If the answer above is we want to tag all metadata
reads as REQ_META that would allow easily spotting all metadata
I/O in blktrace.

Is there any point in keeping the READ_SYNC? If the users
really want to keep the direct unplugging semantics we can
just write READ | REQ_UNPLUG, but even that seems to be wrong
for the log recvoery callers.

Similarly is there a point in keeping WRITE_SYNC? Most callers
really want WRITE_SYNC_PLUG semantics, and the current naming
is extremtly confusing. I'd much rather see the callers
specify explicitly using REQ_UNPLUG if they want to unplug
the queue.

Why do O_DIRECT writes not want to set REQ_NOIDLE (and that
exactly does REQ_NOIDLE mean anyway). It's the only sync writes
that do not set it, so if this special case went away we
could get rid of the flag and key it off REQ_SYNC.

Do we really need to keep the SWRITE flags? I'd rather
make that an additional flag to ll_rw_block, or even better
split ll_rw_block into two helpers for the guaranteed locking
or not case.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jens Axboe on
On 2010-06-21 11:48, Christoph Hellwig wrote:
> Hi Jens,
>
> I'm trying to understand the "sync" and "meta" flags we add to bios /
> requests and the more I look into the more I get confused. Let's start
> with the simple bits:
>
> - REQ_META
>
> Only really used for READ_META. There's a WRITE_META that I
> added a while ago, but the actual users of it are gone again
> now.
>
> The READ_META is used by ext3 and ext4 for reading inodes,
> and for all reads of directory blocks. XFS has a place that
> sets READ_META, but it's unreachable now. GFS2 on the other
> hand uses REQ_META directly, and ORs it into just about every
> other kind of READ/WRITE request: WRITE_BARRIER,
> WRITE_SYNC_PLUG, WRITE, READ_SYNC.
>
> - REQ_SYNC
>
> Used for READ_SYNC / WRITE_SYNC_PLUG / WRITE_SYNC /
> WRITE_ODIRECT_PLUG / WRITE_BARRIER and the SWRITE variants.
>
> READ_SYNC is used by GFS2 in a few spots, but always together
> with REQ_META, and by jfs and xfs for the code reading the log
> during recovery.
> (S)WRITE_SYNC_PLUG is used for ->writepage instances with
> the WB_SYNC_ALL sync_mode, and random pieces of journal /
> metadata I/O in gfs2 and jbd(2) as well as fsync_buffers_list().
> Note that the SWRITE_SYNC_PLUG uses actually degenerate to WRITE
> in ll_rw_block.c due to what seems to be a bug in that code.
> WRITE_SYNC is used by xfs for the log write if barriers are not
> supported, by jfs for all log writes, by gfs2 for at least some
> log writes, by btrfs for data writeout in some cases.
> SWRITE_SYNC is entirely unused.
> WRITE_ODIRECT_PLUG is used for O_DIRECT writes, and
> WRITE_BARRIER is used for anything using barriers.
>
> - REQ_NOIDLE
>
> Used for WRITE_SYNC_PLUG / WRITE_SYNC / WRITE_BARRIER and
> SWRITE variants. See the description above for details.
>
> - REQ_UNPLUG
>
> Used for READ_SYNC, WRITE_SYNC and WRITE_BARRIER.
>
> Now how do we use these flags in the block layer?
>
> - REQ_META
>
> The only place where we ever use this flag is inside the
> cfq scheduler. In cfq_choose_req we use it to give a meta
> request priority over one that doesn't have it. But before
> that we already do the same preference check with rw_is_sync,
> which evaluates to true for requests with that are either
> reads or have REQ_SYNC set. So for reads the REQ_META flag
> here effectively is a no-op, and for writes it gives less
> priority than REQ_SYNC.
> In addition to that we use it to account for pending metadata
> requests in cfq_rq_enqueued/cfq_remove_request which gets
> checked in cfq_should_preempt to give priority to a meta
> request if the other queue doesn't have any pending meta
> requests. But again this priority comes after a similar
> check for sync requests that checks if the other queue has
> been marked to have sync requests pending.

It's also annotation for blktrace, so you can tell which parts of the IO
is meta data etc. The scheduler impact is questionable, I doubt it makes
a whole lot of difference.

> - REQ_SYNC
>
> This one is used in quite a few places, with many of them
> obsfucated by macros like rw_is_sync, rq_is_sync and
> cfq_bio_sync. In general all uses seem to imply giving
> a write request the same priority as a read request and
> treat it as synchronous. I could not spot a place where
> it actually has any effect on reads.

Reads are sync by nature in the block layer, so they don't get that
special annotation.

> - REQ_NOIDLE
>
> Only used inside the cfq I/O scheduler in one place:
> cfq_completed_request. Only used for READ or synchronous
> request. I can't figure out what's actually going on in
> details here.

For a sync request, CFQ will idle the disk waiting for another one that
is close to the completed request. This is what keeps up high throughput
for read or sync write workloads, if you have more than one stream
going.

> Now the big questions:
>
> What eactly is the REQ_META flag for? Even after going
> through all the things I mentioned above I don't quite
> understand it. It's only used for reads, and doesn't seem
> to give any major priority. Should it be used for all metadata
> reads or just some? Currently it's not actually very widely
> used.
>
> Should we allow REQ_META on writes, but ignore it except for
> blktrace? If the answer above is we want to tag all metadata
> reads as REQ_META that would allow easily spotting all metadata
> I/O in blktrace.

For analysis purposes, annotating all meta data IOs as such would be
beneficial (reads as well as writes). As mentioned above, the current
scheduler impact isn't huge. There may be some interesting test and
benchmarks there for improving that part.

> Is there any point in keeping the READ_SYNC? If the users
> really want to keep the direct unplugging semantics we can
> just write READ | REQ_UNPLUG, but even that seems to be wrong
> for the log recvoery callers.
>
> Similarly is there a point in keeping WRITE_SYNC? Most callers
> really want WRITE_SYNC_PLUG semantics, and the current naming
> is extremtly confusing. I'd much rather see the callers
> specify explicitly using REQ_UNPLUG if they want to unplug
> the queue.

So a large part of that problem is the overloaded meaning of sync. For
some cases it means "unplug on issue", and for others it means that the
IO itself is syncronous. The other nasty bit is the implicit plugging
that happens behind the back of the submitter, but that's an issue to
tackle separately. I'd suggest avoiding unnecessary churn in naming of
those.

> Why do O_DIRECT writes not want to set REQ_NOIDLE (and that
> exactly does REQ_NOIDLE mean anyway). It's the only sync writes
> that do not set it, so if this special case went away we
> could get rid of the flag and key it off REQ_SYNC.

See above for NOIDLE. You kill O_DIRECT write throughput if you don't
idle at the end of a write, if you have other activity on the disk.

> Do we really need to keep the SWRITE flags? I'd rather
> make that an additional flag to ll_rw_block, or even better
> split ll_rw_block into two helpers for the guaranteed locking
> or not case.

No, lets kill those.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Christoph Hellwig on
On Mon, Jun 21, 2010 at 12:04:06PM +0200, Jens Axboe wrote:
> It's also annotation for blktrace, so you can tell which parts of the IO
> is meta data etc. The scheduler impact is questionable, I doubt it makes
> a whole lot of difference.

> For analysis purposes, annotating all meta data IOs as such would be
> beneficial (reads as well as writes). As mentioned above, the current
> scheduler impact isn't huge. There may be some interesting test and
> benchmarks there for improving that part.

As mentioned in the previous mail the use of the flag is not very
wide spread. Currently it's only ext3/ext3 inodes and directories as well
as all metadata I/O in gfs2 that gets marked this way. And I'd be much more
comfortable to add more annotations if it didn't also have some form
of schedule impact. The I/O schedules in general and cfq in particular
have caused us far too many issues with such subtile differences.

> > This one is used in quite a few places, with many of them
> > obsfucated by macros like rw_is_sync, rq_is_sync and
> > cfq_bio_sync. In general all uses seem to imply giving
> > a write request the same priority as a read request and
> > treat it as synchronous. I could not spot a place where
> > it actually has any effect on reads.
>
> Reads are sync by nature in the block layer, so they don't get that
> special annotation.

Well, we do give them this special annotation in a few places, but we
don't actually use it.

> So a large part of that problem is the overloaded meaning of sync. For
> some cases it means "unplug on issue", and for others it means that the
> IO itself is syncronous. The other nasty bit is the implicit plugging
> that happens behind the back of the submitter, but that's an issue to
> tackle separately. I'd suggest avoiding unnecessary churn in naming of
> those.

Well, the current naming is extremly confusing. The best thing I could
come up with is to completely drop READ_SYNC and WRITE_SYNC and just
pass REQ_UNPLUG explicitly together with READ / WRITE_SYNC_PLUG.
There's only 5 respective 8 users of them, so explicitly documenting
our intentions there seems useful. Especially if we want to add more
_META annotation in which case the simple READ_* / WRITE_* macros
don't do anymore either. Similarly it might be useful to remove
READ_META/WRITE_META and replace them with explicit | REQ_META, which
is just about as short and a lot more descriptive, especially for
synchronous metadata writes.

> > Why do O_DIRECT writes not want to set REQ_NOIDLE (and that
> > exactly does REQ_NOIDLE mean anyway). It's the only sync writes
> > that do not set it, so if this special case went away we
> > could get rid of the flag and key it off REQ_SYNC.
>
> See above for NOIDLE. You kill O_DIRECT write throughput if you don't
> idle at the end of a write, if you have other activity on the disk.

Ok, makes sense. Would you mind taking a patch to kill the
WRITE_ODIRECT_PLUG and just do a

/*
* O_DIRECT writes are synchronous, but we must not disable the
* idling logic in CFQ to avoid killing performance.
*/
if (rw & WRITE)
rw |= REQ_SYNC;

But that leaves the question why disabling the idling logical for
data integrity ->writepage is fine? This gets called from ->fsync
or O_SYNC writes and will have the same impact as O_DIRECT writes.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jens Axboe on
On 21/06/10 20.52, Jeff Moyer wrote:
> Jens Axboe <axboe(a)kernel.dk> writes:
>
>> On 2010-06-21 11:48, Christoph Hellwig wrote:
>>> Now how do we use these flags in the block layer?
>>>
>>> - REQ_META
>>>
>>> The only place where we ever use this flag is inside the
>>> cfq scheduler. In cfq_choose_req we use it to give a meta
>>> request priority over one that doesn't have it. But before
>>> that we already do the same preference check with rw_is_sync,
>>> which evaluates to true for requests with that are either
>>> reads or have REQ_SYNC set. So for reads the REQ_META flag
>>> here effectively is a no-op, and for writes it gives less
>>> priority than REQ_SYNC.
>>> In addition to that we use it to account for pending metadata
>>> requests in cfq_rq_enqueued/cfq_remove_request which gets
>>> checked in cfq_should_preempt to give priority to a meta
>>> request if the other queue doesn't have any pending meta
>>> requests. But again this priority comes after a similar
>>> check for sync requests that checks if the other queue has
>>> been marked to have sync requests pending.
>>
>> It's also annotation for blktrace, so you can tell which parts of the IO
>> is meta data etc. The scheduler impact is questionable, I doubt it makes
>> a whole lot of difference.
>
> Really? Even after I showed the performance impact of setting that bit
> for journal I/O?
>
> http://lkml.org/lkml/2010/4/1/344

It's definitely a win in some cases, as you showed there as well.
My initial testing a long time ago had some nice benefits too. So
perhaps the above wasn't worded very well, I always worry that we
have regressions doing boosts for things like that. But given that
meta data is something that needs to be done before we get to the
real data, bumping priority generally seems like a good thing to do.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jeff Moyer on
Jens Axboe <axboe(a)kernel.dk> writes:

> On 2010-06-21 11:48, Christoph Hellwig wrote:
>> Now how do we use these flags in the block layer?
>>
>> - REQ_META
>>
>> The only place where we ever use this flag is inside the
>> cfq scheduler. In cfq_choose_req we use it to give a meta
>> request priority over one that doesn't have it. But before
>> that we already do the same preference check with rw_is_sync,
>> which evaluates to true for requests with that are either
>> reads or have REQ_SYNC set. So for reads the REQ_META flag
>> here effectively is a no-op, and for writes it gives less
>> priority than REQ_SYNC.
>> In addition to that we use it to account for pending metadata
>> requests in cfq_rq_enqueued/cfq_remove_request which gets
>> checked in cfq_should_preempt to give priority to a meta
>> request if the other queue doesn't have any pending meta
>> requests. But again this priority comes after a similar
>> check for sync requests that checks if the other queue has
>> been marked to have sync requests pending.
>
> It's also annotation for blktrace, so you can tell which parts of the IO
> is meta data etc. The scheduler impact is questionable, I doubt it makes
> a whole lot of difference.

Really? Even after I showed the performance impact of setting that bit
for journal I/O?

http://lkml.org/lkml/2010/4/1/344

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/