ext4: Don't send extra barrier during fsync if there are no dirty pages. [Kernel]

Prev: [PATCH] Fix ttm_page_alloc.c build breakage
Next: JBD2: Allow feature checks before journal recovery

From: Jan Kara on 21 Jul 2010 13:20

Hi,

> On Wed, Jun 30, 2010 at 09:21:04AM -0400, Ric Wheeler wrote:
> >
> > The problem with not issuing a cache flush when you have dirty meta
> > data or data is that it does not have any tie to the state of the
> > volatile write cache of the target storage device.
>
> We track whether or not there is any metadata updates associated with
> the inode already; if it does, we force a journal commit, and this
> implies a barrier operation.
>
> The case we're talking about here is one where either (a) there is no
> journal, or (b) there have been no metadata updates (I'm simplifying a
> little here; in fact we track whether there have been fdatasync()- vs
> fsync()- worthy metadata updates), and so there hasn't been a journal
> commit to do the cache flush.
>
> In this case, we want to track when is the last time an fsync() has
> been issued, versus when was the last time data blocks for a
> particular inode have been pushed out to disk.
>
> To use an example I used as motivation for why we might want an
> fsync2(int fd[], int flags[], int num) syscall, consider the situation
> of:
>
> fsync(control_fd);
> fdatasync(data_fd);
>
> The first fsync() will have executed a cache flush operation. So when
> we do the fdatasync() (assuming that no metadata needs to be flushed
> out to disk), there is no need for the cache flush operation.
>
> If we had an enhanced fsync command, we would also be able to
> eliminate a second journal commit in the case where data_fd also had
> some metadata that needed to be flushed out to disk.
Current implementation already avoids journal commit because of
fdatasync(data_fd). We remeber a transaction ID when inode metadata has
last been updated and do not force a transaction commit if it is already
committed. Thus the first fsync might force a transaction commit but second
fdatasync likely won't.
We could actually improve the scheme to work for data as well. I wrote
a proof-of-concept patches (attached) and they nicely avoid second barrier
when doing:
echo "aaa" >file1; echo "aaa" >file2; fsync file2; fsync file1

Ted, would you be interested in something like this?

Honza
--
Jan Kara <jack(a)suse.cz>
SuSE CR Labs

From: Darrick J. Wong on 2 Aug 2010 20:20

On Wed, Jul 21, 2010 at 07:16:09PM +0200, Jan Kara wrote:
> Hi,
>
> > On Wed, Jun 30, 2010 at 09:21:04AM -0400, Ric Wheeler wrote:
> > >
> > > The problem with not issuing a cache flush when you have dirty meta
> > > data or data is that it does not have any tie to the state of the
> > > volatile write cache of the target storage device.
> >
> > We track whether or not there is any metadata updates associated with
> > the inode already; if it does, we force a journal commit, and this
> > implies a barrier operation.
> >
> > The case we're talking about here is one where either (a) there is no
> > journal, or (b) there have been no metadata updates (I'm simplifying a
> > little here; in fact we track whether there have been fdatasync()- vs
> > fsync()- worthy metadata updates), and so there hasn't been a journal
> > commit to do the cache flush.
> >
> > In this case, we want to track when is the last time an fsync() has
> > been issued, versus when was the last time data blocks for a
> > particular inode have been pushed out to disk.
> >
> > To use an example I used as motivation for why we might want an
> > fsync2(int fd[], int flags[], int num) syscall, consider the situation
> > of:
> >
> > fsync(control_fd);
> > fdatasync(data_fd);
> >
> > The first fsync() will have executed a cache flush operation. So when
> > we do the fdatasync() (assuming that no metadata needs to be flushed
> > out to disk), there is no need for the cache flush operation.
> >
> > If we had an enhanced fsync command, we would also be able to
> > eliminate a second journal commit in the case where data_fd also had
> > some metadata that needed to be flushed out to disk.
> Current implementation already avoids journal commit because of
> fdatasync(data_fd). We remeber a transaction ID when inode metadata has
> last been updated and do not force a transaction commit if it is already
> committed. Thus the first fsync might force a transaction commit but second
> fdatasync likely won't.
> We could actually improve the scheme to work for data as well. I wrote
> a proof-of-concept patches (attached) and they nicely avoid second barrier
> when doing:
> echo "aaa" >file1; echo "aaa" >file2; fsync file2; fsync file1
>
> Ted, would you be interested in something like this?

Well... on my fsync-happy workloads, this seems to cut the barrier count down
by about 20%, and speeds it up by about 20%.

I also have a patch to ext4_sync_files that batches the fsync requests together
for a further 20% decrease in barrier IOs, which makes it run another 20%
faster. I'll send that one out shortly, though I've not safety-tested it at
all.

--D
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Darrick J. Wong on 4 Aug 2010 14:20

On Tue, Aug 03, 2010 at 05:01:52AM -0400, Christoph Hellwig wrote:
> On Mon, Aug 02, 2010 at 05:09:39PM -0700, Darrick J. Wong wrote:
> > Well... on my fsync-happy workloads, this seems to cut the barrier count down
> > by about 20%, and speeds it up by about 20%.
>
> Care to share the test case for this? I'd be especially interesting on
> how it behaves with non-draining barriers / cache flushes in fsync.

Sure. When I run blktrace with the ffsb profile, I get these results:

barriers transactions/sec
16212 206
15625 201
10442 269
10870 266
15658 201

Without Jan's patch:
barriers transactions/sec
20855 177
20963 177
20340 174
20908 177

The two ~270 results are a little odd... if we ignore them, the net gain with
Jan's patch is about a 25% reduction in barriers issued and about a 15%
increase in tps. (If we don't, it's ~30% and ~30%, respectively.) That said,
I was running mkfs between runs, so it's possible that the disk layout could
have shifted a bit. If I turn off the fsync parts of the ffsb profile, the
barrier counts drop to about a couple every second or so, which means that
Jan's patch doesn't have much of an effect. But it does help if someone is
hammering on the filesystem with fsync.

The ffsb profile is attached below.

--D

-----------

time=300
alignio=1
directio=1

[filesystem0]
location=/mnt/
num_files=100000
num_dirs=1000

reuse=1
# File sizes range from 1kB to 1MB.
size_weight 1KB 10
size_weight 2KB 15
size_weight 4KB 16
size_weight 8KB 16
size_weight 16KB 15
size_weight 32KB 10
size_weight 64KB 8
size_weight 128KB 4
size_weight 256KB 3
size_weight 512KB 2
size_weight 1MB 1

create_blocksize=1048576
[end0]

[threadgroup0]
num_threads=64

readall_weight=4
create_fsync_weight=2
delete_weight=1

append_weight = 1
append_fsync_weight = 1
stat_weight = 1
create_weight = 1
writeall_weight = 1
writeall_fsync_weight = 1
open_close_weight = 1

write_size=64KB
write_blocksize=512KB

read_size=64KB
read_blocksize=512KB

[stats]
enable_stats=1
enable_range=1

msec_range 0.00 0.01
msec_range 0.01 0.02
msec_range 0.02 0.05
msec_range 0.05 0.10
msec_range 0.10 0.20
msec_range 0.20 0.50
msec_range 0.50 1.00
msec_range 1.00 2.00
msec_range 2.00 5.00
msec_range 5.00 10.00
msec_range 10.00 20.00
msec_range 20.00 50.00
msec_range 50.00 100.00
msec_range 100.00 200.00
msec_range 200.00 500.00
msec_range 500.00 1000.00
msec_range 1000.00 2000.00
msec_range 2000.00 5000.00
msec_range 5000.00 10000.00
[end]
[end0]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

|
Pages: 1
Prev: [PATCH] Fix ttm_page_alloc.c build breakage
Next: JBD2: Allow feature checks before journal recovery