improve the performance of large sequential write NFS workloads [Kernel]

Prev: -tip: origin tree boot crash
Next: patch staging-ramzswap-remove-arm-specific-d-cache-hack.patch added to gregkh-2.6 tree

From: Steve Rago on 23 Dec 2009 23:40

On Thu, 2009-12-24 at 00:44 +0100, Trond Myklebust wrote:

> > #2 is the difficult one. If you wait for memory pressure, you could
> > have waited too long, because depending on the latency of the commit,
> > you could run into low-memory situations. Then mayhem ensues, the
> > oom-killer gets cranky (if you haven't disabled it), and stuff starts
> > failing and/or hanging. So you need to be careful about setting the
> > threshold for generating a commit so that the client doesn't run out of
> > memory before the server can respond.
>
> Right, but this is why we have limits on the total number of dirty pages
> that can be kept in memory. The NFS unstable writes don't significantly
> change that model, they just add an extra step: once all the dirty data
> has been transmitted to the server, your COMMIT defines a
> synchronisation point after which you know that the data you just sent
> is all on disk. Given a reasonable NFS server implementation, it will
> already have started the write out of that data, and so hopefully the
> COMMIT operation itself will run reasonably quickly.

Right. The trick is to do this with the best performance possible.

>
> Any userland application with basic data integrity requirements will
> have the same expectations. It will write out the data and then fsync()
> at regular intervals. I've never heard of any expectations from
> filesystem and VM designers that applications should be required to
> fine-tune the length of those intervals in order to achieve decent
> performance.

Agreed, except that the more you call fsync(), the more you are stalling
the writing, so application designers must use fsync() judiciously.
Otherwise they'd just use synchronous writes. (Apologies if I sound
like Captain Obvious.)

Thanks,

Steve

>
> Cheers
> Trond
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo(a)vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Wu Fengguang on 24 Dec 2009 00:30

On Wed, Dec 23, 2009 at 09:32:44PM +0800, Jan Kara wrote:
> On Wed 23-12-09 03:43:02, Christoph Hellwig wrote:
> > On Tue, Dec 22, 2009 at 01:35:39PM +0100, Jan Kara wrote:
> > > > nfsd_sync:
> > > > [take i_mutex]
> > > > filemap_fdatawrite => can also be blocked, but less a problem
> > > > [drop i_mutex]
> > > > filemap_fdatawait
> > > >
> > > > Maybe it's a dumb question, but what's the purpose of i_mutex here?
> > > > For correctness or to prevent livelock? I can imagine some livelock
> > > > problem here (current implementation can easily wait for extra
> > > > pages), however not too hard to fix.
> > > Generally, most filesystems take i_mutex during fsync to
> > > a) avoid all sorts of livelocking problems
> > > b) serialize fsyncs for one inode (mostly for simplicity)
> > > I don't see what advantage would it bring that we get rid of i_mutex
> > > for fdatawait - only that maybe writers could proceed while we are
> > > waiting but is that really the problem?
> >
> > It would match what we do in vfs_fsync for the non-nfsd path, so it's
> > a no-brainer to do it. In fact I did switch it over to vfs_fsync a
> > while ago but that go reverted because it caused deadlocks for
> > nfsd_sync_dir which for some reason can't take the i_mutex (I'd have to
> > check the archives why).
> >
> > Here's a RFC patch to make some more sense of the fsync callers in nfsd,
> > including fixing up the data write/wait calling conventions to match the
> > regular fsync path (which might make this a -stable candidate):
> The patch looks good to me from general soundness point of view :).
> Someone with more NFS knowledge should tell whether dropping i_mutex for
> fdatawrite_and_wait is fine for NFS.

I believe it's safe to drop i_mutex for fdatawrite_and_wait().
Because NFS
1) client: collect all unstable pages (which server ACKed that have
reach its page cache)
2) client: send COMMIT
3) server: fdatawrite_and_wait(), which makes sure pages in 1) get cleaned
4) client: put all pages collected in 1) to clean state

So there's no need to take i_mutex to prevent concurrent
write/commits.

If someone else concurrently truncate and then extend i_size, the NFS
verf will be changed and thus client will resend the pages? (whether
it should overwrite the pages is another problem..)

Thanks,
Fengguang

>
> > Index: linux-2.6/fs/nfsd/vfs.c
> > ===================================================================
> > --- linux-2.6.orig/fs/nfsd/vfs.c 2009-12-23 09:32:45.693170043 +0100
> > +++ linux-2.6/fs/nfsd/vfs.c 2009-12-23 09:39:47.627170082 +0100
> > @@ -769,45 +769,27 @@ nfsd_close(struct file *filp)
> > }
> >
> > /*
> > - * Sync a file
> > - * As this calls fsync (not fdatasync) there is no need for a write_inode
> > - * after it.
> > + * Sync a directory to disk.
> > + *
> > + * This is odd compared to all other fsync callers because we
> > + *
> > + * a) do not have a file struct available
> > + * b) expect to have i_mutex already held by the caller
> > */
> > -static inline int nfsd_dosync(struct file *filp, struct dentry *dp,
> > - const struct file_operations *fop)
> > +int
> > +nfsd_sync_dir(struct dentry *dentry)
> > {
> > - struct inode *inode = dp->d_inode;
> > - int (*fsync) (struct file *, struct dentry *, int);
> > + struct inode *inode = dentry->d_inode;
> > int err;
> >
> > - err = filemap_fdatawrite(inode->i_mapping);
> > - if (err == 0 && fop && (fsync = fop->fsync))
> > - err = fsync(filp, dp, 0);
> > - if (err == 0)
> > - err = filemap_fdatawait(inode->i_mapping);
> > + WARN_ON(!mutex_is_locked(&inode->i_mutex));
> >
> > + err = filemap_write_and_wait(inode->i_mapping);
> > + if (err == 0 && inode->i_fop->fsync)
> > + err = inode->i_fop->fsync(NULL, dentry, 0);
> > return err;
> > }
> >
> > -static int
> > -nfsd_sync(struct file *filp)
> > -{
> > - int err;
> > - struct inode *inode = filp->f_path.dentry->d_inode;
> > - dprintk("nfsd: sync file %s\n", filp->f_path.dentry->d_name.name);
> > - mutex_lock(&inode->i_mutex);
> > - err=nfsd_dosync(filp, filp->f_path.dentry, filp->f_op);
> > - mutex_unlock(&inode->i_mutex);
> > -
> > - return err;
> > -}
> > -
> > -int
> > -nfsd_sync_dir(struct dentry *dp)
> > -{
> > - return nfsd_dosync(NULL, dp, dp->d_inode->i_fop);
> > -}
> > -
> > /*
> > * Obtain the readahead parameters for the file
> > * specified by (dev, ino).
> > @@ -1011,7 +993,7 @@ static int wait_for_concurrent_writes(st
> >
> > if (inode->i_state & I_DIRTY) {
> > dprintk("nfsd: write sync %d\n", task_pid_nr(current));
> > - err = nfsd_sync(file);
> > + err = vfs_fsync(file, file->f_path.dentry, 0);
> > }
> > last_ino = inode->i_ino;
> > last_dev = inode->i_sb->s_dev;
> > @@ -1180,7 +1162,7 @@ nfsd_commit(struct svc_rqst *rqstp, stru
> > return err;
> > if (EX_ISSYNC(fhp->fh_export)) {
> > if (file->f_op && file->f_op->fsync) {
> > - err = nfserrno(nfsd_sync(file));
> > + err = nfserrno(vfs_fsync(file, file->f_path.dentry, 0));
> > } else {
> > err = nfserr_notsupp;
> > }
> --
> Jan Kara <jack(a)suse.cz>
> SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Steve Rago on 24 Dec 2009 09:50

On Thu, 2009-12-24 at 09:21 +0800, Wu Fengguang wrote:

> > Commits and writes on the same inode need to be serialized for
> > consistency (write can change the data and metadata; commit [fsync]
> > needs to provide guarantees that the written data are stable). The
> > performance problem arises because NFS writes are fast (they generally
> > just deposit data into the server's page cache), but commits can take a
>
> Right.
>
> > long time, especially if there is a lot of cached data to flush to
> > stable storage.
>
> "a lot of cached data to flush" is not likely with pdflush, since it
> roughly send one COMMIT per 4MB WRITEs. So in average each COMMIT
> syncs 4MB at the server side.

Maybe on paper, but empirically I see anywhere from one commit per 8MB
to one commit per 64 MB.

>
> Your patch adds another pre-pdlush async write logic, which greatly
> reduced the number of COMMITs by pdflush. Can this be the major factor
> of the performance gain?

My patch removes pdflush from the picture almost entirely. See my
comments below.

>
> Jan has been proposing to change the pdflush logic from
>
> loop over dirty files {
> writeback 4MB
> write_inode
> }
> to
> loop over dirty files {
> writeback all its dirty pages
> write_inode
> }
>
> This should also be able to reduce the COMMIT numbers. I wonder if
> this (more general) approach can achieve the same performance gain.

The pdflush mechanism is fine for random writes and small sequential
writes, because it promotes concurrency -- instead of the application
blocking while it tries to write and commit its data, the application
can go on doing other more useful things, and the data gets flushed in
the background. There is also a benefit if the application makes
another modification to a page that is already dirty, because then
multiple modifications are coalesced into a single write.

However, the pdflush mechanism is wrong for large sequential writes
(like a backup stream, for example). First, there is no concurrency to
exploit -- the application is only going to dirty more pages, so
removing the need for it to block writing the pages out only adds to the
problem of memory pressure. Second, the application is not going to go
back and modify a page it has already written, so leaving it in the
cache for someone else to write provides no additional benefit.

Note that this assumes the application actually cares about the
consistency of its data and will call fsync() when it is done. If the
application doesn't call fsync(), then it doesn't matter when the pages
are written to backing store, because the interface makes no guarantees
in this case.

Thanks,

Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Wu Fengguang on 25 Dec 2009 01:00

On Thu, Dec 24, 2009 at 08:04:41PM +0800, Trond Myklebust wrote:
> On Thu, 2009-12-24 at 10:52 +0800, Wu Fengguang wrote:
> > Trond,
> >
> > On Thu, Dec 24, 2009 at 03:12:54AM +0800, Trond Myklebust wrote:
> > > On Wed, 2009-12-23 at 19:05 +0100, Jan Kara wrote:
> > > > On Wed 23-12-09 15:21:47, Trond Myklebust wrote:
> > > > > @@ -474,6 +482,18 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
> > > > > }
> > > > >
> > > > > spin_lock(&inode_lock);
> > > > > + /*
> > > > > + * Special state for cleaning NFS unstable pages
> > > > > + */
> > > > > + if (inode->i_state & I_UNSTABLE_PAGES) {
> > > > > + int err;
> > > > > + inode->i_state &= ~I_UNSTABLE_PAGES;
> > > > > + spin_unlock(&inode_lock);
> > > > > + err = commit_unstable_pages(inode, wait);
> > > > > + if (ret == 0)
> > > > > + ret = err;
> > > > > + spin_lock(&inode_lock);
> > > > > + }
> > > > I don't quite understand this chunk: We've called writeback_single_inode
> > > > because it had some dirty pages. Thus it has I_DIRTY_DATASYNC set and a few
> > > > lines above your chunk, we've called nfs_write_inode which sent commit to
> > > > the server. Now here you sometimes send the commit again? What's the
> > > > purpose?
> > >
> > > We no longer set I_DIRTY_DATASYNC. We only set I_DIRTY_PAGES (and later
> > > I_UNSTABLE_PAGES).
> > >
> > > The point is that we now do the commit only _after_ we've sent all the
> > > dirty pages, and waited for writeback to complete, whereas previously we
> > > did it in the wrong order.
> >
> > Sorry I still don't get it. The timing used to be:
> >
> > write 4MB ==> WRITE block 0 (ie. first 512KB)
> > WRITE block 1
> > WRITE block 2
> > WRITE block 3 ack from server for WRITE block 0 => mark 0 as unstable (inode marked need-commit)
> > WRITE block 4 ack from server for WRITE block 1 => mark 1 as unstable
> > WRITE block 5 ack from server for WRITE block 2 => mark 2 as unstable
> > WRITE block 6 ack from server for WRITE block 3 => mark 3 as unstable
> > WRITE block 7 ack from server for WRITE block 4 => mark 4 as unstable
> > ack from server for WRITE block 5 => mark 5 as unstable
> > write_inode ==> COMMIT blocks 0-5
> > ack from server for WRITE block 6 => mark 6 as unstable (inode marked need-commit)
> > ack from server for WRITE block 7 => mark 7 as unstable
> >
> > ack from server for COMMIT blocks 0-5 => mark 0-5 as clean
> >
> > write_inode ==> COMMIT blocks 6-7
> >
> > ack from server for COMMIT blocks 6-7 => mark 6-7 as clean
> >
> > Note that the first COMMIT is submitted before receiving all ACKs for
> > the previous writes, hence the second COMMIT is necessary. It seems
> > that your patch does not improve the timing at all.
>
> That would indicate that we're cycling through writeback_single_inode()
> more than once. Why?

Yes. The above sequence can happen for a 4MB sized dirty file.
The first COMMIT is done by L547, while the second COMMIT will be
scheduled either by __mark_inode_dirty(), or scheduled by L583
(depending on the time ACKs for L543 but missed L547 arrives:
if an ACK missed L578, the inode will be queued into b_dirty list,
but if any ACK arrives between L547 and L578, the inode will enter
b_more_io_wait, which is a to-be-introduced new dirty list).

537 dirty = inode->i_state & I_DIRTY;
538 inode->i_state |= I_SYNC;
539 inode->i_state &= ~I_DIRTY;
540
541 spin_unlock(&inode_lock);
542
==> 543 ret = do_writepages(mapping, wbc);
544
545 /* Don't write the inode if only I_DIRTY_PAGES was set */
546 if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
==> 547 int err = write_inode(inode, wait);
548 if (ret == 0)
549 ret = err;
550 }
551
552 if (wait) {
553 int err = filemap_fdatawait(mapping);
554 if (ret == 0)
555 ret = err;
556 }
557
558 spin_lock(&inode_lock);
559 inode->i_state &= ~I_SYNC;
560 if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
561 if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
562 /*
563 * We didn't write back all the pages. nfs_writepages()
564 * sometimes bales out without doing anything.
565 */
566 inode->i_state |= I_DIRTY_PAGES;
567 if (wbc->nr_to_write <= 0) {
568 /*
569 * slice used up: queue for next turn
570 */
571 requeue_io(inode);
572 } else {
573 /*
574 * somehow blocked: retry later
575 */
576 requeue_io_wait(inode);
577 }
==> 578 } else if (inode->i_state & I_DIRTY) {
579 /*
580 * At least XFS will redirty the inode during the
581 * writeback (delalloc) and on io completion (isize).
582 */
==> 583 requeue_io_wait(inode);

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Wu Fengguang on 25 Dec 2009 02:40

On Thu, Dec 24, 2009 at 10:49:40PM +0800, Steve Rago wrote:
>
> On Thu, 2009-12-24 at 09:21 +0800, Wu Fengguang wrote:
>
> > > Commits and writes on the same inode need to be serialized for
> > > consistency (write can change the data and metadata; commit [fsync]
> > > needs to provide guarantees that the written data are stable). The
> > > performance problem arises because NFS writes are fast (they generally
> > > just deposit data into the server's page cache), but commits can take a
> >
> > Right.
> >
> > > long time, especially if there is a lot of cached data to flush to
> > > stable storage.
> >
> > "a lot of cached data to flush" is not likely with pdflush, since it
> > roughly send one COMMIT per 4MB WRITEs. So in average each COMMIT
> > syncs 4MB at the server side.
>
> Maybe on paper, but empirically I see anywhere from one commit per 8MB
> to one commit per 64 MB.

Thanks for the data. It seems that your CPU works faster than network,
so that non of the NFS writes (submitted by L543) return by the time we
try to COMMIT at L547.

543 ret = do_writepages(mapping, wbc);
544
545 /* Don't write the inode if only I_DIRTY_PAGES was set */
546 if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
547 int err = write_inode(inode, wait);

Thus pdflush is able to do several rounds of do_writepages() before
write_inode() can actually collect some pages to be COMMITed.

> >
> > Your patch adds another pre-pdlush async write logic, which greatly
> > reduced the number of COMMITs by pdflush. Can this be the major factor
> > of the performance gain?
>
> My patch removes pdflush from the picture almost entirely. See my
> comments below.

Yes for sequential async writes, so I said "pre-pdflush" :)

> >
> > Jan has been proposing to change the pdflush logic from
> >
> > loop over dirty files {
> > writeback 4MB
> > write_inode
> > }
> > to
> > loop over dirty files {
> > writeback all its dirty pages
> > write_inode
> > }
> >
> > This should also be able to reduce the COMMIT numbers. I wonder if
> > this (more general) approach can achieve the same performance gain.
>
> The pdflush mechanism is fine for random writes and small sequential
> writes, because it promotes concurrency -- instead of the application
> blocking while it tries to write and commit its data, the application
> can go on doing other more useful things, and the data gets flushed in
> the background. There is also a benefit if the application makes
> another modification to a page that is already dirty, because then
> multiple modifications are coalesced into a single write.

Right.

> However, the pdflush mechanism is wrong for large sequential writes
> (like a backup stream, for example). First, there is no concurrency to
> exploit -- the application is only going to dirty more pages, so
> removing the need for it to block writing the pages out only adds to the
> problem of memory pressure. Second, the application is not going to go
> back and modify a page it has already written, so leaving it in the
> cache for someone else to write provides no additional benefit.

Well, in general pdflush does more good than bad, that's why we need it.
The above two reasons are about "pdflush is not as helpful", but not
that it is wrong.

That said, I do agree to limit the per-file dirty pages for NFS -- because
it tends to flush before simple stat/read operations, which could be costly.

> Note that this assumes the application actually cares about the
> consistency of its data and will call fsync() when it is done. If the
> application doesn't call fsync(), then it doesn't matter when the pages
> are written to backing store, because the interface makes no guarantees
> in this case.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: -tip: origin tree boot crash
Next: patch staging-ramzswap-remove-arm-specific-d-cache-hack.patch added to gregkh-2.6 tree