improve the performance of large sequential write NFS workloads [Kernel]

Prev: -tip: origin tree build failure
Next: Possible regression of LBA48 support for piix

From: Trond Myklebust on 23 Dec 2009 14:20

On Wed, 2009-12-23 at 19:05 +0100, Jan Kara wrote:
> On Wed 23-12-09 15:21:47, Trond Myklebust wrote:
> > @@ -474,6 +482,18 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
> > }
> >
> > spin_lock(&inode_lock);
> > + /*
> > + * Special state for cleaning NFS unstable pages
> > + */
> > + if (inode->i_state & I_UNSTABLE_PAGES) {
> > + int err;
> > + inode->i_state &= ~I_UNSTABLE_PAGES;
> > + spin_unlock(&inode_lock);
> > + err = commit_unstable_pages(inode, wait);
> > + if (ret == 0)
> > + ret = err;
> > + spin_lock(&inode_lock);
> > + }
> I don't quite understand this chunk: We've called writeback_single_inode
> because it had some dirty pages. Thus it has I_DIRTY_DATASYNC set and a few
> lines above your chunk, we've called nfs_write_inode which sent commit to
> the server. Now here you sometimes send the commit again? What's the
> purpose?

We no longer set I_DIRTY_DATASYNC. We only set I_DIRTY_PAGES (and later
I_UNSTABLE_PAGES).

The point is that we now do the commit only _after_ we've sent all the
dirty pages, and waited for writeback to complete, whereas previously we
did it in the wrong order.

Cheers
Trond
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Trond Myklebust on 23 Dec 2009 16:50

On Wed, 2009-12-23 at 15:16 -0500, Steve Rago wrote:
> On Wed, 2009-12-23 at 19:39 +0100, Jan Kara wrote:
> > On Tue 22-12-09 11:20:15, Steve Rago wrote:
> > >
> > > On Tue, 2009-12-22 at 13:25 +0100, Jan Kara wrote:
> > > > > I originally spent several months playing with the balance_dirty_pages
> > > > > algorithm. The main drawback is that it affects more than the inodes
> > > > > that the caller is writing and that the control of what to do is too
> > > > Can you be more specific here please?
> > >
> > > Sure; balance_dirty_pages() will schedule writeback by the flusher
> > > thread once the number of dirty pages exceeds dirty_background_ratio.
> > > The flusher thread calls writeback_inodes_wb() to flush all dirty inodes
> > > associated with the bdi. Similarly, the process dirtying the pages will
> > > call writeback_inodes_wbc() when it's bdi threshold has been exceeded.
> > > The first problem is that these functions process all dirty inodes with
> > > the same backing device, which can lead to excess (duplicate) flushing
> > > of the same inode. Second, there is no distinction between pages that
> > > need to be committed and pages that have commits pending in
> > > NR_UNSTABLE_NFS/BDI_RECLAIMABLE (a page that has a commit pending won't
> > > be cleaned any faster by sending more commits). This tends to overstate
> > > the amount of memory that can be cleaned, leading to additional commit
> > > requests. Third, these functions generate a commit for each set of
> > > writes they do, which might not be appropriate. For background writing,
> > > you'd like to delay the commit as long as possible.
> > Ok, I get it. Thanks for explanation. The problem with more writing
> > threads bites us also for ordinary SATA drives (the IO pattern and thus
> > throughput gets worse and worse the more threads do writes). The plan is to
> > let only flusher thread do the IO and throttled thread in
> > balance_dirty_pages just waits for flusher thread to do the work. There
> > were even patches for this floating around but I'm not sure what's happened
> > to them. So that part of the problem should be easy to solve.
> > Another part is about sending commits - if we have just one thread doing
> > flushing, we have no problems with excessive commits for one inode. You're
> > right that we may want to avoid sending commits for background writeback
> > but until we send the commit, pages are just accumulating in the unstable
> > state, aren't they? So we might want to periodically send the commit for
> > the inode anyway to get rid of those pages. So from this point of view,
> > sending commit after each writepages call does not seem like a so bad idea
> > - although it might be more appropriate to send it some time after the
> > writepages call when we are not close to dirty limit so that server has
> > more time to do more natural "unforced" writeback...
>
> When to send the commit is a complex question to answer. If you delay
> it long enough, the server's flusher threads will have already done most
> of the work for you, so commits can be cheap, but you don't have access
> to the necessary information to figure this out. You can't delay it too
> long, though, because the unstable pages on the client will grow too
> large, creating memory pressure. I have a second patch, which I haven't
> posted yet, that adds feedback piggy-backed on the NFS write response,
> which allows the NFS client to free pages proactively. This greatly
> reduces the need to send commit messages, but it extends the protocol
> (in a backward-compatible manner), so it could be hard to convince
> people to accept.

There are only 2 cases when the client should send a COMMIT:
1. When it hits a synchronisation point (i.e. when the user calls
f/sync(), or close(), or when the user sets/clears a file
lock).
2. When memory pressure causes the VM to wants to free up those
pages that are marked as clean but unstable.

We should never be sending COMMIT in any other situation, since that
would imply that the client somehow has better information on how to
manage dirty pages on the server than the server's own VM.

Cheers
Trond
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Trond Myklebust on 23 Dec 2009 18:50

On Wed, 2009-12-23 at 18:13 -0500, Steve Rago wrote:
> On Wed, 2009-12-23 at 22:49 +0100, Trond Myklebust wrote:
>
> > > When to send the commit is a complex question to answer. If you delay
> > > it long enough, the server's flusher threads will have already done most
> > > of the work for you, so commits can be cheap, but you don't have access
> > > to the necessary information to figure this out. You can't delay it too
> > > long, though, because the unstable pages on the client will grow too
> > > large, creating memory pressure. I have a second patch, which I haven't
> > > posted yet, that adds feedback piggy-backed on the NFS write response,
> > > which allows the NFS client to free pages proactively. This greatly
> > > reduces the need to send commit messages, but it extends the protocol
> > > (in a backward-compatible manner), so it could be hard to convince
> > > people to accept.
> >
> > There are only 2 cases when the client should send a COMMIT:
> > 1. When it hits a synchronisation point (i.e. when the user calls
> > f/sync(), or close(), or when the user sets/clears a file
> > lock).
> > 2. When memory pressure causes the VM to wants to free up those
> > pages that are marked as clean but unstable.
> >
> > We should never be sending COMMIT in any other situation, since that
> > would imply that the client somehow has better information on how to
> > manage dirty pages on the server than the server's own VM.
> >
> > Cheers
> > Trond
>
> #2 is the difficult one. If you wait for memory pressure, you could
> have waited too long, because depending on the latency of the commit,
> you could run into low-memory situations. Then mayhem ensues, the
> oom-killer gets cranky (if you haven't disabled it), and stuff starts
> failing and/or hanging. So you need to be careful about setting the
> threshold for generating a commit so that the client doesn't run out of
> memory before the server can respond.

Right, but this is why we have limits on the total number of dirty pages
that can be kept in memory. The NFS unstable writes don't significantly
change that model, they just add an extra step: once all the dirty data
has been transmitted to the server, your COMMIT defines a
synchronisation point after which you know that the data you just sent
is all on disk. Given a reasonable NFS server implementation, it will
already have started the write out of that data, and so hopefully the
COMMIT operation itself will run reasonably quickly.

Any userland application with basic data integrity requirements will
have the same expectations. It will write out the data and then fsync()
at regular intervals. I've never heard of any expectations from
filesystem and VM designers that applications should be required to
fine-tune the length of those intervals in order to achieve decent
performance.

Cheers
Trond
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Trond Myklebust on 24 Dec 2009 07:10

On Thu, 2009-12-24 at 10:52 +0800, Wu Fengguang wrote:
> Trond,
>
> On Thu, Dec 24, 2009 at 03:12:54AM +0800, Trond Myklebust wrote:
> > On Wed, 2009-12-23 at 19:05 +0100, Jan Kara wrote:
> > > On Wed 23-12-09 15:21:47, Trond Myklebust wrote:
> > > > @@ -474,6 +482,18 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
> > > > }
> > > >
> > > > spin_lock(&inode_lock);
> > > > + /*
> > > > + * Special state for cleaning NFS unstable pages
> > > > + */
> > > > + if (inode->i_state & I_UNSTABLE_PAGES) {
> > > > + int err;
> > > > + inode->i_state &= ~I_UNSTABLE_PAGES;
> > > > + spin_unlock(&inode_lock);
> > > > + err = commit_unstable_pages(inode, wait);
> > > > + if (ret == 0)
> > > > + ret = err;
> > > > + spin_lock(&inode_lock);
> > > > + }
> > > I don't quite understand this chunk: We've called writeback_single_inode
> > > because it had some dirty pages. Thus it has I_DIRTY_DATASYNC set and a few
> > > lines above your chunk, we've called nfs_write_inode which sent commit to
> > > the server. Now here you sometimes send the commit again? What's the
> > > purpose?
> >
> > We no longer set I_DIRTY_DATASYNC. We only set I_DIRTY_PAGES (and later
> > I_UNSTABLE_PAGES).
> >
> > The point is that we now do the commit only _after_ we've sent all the
> > dirty pages, and waited for writeback to complete, whereas previously we
> > did it in the wrong order.
>
> Sorry I still don't get it. The timing used to be:
>
> write 4MB ==> WRITE block 0 (ie. first 512KB)
> WRITE block 1
> WRITE block 2
> WRITE block 3 ack from server for WRITE block 0 => mark 0 as unstable (inode marked need-commit)
> WRITE block 4 ack from server for WRITE block 1 => mark 1 as unstable
> WRITE block 5 ack from server for WRITE block 2 => mark 2 as unstable
> WRITE block 6 ack from server for WRITE block 3 => mark 3 as unstable
> WRITE block 7 ack from server for WRITE block 4 => mark 4 as unstable
> ack from server for WRITE block 5 => mark 5 as unstable
> write_inode ==> COMMIT blocks 0-5
> ack from server for WRITE block 6 => mark 6 as unstable (inode marked need-commit)
> ack from server for WRITE block 7 => mark 7 as unstable
>
> ack from server for COMMIT blocks 0-5 => mark 0-5 as clean
>
> write_inode ==> COMMIT blocks 6-7
>
> ack from server for COMMIT blocks 6-7 => mark 6-7 as clean
>
> Note that the first COMMIT is submitted before receiving all ACKs for
> the previous writes, hence the second COMMIT is necessary. It seems
> that your patch does not improve the timing at all.

That would indicate that we're cycling through writeback_single_inode()
more than once. Why?

Trond

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Trond Myklebust on 30 Dec 2009 11:30

On Fri, 2009-12-25 at 13:56 +0800, Wu Fengguang wrote:
> On Thu, Dec 24, 2009 at 08:04:41PM +0800, Trond Myklebust wrote:
> > That would indicate that we're cycling through writeback_single_inode()
> > more than once. Why?
>
> Yes. The above sequence can happen for a 4MB sized dirty file.
> The first COMMIT is done by L547, while the second COMMIT will be
> scheduled either by __mark_inode_dirty(), or scheduled by L583
> (depending on the time ACKs for L543 but missed L547 arrives:
> if an ACK missed L578, the inode will be queued into b_dirty list,
> but if any ACK arrives between L547 and L578, the inode will enter
> b_more_io_wait, which is a to-be-introduced new dirty list).
>
> 537 dirty = inode->i_state & I_DIRTY;
> 538 inode->i_state |= I_SYNC;
> 539 inode->i_state &= ~I_DIRTY;
> 540
> 541 spin_unlock(&inode_lock);
> 542
> ==> 543 ret = do_writepages(mapping, wbc);
> 544
> 545 /* Don't write the inode if only I_DIRTY_PAGES was set */
> 546 if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
> ==> 547 int err = write_inode(inode, wait);
> 548 if (ret == 0)
> 549 ret = err;
> 550 }
> 551
> 552 if (wait) {
> 553 int err = filemap_fdatawait(mapping);
> 554 if (ret == 0)
> 555 ret = err;
> 556 }
> 557
> 558 spin_lock(&inode_lock);
> 559 inode->i_state &= ~I_SYNC;
> 560 if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
> 561 if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> 562 /*
> 563 * We didn't write back all the pages. nfs_writepages()
> 564 * sometimes bales out without doing anything.
> 565 */
> 566 inode->i_state |= I_DIRTY_PAGES;
> 567 if (wbc->nr_to_write <= 0) {
> 568 /*
> 569 * slice used up: queue for next turn
> 570 */
> 571 requeue_io(inode);
> 572 } else {
> 573 /*
> 574 * somehow blocked: retry later
> 575 */
> 576 requeue_io_wait(inode);
> 577 }
> ==> 578 } else if (inode->i_state & I_DIRTY) {
> 579 /*
> 580 * At least XFS will redirty the inode during the
> 581 * writeback (delalloc) and on io completion (isize).
> 582 */
> ==> 583 requeue_io_wait(inode);

Hi Fengguang,

Apologies for having taken time over this. Do you see any improvement
with the appended variant instead? It adds a new address_space_operation
in order to do the commit. Furthermore, it ignores the commit request if
the caller is just doing a WB_SYNC_NONE background flush, waiting
instead for the ensuing WB_SYNC_ALL request...

Cheers
Trond
--------------------------------------------------------------------------------------------------------
VFS: Add a new inode state: I_UNSTABLE_PAGES

From: Trond Myklebust <Trond.Myklebust(a)netapp.com>

Add a new inode state to enable the vfs to commit the nfs unstable pages to
stable storage once the write back of dirty pages is done.

Signed-off-by: Trond Myklebust <Trond.Myklebust(a)netapp.com>
---

fs/fs-writeback.c | 27 +++++++++++++++++++++++++--
fs/nfs/file.c | 1 +
fs/nfs/inode.c | 16 ----------------
fs/nfs/internal.h | 3 ++-
fs/nfs/super.c | 2 --
fs/nfs/write.c | 29 ++++++++++++++++++++++++++++-
include/linux/fs.h | 9 +++++++++
7 files changed, 65 insertions(+), 22 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 49bc1b8..24bc817 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -388,6 +388,17 @@ static int write_inode(struct inode *inode, int sync)
}

/*
+ * Commit the NFS unstable pages.
+ */
+static int commit_unstable_pages(struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+ if (mapping->a_ops && mapping->a_ops->commit_unstable_pages)
+ return mapping->a_ops->commit_unstable_pages(mapping, wbc);
+ return 0;
+}
+
+/*
* Wait for writeback on an inode to complete.
*/
static void inode_wait_for_writeback(struct inode *inode)
@@ -474,6 +485,18 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
}

spin_lock(&inode_lock);
+ /*
+ * Special state for cleaning NFS unstable pages
+ */
+ if (inode->i_state & I_UNSTABLE_PAGES) {
+ int err;
+ inode->i_state &= ~I_UNSTABLE_PAGES;
+ spin_unlock(&inode_lock);
+ err = commit_unstable_pages(mapping, wbc);
+ if (ret == 0)
+ ret = err;
+ spin_lock(&inode_lock);
+ }
inode->i_state &= ~I_SYNC;
if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
if ((inode->i_state & I_DIRTY_PAGES) && wbc->for_kupdate) {
@@ -481,7 +504,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
* More pages get dirtied by a fast dirtier.
*/
goto select_queue;
- } else if (inode->i_state & I_DIRTY) {
+ } else if (inode->i_state & (I_DIRTY|I_UNSTABLE_PAGES)) {
/*
* At least XFS will redirty the inode during the
* writeback (delalloc) and on io completion (isize).
@@ -1050,7 +1073,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)

spin_lock(&inode_lock);
if ((inode->i_state & flags) != flags) {
- const int was_dirty = inode->i_state & I_DIRTY;
+ const int was_dirty = inode->i_state & (I_DIRTY|I_UNSTABLE_PAGES);

inode->i_state |= flags;

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 6b89132..67e50ac 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -526,6 +526,7 @@ const struct address_space_operations nfs_file_aops = {
.migratepage = nfs_migrate_page,
.launder_page = nfs_launder_page,
.error_remove_page = generic_error_remove_page,
+ .commit_unstable_pages = nfs_commit_unstable_pages,
};

/*
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index faa0918..8341709 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -97,22 +97,6 @@ u64 nfs_compat_user_ino64(u64 fileid)
return ino;
}

-int nfs_write_inode(struct inode *inode, int sync)
-{
- int ret;
-
- if (sync) {
- ret = filemap_fdatawait(inode->i_mapping);
- if (ret == 0)
- ret = nfs_commit_inode(inode, FLUSH_SYNC);
- } else
- ret = nfs_commit_inode(inode, 0);
- if (ret >= 0)
- return 0;
- __mark_inode_dirty(inode, I_DIRTY_DATASYNC);
- return ret;
-}
-
void nfs_clear_inode(struct inode *inode)
{
/*
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 29e464d..7bb326f 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -211,7 +211,6 @@ extern int nfs_access_cache_shrinker(int nr_to_scan, gfp_t gfp_mask);
extern struct workqueue_struct *nfsiod_workqueue;
extern struct inode *nfs_alloc_inode(struct super_block *sb);
extern void nfs_destroy_inode(struct inode *);
-extern int nfs_write_inode(struct inode *,int);
extern void nfs_clear_inode(struct inode *);
#ifdef CONFIG_NFS_V4
extern void nfs4_clear_inode(struct inode *);
@@ -253,6 +252,8 @@ extern int nfs4_path_walk(struct nfs_server *server,
extern void nfs_read_prepare(struct rpc_task *task, void *calldata);

/* write.c */
+extern int nfs_commit_unstable_pages(struct address_space *mapping,
+ struct writeback_control *wbc);
extern void nfs_write_prepare(struct rpc_task *task, void *calldata);
#ifdef CONFIG_MIGRATION
extern int nfs_migrate_page(struct address_space *,
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index ce907ef..805c1a0 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -265,7 +265,6 @@ struct file_system_type nfs_xdev_fs_type = {
static const struct super_operations nfs_sops = {
.alloc_inode = nfs_alloc_inode,
.destroy_inode = nfs_destroy_inode,
- .write_inode = nfs_write_inode,
.statfs = nfs_statfs,
.clear_inode = nfs_clear_inode,
.umount_begin = nfs_umount_begin,
@@ -334,7 +333,6 @@ struct file_system_type nfs4_referral_fs_type = {
static const struct super_operations nfs4_sops = {
.alloc_inode = nfs_alloc_inode,
.destroy_inode = nfs_destroy_inode,
- .write_inode = nfs_write_inode,
.statfs = nfs_statfs,
.clear_inode = nfs4_clear_inode,
.umount_begin = nfs_umount_begin,
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index d171696..187f3a9 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -441,7 +441,7 @@ nfs_mark_request_commit(struct nfs_page *req)
spin_unlock(&inode->i_lock);
inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
- __mark_inode_dirty(inode, I_DIRTY_DATASYNC);
+ mark_inode_unstable_pages(inode);
}

static int
@@ -1406,11 +1406,38 @@ int nfs_commit_inode(struct inode *inode, int how)
}
return res;
}
+
+int nfs_commit_unstable_pages(struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+ struct inode *inode = mapping->host;
+ int flags = FLUSH_SYNC;
+ int ret;
+
+ /* Don't commit if this is just a non-blocking flush */
+ if (wbc->sync_mode != WB_SYNC_ALL) {
+ mark_inode_unstable_pages(inode);
+ return 0;
+ }
+ if (wbc->nonblocking)
+ flags = 0;
+ ret = nfs_commit_inode(inode, flags);
+ if (ret > 0)
+ return 0;
+ return ret;
+}
+
#else
static inline int nfs_commit_list(struct inode *inode, struct list_head *head, int how)
{
return 0;
}
+
+int nfs_commit_unstable_pages(struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+ return 0;
+}
#endif

long nfs_sync_mapping_wait(struct address_space *mapping, struct writeback_control *wbc, int how)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9147ca8..ea0b7a3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -602,6 +602,8 @@ struct address_space_operations {
int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
unsigned long);
int (*error_remove_page)(struct address_space *, struct page *);
+ int (*commit_unstable_pages)(struct address_space *,
+ struct writeback_control *);
};

/*
@@ -1635,6 +1637,8 @@ struct super_operations {
#define I_CLEAR 64
#define __I_SYNC 7
#define I_SYNC (1 << __I_SYNC)
+#define __I_UNSTABLE_PAGES 9
+#define I_UNSTABLE_PAGES (1 << __I_UNSTABLE_PAGES)

#define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)

@@ -1649,6 +1653,11 @@ static inline void mark_inode_dirty_sync(struct inode *inode)
__mark_inode_dirty(inode, I_DIRTY_SYNC);
}

+static inline void mark_inode_unstable_pages(struct inode *inode)
+{
+ __mark_inode_dirty(inode, I_UNSTABLE_PAGES);
+}
+
/**
* inc_nlink - directly increment an inode's link count
* @inode: inode

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: -tip: origin tree build failure
Next: Possible regression of LBA48 support for piix