From: Steve Rago on

On Wed, 2009-12-23 at 19:39 +0100, Jan Kara wrote:
> On Tue 22-12-09 11:20:15, Steve Rago wrote:
> >
> > On Tue, 2009-12-22 at 13:25 +0100, Jan Kara wrote:
> > > > I originally spent several months playing with the balance_dirty_pages
> > > > algorithm. The main drawback is that it affects more than the inodes
> > > > that the caller is writing and that the control of what to do is too
> > > Can you be more specific here please?
> >
> > Sure; balance_dirty_pages() will schedule writeback by the flusher
> > thread once the number of dirty pages exceeds dirty_background_ratio.
> > The flusher thread calls writeback_inodes_wb() to flush all dirty inodes
> > associated with the bdi. Similarly, the process dirtying the pages will
> > call writeback_inodes_wbc() when it's bdi threshold has been exceeded.
> > The first problem is that these functions process all dirty inodes with
> > the same backing device, which can lead to excess (duplicate) flushing
> > of the same inode. Second, there is no distinction between pages that
> > need to be committed and pages that have commits pending in
> > NR_UNSTABLE_NFS/BDI_RECLAIMABLE (a page that has a commit pending won't
> > be cleaned any faster by sending more commits). This tends to overstate
> > the amount of memory that can be cleaned, leading to additional commit
> > requests. Third, these functions generate a commit for each set of
> > writes they do, which might not be appropriate. For background writing,
> > you'd like to delay the commit as long as possible.
> Ok, I get it. Thanks for explanation. The problem with more writing
> threads bites us also for ordinary SATA drives (the IO pattern and thus
> throughput gets worse and worse the more threads do writes). The plan is to
> let only flusher thread do the IO and throttled thread in
> balance_dirty_pages just waits for flusher thread to do the work. There
> were even patches for this floating around but I'm not sure what's happened
> to them. So that part of the problem should be easy to solve.
> Another part is about sending commits - if we have just one thread doing
> flushing, we have no problems with excessive commits for one inode. You're
> right that we may want to avoid sending commits for background writeback
> but until we send the commit, pages are just accumulating in the unstable
> state, aren't they? So we might want to periodically send the commit for
> the inode anyway to get rid of those pages. So from this point of view,
> sending commit after each writepages call does not seem like a so bad idea
> - although it might be more appropriate to send it some time after the
> writepages call when we are not close to dirty limit so that server has
> more time to do more natural "unforced" writeback...

When to send the commit is a complex question to answer. If you delay
it long enough, the server's flusher threads will have already done most
of the work for you, so commits can be cheap, but you don't have access
to the necessary information to figure this out. You can't delay it too
long, though, because the unstable pages on the client will grow too
large, creating memory pressure. I have a second patch, which I haven't
posted yet, that adds feedback piggy-backed on the NFS write response,
which allows the NFS client to free pages proactively. This greatly
reduces the need to send commit messages, but it extends the protocol
(in a backward-compatible manner), so it could be hard to convince
people to accept.

>
> > > > Part of the patch does implement a heuristic write-behind. See where
> > > > nfs_wb_eager() is called.
> > > I believe that if we had per-bdi dirty_background_ratio and set it low
> > > for NFS's bdi, then the write-behind logic would not be needed
> > > (essentially the flusher thread should submit the writes to the server
> > > early).
> > >
> > Maybe so, but you still need something to prevent the process that is
> > dirtying pages from continuing, because a process can always write to
> > memory faster than writing to disk/network, so the flusher won't be able
> > to keep up.
> Yes, I agree that part is needed. But Fengguang already had patches in
> that direction if my memory serves me well.
>
> So to recap: If we block tasks in balance_dirty_pages until unstable
> pages are committed and make just one thread do the writing, what else is
> missing to make you happy? :)
> Honza

As long as the performance improves substantially, I'll be happy.

Part of the problem that isn't addressed by your summary is the
synchronous writes. With the eager writeback patch, these are removed
[see the short-circuit in wb_priority()]. I would have expected that
change to be controversial, but I've not heard any complaints (yet). If
the process or the bdi flusher is writing and committing regularly, then
pages should be recycled quickly and the change shouldn't matter, but
I'd need to run my systemtap scripts to make sure.

Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Steve Rago on

On Wed, 2009-12-23 at 22:49 +0100, Trond Myklebust wrote:

> > When to send the commit is a complex question to answer. If you delay
> > it long enough, the server's flusher threads will have already done most
> > of the work for you, so commits can be cheap, but you don't have access
> > to the necessary information to figure this out. You can't delay it too
> > long, though, because the unstable pages on the client will grow too
> > large, creating memory pressure. I have a second patch, which I haven't
> > posted yet, that adds feedback piggy-backed on the NFS write response,
> > which allows the NFS client to free pages proactively. This greatly
> > reduces the need to send commit messages, but it extends the protocol
> > (in a backward-compatible manner), so it could be hard to convince
> > people to accept.
>
> There are only 2 cases when the client should send a COMMIT:
> 1. When it hits a synchronisation point (i.e. when the user calls
> f/sync(), or close(), or when the user sets/clears a file
> lock).
> 2. When memory pressure causes the VM to wants to free up those
> pages that are marked as clean but unstable.
>
> We should never be sending COMMIT in any other situation, since that
> would imply that the client somehow has better information on how to
> manage dirty pages on the server than the server's own VM.
>
> Cheers
> Trond

#2 is the difficult one. If you wait for memory pressure, you could
have waited too long, because depending on the latency of the commit,
you could run into low-memory situations. Then mayhem ensues, the
oom-killer gets cranky (if you haven't disabled it), and stuff starts
failing and/or hanging. So you need to be careful about setting the
threshold for generating a commit so that the client doesn't run out of
memory before the server can respond.

Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Wu Fengguang on
On Tue, Dec 22, 2009 at 08:35:39PM +0800, Jan Kara wrote:
> > 2) NFS commit stops pipeline because it sleep&wait inside i_mutex,
> > which blocks all other NFSDs trying to write/writeback the inode.
> >
> > nfsd_sync:
> > take i_mutex
> > filemap_fdatawrite
> > filemap_fdatawait
> > drop i_mutex
> I believe this is unrelated to the problem Steve is trying to solve.
> When we get to doing sync writes the performance is busted so we better
> shouldn't get to that (unless user asked for that of course).

Yes, first priority is always to reduce the COMMITs and the number of
writeback pages they submitted under WB_SYNC_ALL. And I guess the
"increase write chunk beyond 128MB" patches can serve it well.

The i_mutex should impact NFS write performance for single big copy in
this way: pdflush submits many (4MB write, 1 commit) pairs, because
the write and commit each will take i_mutex, it effectively limits the
server side io queue depth to <=4MB: the next 4MB dirty data won't
reach page cache until the previous 4MB is completely synced to disk.

There are two kinds of inefficiency here:
- the small queue depth
- the interleaved use of CPU/DISK:
loop {
write 4MB => normally only CPU
writeback 4MB => mostly disk
}

When writing many small dirty files _plus_ one big file, there will
still be interleaved write/writeback: the 4MB write will be broken
into 8 NFS writes with the default wsize=524288. So there may be one
nfsd doing COMMIT, another 7 nfsd waiting for the big file's i_mutex.
All 8 nfsd are "busy" and pipeline is destroyed. Just a possibility.

> > If filemap_fdatawait() can be moved out of i_mutex (or just remove
> > the lock), we solve the root problem:
> >
> > nfsd_sync:
> > [take i_mutex]
> > filemap_fdatawrite => can also be blocked, but less a problem
> > [drop i_mutex]
> > filemap_fdatawait
> >
> > Maybe it's a dumb question, but what's the purpose of i_mutex here?
> > For correctness or to prevent livelock? I can imagine some livelock
> > problem here (current implementation can easily wait for extra
> > pages), however not too hard to fix.
> Generally, most filesystems take i_mutex during fsync to
> a) avoid all sorts of livelocking problems
> b) serialize fsyncs for one inode (mostly for simplicity)
> I don't see what advantage would it bring that we get rid of i_mutex
> for fdatawait - only that maybe writers could proceed while we are
> waiting but is that really the problem?

The i_mutex at least has some performance impact. Another one would be
the WB_SYNC_ALL. All are related to the COMMIT/sync write behavior.

Are there some other _direct_ causes?

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Wu Fengguang on
Hi Martin,

On Tue, Dec 22, 2009 at 09:01:46PM +0800, Martin Knoblauch wrote:
> ----- Original Message ----
>
> > From: Wu Fengguang <fengguang.wu(a)intel.com>
> > To: Steve Rago <sar(a)nec-labs.com>
> > Cc: Peter Zijlstra <peterz(a)infradead.org>; "linux-nfs(a)vger.kernel.org" <linux-nfs(a)vger.kernel.org>; "linux-kernel(a)vger.kernel.org" <linux-kernel(a)vger.kernel.org>; "Trond.Myklebust(a)netapp.com" <Trond.Myklebust(a)netapp.com>; jens.axboe <jens.axboe(a)oracle.com>
> > Sent: Tue, December 22, 2009 2:59:07 AM
> > Subject: Re: [PATCH] improve the performance of large sequential write NFS workloads
> >
>
> [big snip]
>
> >
> > In general it's reasonable to keep NFS per-file nr_dirty low, however
> > questionable to do per-file nr_writeback throttling. This does not
> > work well with the global limits - eg. when there are many dirty
> > files, the summed-up nr_writeback will still grow out of control.
> > And it's more likely to impact user visible responsiveness than
> > a global limit. But my opinion can be biased -- me have a patch to
> > do global NFS nr_writeback limit ;)
> >
>
>
> is that "NFS: introduce writeback wait queue", which you sent me a while ago and I did not test until now :-( ?

Yes it is - I've been puzzled by some bumpy NFS write problem, and
the information in this thread seem to be helpful to understand it :)

Thanks,
Fengguang
---

NFS: introduce writeback wait queue

The generic writeback routines are departing from congestion_wait()
in preferance of get_request_wait(), aka. waiting on the block queues.

Introduce the missing writeback wait queue for NFS, otherwise its
writeback pages may grow out of control.

In perticular, balance_dirty_pages() will exit after it pushes
write_chunk pages into the PG_writeback page pool, _OR_ when the
background writeback work quits. The latter is new behavior, and could
not only quit (normally) after below background threshold, but also
quit when it finds _zero_ dirty pages to write. The latter case gives
rise to the number of PG_writeback pages if it is not explicitly limited.

CC: Jens Axboe <jens.axboe(a)oracle.com>
CC: Chris Mason <chris.mason(a)oracle.com>
CC: Peter Zijlstra <a.p.zijlstra(a)chello.nl>
CC: Peter Staubach <staubach(a)redhat.com>
CC: Trond Myklebust <Trond.Myklebust(a)netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu(a)intel.com>
---

The wait time and network throughput varies a lot! this is a major problem.
This means nfs_end_page_writeback() is not called smoothly over time,
even when there are plenty of PG_writeback pages on the client side.

[ 397.828509] write_bandwidth: comm=nfsiod pages=192 time=16ms
[ 397.850976] write_bandwidth: comm=nfsiod pages=192 time=20ms
[ 403.065244] write_bandwidth: comm=nfsiod pages=192 time=5212ms
[ 403.549134] write_bandwidth: comm=nfsiod pages=1536 time=144ms
[ 403.570717] write_bandwidth: comm=nfsiod pages=192 time=20ms
[ 403.595749] write_bandwidth: comm=nfsiod pages=192 time=20ms
[ 403.622171] write_bandwidth: comm=nfsiod pages=192 time=24ms
[ 403.651779] write_bandwidth: comm=nfsiod pages=192 time=28ms
[ 403.680543] write_bandwidth: comm=nfsiod pages=192 time=24ms
[ 403.712572] write_bandwidth: comm=nfsiod pages=192 time=28ms
[ 403.751552] write_bandwidth: comm=nfsiod pages=192 time=36ms
[ 403.785979] write_bandwidth: comm=nfsiod pages=192 time=28ms
[ 403.823995] write_bandwidth: comm=nfsiod pages=192 time=36ms
[ 403.858970] write_bandwidth: comm=nfsiod pages=192 time=32ms
[ 403.880786] write_bandwidth: comm=nfsiod pages=192 time=16ms
[ 403.902732] write_bandwidth: comm=nfsiod pages=192 time=20ms
[ 403.925925] write_bandwidth: comm=nfsiod pages=192 time=20ms
[ 403.952044] write_bandwidth: comm=nfsiod pages=258 time=24ms
[ 403.974006] write_bandwidth: comm=nfsiod pages=192 time=16ms
[ 403.995989] write_bandwidth: comm=nfsiod pages=192 time=20ms
[ 405.031049] write_bandwidth: comm=nfsiod pages=192 time=1032ms
[ 405.257635] write_bandwidth: comm=nfsiod pages=1536 time=192ms
[ 405.279069] write_bandwidth: comm=nfsiod pages=192 time=20ms
[ 405.300843] write_bandwidth: comm=nfsiod pages=192 time=16ms
[ 405.326031] write_bandwidth: comm=nfsiod pages=192 time=20ms
[ 405.350843] write_bandwidth: comm=nfsiod pages=192 time=24ms
[ 405.375160] write_bandwidth: comm=nfsiod pages=192 time=20ms
[ 409.331015] write_bandwidth: comm=nfsiod pages=192 time=3952ms
[ 409.587928] write_bandwidth: comm=nfsiod pages=1536 time=152ms
[ 409.610068] write_bandwidth: comm=nfsiod pages=192 time=20ms
[ 409.635736] write_bandwidth: comm=nfsiod pages=192 time=24ms

# vmmon -d 1 nr_writeback nr_dirty nr_unstable

nr_writeback nr_dirty nr_unstable
11227 41463 38044
11227 41463 38044
11227 41463 38044
11227 41463 38044
11045 53987 6490
11033 53120 8145
11195 52143 10886
11211 52144 10913
11211 52144 10913
11211 52144 10913
11056 56887 3876
11062 55298 8155
11214 54485 9838
11225 54461 9852
11225 54461 9852
11225 54461 4582
22342 35535 7823

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
0 0 9 92 0 0| 0 0 | 66B 306B| 0 0 |1003 377
0 1 39 60 0 1| 0 0 | 90k 1361k| 0 0 |1765 1599
0 15 12 43 0 31| 0 0 |2292k 34M| 0 0 | 12k 16k
0 0 16 84 0 0| 0 0 | 132B 306B| 0 0 |1003 376
0 0 43 57 0 0| 0 0 | 66B 306B| 0 0 |1004 376
0 7 25 55 0 13| 0 0 |1202k 18M| 0 0 |7331 8921
0 8 21 55 0 15| 0 0 |1195k 18M| 0 0 |5382 6579
0 0 38 62 0 0| 0 0 | 66B 306B| 0 0 |1002 371
0 0 33 67 0 0| 0 0 | 66B 306B| 0 0 |1003 376
0 14 20 41 0 24| 0 0 |1621k 24M| 0 0 |8549 10k
0 5 31 55 0 9| 0 0 | 769k 11M| 0 0 |4444 5180
0 0 18 82 0 0| 0 0 | 66B 568B| 0 0 |1004 377
0 1 41 54 0 3| 0 0 | 184k 2777k| 0 0 |2609 2619
1 13 22 43 0 22| 0 0 |1572k 23M| 0 0 |8138 10k
0 11 9 59 0 20| 0 0 |1861k 27M| 0 0 |9576 13k
0 5 23 66 0 5| 0 0 | 540k 8122k| 0 0 |2816 2885


fs/nfs/client.c | 2
fs/nfs/write.c | 92 +++++++++++++++++++++++++++++++-----
include/linux/nfs_fs_sb.h | 1
3 files changed, 84 insertions(+), 11 deletions(-)

--- linux.orig/fs/nfs/write.c 2009-11-06 09:52:23.000000000 +0800
+++ linux/fs/nfs/write.c 2009-11-06 09:52:27.000000000 +0800
@@ -187,11 +187,65 @@ static int wb_priority(struct writeback_
* NFS congestion control
*/

+#define NFS_WAIT_PAGES (1024L >> (PAGE_CACHE_SHIFT - 10))
int nfs_congestion_kb;

-#define NFS_CONGESTION_ON_THRESH (nfs_congestion_kb >> (PAGE_SHIFT-10))
-#define NFS_CONGESTION_OFF_THRESH \
- (NFS_CONGESTION_ON_THRESH - (NFS_CONGESTION_ON_THRESH >> 2))
+/*
+ * SYNC requests will block on (2*limit) and wakeup on (2*limit-NFS_WAIT_PAGES)
+ * ASYNC requests will block on (limit) and wakeup on (limit - NFS_WAIT_PAGES)
+ * In this way SYNC writes will never be blocked by ASYNC ones.
+ */
+
+static void nfs_set_congested(long nr, long limit,
+ struct backing_dev_info *bdi)
+{
+ if (nr > limit && !test_bit(BDI_async_congested, &bdi->state))
+ set_bdi_congested(bdi, BLK_RW_ASYNC);
+ else if (nr > 2 * limit && !test_bit(BDI_sync_congested, &bdi->state))
+ set_bdi_congested(bdi, BLK_RW_SYNC);
+}
+
+static void nfs_wait_contested(int is_sync,
+ struct backing_dev_info *bdi,
+ wait_queue_head_t *wqh)
+{
+ int waitbit = is_sync ? BDI_sync_congested : BDI_async_congested;
+ DEFINE_WAIT(wait);
+
+ if (!test_bit(waitbit, &bdi->state))
+ return;
+
+ for (;;) {
+ prepare_to_wait(&wqh[is_sync], &wait, TASK_UNINTERRUPTIBLE);
+ if (!test_bit(waitbit, &bdi->state))
+ break;
+
+ io_schedule();
+ }
+ finish_wait(&wqh[is_sync], &wait);
+}
+
+static void nfs_wakeup_congested(long nr, long limit,
+ struct backing_dev_info *bdi,
+ wait_queue_head_t *wqh)
+{
+ if (nr < 2*limit - min(limit/8, NFS_WAIT_PAGES)) {
+ if (test_bit(BDI_sync_congested, &bdi->state)) {
+ clear_bdi_congested(bdi, BLK_RW_SYNC);
+ smp_mb__after_clear_bit();
+ }
+ if (waitqueue_active(&wqh[BLK_RW_SYNC]))
+ wake_up(&wqh[BLK_RW_SYNC]);
+ }
+ if (nr < limit - min(limit/8, NFS_WAIT_PAGES)) {
+ if (test_bit(BDI_async_congested, &bdi->state)) {
+ clear_bdi_congested(bdi, BLK_RW_ASYNC);
+ smp_mb__after_clear_bit();
+ }
+ if (waitqueue_active(&wqh[BLK_RW_ASYNC]))
+ wake_up(&wqh[BLK_RW_ASYNC]);
+ }
+}

static int nfs_set_page_writeback(struct page *page)
{
@@ -201,11 +255,9 @@ static int nfs_set_page_writeback(struct
struct inode *inode = page->mapping->host;
struct nfs_server *nfss = NFS_SERVER(inode);

- if (atomic_long_inc_return(&nfss->writeback) >
- NFS_CONGESTION_ON_THRESH) {
- set_bdi_congested(&nfss->backing_dev_info,
- BLK_RW_ASYNC);
- }
+ nfs_set_congested(atomic_long_inc_return(&nfss->writeback),
+ nfs_congestion_kb >> (PAGE_SHIFT-10),
+ &nfss->backing_dev_info);
}
return ret;
}
@@ -216,8 +268,11 @@ static void nfs_end_page_writeback(struc
struct nfs_server *nfss = NFS_SERVER(inode);

end_page_writeback(page);
- if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
- clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
+
+ nfs_wakeup_congested(atomic_long_dec_return(&nfss->writeback),
+ nfs_congestion_kb >> (PAGE_SHIFT-10),
+ &nfss->backing_dev_info,
+ nfss->writeback_wait);
}

static struct nfs_page *nfs_find_and_lock_request(struct page *page)
@@ -309,19 +364,34 @@ static int nfs_writepage_locked(struct p

int nfs_writepage(struct page *page, struct writeback_control *wbc)
{
+ struct inode *inode = page->mapping->host;
+ struct nfs_server *nfss = NFS_SERVER(inode);
int ret;

ret = nfs_writepage_locked(page, wbc);
unlock_page(page);
+
+ nfs_wait_contested(wbc->sync_mode == WB_SYNC_ALL,
+ &nfss->backing_dev_info,
+ nfss->writeback_wait);
+
return ret;
}

-static int nfs_writepages_callback(struct page *page, struct writeback_control *wbc, void *data)
+static int nfs_writepages_callback(struct page *page,
+ struct writeback_control *wbc, void *data)
{
+ struct inode *inode = page->mapping->host;
+ struct nfs_server *nfss = NFS_SERVER(inode);
int ret;

ret = nfs_do_writepage(page, wbc, data);
unlock_page(page);
+
+ nfs_wait_contested(wbc->sync_mode == WB_SYNC_ALL,
+ &nfss->backing_dev_info,
+ nfss->writeback_wait);
+
return ret;
}

--- linux.orig/include/linux/nfs_fs_sb.h 2009-11-06 09:22:30.000000000 +0800
+++ linux/include/linux/nfs_fs_sb.h 2009-11-06 09:52:27.000000000 +0800
@@ -108,6 +108,7 @@ struct nfs_server {
struct nfs_iostats * io_stats; /* I/O statistics */
struct backing_dev_info backing_dev_info;
atomic_long_t writeback; /* number of writeback pages */
+ wait_queue_head_t writeback_wait[2];
int flags; /* various flags */
unsigned int caps; /* server capabilities */
unsigned int rsize; /* read size */
--- linux.orig/fs/nfs/client.c 2009-11-06 09:22:30.000000000 +0800
+++ linux/fs/nfs/client.c 2009-11-06 09:52:27.000000000 +0800
@@ -991,6 +991,8 @@ static struct nfs_server *nfs_alloc_serv
INIT_LIST_HEAD(&server->master_link);

atomic_set(&server->active, 0);
+ init_waitqueue_head(&server->writeback_wait[BLK_RW_SYNC]);
+ init_waitqueue_head(&server->writeback_wait[BLK_RW_ASYNC]);

server->io_stats = nfs_alloc_iostats();
if (!server->io_stats) {
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Wu Fengguang on
Trond,

On Thu, Dec 24, 2009 at 03:12:54AM +0800, Trond Myklebust wrote:
> On Wed, 2009-12-23 at 19:05 +0100, Jan Kara wrote:
> > On Wed 23-12-09 15:21:47, Trond Myklebust wrote:
> > > @@ -474,6 +482,18 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
> > > }
> > >
> > > spin_lock(&inode_lock);
> > > + /*
> > > + * Special state for cleaning NFS unstable pages
> > > + */
> > > + if (inode->i_state & I_UNSTABLE_PAGES) {
> > > + int err;
> > > + inode->i_state &= ~I_UNSTABLE_PAGES;
> > > + spin_unlock(&inode_lock);
> > > + err = commit_unstable_pages(inode, wait);
> > > + if (ret == 0)
> > > + ret = err;
> > > + spin_lock(&inode_lock);
> > > + }
> > I don't quite understand this chunk: We've called writeback_single_inode
> > because it had some dirty pages. Thus it has I_DIRTY_DATASYNC set and a few
> > lines above your chunk, we've called nfs_write_inode which sent commit to
> > the server. Now here you sometimes send the commit again? What's the
> > purpose?
>
> We no longer set I_DIRTY_DATASYNC. We only set I_DIRTY_PAGES (and later
> I_UNSTABLE_PAGES).
>
> The point is that we now do the commit only _after_ we've sent all the
> dirty pages, and waited for writeback to complete, whereas previously we
> did it in the wrong order.

Sorry I still don't get it. The timing used to be:

write 4MB ==> WRITE block 0 (ie. first 512KB)
WRITE block 1
WRITE block 2
WRITE block 3 ack from server for WRITE block 0 => mark 0 as unstable (inode marked need-commit)
WRITE block 4 ack from server for WRITE block 1 => mark 1 as unstable
WRITE block 5 ack from server for WRITE block 2 => mark 2 as unstable
WRITE block 6 ack from server for WRITE block 3 => mark 3 as unstable
WRITE block 7 ack from server for WRITE block 4 => mark 4 as unstable
ack from server for WRITE block 5 => mark 5 as unstable
write_inode ==> COMMIT blocks 0-5
ack from server for WRITE block 6 => mark 6 as unstable (inode marked need-commit)
ack from server for WRITE block 7 => mark 7 as unstable

ack from server for COMMIT blocks 0-5 => mark 0-5 as clean

write_inode ==> COMMIT blocks 6-7

ack from server for COMMIT blocks 6-7 => mark 6-7 as clean

Note that the first COMMIT is submitted before receiving all ACKs for
the previous writes, hence the second COMMIT is necessary. It seems
that your patch does not improve the timing at all.

Thanks,
Fengguang

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/