From: Wu Fengguang on
On Mon, Jul 19, 2010 at 09:11:30PM +0800, Mel Gorman wrote:
> There are a number of cases where pages get cleaned but two of concern
> to this patch are;
> o When dirtying pages, processes may be throttled to clean pages if
> dirty_ratio is not met.
> o Pages belonging to inodes dirtied longer than
> dirty_writeback_centisecs get cleaned.
>
> The problem for reclaim is that dirty pages can reach the end of the LRU
> if pages are being dirtied slowly so that neither the throttling cleans
> them or a flusher thread waking periodically.
>
> Background flush is already cleaning old or expired inodes first but the
> expire time is too far in the future at the time of page reclaim. To mitigate
> future problems, this patch wakes flusher threads to clean 1.5 times the
> number of dirty pages encountered by reclaimers. The reasoning is that pages
> were being dirtied at a roughly constant rate recently so if N dirty pages
> were encountered in this scan block, we are likely to see roughly N dirty
> pages again soon so try keep the flusher threads ahead of reclaim.
>
> This is unfortunately very hand-wavy but there is not really a good way of
> quantifying how bad it is when reclaim encounters dirty pages other than
> "down with that sort of thing". Similarly, there is not an obvious way of
> figuring how what percentage of dirty pages are old in terms of LRU-age and
> should be cleaned. Ideally, the background flushers would only be cleaning
> pages belonging to the zone being scanned but it's not clear if this would
> be of benefit (less IO) or not (potentially less efficient IO if an inode
> is scattered across multiple zones).
>
> Signed-off-by: Mel Gorman <mel(a)csn.ul.ie>
> ---
> mm/vmscan.c | 18 +++++++++++-------
> 1 files changed, 11 insertions(+), 7 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bc50937..5763719 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -806,6 +806,8 @@ restart_dirty:
> }
>
> if (PageDirty(page)) {
> + nr_dirty++;
> +
> /*
> * If the caller cannot writeback pages, dirty pages
> * are put on a separate list for cleaning by either
> @@ -814,7 +816,6 @@ restart_dirty:
> if (!reclaim_can_writeback(sc, page)) {
> list_add(&page->lru, &dirty_pages);
> unlock_page(page);
> - nr_dirty++;
> goto keep_dirty;
> }
>
> @@ -933,13 +934,16 @@ keep_dirty:
> VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> }
>
> + /*
> + * If reclaim is encountering dirty pages, it may be because
> + * dirty pages are reaching the end of the LRU even though
> + * the dirty_ratio may be satisified. In this case, wake
> + * flusher threads to pro-actively clean some pages
> + */
> + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);

Ah it's very possible that nr_dirty==0 here! Then you are hitting the
number of dirty pages down to 0 whether or not pageout() is called.

Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
normally a small number, much smaller than MAX_WRITEBACK_PAGES.
The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
for efficiency. And it seems good to let the flusher write much more
than nr_dirty pages to safeguard a reasonable large
vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
update the comments.

Thanks,
Fengguang

> if (dirty_isolated < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) {
> - /*
> - * Wakeup a flusher thread to clean at least as many dirty
> - * pages as encountered by direct reclaim. Wait on congestion
> - * to throttle processes cleaning dirty pages
> - */
> - wakeup_flusher_threads(nr_dirty);
> + /* Throttle direct reclaimers cleaning pages */
> congestion_wait(BLK_RW_ASYNC, HZ/10);
>
> /*
> --
> 1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Wu Fengguang on
> > > @@ -933,13 +934,16 @@ keep_dirty:
> > > VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > > }
> > >
> > > + /*
> > > + * If reclaim is encountering dirty pages, it may be because
> > > + * dirty pages are reaching the end of the LRU even though
> > > + * the dirty_ratio may be satisified. In this case, wake
> > > + * flusher threads to pro-actively clean some pages
> > > + */
> > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> >
> > Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> > number of dirty pages down to 0 whether or not pageout() is called.
> >
>
> True, this has been fixed to only wakeup flusher threads when this is
> the file LRU, dirty pages have been encountered and the caller has
> sc->may_writepage.

OK.

> > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> > normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> > for efficiency.
> > And it seems good to let the flusher write much more
> > than nr_dirty pages to safeguard a reasonable large
> > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> > update the comments.
> >
>
> Ok, the reasoning had been to flush a number of pages that was related
> to the scanning rate but if that is inefficient for the flusher, I'll
> use MAX_WRITEBACK_PAGES.

It would be better to pass something like (nr_dirty * N).
MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is
obviously too large as a parameter. When the batch size is increased
to 128MB, the writeback code may be improved somehow to not exceed the
nr_pages limit too much.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Wu Fengguang on
On Mon, Jul 26, 2010 at 08:57:17PM +0800, Mel Gorman wrote:
> On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote:
> > > > > @@ -933,13 +934,16 @@ keep_dirty:
> > > > > VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > > > > }
> > > > >
> > > > > + /*
> > > > > + * If reclaim is encountering dirty pages, it may be because
> > > > > + * dirty pages are reaching the end of the LRU even though
> > > > > + * the dirty_ratio may be satisified. In this case, wake
> > > > > + * flusher threads to pro-actively clean some pages
> > > > > + */
> > > > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > > >
> > > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> > > > number of dirty pages down to 0 whether or not pageout() is called.
> > > >
> > >
> > > True, this has been fixed to only wakeup flusher threads when this is
> > > the file LRU, dirty pages have been encountered and the caller has
> > > sc->may_writepage.
> >
> > OK.
> >
> > > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> > > > normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> > > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> > > > for efficiency.
> > > > And it seems good to let the flusher write much more
> > > > than nr_dirty pages to safeguard a reasonable large
> > > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> > > > update the comments.
> > > >
> > >
> > > Ok, the reasoning had been to flush a number of pages that was related
> > > to the scanning rate but if that is inefficient for the flusher, I'll
> > > use MAX_WRITEBACK_PAGES.
> >
> > It would be better to pass something like (nr_dirty * N).
> > MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is
> > obviously too large as a parameter. When the batch size is increased
> > to 128MB, the writeback code may be improved somehow to not exceed the
> > nr_pages limit too much.
> >
>
> What might be a useful value for N? 1.5 appears to work reasonably well
> to create a window of writeback ahead of the scanner but it's a bit
> arbitrary.

I'd recommend N to be a large value. It's no longer relevant now since
we'll call the flusher to sync some range containing the target page.
The flusher will then choose an N large enough (eg. 4MB) for efficient
IO. It needs to be a large value, otherwise the vmscan code will
quickly run into dirty pages again..

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Wu Fengguang on
On Tue, Jul 27, 2010 at 09:35:13PM +0800, Mel Gorman wrote:
> On Mon, Jul 26, 2010 at 09:10:08PM +0800, Wu Fengguang wrote:
> > On Mon, Jul 26, 2010 at 08:57:17PM +0800, Mel Gorman wrote:
> > > On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote:
> > > > > > > @@ -933,13 +934,16 @@ keep_dirty:
> > > > > > > VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > > > > > > }
> > > > > > >
> > > > > > > + /*
> > > > > > > + * If reclaim is encountering dirty pages, it may be because
> > > > > > > + * dirty pages are reaching the end of the LRU even though
> > > > > > > + * the dirty_ratio may be satisified. In this case, wake
> > > > > > > + * flusher threads to pro-actively clean some pages
> > > > > > > + */
> > > > > > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > > > > >
> > > > > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> > > > > > number of dirty pages down to 0 whether or not pageout() is called.
> > > > > >
> > > > >
> > > > > True, this has been fixed to only wakeup flusher threads when this is
> > > > > the file LRU, dirty pages have been encountered and the caller has
> > > > > sc->may_writepage.
> > > >
> > > > OK.
> > > >
> > > > > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> > > > > > normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> > > > > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> > > > > > for efficiency.
> > > > > > And it seems good to let the flusher write much more
> > > > > > than nr_dirty pages to safeguard a reasonable large
> > > > > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> > > > > > update the comments.
> > > > > >
> > > > >
> > > > > Ok, the reasoning had been to flush a number of pages that was related
> > > > > to the scanning rate but if that is inefficient for the flusher, I'll
> > > > > use MAX_WRITEBACK_PAGES.
> > > >
> > > > It would be better to pass something like (nr_dirty * N).
> > > > MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is
> > > > obviously too large as a parameter. When the batch size is increased
> > > > to 128MB, the writeback code may be improved somehow to not exceed the
> > > > nr_pages limit too much.
> > > >
> > >
> > > What might be a useful value for N? 1.5 appears to work reasonably well
> > > to create a window of writeback ahead of the scanner but it's a bit
> > > arbitrary.
> >
> > I'd recommend N to be a large value. It's no longer relevant now since
> > we'll call the flusher to sync some range containing the target page.
> > The flusher will then choose an N large enough (eg. 4MB) for efficient
> > IO. It needs to be a large value, otherwise the vmscan code will
> > quickly run into dirty pages again..
> >
>
> Ok, I took the 4MB at face value to be a "reasonable amount that should
> not cause congestion".

Under memory pressure, the disk should be busy/congested anyway.
The big 4MB adds much work, however many of the pages may need to be
synced in the near future anyway. It also requires more time to do
the bigger IO, hence adding some latency, however the latency should
be a small factor comparing to the IO queue time (which will be long
for a busy disk).

Overall expectation is, the more efficient IO, the more progress :)

> The end result is
>
> #define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> #define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> static inline long nr_writeback_pages(unsigned long nr_dirty)
> {
> return laptop_mode ? 0 :
> min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> }
>
> nr_writeback_pages(nr_dirty) is what gets passed to
> wakeup_flusher_threads(). Does that seem sensible?

If you plan to keep wakeup_flusher_threads(), a simpler form may be
sufficient, eg.

laptop_mode ? 0 : (nr_dirty * 16)

On top of this, we may write another patch to convert the
wakeup_flusher_threads(bdi, nr_pages) call to some
bdi_start_inode_writeback(inode, offset) call, to start more oriented
writeback.

When talking the 4MB optimization, I was referring to the internal
implementation of bdi_start_inode_writeback(). Sorry for the missing
context in the previous email.

It may need a big patch to implement bdi_start_inode_writeback().
Would you like to try it, or leave the task to me?

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Wu Fengguang on
> If you plan to keep wakeup_flusher_threads(), a simpler form may be
> sufficient, eg.
>
> laptop_mode ? 0 : (nr_dirty * 16)

This number is not sensitive because the writeback code may well round
it up to some more IO efficient value (currently 4MB). AFAIK the
nr_pages parameters passed by all existing flusher callers are some
rule-of-thumb value, and far from being an exact number.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/