vmscan: Do not writeback filesystem pages in direct reclaim [Kernel]

Prev: xstat: Add a dentry op to handle automounting rather than abusing follow_link() [ver #6]
Next: vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages

From: Mel Gorman on 19 Jul 2010 10:30

On Mon, Jul 19, 2010 at 10:19:34AM -0400, Christoph Hellwig wrote:
> On Mon, Jul 19, 2010 at 02:11:26PM +0100, Mel Gorman wrote:
> > As the call-chain for writing anonymous pages is not expected to be deep
> > and they are not cleaned by flusher threads, anonymous pages are still
> > written back in direct reclaim.
>
> While it is not quite as deep as it skips the filesystem allocator and
> extent mapping code it can still be quite deep for swap given that it
> still has to traverse the whole I/O stack. Probably not worth worrying
> about now, but we need to keep an eye on it.
>

Agreed that we need to keep an eye on it. If this ever becomes a
problem, we're going to need to consider a flusher for anonymous pages.
If you look at the figures, we are still doing a lot of writeback of
anonymous pages. Granted, the layout of swap sucks anyway but it's
something to keep at the back of the mind.

> The patch looks fine to me anyway.
>

Thanks.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mel Gorman on 20 Jul 2010 09:50

On Tue, Jul 20, 2010 at 12:14:20AM +0200, Johannes Weiner wrote:
> Hi Mel,
>
> On Mon, Jul 19, 2010 at 02:11:26PM +0100, Mel Gorman wrote:
> > @@ -406,7 +461,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
> > return PAGE_SUCCESS;
> > }
>
> Did you forget to delete the worker code from pageout() which is now
> in write_reclaim_page()?
>

Damn, a snarl during the final rebase when collapsing patches together that
I missed when re-reading. Sorry :(

> > - return PAGE_CLEAN;
> > + return write_reclaim_page(page, mapping, sync_writeback);
> > }
> >
> > /*
> > @@ -639,6 +694,9 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> > pagevec_free(&freed_pvec);
> > }
> >
> > +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
> > +#define MAX_SWAP_CLEAN_WAIT 50
> > +
> > /*
> > * shrink_page_list() returns the number of reclaimed pages
> > */
> > @@ -646,13 +704,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > struct scan_control *sc,
> > enum pageout_io sync_writeback)
> > {
> > - LIST_HEAD(ret_pages);
> > LIST_HEAD(free_pages);
> > - int pgactivate = 0;
> > + LIST_HEAD(putback_pages);
> > + LIST_HEAD(dirty_pages);
> > + int pgactivate;
> > + int dirty_isolated = 0;
> > + unsigned long nr_dirty;
> > unsigned long nr_reclaimed = 0;
> >
> > + pgactivate = 0;
> > cond_resched();
> >
> > +restart_dirty:
> > + nr_dirty = 0;
> > while (!list_empty(page_list)) {
> > enum page_references references;
> > struct address_space *mapping;
> > @@ -741,7 +805,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > }
> > }
> >
> > - if (PageDirty(page)) {
> > + if (PageDirty(page)) {
> > + /*
> > + * If the caller cannot writeback pages, dirty pages
> > + * are put on a separate list for cleaning by either
> > + * a flusher thread or kswapd
> > + */
> > + if (!reclaim_can_writeback(sc, page)) {
> > + list_add(&page->lru, &dirty_pages);
> > + unlock_page(page);
> > + nr_dirty++;
> > + goto keep_dirty;
> > + }
> > +
> > if (references == PAGEREF_RECLAIM_CLEAN)
> > goto keep_locked;
> > if (!may_enter_fs)
> > @@ -852,13 +928,39 @@ activate_locked:
> > keep_locked:
> > unlock_page(page);
> > keep:
> > - list_add(&page->lru, &ret_pages);
> > + list_add(&page->lru, &putback_pages);
> > +keep_dirty:
> > VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > }
> >
> > + if (dirty_isolated < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) {
> > + /*
> > + * Wakeup a flusher thread to clean at least as many dirty
> > + * pages as encountered by direct reclaim. Wait on congestion
> > + * to throttle processes cleaning dirty pages
> > + */
> > + wakeup_flusher_threads(nr_dirty);
> > + congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +
> > + /*
> > + * As lumpy reclaim and memcg targets specific pages, wait on
> > + * them to be cleaned and try reclaim again.
> > + */
> > + if (sync_writeback == PAGEOUT_IO_SYNC ||
> > + sc->mem_cgroup != NULL) {
> > + dirty_isolated++;
> > + list_splice(&dirty_pages, page_list);
> > + INIT_LIST_HEAD(&dirty_pages);
> > + goto restart_dirty;
> > + }
> > + }
>
> I think it would turn out more natural to just return dirty pages on
> page_list and have the whole looping logic in shrink_inactive_list().
>
> Mixing dirty pages with other 'please try again' pages is probably not
> so bad anyway, it means we could retry all temporary unavailable pages
> instead of twiddling thumbs over that particular bunch of pages until
> the flushers catch up.
>
> What do you think?
>

It's worth considering! It won't be very tidy but it's workable. The reason
it is not tidy is that dirty pages and pages that couldn't be paged will be
on the same list so they whole lot will need to be recycled. We'd record in
scan_control though that there were pages that need to be retried and loop
based on that value. That is managable though.

The reason why I did it this way was because of lumpy reclaim and memcg
requiring specific pages. I considered lumpy reclaim to be the more common
case. In that case, it's removing potentially a large number of pages from
the LRU that are contiguous. If some of those are dirty and it selects more
contiguous ranges for reclaim, I'd worry that lumpy reclaim would trash the
system even worse than it currently does when the system is under load. Hence,
this wait and retry loop is done instead of returning and isolating more pages.

For memcg, the concern was different. It is depending on flusher threads
to clean its pages, kswapd does not operate on the list and it can't clean
pages itself because the stack may overflow. If the memcg has many dirty
pages, one process in the container could isolate all the dirty pages in
the list forcing others to reclaim clean pages regardless of age. This
could be very disruptive but looping like this throttling processes that
encounter dirty pages instead of isolating more.

For lumpy, I don't think we should return and isolate more pages, it's
too disruptive. For memcg, I think it could possibly get an advantage
but there is a nasty corner case if the container is mostly dirty - it
depends on how memcg handles dirty_ratio I guess.

Is it worth it at this point?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mel Gorman on 21 Jul 2010 10:30

On Wed, Jul 21, 2010 at 09:01:11PM +0900, KAMEZAWA Hiroyuki wrote:
> > <SNIP>
> >
> > /*
> > - * If we are direct reclaiming for contiguous pages and we do
> > + * If specific pages are needed such as with direct reclaiming
> > + * for contiguous pages or for memory containers and we do
> > * not reclaim everything in the list, try again and wait
> > - * for IO to complete. This will stall high-order allocations
> > - * but that should be acceptable to the caller
> > + * for IO to complete. This will stall callers that require
> > + * specific pages but it should be acceptable to the caller
> > */
> > - if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> > - sc->lumpy_reclaim_mode) {
> > - congestion_wait(BLK_RW_ASYNC, HZ/10);
> > + if (sc->may_writepage && !current_is_kswapd() &&
> > + (sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
> > + int dirty_retry = MAX_SWAP_CLEAN_WAIT;
>
> Hmm, ok. I see what will happen to memcg.

Thanks

> But, hmm, memcg will have to select to enter this rounine based on
> the result of 1st memory reclaim.
>

It has the option of igoring pages being dirtied but I worry that the
container could be filled with dirty pages waiting for flushers to do
something.

> >
> > - /*
> > - * The attempt at page out may have made some
> > - * of the pages active, mark them inactive again.
> > - */
> > - nr_active = clear_active_flags(&page_list, NULL);
> > - count_vm_events(PGDEACTIVATE, nr_active);
> > + while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> > + congestion_wait(BLK_RW_ASYNC, HZ/10);
> >
>
> Congestion wait is required ?? Where the congestion happens ?
> I'm sorry you already have some other trick in other patch.
>

It's to wait for the IO to occur.

> > - nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> > + /*
> > + * The attempt at page out may have made some
> > + * of the pages active, mark them inactive again.
> > + */
> > + nr_active = clear_active_flags(&page_list, NULL);
> > + count_vm_events(PGDEACTIVATE, nr_active);
> > +
> > + nr_reclaimed += shrink_page_list(&page_list, sc,
> > + PAGEOUT_IO_SYNC, &nr_dirty);
> > + }
>
> Just a question. This PAGEOUT_IO_SYNC has some meanings ?
>

Yes, in pageout it will wait on pages currently being written back to be
cleaned before trying to reclaim them.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mel Gorman on 21 Jul 2010 10:40

On Wed, Jul 21, 2010 at 04:28:44PM +0200, Johannes Weiner wrote:
> On Wed, Jul 21, 2010 at 02:38:57PM +0100, Mel Gorman wrote:
> > Here is an updated version. Thanks very much
> >
> > ==== CUT HERE ====
> > vmscan: Do not writeback filesystem pages in direct reclaim
> >
> > When memory is under enough pressure, a process may enter direct
> > reclaim to free pages in the same manner kswapd does. If a dirty page is
> > encountered during the scan, this page is written to backing storage using
> > mapping->writepage. This can result in very deep call stacks, particularly
> > if the target storage or filesystem are complex. It has already been observed
> > on XFS that the stack overflows but the problem is not XFS-specific.
> >
> > This patch prevents direct reclaim writing back filesystem pages by checking
> > if current is kswapd or the page is anonymous before writing back. If the
> > dirty pages cannot be written back, they are placed back on the LRU lists
> > for either background writing by the BDI threads or kswapd. If in direct
> > lumpy reclaim and dirty pages are encountered, the process will stall for
> > the background flusher before trying to reclaim the pages again.
> >
> > As the call-chain for writing anonymous pages is not expected to be deep
> > and they are not cleaned by flusher threads, anonymous pages are still
> > written back in direct reclaim.
> >
> > Signed-off-by: Mel Gorman <mel(a)csn.ul.ie>
> > Acked-by: Rik van Riel <riel(a)redhat.com>
>
> Cool!
>
> Except for one last tiny thing...
>
> > @@ -858,7 +872,7 @@ keep:
> >
> > free_page_list(&free_pages);
> >
> > - list_splice(&ret_pages, page_list);
>
> This will lose all retry pages forever, I think.
>

Above this is

while (!list_empty(page_list)) {
...
}

page_list should be empty and keep_locked is putting the pages on ret_pages
already so I think it's ok.

> > + *nr_still_dirty = nr_dirty;
> > count_vm_events(PGACTIVATE, pgactivate);
> > return nr_reclaimed;
> > }
>
> Otherwise,
> Reviewed-by: Johannes Weiner <hannes(a)cmpxchg.org>
>

Thanks!

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mel Gorman on 21 Jul 2010 11:10

On Wed, Jul 21, 2010 at 04:39:56PM +0200, Johannes Weiner wrote:
> On Wed, Jul 21, 2010 at 03:31:19PM +0100, Mel Gorman wrote:
> > On Wed, Jul 21, 2010 at 04:28:44PM +0200, Johannes Weiner wrote:
> > > On Wed, Jul 21, 2010 at 02:38:57PM +0100, Mel Gorman wrote:
> > > > @@ -858,7 +872,7 @@ keep:
> > > >
> > > > free_page_list(&free_pages);
> > > >
> > > > - list_splice(&ret_pages, page_list);
> > >
> > > This will lose all retry pages forever, I think.
> > >
> >
> > Above this is
> >
> > while (!list_empty(page_list)) {
> > ...
> > }
> >
> > page_list should be empty and keep_locked is putting the pages on ret_pages
> > already so I think it's ok.
>
> But ret_pages is function-local. Putting them back on the then-empty
> page_list is to give them back to the caller, otherwise they are lost
> in a dead stack slot.
>

Bah, you're right, it is repaired now. /me slaps self. Thanks

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2
Prev: xstat: Add a dentry op to handle automounting rather than abusing follow_link() [ver #6]
Next: vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages