From: Christoph Hellwig on
On Mon, Jul 19, 2010 at 02:11:26PM +0100, Mel Gorman wrote:
> As the call-chain for writing anonymous pages is not expected to be deep
> and they are not cleaned by flusher threads, anonymous pages are still
> written back in direct reclaim.

While it is not quite as deep as it skips the filesystem allocator and
extent mapping code it can still be quite deep for swap given that it
still has to traverse the whole I/O stack. Probably not worth worrying
about now, but we need to keep an eye on it.

The patch looks fine to me anyway.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Rik van Riel on
On 07/19/2010 09:11 AM, Mel Gorman wrote:
> When memory is under enough pressure, a process may enter direct
> reclaim to free pages in the same manner kswapd does. If a dirty page is
> encountered during the scan, this page is written to backing storage using
> mapping->writepage. This can result in very deep call stacks, particularly
> if the target storage or filesystem are complex. It has already been observed
> on XFS that the stack overflows but the problem is not XFS-specific.
>
> This patch prevents direct reclaim writing back filesystem pages by checking
> if current is kswapd or the page is anonymous before writing back. If the
> dirty pages cannot be written back, they are placed back on the LRU lists
> for either background writing by the BDI threads or kswapd. If in direct
> lumpy reclaim and dirty pages are encountered, the process will stall for
> the background flusher before trying to reclaim the pages again.
>
> As the call-chain for writing anonymous pages is not expected to be deep
> and they are not cleaned by flusher threads, anonymous pages are still
> written back in direct reclaim.
>
> Signed-off-by: Mel Gorman<mel(a)csn.ul.ie>

Acked-by: Rik van Riel <riel(a)redhat.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Johannes Weiner on
Hi Mel,

On Mon, Jul 19, 2010 at 02:11:26PM +0100, Mel Gorman wrote:
> @@ -406,7 +461,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
> return PAGE_SUCCESS;
> }

Did you forget to delete the worker code from pageout() which is now
in write_reclaim_page()?

> - return PAGE_CLEAN;
> + return write_reclaim_page(page, mapping, sync_writeback);
> }
>
> /*
> @@ -639,6 +694,9 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> pagevec_free(&freed_pvec);
> }
>
> +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
> +#define MAX_SWAP_CLEAN_WAIT 50
> +
> /*
> * shrink_page_list() returns the number of reclaimed pages
> */
> @@ -646,13 +704,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> struct scan_control *sc,
> enum pageout_io sync_writeback)
> {
> - LIST_HEAD(ret_pages);
> LIST_HEAD(free_pages);
> - int pgactivate = 0;
> + LIST_HEAD(putback_pages);
> + LIST_HEAD(dirty_pages);
> + int pgactivate;
> + int dirty_isolated = 0;
> + unsigned long nr_dirty;
> unsigned long nr_reclaimed = 0;
>
> + pgactivate = 0;
> cond_resched();
>
> +restart_dirty:
> + nr_dirty = 0;
> while (!list_empty(page_list)) {
> enum page_references references;
> struct address_space *mapping;
> @@ -741,7 +805,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> }
> }
>
> - if (PageDirty(page)) {
> + if (PageDirty(page)) {
> + /*
> + * If the caller cannot writeback pages, dirty pages
> + * are put on a separate list for cleaning by either
> + * a flusher thread or kswapd
> + */
> + if (!reclaim_can_writeback(sc, page)) {
> + list_add(&page->lru, &dirty_pages);
> + unlock_page(page);
> + nr_dirty++;
> + goto keep_dirty;
> + }
> +
> if (references == PAGEREF_RECLAIM_CLEAN)
> goto keep_locked;
> if (!may_enter_fs)
> @@ -852,13 +928,39 @@ activate_locked:
> keep_locked:
> unlock_page(page);
> keep:
> - list_add(&page->lru, &ret_pages);
> + list_add(&page->lru, &putback_pages);
> +keep_dirty:
> VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> }
>
> + if (dirty_isolated < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) {
> + /*
> + * Wakeup a flusher thread to clean at least as many dirty
> + * pages as encountered by direct reclaim. Wait on congestion
> + * to throttle processes cleaning dirty pages
> + */
> + wakeup_flusher_threads(nr_dirty);
> + congestion_wait(BLK_RW_ASYNC, HZ/10);
> +
> + /*
> + * As lumpy reclaim and memcg targets specific pages, wait on
> + * them to be cleaned and try reclaim again.
> + */
> + if (sync_writeback == PAGEOUT_IO_SYNC ||
> + sc->mem_cgroup != NULL) {
> + dirty_isolated++;
> + list_splice(&dirty_pages, page_list);
> + INIT_LIST_HEAD(&dirty_pages);
> + goto restart_dirty;
> + }
> + }

I think it would turn out more natural to just return dirty pages on
page_list and have the whole looping logic in shrink_inactive_list().

Mixing dirty pages with other 'please try again' pages is probably not
so bad anyway, it means we could retry all temporary unavailable pages
instead of twiddling thumbs over that particular bunch of pages until
the flushers catch up.

What do you think?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: KAMEZAWA Hiroyuki on
On Wed, 21 Jul 2010 12:52:50 +0100
Mel Gorman <mel(a)csn.ul.ie> wrote:


> ==== CUT HERE ====
> [PATCH] vmscan: Do not writeback filesystem pages in direct reclaim
>
> When memory is under enough pressure, a process may enter direct
> reclaim to free pages in the same manner kswapd does. If a dirty page is
> encountered during the scan, this page is written to backing storage using
> mapping->writepage. This can result in very deep call stacks, particularly
> if the target storage or filesystem are complex. It has already been observed
> on XFS that the stack overflows but the problem is not XFS-specific.
>
> This patch prevents direct reclaim writing back filesystem pages by checking
> if current is kswapd or the page is anonymous before writing back. If the
> dirty pages cannot be written back, they are placed back on the LRU lists
> for either background writing by the BDI threads or kswapd. If in direct
> lumpy reclaim and dirty pages are encountered, the process will stall for
> the background flusher before trying to reclaim the pages again.
>
> As the call-chain for writing anonymous pages is not expected to be deep
> and they are not cleaned by flusher threads, anonymous pages are still
> written back in direct reclaim.
>
> Signed-off-by: Mel Gorman <mel(a)csn.ul.ie>
> Acked-by: Rik van Riel <riel(a)redhat.com>
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6587155..e3a5816 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -323,6 +323,51 @@ typedef enum {
> PAGE_CLEAN,
> } pageout_t;
>
> +int write_reclaim_page(struct page *page, struct address_space *mapping,
> + enum pageout_io sync_writeback)
> +{
> + int res;
> + struct writeback_control wbc = {
> + .sync_mode = WB_SYNC_NONE,
> + .nr_to_write = SWAP_CLUSTER_MAX,
> + .range_start = 0,
> + .range_end = LLONG_MAX,
> + .nonblocking = 1,
> + .for_reclaim = 1,
> + };
> +
> + if (!clear_page_dirty_for_io(page))
> + return PAGE_CLEAN;
> +
> + SetPageReclaim(page);
> + res = mapping->a_ops->writepage(page, &wbc);
> + if (res < 0)
> + handle_write_error(mapping, page, res);
> + if (res == AOP_WRITEPAGE_ACTIVATE) {
> + ClearPageReclaim(page);
> + return PAGE_ACTIVATE;
> + }
> +
> + /*
> + * Wait on writeback if requested to. This happens when
> + * direct reclaiming a large contiguous area and the
> + * first attempt to free a range of pages fails.
> + */
> + if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
> + wait_on_page_writeback(page);
> +
> + if (!PageWriteback(page)) {
> + /* synchronous write or broken a_ops? */
> + ClearPageReclaim(page);
> + }
> + trace_mm_vmscan_writepage(page,
> + page_is_file_cache(page),
> + sync_writeback == PAGEOUT_IO_SYNC);
> + inc_zone_page_state(page, NR_VMSCAN_WRITE);
> +
> + return PAGE_SUCCESS;
> +}
> +
> /*
> * pageout is called by shrink_page_list() for each dirty page.
> * Calls ->writepage().
> @@ -367,46 +412,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
> if (!may_write_to_queue(mapping->backing_dev_info))
> return PAGE_KEEP;
>
> - if (clear_page_dirty_for_io(page)) {
> - int res;
> - struct writeback_control wbc = {
> - .sync_mode = WB_SYNC_NONE,
> - .nr_to_write = SWAP_CLUSTER_MAX,
> - .range_start = 0,
> - .range_end = LLONG_MAX,
> - .nonblocking = 1,
> - .for_reclaim = 1,
> - };
> -
> - SetPageReclaim(page);
> - res = mapping->a_ops->writepage(page, &wbc);
> - if (res < 0)
> - handle_write_error(mapping, page, res);
> - if (res == AOP_WRITEPAGE_ACTIVATE) {
> - ClearPageReclaim(page);
> - return PAGE_ACTIVATE;
> - }
> -
> - /*
> - * Wait on writeback if requested to. This happens when
> - * direct reclaiming a large contiguous area and the
> - * first attempt to free a range of pages fails.
> - */
> - if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
> - wait_on_page_writeback(page);
> -
> - if (!PageWriteback(page)) {
> - /* synchronous write or broken a_ops? */
> - ClearPageReclaim(page);
> - }
> - trace_mm_vmscan_writepage(page,
> - page_is_file_cache(page),
> - sync_writeback == PAGEOUT_IO_SYNC);
> - inc_zone_page_state(page, NR_VMSCAN_WRITE);
> - return PAGE_SUCCESS;
> - }
> -
> - return PAGE_CLEAN;
> + return write_reclaim_page(page, mapping, sync_writeback);
> }
>
> /*
> @@ -639,18 +645,25 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> pagevec_free(&freed_pvec);
> }
>
> +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
> +#define MAX_SWAP_CLEAN_WAIT 50
> +
> /*
> * shrink_page_list() returns the number of reclaimed pages
> */
> static unsigned long shrink_page_list(struct list_head *page_list,
> struct scan_control *sc,
> - enum pageout_io sync_writeback)
> + enum pageout_io sync_writeback,
> + unsigned long *nr_still_dirty)
> {
> - LIST_HEAD(ret_pages);
> LIST_HEAD(free_pages);
> - int pgactivate = 0;
> + LIST_HEAD(putback_pages);
> + LIST_HEAD(dirty_pages);
> + int pgactivate;
> + unsigned long nr_dirty = 0;
> unsigned long nr_reclaimed = 0;
>
> + pgactivate = 0;
> cond_resched();
>
> while (!list_empty(page_list)) {
> @@ -741,7 +754,18 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> }
> }
>
> - if (PageDirty(page)) {
> + if (PageDirty(page)) {
> + /*
> + * Only kswapd can writeback filesystem pages to
> + * avoid risk of stack overflow
> + */
> + if (page_is_file_cache(page) && !current_is_kswapd()) {
> + list_add(&page->lru, &dirty_pages);
> + unlock_page(page);
> + nr_dirty++;
> + goto keep_dirty;
> + }
> +
> if (references == PAGEREF_RECLAIM_CLEAN)
> goto keep_locked;
> if (!may_enter_fs)
> @@ -852,13 +876,19 @@ activate_locked:
> keep_locked:
> unlock_page(page);
> keep:
> - list_add(&page->lru, &ret_pages);
> + list_add(&page->lru, &putback_pages);
> +keep_dirty:
> VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> }
>
> free_page_list(&free_pages);
>
> - list_splice(&ret_pages, page_list);
> + if (nr_dirty) {
> + *nr_still_dirty = nr_dirty;
> + list_splice(&dirty_pages, page_list);
> + }
> + list_splice(&putback_pages, page_list);
> +
> count_vm_events(PGACTIVATE, pgactivate);
> return nr_reclaimed;
> }
> @@ -1245,6 +1275,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> unsigned long nr_active;
> unsigned long nr_anon;
> unsigned long nr_file;
> + unsigned long nr_dirty;
>
> while (unlikely(too_many_isolated(zone, file, sc))) {
> congestion_wait(BLK_RW_ASYNC, HZ/10);
> @@ -1293,26 +1324,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>
> spin_unlock_irq(&zone->lru_lock);
>
> - nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> + nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> + &nr_dirty);
>
> /*
> - * If we are direct reclaiming for contiguous pages and we do
> + * If specific pages are needed such as with direct reclaiming
> + * for contiguous pages or for memory containers and we do
> * not reclaim everything in the list, try again and wait
> - * for IO to complete. This will stall high-order allocations
> - * but that should be acceptable to the caller
> + * for IO to complete. This will stall callers that require
> + * specific pages but it should be acceptable to the caller
> */
> - if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> - sc->lumpy_reclaim_mode) {
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> + if (sc->may_writepage && !current_is_kswapd() &&
> + (sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
> + int dirty_retry = MAX_SWAP_CLEAN_WAIT;

Hmm, ok. I see what will happen to memcg.
But, hmm, memcg will have to select to enter this rounine based on
the result of 1st memory reclaim.

>
> - /*
> - * The attempt at page out may have made some
> - * of the pages active, mark them inactive again.
> - */
> - nr_active = clear_active_flags(&page_list, NULL);
> - count_vm_events(PGDEACTIVATE, nr_active);
> + while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> + congestion_wait(BLK_RW_ASYNC, HZ/10);
>
Congestion wait is required ?? Where the congestion happens ?
I'm sorry you already have some other trick in other patch.

> - nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> + /*
> + * The attempt at page out may have made some
> + * of the pages active, mark them inactive again.
> + */
> + nr_active = clear_active_flags(&page_list, NULL);
> + count_vm_events(PGDEACTIVATE, nr_active);
> +
> + nr_reclaimed += shrink_page_list(&page_list, sc,
> + PAGEOUT_IO_SYNC, &nr_dirty);
> + }

Just a question. This PAGEOUT_IO_SYNC has some meanings ?

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Johannes Weiner on
On Wed, Jul 21, 2010 at 02:38:57PM +0100, Mel Gorman wrote:
> Here is an updated version. Thanks very much
>
> ==== CUT HERE ====
> vmscan: Do not writeback filesystem pages in direct reclaim
>
> When memory is under enough pressure, a process may enter direct
> reclaim to free pages in the same manner kswapd does. If a dirty page is
> encountered during the scan, this page is written to backing storage using
> mapping->writepage. This can result in very deep call stacks, particularly
> if the target storage or filesystem are complex. It has already been observed
> on XFS that the stack overflows but the problem is not XFS-specific.
>
> This patch prevents direct reclaim writing back filesystem pages by checking
> if current is kswapd or the page is anonymous before writing back. If the
> dirty pages cannot be written back, they are placed back on the LRU lists
> for either background writing by the BDI threads or kswapd. If in direct
> lumpy reclaim and dirty pages are encountered, the process will stall for
> the background flusher before trying to reclaim the pages again.
>
> As the call-chain for writing anonymous pages is not expected to be deep
> and they are not cleaned by flusher threads, anonymous pages are still
> written back in direct reclaim.
>
> Signed-off-by: Mel Gorman <mel(a)csn.ul.ie>
> Acked-by: Rik van Riel <riel(a)redhat.com>

Cool!

Except for one last tiny thing...

> @@ -858,7 +872,7 @@ keep:
>
> free_page_list(&free_pages);
>
> - list_splice(&ret_pages, page_list);

This will lose all retry pages forever, I think.

> + *nr_still_dirty = nr_dirty;
> count_vm_events(PGACTIVATE, pgactivate);
> return nr_reclaimed;
> }

Otherwise,
Reviewed-by: Johannes Weiner <hannes(a)cmpxchg.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/