From: Mel Gorman on
On Wed, Jul 28, 2010 at 04:46:54PM +0800, Wu Fengguang wrote:
> The wait_on_page_writeback() call inside pageout() is virtually dead code.
>
> shrink_inactive_list()
> shrink_page_list(PAGEOUT_IO_ASYNC)
> pageout(PAGEOUT_IO_ASYNC)
> shrink_page_list(PAGEOUT_IO_SYNC)
> pageout(PAGEOUT_IO_SYNC)
>
> Because shrink_page_list/pageout(PAGEOUT_IO_SYNC) is always called after
> a preceding shrink_page_list/pageout(PAGEOUT_IO_ASYNC), the first
> pageout(ASYNC) converts dirty pages into writeback pages, the second
> shrink_page_list(SYNC) waits on the clean of writeback pages before
> calling pageout(SYNC). The second shrink_page_list(SYNC) can hardly run
> into dirty pages for pageout(SYNC) unless in some race conditions.
>

It's possible for the second call to run into dirty pages as there is a
congestion_wait() call between the first shrink_page_list() call and the
second. That's a big window.

> And the wait page-by-page behavior of pageout(SYNC) will lead to very
> long stall time if running into some range of dirty pages.

True, but this is also lumpy reclaim which is depending on a contiguous
range of pages. It's better for it to wait on the selected range of pages
which is known to contain at least one old page than excessively scan and
reclaim newer pages.

> So it's bad
> idea anyway to call wait_on_page_writeback() inside pageout().
>

I recognise that you are probably thinking of the stall-due-to-fork problem
but I'd expect the patch that raises the bar for <= PAGE_ALLOC_COSTLY_ORDER
to be sufficient. If not, I think it still makes sense to call
wait_on_page_writeback() for > PAGE_ALLOC_COSTLY_ORDER.


> CC: Andy Whitcroft <apw(a)shadowen.org>
> Signed-off-by: Wu Fengguang <fengguang.wu(a)intel.com>
> ---
> mm/vmscan.c | 13 ++-----------
> 1 file changed, 2 insertions(+), 11 deletions(-)
>
> --- linux-next.orig/mm/vmscan.c 2010-07-28 16:22:21.000000000 +0800
> +++ linux-next/mm/vmscan.c 2010-07-28 16:23:35.000000000 +0800
> @@ -324,8 +324,7 @@ typedef enum {
> * pageout is called by shrink_page_list() for each dirty page.
> * Calls ->writepage().
> */
> -static pageout_t pageout(struct page *page, struct address_space *mapping,
> - enum pageout_io sync_writeback)
> +static pageout_t pageout(struct page *page, struct address_space *mapping)
> {
> /*
> * If the page is dirty, only perform writeback if that write
> @@ -384,14 +383,6 @@ static pageout_t pageout(struct page *pa
> return PAGE_ACTIVATE;
> }
>
> - /*
> - * Wait on writeback if requested to. This happens when
> - * direct reclaiming a large contiguous area and the
> - * first attempt to free a range of pages fails.
> - */
> - if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
> - wait_on_page_writeback(page);
> -
> if (!PageWriteback(page)) {
> /* synchronous write or broken a_ops? */
> ClearPageReclaim(page);
> @@ -727,7 +718,7 @@ static unsigned long shrink_page_list(st
> goto keep_locked;
>
> /* Page is dirty, try to write it out here */
> - switch (pageout(page, mapping, sync_writeback)) {
> + switch (pageout(page, mapping)) {
> case PAGE_KEEP:
> goto keep_locked;
> case PAGE_ACTIVATE:
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Wu Fengguang on
On Wed, Jul 28, 2010 at 05:10:33PM +0800, Mel Gorman wrote:
> On Wed, Jul 28, 2010 at 04:46:54PM +0800, Wu Fengguang wrote:
> > The wait_on_page_writeback() call inside pageout() is virtually dead code.
> >
> > shrink_inactive_list()
> > shrink_page_list(PAGEOUT_IO_ASYNC)
> > pageout(PAGEOUT_IO_ASYNC)
> > shrink_page_list(PAGEOUT_IO_SYNC)
> > pageout(PAGEOUT_IO_SYNC)
> >
> > Because shrink_page_list/pageout(PAGEOUT_IO_SYNC) is always called after
> > a preceding shrink_page_list/pageout(PAGEOUT_IO_ASYNC), the first
> > pageout(ASYNC) converts dirty pages into writeback pages, the second
> > shrink_page_list(SYNC) waits on the clean of writeback pages before
> > calling pageout(SYNC). The second shrink_page_list(SYNC) can hardly run
> > into dirty pages for pageout(SYNC) unless in some race conditions.
> >
>
> It's possible for the second call to run into dirty pages as there is a
> congestion_wait() call between the first shrink_page_list() call and the
> second. That's a big window.

OK there is a <=0.1s time window. Then what about the data set size?
After first shrink_page_list(ASYNC), there will be hardly any pages
left in the page_list except for the already under-writeback pages and
other unreclaimable pages. So it still asks for some race conditions
for hitting the second pageout(SYNC) -- some unreclaimable pages
become reclaimable+dirty in the 0.1s time window.

> > And the wait page-by-page behavior of pageout(SYNC) will lead to very
> > long stall time if running into some range of dirty pages.
>
> True, but this is also lumpy reclaim which is depending on a contiguous
> range of pages. It's better for it to wait on the selected range of pages
> which is known to contain at least one old page than excessively scan and
> reclaim newer pages.
>
> > So it's bad
> > idea anyway to call wait_on_page_writeback() inside pageout().
> >
>
> I recognise that you are probably thinking of the stall-due-to-fork problem
> but I'd expect the patch that raises the bar for <= PAGE_ALLOC_COSTLY_ORDER
> to be sufficient. If not, I think it still makes sense to call
> wait_on_page_writeback() for > PAGE_ALLOC_COSTLY_ORDER.

The main intention of this patch is to remove semi-dead code.
I'm less disturbed by the long stall time now with the previous patch ;)

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on
On Wed, Jul 28, 2010 at 05:30:31PM +0800, Wu Fengguang wrote:
> On Wed, Jul 28, 2010 at 05:10:33PM +0800, Mel Gorman wrote:
> > On Wed, Jul 28, 2010 at 04:46:54PM +0800, Wu Fengguang wrote:
> > > The wait_on_page_writeback() call inside pageout() is virtually dead code.
> > >
> > > shrink_inactive_list()
> > > shrink_page_list(PAGEOUT_IO_ASYNC)
> > > pageout(PAGEOUT_IO_ASYNC)
> > > shrink_page_list(PAGEOUT_IO_SYNC)
> > > pageout(PAGEOUT_IO_SYNC)
> > >
> > > Because shrink_page_list/pageout(PAGEOUT_IO_SYNC) is always called after
> > > a preceding shrink_page_list/pageout(PAGEOUT_IO_ASYNC), the first
> > > pageout(ASYNC) converts dirty pages into writeback pages, the second
> > > shrink_page_list(SYNC) waits on the clean of writeback pages before
> > > calling pageout(SYNC). The second shrink_page_list(SYNC) can hardly run
> > > into dirty pages for pageout(SYNC) unless in some race conditions.
> > >
> >
> > It's possible for the second call to run into dirty pages as there is a
> > congestion_wait() call between the first shrink_page_list() call and the
> > second. That's a big window.
>
> OK there is a <=0.1s time window.

Ok, big was an exagguration for IO but during this window the page can
also be refaulted. If unmapped, it can get dirtied again.

> Then what about the data set size?
> After first shrink_page_list(ASYNC), there will be hardly any pages
> left in the page_list except for the already under-writeback pages and
> other unreclaimable pages. So it still asks for some race conditions
> for hitting the second pageout(SYNC) -- some unreclaimable pages
> become reclaimable+dirty in the 0.1s time window.
>

We are hitting this window because otherwise the trace points would not be
reporting sync IO in pageout(). Take from an ftrace-based report

Direct reclaims 1176
Direct reclaim pages scanned 184337
Direct reclaim write file async I/O 2317
Direct reclaim write anon async I/O 35551
Direct reclaim write file sync I/O 1817
Direct reclaim write anon sync I/O 15920

For the last line to have a positive value, we must have called
pageout(PAGEOUT_IO_ASYNC) and then hit a dirty page during the
pageout(PAGEOUT_IO_SYNC) call.

Here is one fairly plausible scenario where we end up waiting on
writeback despite the previous pageout() call.

shrink_inactive_list()
shrink_page_list(PAGEOUT_IO_ASYNC)
Check PageWriteback
Unmap page (set_dirty_page, if PTE was dirty)
pageout(PAGEOUT_IO_ASYNC, IO starts, page in writeback)
call congestion_wait()

During this 0.1s window, the process references the page and faults in.
As it is lumpy reclaim, the page could have been young even though it
was physically located near an old page

shrink_page_list(PAGEOUT_IO_SYNC)
Check PageWriteback (Lets assume it is written back for this example)
Unmap page again (dirty page again, if PTE was dirty)
pageout(PAGEOUT_IO_SYNC, IO starts, wait on writeback this time)

> > > And the wait page-by-page behavior of pageout(SYNC) will lead to very
> > > long stall time if running into some range of dirty pages.
> >
> > True, but this is also lumpy reclaim which is depending on a contiguous
> > range of pages. It's better for it to wait on the selected range of pages
> > which is known to contain at least one old page than excessively scan and
> > reclaim newer pages.
> >
> > > So it's bad
> > > idea anyway to call wait_on_page_writeback() inside pageout().
> > >
> >
> > I recognise that you are probably thinking of the stall-due-to-fork problem
> > but I'd expect the patch that raises the bar for <= PAGE_ALLOC_COSTLY_ORDER
> > to be sufficient. If not, I think it still makes sense to call
> > wait_on_page_writeback() for > PAGE_ALLOC_COSTLY_ORDER.
>
> The main intention of this patch is to remove semi-dead code.
> I'm less disturbed by the long stall time now with the previous patch ;)
>

Unfortuately, while the code may not be currently doing the most
efficient thing with respect to lumpy reclaim, it's not dead either :/

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: KOSAKI Motohiro on
> On Wed, Jul 28, 2010 at 04:46:54PM +0800, Wu Fengguang wrote:
> > The wait_on_page_writeback() call inside pageout() is virtually dead code.
> >
> > shrink_inactive_list()
> > shrink_page_list(PAGEOUT_IO_ASYNC)
> > pageout(PAGEOUT_IO_ASYNC)
> > shrink_page_list(PAGEOUT_IO_SYNC)
> > pageout(PAGEOUT_IO_SYNC)
> >
> > Because shrink_page_list/pageout(PAGEOUT_IO_SYNC) is always called after
> > a preceding shrink_page_list/pageout(PAGEOUT_IO_ASYNC), the first
> > pageout(ASYNC) converts dirty pages into writeback pages, the second
> > shrink_page_list(SYNC) waits on the clean of writeback pages before
> > calling pageout(SYNC). The second shrink_page_list(SYNC) can hardly run
> > into dirty pages for pageout(SYNC) unless in some race conditions.
> >
>
> It's possible for the second call to run into dirty pages as there is a
> congestion_wait() call between the first shrink_page_list() call and the
> second. That's a big window.
>
> > And the wait page-by-page behavior of pageout(SYNC) will lead to very
> > long stall time if running into some range of dirty pages.
>
> True, but this is also lumpy reclaim which is depending on a contiguous
> range of pages. It's better for it to wait on the selected range of pages
> which is known to contain at least one old page than excessively scan and
> reclaim newer pages.

Today, I was successful to reproduce the Andres's issue. and I disagree this
opinion.
The root cause is, congestion_wait() mean "wait until clear io congestion". but
if the system have plenty dirty pages, flusher threads are issueing IO conteniously.
So, io congestion is not cleared long time. eventually, congestion_wait(BLK_RW_ASYNC, HZ/10)
become to equivalent to sleep(HZ/10).

I would propose followint patch instead.

And I've found synchronous lumpy reclaim have more serious problem. I woule like to
explain it as another mail.

Thanks.



From 0266fb2c23aef659cd4e89fccfeb464f23257b74 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com>
Date: Tue, 27 Jul 2010 14:36:44 +0900
Subject: [PATCH] vmscan: synchronous lumpy reclaim don't call congestion_wait()

congestion_wait() mean "waiting for number of requests in IO queue is
under congestion threshold".
That said, if the system have plenty dirty pages, flusher thread push
new request to IO queue conteniously. So, IO queue are not cleared
congestion status for a long time. thus, congestion_wait(HZ/10) is
almostly equivalent schedule_timeout(HZ/10).

If the system 512MB memory, DEF_PRIORITY mean 128kB scan and 4096 times
shrink_inactive_list call. 4096 times 0.1sec stall makes crazy insane
long stall. That shouldn't.

In the other hand, this synchronous lumpy reclaim donesn't need this
congestion_wait() at all. shrink_page_list(PAGEOUT_IO_SYNC) cause to
call wait_on_page_writeback() and it provide sufficient waiting.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com>
---
mm/vmscan.c | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 97170eb..2aa16eb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1304,8 +1304,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
*/
if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
sc->lumpy_reclaim_mode) {
- congestion_wait(BLK_RW_ASYNC, HZ/10);
-
/*
* The attempt at page out may have made some
* of the pages active, mark them inactive again.
--
1.6.5.2




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on
On Wed, Jul 28, 2010 at 06:43:41PM +0900, KOSAKI Motohiro wrote:
> > On Wed, Jul 28, 2010 at 04:46:54PM +0800, Wu Fengguang wrote:
> > > The wait_on_page_writeback() call inside pageout() is virtually dead code.
> > >
> > > shrink_inactive_list()
> > > shrink_page_list(PAGEOUT_IO_ASYNC)
> > > pageout(PAGEOUT_IO_ASYNC)
> > > shrink_page_list(PAGEOUT_IO_SYNC)
> > > pageout(PAGEOUT_IO_SYNC)
> > >
> > > Because shrink_page_list/pageout(PAGEOUT_IO_SYNC) is always called after
> > > a preceding shrink_page_list/pageout(PAGEOUT_IO_ASYNC), the first
> > > pageout(ASYNC) converts dirty pages into writeback pages, the second
> > > shrink_page_list(SYNC) waits on the clean of writeback pages before
> > > calling pageout(SYNC). The second shrink_page_list(SYNC) can hardly run
> > > into dirty pages for pageout(SYNC) unless in some race conditions.
> > >
> >
> > It's possible for the second call to run into dirty pages as there is a
> > congestion_wait() call between the first shrink_page_list() call and the
> > second. That's a big window.
> >
> > > And the wait page-by-page behavior of pageout(SYNC) will lead to very
> > > long stall time if running into some range of dirty pages.
> >
> > True, but this is also lumpy reclaim which is depending on a contiguous
> > range of pages. It's better for it to wait on the selected range of pages
> > which is known to contain at least one old page than excessively scan and
> > reclaim newer pages.
>
> Today, I was successful to reproduce the Andres's issue. and I disagree this
> opinion.

Is Andres's issue not covered by the patch "vmscan: raise the bar to
PAGEOUT_IO_SYNC stalls" because wait_on_page_writeback() was the
main problem?

> The root cause is, congestion_wait() mean "wait until clear io congestion". but
> if the system have plenty dirty pages, flusher threads are issueing IO conteniously.
> So, io congestion is not cleared long time. eventually, congestion_wait(BLK_RW_ASYNC, HZ/10)
> become to equivalent to sleep(HZ/10).
>
> I would propose followint patch instead.
>
> And I've found synchronous lumpy reclaim have more serious problem. I woule like to
> explain it as another mail.
>
> Thanks.
>
>
>
> From 0266fb2c23aef659cd4e89fccfeb464f23257b74 Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com>
> Date: Tue, 27 Jul 2010 14:36:44 +0900
> Subject: [PATCH] vmscan: synchronous lumpy reclaim don't call congestion_wait()
>
> congestion_wait() mean "waiting for number of requests in IO queue is
> under congestion threshold".
> That said, if the system have plenty dirty pages, flusher thread push
> new request to IO queue conteniously. So, IO queue are not cleared
> congestion status for a long time. thus, congestion_wait(HZ/10) is
> almostly equivalent schedule_timeout(HZ/10).
>
> If the system 512MB memory, DEF_PRIORITY mean 128kB scan and 4096 times
> shrink_inactive_list call. 4096 times 0.1sec stall makes crazy insane
> long stall. That shouldn't.
>
> In the other hand, this synchronous lumpy reclaim donesn't need this
> congestion_wait() at all. shrink_page_list(PAGEOUT_IO_SYNC) cause to
> call wait_on_page_writeback() and it provide sufficient waiting.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com>

I think the final paragraph makes a lot of sense. If a lumpy reclaimer is
going to get stalled on wait_on_page_writeback(), it should be a sufficient
throttling mechanism.

Will test.

> ---
> mm/vmscan.c | 2 --
> 1 files changed, 0 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 97170eb..2aa16eb 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1304,8 +1304,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> */
> if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> sc->lumpy_reclaim_mode) {
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> -
> /*
> * The attempt at page out may have made some
> * of the pages active, mark them inactive again.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/