From: Mel Gorman on
On Mon, Jul 19, 2010 at 10:21:45AM -0400, Christoph Hellwig wrote:
> On Mon, Jul 19, 2010 at 02:11:29PM +0100, Mel Gorman wrote:
> > From: Wu Fengguang <fengguang.wu(a)intel.com>
> >
> > A background flush work may run for ever. So it's reasonable for it to
> > mimic the kupdate behavior of syncing old/expired inodes first.
> >
> > This behavior also makes sense from the perspective of page reclaim.
> > File pages are added to the inactive list and promoted if referenced
> > after one recycling. If not referenced, it's very easy for pages to be
> > cleaned from reclaim context which is inefficient in terms of IO. If
> > background flush is cleaning pages, it's best it cleans old pages to
> > help minimise IO from reclaim.
>
> Yes, we absolutely do this.

Do you mean we absolutely want to do this?

> Wu, do you have an improved version of the
> pending or should we put it in this version for now?
>

Some insight on how the other writeback changes that are being floated
around might affect the number of dirty pages reclaim encounters would also
be helpful. The tracepoints are there for people to figure it out but any
help figuring it out is useful.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on
On Thu, Jul 22, 2010 at 04:52:10PM +0800, Wu Fengguang wrote:
> > Some insight on how the other writeback changes that are being floated
> > around might affect the number of dirty pages reclaim encounters would also
> > be helpful.
>
> Here is an interesting related problem about the wait_on_page_writeback() call
> inside shrink_page_list():
>
> http://lkml.org/lkml/2010/4/4/86
>
> The problem is, wait_on_page_writeback() is called too early in the
> direct reclaim path, which blocks many random/unrelated processes when
> some slow (USB stick) writeback is on the way.
>
> A simple dd can easily create a big range of dirty pages in the LRU
> list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a
> typical desktop, which triggers the lumpy reclaim mode and hence
> wait_on_page_writeback().
>

Lumpy reclaim is for high-order allocations. A simple dd should not be
triggering it regularly unless there was a lot of forking going on at the
same time. Also, how would a random or unrelated process get blocked on
writeback unless they were also doing high-order allocations? What was the
source of the high-order allocations?

> I proposed this patch at the time, which was confirmed to solve the problem:
>
> --- linux-next.orig/mm/vmscan.c 2010-06-24 14:32:03.000000000 +0800
> +++ linux-next/mm/vmscan.c 2010-07-22 16:12:34.000000000 +0800
> @@ -1650,7 +1650,7 @@ static void set_lumpy_reclaim_mode(int p
> */
> if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
> sc->lumpy_reclaim_mode = 1;
> - else if (sc->order && priority < DEF_PRIORITY - 2)
> + else if (sc->order && priority < DEF_PRIORITY / 2)
> sc->lumpy_reclaim_mode = 1;
> else
> sc->lumpy_reclaim_mode = 0;
>
>
> However KOSAKI and Minchan raised concerns about raising the bar.
> I guess this new patch is more problem oriented and acceptable:
>
> --- linux-next.orig/mm/vmscan.c 2010-07-22 16:36:58.000000000 +0800
> +++ linux-next/mm/vmscan.c 2010-07-22 16:39:57.000000000 +0800
> @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis
> count_vm_events(PGDEACTIVATE, nr_active);
>
> nr_freed += shrink_page_list(&page_list, sc,
> - PAGEOUT_IO_SYNC);
> + priority < DEF_PRIORITY / 3 ?
> + PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC);
> }
>

I'm not seeing how this helps. It delays when lumpy reclaim waits on IO
to clean contiguous ranges of pages.

I'll read that full thread as I wasn't aware of it before.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on
On Thu, Jul 22, 2010 at 05:21:55PM +0800, Wu Fengguang wrote:
> > I guess this new patch is more problem oriented and acceptable:
> >
> > --- linux-next.orig/mm/vmscan.c 2010-07-22 16:36:58.000000000 +0800
> > +++ linux-next/mm/vmscan.c 2010-07-22 16:39:57.000000000 +0800
> > @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis
> > count_vm_events(PGDEACTIVATE, nr_active);
> >
> > nr_freed += shrink_page_list(&page_list, sc,
> > - PAGEOUT_IO_SYNC);
> > + priority < DEF_PRIORITY / 3 ?
> > + PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC);
> > }
> >
> > nr_reclaimed += nr_freed;
>
> This one looks better:
> ---
> vmscan: raise the bar to PAGEOUT_IO_SYNC stalls
>
> Fix "system goes totally unresponsive with many dirty/writeback pages"
> problem:
>
> http://lkml.org/lkml/2010/4/4/86
>
> The root cause is, wait_on_page_writeback() is called too early in the
> direct reclaim path, which blocks many random/unrelated processes when
> some slow (USB stick) writeback is on the way.
>

So, what's the bet if lumpy reclaim is a factor that it's
high-order-but-low-cost such as fork() that are getting caught by this since
[78dc583d: vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC]
was introduced?

That could manifest to the user as stalls creating new processes when under
heavy IO. I would be surprised it would freeze the entire system but certainly
any new work would feel very slow.

> A simple dd can easily create a big range of dirty pages in the LRU
> list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a
> typical desktop, which triggers the lumpy reclaim mode and hence
> wait_on_page_writeback().
>

which triggers the lumpy reclaim mode for high-order allocations.

lumpy reclaim mode is not something that is triggered just because priority
is high.

I think there is a second possibility for causing stalls as well that is
unrelated to lumpy reclaim. Once dirty_limit is reached, new page faults may
also result in stalls. If it is taking a long time to writeback dirty data,
random processes could be getting stalled just because they happened to dirty
data at the wrong time. This would be the case if the main dirtying process
(e.g. dd) is not calling sync and dropping pages it's no longer using.

> In Andreas' case, 512MB/1024 = 512KB, this is way too low comparing to
> the 22MB writeback and 190MB dirty pages. There can easily be a
> continuous range of 512KB dirty/writeback pages in the LRU, which will
> trigger the wait logic.
>
> To make it worse, when there are 50MB writeback pages and USB 1.1 is
> writing them in 1MB/s, wait_on_page_writeback() may stuck for up to 50
> seconds.
>
> So only enter sync write&wait when priority goes below DEF_PRIORITY/3,
> or 6.25% LRU. As the default dirty throttle ratio is 20%, sync write&wait
> will hardly be triggered by pure dirty pages.
>
> Signed-off-by: Wu Fengguang <fengguang.wu(a)intel.com>
> ---
> mm/vmscan.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> --- linux-next.orig/mm/vmscan.c 2010-07-22 16:36:58.000000000 +0800
> +++ linux-next/mm/vmscan.c 2010-07-22 17:03:47.000000000 +0800
> @@ -1206,7 +1206,7 @@ static unsigned long shrink_inactive_lis
> * but that should be acceptable to the caller
> */
> if (nr_freed < nr_taken && !current_is_kswapd() &&
> - sc->lumpy_reclaim_mode) {
> + sc->lumpy_reclaim_mode && priority < DEF_PRIORITY / 3) {
> congestion_wait(BLK_RW_ASYNC, HZ/10);
>

This will also delay waiting on congestion for really high-order
allocations such as huge pages, some video decoder and the like which
really should be stalling. How about the following compile-tested diff?
It takes the cost of the high-order allocation into account and the
priority when deciding whether to synchronously wait or not.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c7e57c..d652e0c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1110,6 +1110,48 @@ static int too_many_isolated(struct zone *zone, int file,
}

/*
+ * Returns true if the caller should stall on congestion and retry to clean
+ * the list of pages synchronously.
+ *
+ * If we are direct reclaiming for contiguous pages and we do not reclaim
+ * everything in the list, try again and wait for IO to complete. This
+ * will stall high-order allocations but that should be acceptable to
+ * the caller
+ */
+static inline bool should_reclaim_stall(unsigned long nr_taken,
+ unsigned long nr_freed,
+ int priority,
+ struct scan_control *sc)
+{
+ int lumpy_stall_priority;
+
+ /* kswapd should not stall on sync IO */
+ if (current_is_kswapd())
+ return false;
+
+ /* Only stall on lumpy reclaim */
+ if (!sc->lumpy_reclaim_mode)
+ return false;
+
+ /* If we have relaimed everything on the isolated list, no stall */
+ if (nr_freed == nr_taken)
+ return false;
+
+ /*
+ * For high-order allocations, there are two stall thresholds.
+ * High-cost allocations stall immediately where as lower
+ * order allocations such as stacks require the scanning
+ * priority to be much higher before stalling
+ */
+ if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+ lumpy_stall_priority = DEF_PRIORITY;
+ else
+ lumpy_stall_priority = DEF_PRIORITY / 3;
+
+ return priority <= lumpy_stall_priority;
+}
+
+/*
* shrink_inactive_list() is a helper for shrink_zone(). It returns the number
* of reclaimed pages
*/
@@ -1199,14 +1241,8 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
nr_scanned += nr_scan;
nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);

- /*
- * If we are direct reclaiming for contiguous pages and we do
- * not reclaim everything in the list, try again and wait
- * for IO to complete. This will stall high-order allocations
- * but that should be acceptable to the caller
- */
- if (nr_freed < nr_taken && !current_is_kswapd() &&
- sc->lumpy_reclaim_mode) {
+ /* Check if we should syncronously wait for writeback */
+ if (should_reclaim_stall(nr_taken, nr_freed, priority, sc)) {
congestion_wait(BLK_RW_ASYNC, HZ/10);

/*



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on
On Fri, Jul 23, 2010 at 05:45:15PM +0800, Wu Fengguang wrote:
> On Thu, Jul 22, 2010 at 06:48:23PM +0800, Mel Gorman wrote:
> > On Thu, Jul 22, 2010 at 05:21:55PM +0800, Wu Fengguang wrote:
> > > > I guess this new patch is more problem oriented and acceptable:
> > > >
> > > > --- linux-next.orig/mm/vmscan.c 2010-07-22 16:36:58.000000000 +0800
> > > > +++ linux-next/mm/vmscan.c 2010-07-22 16:39:57.000000000 +0800
> > > > @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis
> > > > count_vm_events(PGDEACTIVATE, nr_active);
> > > >
> > > > nr_freed += shrink_page_list(&page_list, sc,
> > > > - PAGEOUT_IO_SYNC);
> > > > + priority < DEF_PRIORITY / 3 ?
> > > > + PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC);
> > > > }
> > > >
> > > > nr_reclaimed += nr_freed;
> > >
> > > This one looks better:
> > > ---
> > > vmscan: raise the bar to PAGEOUT_IO_SYNC stalls
> > >
> > > Fix "system goes totally unresponsive with many dirty/writeback pages"
> > > problem:
> > >
> > > http://lkml.org/lkml/2010/4/4/86
> > >
> > > The root cause is, wait_on_page_writeback() is called too early in the
> > > direct reclaim path, which blocks many random/unrelated processes when
> > > some slow (USB stick) writeback is on the way.
> > >
> >
> > So, what's the bet if lumpy reclaim is a factor that it's
> > high-order-but-low-cost such as fork() that are getting caught by this since
> > [78dc583d: vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC]
> > was introduced?
>
> Sorry I'm a bit confused by your wording..
>

After reading the thread, I realised that fork() stalling could be a
factor. That commit allows lumpy reclaim and PAGEOUT_IO_SYNC to be used for
high-order allocations such as those used by fork(). It might have been an
oversight to allow order-1 to use PAGEOUT_IO_SYNC too easily.

> > That could manifest to the user as stalls creating new processes when under
> > heavy IO. I would be surprised it would freeze the entire system but certainly
> > any new work would feel very slow.
> >
> > > A simple dd can easily create a big range of dirty pages in the LRU
> > > list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a
> > > typical desktop, which triggers the lumpy reclaim mode and hence
> > > wait_on_page_writeback().
> > >
> >
> > which triggers the lumpy reclaim mode for high-order allocations.
>
> Exactly. Changelog updated.
>
> > lumpy reclaim mode is not something that is triggered just because priority
> > is high.
>
> Right.
>
> > I think there is a second possibility for causing stalls as well that is
> > unrelated to lumpy reclaim. Once dirty_limit is reached, new page faults may
> > also result in stalls. If it is taking a long time to writeback dirty data,
> > random processes could be getting stalled just because they happened to dirty
> > data at the wrong time. This would be the case if the main dirtying process
> > (e.g. dd) is not calling sync and dropping pages it's no longer using.
>
> The dirty_limit throttling will slow down the dirty process to the
> writeback throughput. If a process is dirtying files on sda (HDD),
> it will be throttled at 80MB/s. If another process is dirtying files
> on sdb (USB 1.1), it will be throttled at 1MB/s.
>

It will slow down the dirty process doing the dd, but can it also slow
down other processes that just happened to dirty pages at the wrong
time.

> So dirty throttling will slow things down. However the slow down
> should be smooth (a series of 100ms stalls instead of a sudden 10s
> stall), and won't impact random processes (which does no read/write IO
> at all).
>

Ok.

> > > In Andreas' case, 512MB/1024 = 512KB, this is way too low comparing to
> > > the 22MB writeback and 190MB dirty pages. There can easily be a
> > > continuous range of 512KB dirty/writeback pages in the LRU, which will
> > > trigger the wait logic.
> > >
> > > To make it worse, when there are 50MB writeback pages and USB 1.1 is
> > > writing them in 1MB/s, wait_on_page_writeback() may stuck for up to 50
> > > seconds.
> > >
> > > So only enter sync write&wait when priority goes below DEF_PRIORITY/3,
> > > or 6.25% LRU. As the default dirty throttle ratio is 20%, sync write&wait
> > > will hardly be triggered by pure dirty pages.
> > >
> > > Signed-off-by: Wu Fengguang <fengguang.wu(a)intel.com>
> > > ---
> > > mm/vmscan.c | 4 ++--
> > > 1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > --- linux-next.orig/mm/vmscan.c 2010-07-22 16:36:58.000000000 +0800
> > > +++ linux-next/mm/vmscan.c 2010-07-22 17:03:47.000000000 +0800
> > > @@ -1206,7 +1206,7 @@ static unsigned long shrink_inactive_lis
> > > * but that should be acceptable to the caller
> > > */
> > > if (nr_freed < nr_taken && !current_is_kswapd() &&
> > > - sc->lumpy_reclaim_mode) {
> > > + sc->lumpy_reclaim_mode && priority < DEF_PRIORITY / 3) {
> > > congestion_wait(BLK_RW_ASYNC, HZ/10);
> > >
> >
> > This will also delay waiting on congestion for really high-order
> > allocations such as huge pages, some video decoder and the like which
> > really should be stalling.
>
> I absolutely agree that high order allocators should be somehow throttled.
>
> However given that one can easily create a large _continuous_ range of
> dirty LRU pages, let someone bumping all the way through the range
> sounds a bit cruel..
>
> > How about the following compile-tested diff?
> > It takes the cost of the high-order allocation into account and the
> > priority when deciding whether to synchronously wait or not.
>
> Very nice patch. Thanks!
>

Will you be picking it up or should I? The changelog should be more or less
the same as yours and consider it

Signed-off-by: Mel Gorman <mel(a)csn.ul.ie>

It'd be nice if the original tester is still knocking around and willing
to confirm the patch resolves his/her problem. I am running this patch on
my desktop at the moment and it does feel a little smoother but it might be
my imagination. I had trouble with odd stalls that I never pinned down and
was attributing to the machine being commonly heavily loaded but I haven't
noticed them today.

It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters
logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also
should use PAGEOUT_IO_SYNC]

Thanks

> <SNIP>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/