From: Wu Fengguang on
On Thu, Jul 22, 2010 at 06:48:23PM +0800, Mel Gorman wrote:
> On Thu, Jul 22, 2010 at 05:21:55PM +0800, Wu Fengguang wrote:
> > > I guess this new patch is more problem oriented and acceptable:
> > >
> > > --- linux-next.orig/mm/vmscan.c 2010-07-22 16:36:58.000000000 +0800
> > > +++ linux-next/mm/vmscan.c 2010-07-22 16:39:57.000000000 +0800
> > > @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis
> > > count_vm_events(PGDEACTIVATE, nr_active);
> > >
> > > nr_freed += shrink_page_list(&page_list, sc,
> > > - PAGEOUT_IO_SYNC);
> > > + priority < DEF_PRIORITY / 3 ?
> > > + PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC);
> > > }
> > >
> > > nr_reclaimed += nr_freed;
> >
> > This one looks better:
> > ---
> > vmscan: raise the bar to PAGEOUT_IO_SYNC stalls
> >
> > Fix "system goes totally unresponsive with many dirty/writeback pages"
> > problem:
> >
> > http://lkml.org/lkml/2010/4/4/86
> >
> > The root cause is, wait_on_page_writeback() is called too early in the
> > direct reclaim path, which blocks many random/unrelated processes when
> > some slow (USB stick) writeback is on the way.
> >
>
> So, what's the bet if lumpy reclaim is a factor that it's
> high-order-but-low-cost such as fork() that are getting caught by this since
> [78dc583d: vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC]
> was introduced?

Sorry I'm a bit confused by your wording..

> That could manifest to the user as stalls creating new processes when under
> heavy IO. I would be surprised it would freeze the entire system but certainly
> any new work would feel very slow.
>
> > A simple dd can easily create a big range of dirty pages in the LRU
> > list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a
> > typical desktop, which triggers the lumpy reclaim mode and hence
> > wait_on_page_writeback().
> >
>
> which triggers the lumpy reclaim mode for high-order allocations.

Exactly. Changelog updated.

> lumpy reclaim mode is not something that is triggered just because priority
> is high.

Right.

> I think there is a second possibility for causing stalls as well that is
> unrelated to lumpy reclaim. Once dirty_limit is reached, new page faults may
> also result in stalls. If it is taking a long time to writeback dirty data,
> random processes could be getting stalled just because they happened to dirty
> data at the wrong time. This would be the case if the main dirtying process
> (e.g. dd) is not calling sync and dropping pages it's no longer using.

The dirty_limit throttling will slow down the dirty process to the
writeback throughput. If a process is dirtying files on sda (HDD),
it will be throttled at 80MB/s. If another process is dirtying files
on sdb (USB 1.1), it will be throttled at 1MB/s.

So dirty throttling will slow things down. However the slow down
should be smooth (a series of 100ms stalls instead of a sudden 10s
stall), and won't impact random processes (which does no read/write IO
at all).

> > In Andreas' case, 512MB/1024 = 512KB, this is way too low comparing to
> > the 22MB writeback and 190MB dirty pages. There can easily be a
> > continuous range of 512KB dirty/writeback pages in the LRU, which will
> > trigger the wait logic.
> >
> > To make it worse, when there are 50MB writeback pages and USB 1.1 is
> > writing them in 1MB/s, wait_on_page_writeback() may stuck for up to 50
> > seconds.
> >
> > So only enter sync write&wait when priority goes below DEF_PRIORITY/3,
> > or 6.25% LRU. As the default dirty throttle ratio is 20%, sync write&wait
> > will hardly be triggered by pure dirty pages.
> >
> > Signed-off-by: Wu Fengguang <fengguang.wu(a)intel.com>
> > ---
> > mm/vmscan.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > --- linux-next.orig/mm/vmscan.c 2010-07-22 16:36:58.000000000 +0800
> > +++ linux-next/mm/vmscan.c 2010-07-22 17:03:47.000000000 +0800
> > @@ -1206,7 +1206,7 @@ static unsigned long shrink_inactive_lis
> > * but that should be acceptable to the caller
> > */
> > if (nr_freed < nr_taken && !current_is_kswapd() &&
> > - sc->lumpy_reclaim_mode) {
> > + sc->lumpy_reclaim_mode && priority < DEF_PRIORITY / 3) {
> > congestion_wait(BLK_RW_ASYNC, HZ/10);
> >
>
> This will also delay waiting on congestion for really high-order
> allocations such as huge pages, some video decoder and the like which
> really should be stalling.

I absolutely agree that high order allocators should be somehow throttled.

However given that one can easily create a large _continuous_ range of
dirty LRU pages, let someone bumping all the way through the range
sounds a bit cruel..

> How about the following compile-tested diff?
> It takes the cost of the high-order allocation into account and the
> priority when deciding whether to synchronously wait or not.

Very nice patch. Thanks!

Cheers,
Fengguang

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9c7e57c..d652e0c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1110,6 +1110,48 @@ static int too_many_isolated(struct zone *zone, int file,
> }
>
> /*
> + * Returns true if the caller should stall on congestion and retry to clean
> + * the list of pages synchronously.
> + *
> + * If we are direct reclaiming for contiguous pages and we do not reclaim
> + * everything in the list, try again and wait for IO to complete. This
> + * will stall high-order allocations but that should be acceptable to
> + * the caller
> + */
> +static inline bool should_reclaim_stall(unsigned long nr_taken,
> + unsigned long nr_freed,
> + int priority,
> + struct scan_control *sc)
> +{
> + int lumpy_stall_priority;
> +
> + /* kswapd should not stall on sync IO */
> + if (current_is_kswapd())
> + return false;
> +
> + /* Only stall on lumpy reclaim */
> + if (!sc->lumpy_reclaim_mode)
> + return false;
> +
> + /* If we have relaimed everything on the isolated list, no stall */
> + if (nr_freed == nr_taken)
> + return false;
> +
> + /*
> + * For high-order allocations, there are two stall thresholds.
> + * High-cost allocations stall immediately where as lower
> + * order allocations such as stacks require the scanning
> + * priority to be much higher before stalling
> + */
> + if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
> + lumpy_stall_priority = DEF_PRIORITY;
> + else
> + lumpy_stall_priority = DEF_PRIORITY / 3;
> +
> + return priority <= lumpy_stall_priority;
> +}
> +
> +/*
> * shrink_inactive_list() is a helper for shrink_zone(). It returns the number
> * of reclaimed pages
> */
> @@ -1199,14 +1241,8 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
> nr_scanned += nr_scan;
> nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
>
> - /*
> - * If we are direct reclaiming for contiguous pages and we do
> - * not reclaim everything in the list, try again and wait
> - * for IO to complete. This will stall high-order allocations
> - * but that should be acceptable to the caller
> - */
> - if (nr_freed < nr_taken && !current_is_kswapd() &&
> - sc->lumpy_reclaim_mode) {
> + /* Check if we should syncronously wait for writeback */
> + if (should_reclaim_stall(nr_taken, nr_freed, priority, sc)) {
> congestion_wait(BLK_RW_ASYNC, HZ/10);
>
> /*
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Wu Fengguang on
On Fri, Jul 23, 2010 at 06:57:19PM +0800, Mel Gorman wrote:
> On Fri, Jul 23, 2010 at 05:45:15PM +0800, Wu Fengguang wrote:
> > On Thu, Jul 22, 2010 at 06:48:23PM +0800, Mel Gorman wrote:
> > > On Thu, Jul 22, 2010 at 05:21:55PM +0800, Wu Fengguang wrote:
> > > > > I guess this new patch is more problem oriented and acceptable:
> > > > >
> > > > > --- linux-next.orig/mm/vmscan.c 2010-07-22 16:36:58.000000000 +0800
> > > > > +++ linux-next/mm/vmscan.c 2010-07-22 16:39:57.000000000 +0800
> > > > > @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis
> > > > > count_vm_events(PGDEACTIVATE, nr_active);
> > > > >
> > > > > nr_freed += shrink_page_list(&page_list, sc,
> > > > > - PAGEOUT_IO_SYNC);
> > > > > + priority < DEF_PRIORITY / 3 ?
> > > > > + PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC);
> > > > > }
> > > > >
> > > > > nr_reclaimed += nr_freed;
> > > >
> > > > This one looks better:
> > > > ---
> > > > vmscan: raise the bar to PAGEOUT_IO_SYNC stalls
> > > >
> > > > Fix "system goes totally unresponsive with many dirty/writeback pages"
> > > > problem:
> > > >
> > > > http://lkml.org/lkml/2010/4/4/86
> > > >
> > > > The root cause is, wait_on_page_writeback() is called too early in the
> > > > direct reclaim path, which blocks many random/unrelated processes when
> > > > some slow (USB stick) writeback is on the way.
> > > >
> > >
> > > So, what's the bet if lumpy reclaim is a factor that it's
> > > high-order-but-low-cost such as fork() that are getting caught by this since
> > > [78dc583d: vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC]
> > > was introduced?
> >
> > Sorry I'm a bit confused by your wording..
> >
>
> After reading the thread, I realised that fork() stalling could be a
> factor. That commit allows lumpy reclaim and PAGEOUT_IO_SYNC to be used for
> high-order allocations such as those used by fork(). It might have been an
> oversight to allow order-1 to use PAGEOUT_IO_SYNC too easily.

That reads much clear. Thanks! I have the same feeling, hence the
proposed patch.

> > > That could manifest to the user as stalls creating new processes when under
> > > heavy IO. I would be surprised it would freeze the entire system but certainly
> > > any new work would feel very slow.
> > >
> > > > A simple dd can easily create a big range of dirty pages in the LRU
> > > > list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a
> > > > typical desktop, which triggers the lumpy reclaim mode and hence
> > > > wait_on_page_writeback().
> > > >
> > >
> > > which triggers the lumpy reclaim mode for high-order allocations.
> >
> > Exactly. Changelog updated.
> >
> > > lumpy reclaim mode is not something that is triggered just because priority
> > > is high.
> >
> > Right.
> >
> > > I think there is a second possibility for causing stalls as well that is
> > > unrelated to lumpy reclaim. Once dirty_limit is reached, new page faults may
> > > also result in stalls. If it is taking a long time to writeback dirty data,
> > > random processes could be getting stalled just because they happened to dirty
> > > data at the wrong time. This would be the case if the main dirtying process
> > > (e.g. dd) is not calling sync and dropping pages it's no longer using.
> >
> > The dirty_limit throttling will slow down the dirty process to the
> > writeback throughput. If a process is dirtying files on sda (HDD),
> > it will be throttled at 80MB/s. If another process is dirtying files
> > on sdb (USB 1.1), it will be throttled at 1MB/s.
> >
>
> It will slow down the dirty process doing the dd, but can it also slow
> down other processes that just happened to dirty pages at the wrong
> time.

For the case of of a heavy dirtier (dd) and concurrent light dirtiers
(some random processes), the light dirtiers won't be easily throttled.
task_dirty_limit() handles that case well. It will give light dirtiers
higher threshold than heavy dirtiers so that only the latter will be
dirty throttled.

> > So dirty throttling will slow things down. However the slow down
> > should be smooth (a series of 100ms stalls instead of a sudden 10s
> > stall), and won't impact random processes (which does no read/write IO
> > at all).
> >
>
> Ok.
>
> > > > In Andreas' case, 512MB/1024 = 512KB, this is way too low comparing to
> > > > the 22MB writeback and 190MB dirty pages. There can easily be a
> > > > continuous range of 512KB dirty/writeback pages in the LRU, which will
> > > > trigger the wait logic.
> > > >
> > > > To make it worse, when there are 50MB writeback pages and USB 1.1 is
> > > > writing them in 1MB/s, wait_on_page_writeback() may stuck for up to 50
> > > > seconds.
> > > >
> > > > So only enter sync write&wait when priority goes below DEF_PRIORITY/3,
> > > > or 6.25% LRU. As the default dirty throttle ratio is 20%, sync write&wait
> > > > will hardly be triggered by pure dirty pages.
> > > >
> > > > Signed-off-by: Wu Fengguang <fengguang.wu(a)intel.com>
> > > > ---
> > > > mm/vmscan.c | 4 ++--
> > > > 1 file changed, 2 insertions(+), 2 deletions(-)
> > > >
> > > > --- linux-next.orig/mm/vmscan.c 2010-07-22 16:36:58.000000000 +0800
> > > > +++ linux-next/mm/vmscan.c 2010-07-22 17:03:47.000000000 +0800
> > > > @@ -1206,7 +1206,7 @@ static unsigned long shrink_inactive_lis
> > > > * but that should be acceptable to the caller
> > > > */
> > > > if (nr_freed < nr_taken && !current_is_kswapd() &&
> > > > - sc->lumpy_reclaim_mode) {
> > > > + sc->lumpy_reclaim_mode && priority < DEF_PRIORITY / 3) {
> > > > congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > >
> > >
> > > This will also delay waiting on congestion for really high-order
> > > allocations such as huge pages, some video decoder and the like which
> > > really should be stalling.
> >
> > I absolutely agree that high order allocators should be somehow throttled.

> > However given that one can easily create a large _continuous_ range of
> > dirty LRU pages, let someone bumping all the way through the range
> > sounds a bit cruel..

Hmm. If such large range of dirty pages are approaching the end of LRU,
it means the LRU lists are being scanned pretty fast, indicating a
busy system and/or high memory pressure. So it seems reasonable to act
cruel to really high order allocators -- they won't perform well under
memory pressure after all, and only make things worse.

> > > How about the following compile-tested diff?
> > > It takes the cost of the high-order allocation into account and the
> > > priority when deciding whether to synchronously wait or not.
> >
> > Very nice patch. Thanks!
> >
>
> Will you be picking it up or should I? The changelog should be more or less
> the same as yours and consider it
>
> Signed-off-by: Mel Gorman <mel(a)csn.ul.ie>

Thanks. I'll post the patch.

> It'd be nice if the original tester is still knocking around and willing
> to confirm the patch resolves his/her problem. I am running this patch on
> my desktop at the moment and it does feel a little smoother but it might be
> my imagination. I had trouble with odd stalls that I never pinned down and
> was attributing to the machine being commonly heavily loaded but I haven't
> noticed them today.

Great. Just added CC to Andreas Mohr.

> It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters
> logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also
> should use PAGEOUT_IO_SYNC]

And Minchan, he has been following this issue too :)

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Wu Fengguang on
Hi Minchan,

On Thu, Jul 22, 2010 at 11:34:40PM +0800, Minchan Kim wrote:
> Hi, Wu.
> Thanks for Cced me.
>
> AFAIR, we discussed this by private mail and didn't conclude yet.
> Let's start from beginning.

OK.

> On Thu, Jul 22, 2010 at 05:21:55PM +0800, Wu Fengguang wrote:
> > > I guess this new patch is more problem oriented and acceptable:
> > >
> > > --- linux-next.orig/mm/vmscan.c 2010-07-22 16:36:58.000000000 +0800
> > > +++ linux-next/mm/vmscan.c 2010-07-22 16:39:57.000000000 +0800
> > > @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis
> > > count_vm_events(PGDEACTIVATE, nr_active);
> > >
> > > nr_freed += shrink_page_list(&page_list, sc,
> > > - PAGEOUT_IO_SYNC);
> > > + priority < DEF_PRIORITY / 3 ?
> > > + PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC);
> > > }
> > >
> > > nr_reclaimed += nr_freed;
> >
> > This one looks better:
> > ---
> > vmscan: raise the bar to PAGEOUT_IO_SYNC stalls
> >
> > Fix "system goes totally unresponsive with many dirty/writeback pages"
> > problem:
> >
> > http://lkml.org/lkml/2010/4/4/86
> >
> > The root cause is, wait_on_page_writeback() is called too early in the
> > direct reclaim path, which blocks many random/unrelated processes when
> > some slow (USB stick) writeback is on the way.
> >
> > A simple dd can easily create a big range of dirty pages in the LRU
> > list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a
> > typical desktop, which triggers the lumpy reclaim mode and hence
> > wait_on_page_writeback().
>
> I see oom message. order is zero.

OOM after applying this patch? It's not an obvious consequence.

> How is lumpy reclaim work?
> For working lumpy reclaim, we have to meet priority < 10 and sc->order > 0.
>
> Please, clarify the problem.

This patch tries to respect the lumpy reclaim logic, and only raises
the bar for sync writeback and IO wait. With Mel's change, it's only
doing so for (order <= PAGE_ALLOC_COSTLY_ORDER) allocations. Hopefully
this will limit unexpected side effects.

> >
> > In Andreas' case, 512MB/1024 = 512KB, this is way too low comparing to
> > the 22MB writeback and 190MB dirty pages. There can easily be a
>
> What's 22MB and 190M?

The numbers are adapted from the OOM dmesg in
http://lkml.org/lkml/2010/4/4/86 . The OOM is order 0 and GFP_KERNEL.

> It would be better to explain more detail.
> I think the description has to be clear as summary of the problem
> without the above link.

Good suggestion. I'll try.

> Thanks for taking out this problem, again. :)

Heh, I'm actually feeling guilty for the long delay!

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Wu Fengguang on
> For the case of of a heavy dirtier (dd) and concurrent light dirtiers
> (some random processes), the light dirtiers won't be easily throttled.
> task_dirty_limit() handles that case well. It will give light dirtiers
> higher threshold than heavy dirtiers so that only the latter will be
> dirty throttled.

The caveat is, the real dirty throttling threshold is not exactly the
value specified by vm.dirty_ratio or vm.dirty_bytes. Instead it's some
value slightly lower than it. That real value differs for each process,
which is a nice trick to throttle heavy dirtiers first. If I remember
it right, that's invented by Peter and Andrew.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Minchan Kim on
On Sun, Jul 25, 2010 at 07:43:20PM +0900, KOSAKI Motohiro wrote:
> Hi
>
> sorry for the delay.
>
> > Will you be picking it up or should I? The changelog should be more or less
> > the same as yours and consider it
> >
> > Signed-off-by: Mel Gorman <mel(a)csn.ul.ie>
> >
> > It'd be nice if the original tester is still knocking around and willing
> > to confirm the patch resolves his/her problem. I am running this patch on
> > my desktop at the moment and it does feel a little smoother but it might be
> > my imagination. I had trouble with odd stalls that I never pinned down and
> > was attributing to the machine being commonly heavily loaded but I haven't
> > noticed them today.
> >
> > It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters
> > logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also
> > should use PAGEOUT_IO_SYNC]
>
> My reviewing doesn't found any bug. however I think original thread have too many guess
> and we need to know reproduce way and confirm it.
>
> At least, we need three confirms.
> o original issue is still there?
> o DEF_PRIORITY/3 is best value?

I agree. Wu, how do you determine DEF_PRIORITY/3 of LRU?
I guess system has 512M and 22M writeback pages.
So you may determine it for skipping max 32M writeback pages.
Is right?

And I have a question of your below comment.

"As the default dirty throttle ratio is 20%, sync write&wait
will hardly be triggered by pure dirty pages"

I am not sure exactly what you mean but at least DEF_PRIOIRTY/3 seems to be
related to dirty_ratio. It always can be changed by admin.
Then do we have to determine magic value(DEF_PRIORITY/3) proportional to dirty_ratio?

--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/