From: Mel Gorman on
On Thu, Jul 22, 2010 at 08:57:34AM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 21 Jul 2010 15:27:10 +0100
> Mel Gorman <mel(a)csn.ul.ie> wrote:
>
> > On Wed, Jul 21, 2010 at 09:01:11PM +0900, KAMEZAWA Hiroyuki wrote:
>
> > > But, hmm, memcg will have to select to enter this rounine based on
> > > the result of 1st memory reclaim.
> > >
> >
> > It has the option of igoring pages being dirtied but I worry that the
> > container could be filled with dirty pages waiting for flushers to do
> > something.
>
> I'll prepare dirty_ratio for memcg. It's not easy but requested by I/O cgroup
> guys, too...
>

I can see why it might be difficult. Dirty pages are not being counted
on a per-container basis. It would require additional infrastructure to
count it or a lot of scanning.

>
> >
> > > >
> > > > - /*
> > > > - * The attempt at page out may have made some
> > > > - * of the pages active, mark them inactive again.
> > > > - */
> > > > - nr_active = clear_active_flags(&page_list, NULL);
> > > > - count_vm_events(PGDEACTIVATE, nr_active);
> > > > + while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> > > > + congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > >
> > >
> > > Congestion wait is required ?? Where the congestion happens ?
> > > I'm sorry you already have some other trick in other patch.
> > >
> >
> > It's to wait for the IO to occur.
> >
>
> 1 tick penalty seems too large. I hope we can have some waitqueue in future.
>

congestion_wait() if congestion occurs goes onto a waitqueue that is
woken if congestion clears. I didn't measure it this time around but I
doubt it waits for HZ/10 much of the time.

> > > > - nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> > > > + /*
> > > > + * The attempt at page out may have made some
> > > > + * of the pages active, mark them inactive again.
> > > > + */
> > > > + nr_active = clear_active_flags(&page_list, NULL);
> > > > + count_vm_events(PGDEACTIVATE, nr_active);
> > > > +
> > > > + nr_reclaimed += shrink_page_list(&page_list, sc,
> > > > + PAGEOUT_IO_SYNC, &nr_dirty);
> > > > + }
> > >
> > > Just a question. This PAGEOUT_IO_SYNC has some meanings ?
> > >
> >
> > Yes, in pageout it will wait on pages currently being written back to be
> > cleaned before trying to reclaim them.
> >
> Hmm. IIUC, this routine is called only when !current_is_kswapd() and
> pageout is done only whne current_is_kswapd(). So, this seems ....
> Wrong ?
>

Both direct reclaim and kswapd can reach shrink_inactive_list

Direct reclaim
do_try_to_free_pages
-> shrink_zones
-> shrink_zone
-> shrink_list
-> shrink_inactive list <--- the routine in question

Kswapd
balance_pgdat
-> shrink_zone
-> shrink_list
-> shrink_inactive_list

pageout() is still called by direct reclaim if the page is anon so it
will synchronously wait on those if PAGEOUT_IO_SYNC is set. For either
anon or file pages, if they are being currently written back, they will
be waited on in shrink_page_list() if PAGEOUT_IO_SYNC.

So it still has meaning. Did I miss something?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on
On Mon, Jul 26, 2010 at 04:29:35PM +0800, Wu Fengguang wrote:
> > ==== CUT HERE ====
> > vmscan: Do not writeback filesystem pages in direct reclaim
> >
> > When memory is under enough pressure, a process may enter direct
> > reclaim to free pages in the same manner kswapd does. If a dirty page is
> > encountered during the scan, this page is written to backing storage using
> > mapping->writepage. This can result in very deep call stacks, particularly
> > if the target storage or filesystem are complex. It has already been observed
> > on XFS that the stack overflows but the problem is not XFS-specific.
> >
> > This patch prevents direct reclaim writing back filesystem pages by checking
> > if current is kswapd or the page is anonymous before writing back. If the
> > dirty pages cannot be written back, they are placed back on the LRU lists
> > for either background writing by the BDI threads or kswapd. If in direct
> > lumpy reclaim and dirty pages are encountered, the process will stall for
> > the background flusher before trying to reclaim the pages again.
> >
> > As the call-chain for writing anonymous pages is not expected to be deep
> > and they are not cleaned by flusher threads, anonymous pages are still
> > written back in direct reclaim.
>
> This is also a good step towards reducing pageout() calls. For better
> IO performance the flusher threads should take more work from pageout().
>

This is true for better IO performance all right but reclaim does require
specific pages cleaned. The strict requirement is when lumpy reclaim is
involved but a looser requirement is when any pages within a zone be cleaned.

> > Signed-off-by: Mel Gorman <mel(a)csn.ul.ie>
> > Acked-by: Rik van Riel <riel(a)redhat.com>
> > ---
> > mm/vmscan.c | 55 +++++++++++++++++++++++++++++++++++++++----------------
> > 1 files changed, 39 insertions(+), 16 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 6587155..45d9934 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -139,6 +139,9 @@ static DECLARE_RWSEM(shrinker_rwsem);
> > #define scanning_global_lru(sc) (1)
> > #endif
> >
> > +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */
> > +#define MAX_SWAP_CLEAN_WAIT 50
> > +
> > static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> > struct scan_control *sc)
> > {
> > @@ -644,11 +647,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> > */
> > static unsigned long shrink_page_list(struct list_head *page_list,
> > struct scan_control *sc,
> > - enum pageout_io sync_writeback)
> > + enum pageout_io sync_writeback,
> > + unsigned long *nr_still_dirty)
> > {
> > LIST_HEAD(ret_pages);
> > LIST_HEAD(free_pages);
> > int pgactivate = 0;
> > + unsigned long nr_dirty = 0;
> > unsigned long nr_reclaimed = 0;
> >
> > cond_resched();
> > @@ -742,6 +747,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > }
> >
> > if (PageDirty(page)) {
> > + /*
> > + * Only kswapd can writeback filesystem pages to
> > + * avoid risk of stack overflow
> > + */
> > + if (page_is_file_cache(page) && !current_is_kswapd()) {
> > + nr_dirty++;
> > + goto keep_locked;
> > + }
> > +
> > if (references == PAGEREF_RECLAIM_CLEAN)
> > goto keep_locked;
> > if (!may_enter_fs)
> > @@ -858,7 +872,7 @@ keep:
> >
> > free_page_list(&free_pages);
> >
> > - list_splice(&ret_pages, page_list);
> > + *nr_still_dirty = nr_dirty;
> > count_vm_events(PGACTIVATE, pgactivate);
> > return nr_reclaimed;
> > }
> > @@ -1245,6 +1259,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> > unsigned long nr_active;
> > unsigned long nr_anon;
> > unsigned long nr_file;
> > + unsigned long nr_dirty;
> >
> > while (unlikely(too_many_isolated(zone, file, sc))) {
> > congestion_wait(BLK_RW_ASYNC, HZ/10);
> > @@ -1293,26 +1308,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> >
> > spin_unlock_irq(&zone->lru_lock);
> >
> > - nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> > + nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> > + &nr_dirty);
> >
> > /*
> > - * If we are direct reclaiming for contiguous pages and we do
> > + * If specific pages are needed such as with direct reclaiming
> > + * for contiguous pages or for memory containers and we do
> > * not reclaim everything in the list, try again and wait
> > - * for IO to complete. This will stall high-order allocations
> > - * but that should be acceptable to the caller
> > + * for IO to complete. This will stall callers that require
> > + * specific pages but it should be acceptable to the caller
> > */
> > - if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> > - sc->lumpy_reclaim_mode) {
> > - congestion_wait(BLK_RW_ASYNC, HZ/10);
> > + if (sc->may_writepage && !current_is_kswapd() &&
> > + (sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
> > + int dirty_retry = MAX_SWAP_CLEAN_WAIT;
> >
> > - /*
> > - * The attempt at page out may have made some
> > - * of the pages active, mark them inactive again.
> > - */
> > - nr_active = clear_active_flags(&page_list, NULL);
> > - count_vm_events(PGDEACTIVATE, nr_active);
> > + while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> > + congestion_wait(BLK_RW_ASYNC, HZ/10);
>
> It needs good luck for the flusher threads to "happen to" sync the
> dirty pages in our page_list.

That is why I'm expecting the "shrink oldest inode" patchset to help. It
still requires a certain amount of luck but callers that encounter dirty
pages will be delayed.

It's also because a certain amount of luck is required that the last patch
in the series aims at reducing the number of dirty pages encountered by
reclaim. The closer that is to 0, the less important the timing of flusher
threads is.

> I'd rather take the logic as "there are
> too many dirty pages, shrink them to avoid some future pageout() calls
> and/or congestion_wait() stalls".
>

What do you mean by shrink them? They cannot be reclaimed until they are
clean.

> So the loop is likely to repeat MAX_SWAP_CLEAN_WAIT times. Let's remove it?
>

This loop only applies to direct reclaimers in lumpy reclaim mode and
memory containers. Both need specific pages to be cleaned and freed.
Hence, the loop is to stall them and wait on flusher threads up to a
point. Otherwise they can cause a reclaim storm of clean pages that
can't be used.

Current tests have not indicated MAX_SWAP_CLEAN_WAIT is regularly reached
but I am inferring this from timing data rather than a direct measurement.

> > - nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> > + /*
> > + * The attempt at page out may have made some
> > + * of the pages active, mark them inactive again.
> > + */
> > + nr_active = clear_active_flags(&page_list, NULL);
> > + count_vm_events(PGDEACTIVATE, nr_active);
> > +
> > + nr_reclaimed += shrink_page_list(&page_list, sc,
> > + PAGEOUT_IO_SYNC, &nr_dirty);
>
> This shrink_page_list() won't be called at all if nr_dirty==0 and
> pageout() was called. This is a change of behavior. It can also be
> fixed by removing the loop.
>

The whole patch is a change of behaviour but in this case it also makes
sense to focus on just the dirty pages. The first shrink_page_list
decided that the pages could not be unmapped and reclaimed - probably
because it was referenced. This is not likely to change during the loop.

Testing with a version of the patch that processed the full list added
significant stalls when sync writeback was involved. Testing time length
was tripled in one case implying that this loop was continually reaching
MAX_SWAP_CLEAN_WAIT.

The intention of this loop is "wait on dirty pages to be cleaned" and
it's a change of behaviour, but one that makes sense and testing
indicates it's a good idea.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on
On Mon, Jul 26, 2010 at 07:19:53PM +0800, Wu Fengguang wrote:
> On Mon, Jul 26, 2010 at 05:12:27PM +0800, Mel Gorman wrote:
> > On Mon, Jul 26, 2010 at 04:29:35PM +0800, Wu Fengguang wrote:
> > > > ==== CUT HERE ====
> > > > vmscan: Do not writeback filesystem pages in direct reclaim
> > > >
> > > > When memory is under enough pressure, a process may enter direct
> > > > reclaim to free pages in the same manner kswapd does. If a dirty page is
> > > > encountered during the scan, this page is written to backing storage using
> > > > mapping->writepage. This can result in very deep call stacks, particularly
> > > > if the target storage or filesystem are complex. It has already been observed
> > > > on XFS that the stack overflows but the problem is not XFS-specific.
> > > >
> > > > This patch prevents direct reclaim writing back filesystem pages by checking
> > > > if current is kswapd or the page is anonymous before writing back. If the
> > > > dirty pages cannot be written back, they are placed back on the LRU lists
> > > > for either background writing by the BDI threads or kswapd. If in direct
> > > > lumpy reclaim and dirty pages are encountered, the process will stall for
> > > > the background flusher before trying to reclaim the pages again.
> > > >
> > > > As the call-chain for writing anonymous pages is not expected to be deep
> > > > and they are not cleaned by flusher threads, anonymous pages are still
> > > > written back in direct reclaim.
> > >
> > > This is also a good step towards reducing pageout() calls. For better
> > > IO performance the flusher threads should take more work from pageout().
> > >
> >
> > This is true for better IO performance all right but reclaim does require
> > specific pages cleaned. The strict requirement is when lumpy reclaim is
> > involved but a looser requirement is when any pages within a zone be cleaned.
>
> Good point, I missed the lumpy reclaim requirement. It seems necessary
> to add a call to the flusher thread to writeback a specific inode range
> (that contains the current dirty page). This is a more reliable way to
> ensure both the strict and looser requirements: the current dirty page
> will guaranteed to be synced, and the inode will have good opportunity
> to contain more dirty pages in the zone, which can be freed quickly if
> tagged PG_reclaim.
>

I'm not sure about an inode range. The window being considered is quite small
and we might select ranges that are too small to be useful. However, taking
the inodes into account makes sense. If wakeup_flusher_thread took a list
of unique inodes that own dirty pages encountered by reclaim, it would then
move inodes to the head of the queue rather than depending just on expired.

> > > > <SNIP>
> > > >
> > > > @@ -1293,26 +1308,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> > > >
> > > > spin_unlock_irq(&zone->lru_lock);
> > > >
> > > > - nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> > > > + nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> > > > + &nr_dirty);
> > > >
> > > > /*
> > > > - * If we are direct reclaiming for contiguous pages and we do
> > > > + * If specific pages are needed such as with direct reclaiming
> > > > + * for contiguous pages or for memory containers and we do
> > > > * not reclaim everything in the list, try again and wait
> > > > - * for IO to complete. This will stall high-order allocations
> > > > - * but that should be acceptable to the caller
> > > > + * for IO to complete. This will stall callers that require
> > > > + * specific pages but it should be acceptable to the caller
> > > > */
> > > > - if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> > > > - sc->lumpy_reclaim_mode) {
> > > > - congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > > + if (sc->may_writepage && !current_is_kswapd() &&
> > > > + (sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
> > > > + int dirty_retry = MAX_SWAP_CLEAN_WAIT;
> > > >
> > > > - /*
> > > > - * The attempt at page out may have made some
> > > > - * of the pages active, mark them inactive again.
> > > > - */
> > > > - nr_active = clear_active_flags(&page_list, NULL);
> > > > - count_vm_events(PGDEACTIVATE, nr_active);
> > > > + while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> > > > + congestion_wait(BLK_RW_ASYNC, HZ/10);
> > >
> > > It needs good luck for the flusher threads to "happen to" sync the
> > > dirty pages in our page_list.
> >
> > That is why I'm expecting the "shrink oldest inode" patchset to help. It
> > still requires a certain amount of luck but callers that encounter dirty
> > pages will be delayed.
> >
> > It's also because a certain amount of luck is required that the last patch
> > in the series aims at reducing the number of dirty pages encountered by
> > reclaim. The closer that is to 0, the less important the timing of flusher
> > threads is.
>
> OK.
>
> > > I'd rather take the logic as "there are
> > > too many dirty pages, shrink them to avoid some future pageout() calls
> > > and/or congestion_wait() stalls".
> > >
> >
> > What do you mean by shrink them? They cannot be reclaimed until they are
> > clean.
>
> I mean we are freeing much more than nr_dirty pages. In this sense we
> are shrinking the number of dirty pages. Note that we are calling
> wakeup_flusher_threads(nr_dirty), however the real synced pages will
> be much more than nr_dirty, that is reasonable good behavior.
>

Ok.

> > > So the loop is likely to repeat MAX_SWAP_CLEAN_WAIT times. Let's remove it?
> > >
> >
> > This loop only applies to direct reclaimers in lumpy reclaim mode and
> > memory containers. Both need specific pages to be cleaned and freed.
> > Hence, the loop is to stall them and wait on flusher threads up to a
> > point. Otherwise they can cause a reclaim storm of clean pages that
> > can't be used.
>
> Agreed. We could call the flusher to sync the inode explicitly, as
> recommended above. This will clean and free (with PG_reclaim) the page
> in seconds. With reasonable waits here we may avoid reclaim storm
> effectively.
>

I'll follow this suggestion as a new patch.

> > Current tests have not indicated MAX_SWAP_CLEAN_WAIT is regularly reached
> > but I am inferring this from timing data rather than a direct measurement.
> >
> > > > - nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> > > > + /*
> > > > + * The attempt at page out may have made some
> > > > + * of the pages active, mark them inactive again.
> > > > + */
> > > > + nr_active = clear_active_flags(&page_list, NULL);
> > > > + count_vm_events(PGDEACTIVATE, nr_active);
> > > > +
> > > > + nr_reclaimed += shrink_page_list(&page_list, sc,
> > > > + PAGEOUT_IO_SYNC, &nr_dirty);
> > >
> > > This shrink_page_list() won't be called at all if nr_dirty==0 and
> > > pageout() was called. This is a change of behavior. It can also be
> > > fixed by removing the loop.
> > >
> >
> > The whole patch is a change of behaviour but in this case it also makes
> > sense to focus on just the dirty pages. The first shrink_page_list
> > decided that the pages could not be unmapped and reclaimed - probably
> > because it was referenced. This is not likely to change during the loop.
>
> Agreed.
>
> > Testing with a version of the patch that processed the full list added
> > significant stalls when sync writeback was involved. Testing time length
> > was tripled in one case implying that this loop was continually reaching
> > MAX_SWAP_CLEAN_WAIT.
>
> I'm OK with the change actually, this removes one not-that-user-friendly
> wait_on_page_writeback().
>
> > The intention of this loop is "wait on dirty pages to be cleaned" and
> > it's a change of behaviour, but one that makes sense and testing
> > indicates it's a good idea.
>
> I mean, this loop may be unwinded. And we may need another loop to
> sync the inodes that contains the dirty pages.
>

I'm not quite sure what you mean here but I think it might tie into the
idea of passing a list of inodes to the flusher threads. Lets see what
that ends up looking like.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on
On Thu, Aug 05, 2010 at 03:59:37PM +0900, KOSAKI Motohiro wrote:
>
> again, very sorry for the delay.
>

No problem.

> > When memory is under enough pressure, a process may enter direct
> > reclaim to free pages in the same manner kswapd does. If a dirty page is
> > encountered during the scan, this page is written to backing storage using
> > mapping->writepage. This can result in very deep call stacks, particularly
> > if the target storage or filesystem are complex. It has already been observed
> > on XFS that the stack overflows but the problem is not XFS-specific.
> >
> > This patch prevents direct reclaim writing back filesystem pages by checking
> > if current is kswapd or the page is anonymous before writing back. If the
> > dirty pages cannot be written back, they are placed back on the LRU lists
> > for either background writing by the BDI threads or kswapd. If in direct
> > lumpy reclaim and dirty pages are encountered, the process will stall for
> > the background flusher before trying to reclaim the pages again.
> >
> > As the call-chain for writing anonymous pages is not expected to be deep
> > and they are not cleaned by flusher threads, anonymous pages are still
> > written back in direct reclaim.
> >
> > Signed-off-by: Mel Gorman <mel(a)csn.ul.ie>
> > Acked-by: Rik van Riel <riel(a)redhat.com>
> > Reviewed-by: Johannes Weiner <hannes(a)cmpxchg.org>
> > ---
> > mm/vmscan.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++------------
> > 1 files changed, 54 insertions(+), 15 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index d83812a..2d2b588 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -139,6 +139,9 @@ static DECLARE_RWSEM(shrinker_rwsem);
> > #define scanning_global_lru(sc) (1)
> > #endif
> >
> > +/* Direct lumpy reclaim waits up to five seconds for background cleaning */
> > +#define MAX_SWAP_CLEAN_WAIT 50
> > +
> > static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> > struct scan_control *sc)
> > {
> > @@ -645,11 +648,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> > */
> > static unsigned long shrink_page_list(struct list_head *page_list,
> > struct scan_control *sc,
> > - enum pageout_io sync_writeback)
> > + enum pageout_io sync_writeback,
> > + unsigned long *nr_still_dirty)
> > {
> > LIST_HEAD(ret_pages);
> > LIST_HEAD(free_pages);
> > int pgactivate = 0;
> > + unsigned long nr_dirty = 0;
> > unsigned long nr_reclaimed = 0;
> >
> > cond_resched();
> > @@ -743,6 +748,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > }
> >
> > if (PageDirty(page)) {
> > + /*
> > + * Only kswapd can writeback filesystem pages to
> > + * avoid risk of stack overflow
> > + */
> > + if (page_is_file_cache(page) && !current_is_kswapd()) {
> > + nr_dirty++;
> > + goto keep_locked;
> > + }
> > +
> > if (references == PAGEREF_RECLAIM_CLEAN)
> > goto keep_locked;
> > if (!may_enter_fs)
> > @@ -860,6 +874,8 @@ keep:
> > free_page_list(&free_pages);
> >
> > list_splice(&ret_pages, page_list);
> > +
> > + *nr_still_dirty = nr_dirty;
> > count_vm_events(PGACTIVATE, pgactivate);
> > return nr_reclaimed;
> > }
> > @@ -1242,12 +1258,14 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> > struct scan_control *sc, int priority, int file)
> > {
> > LIST_HEAD(page_list);
> > + LIST_HEAD(putback_list);
> > unsigned long nr_scanned;
> > unsigned long nr_reclaimed = 0;
> > unsigned long nr_taken;
> > unsigned long nr_active;
> > unsigned long nr_anon;
> > unsigned long nr_file;
> > + unsigned long nr_dirty;
> >
> > while (unlikely(too_many_isolated(zone, file, sc))) {
> > congestion_wait(BLK_RW_ASYNC, HZ/10);
> > @@ -1296,28 +1314,49 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> >
> > spin_unlock_irq(&zone->lru_lock);
> >
> > - nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> > + nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> > + &nr_dirty);
> >
> > /*
> > - * If we are direct reclaiming for contiguous pages and we do
> > + * If specific pages are needed such as with direct reclaiming
> > + * for contiguous pages or for memory containers and we do
> > * not reclaim everything in the list, try again and wait
> > - * for IO to complete. This will stall high-order allocations
> > - * but that should be acceptable to the caller
> > + * for IO to complete. This will stall callers that require
> > + * specific pages but it should be acceptable to the caller
> > */
> > - if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> > - sc->lumpy_reclaim_mode) {
> > - congestion_wait(BLK_RW_ASYNC, HZ/10);
> > + if (sc->may_writepage && !current_is_kswapd() &&
> > + (sc->lumpy_reclaim_mode || sc->mem_cgroup)) {
> > + int dirty_retry = MAX_SWAP_CLEAN_WAIT;
> >
> > - /*
> > - * The attempt at page out may have made some
> > - * of the pages active, mark them inactive again.
> > - */
> > - nr_active = clear_active_flags(&page_list, NULL);
> > - count_vm_events(PGDEACTIVATE, nr_active);
> > + while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > + struct page *page, *tmp;
> > +
> > + /* Take off the clean pages marked for activation */
> > + list_for_each_entry_safe(page, tmp, &page_list, lru) {
> > + if (PageDirty(page) || PageWriteback(page))
> > + continue;
> > +
> > + list_del(&page->lru);
> > + list_add(&page->lru, &putback_list);
> > + }
> > +
> > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
>
> ditto.
> wakeup_flusher_threads(0) is not correct?
>

It's correct. When in lumpy mode, clean everything if the disk has to
spin up.

> And, When flusher thread still don't start IO, this loop don't have proper
> waiting. do we need wait_on_page_dirty() or something?
> (similar wait_on_page_writeback)
>

If IO is not started on the correct pages, the flusher threads will be
rekicked for more work and another attempt is made at shrink_page_list.

>
>
> > + congestion_wait(BLK_RW_ASYNC, HZ/10);
>
> As we discussed, congestion_wait() don't works find if slow strage device
> is connected.
>

I currently support the removal of this congestion_wait(), but it belongs
in its own patch.

>
> >
> > - nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> > + /*
> > + * The attempt at page out may have made some
> > + * of the pages active, mark them inactive again.
> > + */
> > + nr_active = clear_active_flags(&page_list, NULL);
> > + count_vm_events(PGDEACTIVATE, nr_active);
> > +
> > + nr_reclaimed += shrink_page_list(&page_list, sc,
> > + PAGEOUT_IO_SYNC, &nr_dirty);
>
> After my patch, when PAGEOUT_IO_SYNC failure, retry is no good idea.
> can we remove this loop?
>

Such a removal belongs in the series related to lower latency of lumpy
reclaim. This patch is just about preventing dirty file pages being written
back by direct reclaim.

>
> > + }
> > }
> >
> > + list_splice(&putback_list, &page_list);
> > +
> > local_irq_disable();
> > if (current_is_kswapd())
> > __count_vm_events(KSWAPD_STEAL, nr_reclaimed);
> > --
> > 1.7.1
> >
>
>
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/