vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages [Kernel]

Prev: vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages
Next: writeback: sync old inodes first in background writeback

From: Mel Gorman on 27 Jul 2010 10:50

On Tue, Jul 27, 2010 at 10:34:23PM +0800, Wu Fengguang wrote:
> > If you plan to keep wakeup_flusher_threads(), a simpler form may be
> > sufficient, eg.
> >
> > laptop_mode ? 0 : (nr_dirty * 16)
>
> This number is not sensitive because the writeback code may well round
> it up to some more IO efficient value (currently 4MB). AFAIK the
> nr_pages parameters passed by all existing flusher callers are some
> rule-of-thumb value, and far from being an exact number.
>

I get that it's a rule of thumb but decided I would still pass in some value
related to nr_dirty that was bounded in some manner. Currently, that bound
is 4MB but maybe it should have been bound to MAX_WRITEBACK_PAGES (which is
4MB for x86, but could be anything depending on the base page size).

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mel Gorman on 31 Jul 2010 06:40

On Fri, Jul 30, 2010 at 03:06:01PM -0700, Andrew Morton wrote:
> On Fri, 30 Jul 2010 14:37:00 +0100
> Mel Gorman <mel(a)csn.ul.ie> wrote:
>
> > There are a number of cases where pages get cleaned but two of concern
> > to this patch are;
> > o When dirtying pages, processes may be throttled to clean pages if
> > dirty_ratio is not met.
>
> Ambiguous. I assume you meant "if dirty_ratio is exceeded".
>

Yes.

> > o Pages belonging to inodes dirtied longer than
> > dirty_writeback_centisecs get cleaned.
> >
> > The problem for reclaim is that dirty pages can reach the end of the LRU if
> > pages are being dirtied slowly so that neither the throttling or a flusher
> > thread waking periodically cleans them.
> >
> > Background flush is already cleaning old or expired inodes first but the
> > expire time is too far in the future at the time of page reclaim. To mitigate
> > future problems, this patch wakes flusher threads to clean 4M of data -
> > an amount that should be manageable without causing congestion in many cases.
> >
> > Ideally, the background flushers would only be cleaning pages belonging
> > to the zone being scanned but it's not clear if this would be of benefit
> > (less IO) or not (potentially less efficient IO if an inode is scattered
> > across multiple zones).
> >
>
> Sigh. We have sooo many problems with writeback and latency. Read
> https://bugzilla.kernel.org/show_bug.cgi?id=12309 and weep.

You aren't joking.

> Everyone's
> running away from the issue and here we are adding code to solve some
> alleged stack-overflow problem which seems to be largely a non-problem,
> by making changes which may worsen our real problems.
>

As it is, filesystems are beginnning to ignore writeback from direct
reclaim - such as xfs and btrfs. I'm lead to believe that ext3
effectively ignores writeback from direct reclaim although I don't have
access to code at the moment to double check (am on the road). So either
way, we are going to be facing this problem so the VM might as well be
aware of it :/

> direct-reclaim wants to write a dirty page because that page is in the
> zone which the caller wants to allcoate from! Telling the flusher
> threads to perform generic writeback will sometimes cause them to just
> gum the disk up with pages from different zones, making it even
> harder/slower to allocate a page from the zones we're interested in,
> no?
>

It's a possibility, but it can happen anyway if the filesystem is ignoring
writeback requests from direct reclaim. I considered passing in the zone to
flusher threads to clean nr_pages from a given zone but then worried about
getting caught by the "poor IO pattern" people and what happened if two
zones needed cleaning with a single inodes pages in both.

> If/when that happens, the problem will be rare, subtle, will take a
> long time to get reported and will take years to understand and fix and
> will probably be reported in the monster bug report which everyone's
> hiding from anyway.
>

With the second patch reducing the number of dirty pages encountered by page
reclaim, I'm hoping there will be some impact on latency. I'll be back online
properly Tuesday and will try reproducing some of the problems in that bug
and see can I spot an underlying cause of some sort.

Thanks

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mel Gorman on 5 Aug 2010 10:20

On Thu, Aug 05, 2010 at 03:45:24PM +0900, KOSAKI Motohiro wrote:
>
> sorry for the _very_ delayed review.
>

Not to worry.

> > <SNIP>
> > +/*
> > + * When reclaim encounters dirty data, wakeup flusher threads to clean
> > + * a maximum of 4M of data.
> > + */
> > +#define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> > +#define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> > +static inline long nr_writeback_pages(unsigned long nr_dirty)
> > +{
> > + return laptop_mode ? 0 :
> > + min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> > +}
>
> ??
>
> As far as I remembered, Hannes pointed out wakeup_flusher_threads(0) is
> incorrect. can you fix this?
>

It's behaving as it should, see http://lkml.org/lkml/2010/7/20/151

>
>
> > +
> > static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> > struct scan_control *sc)
> > {
> > @@ -649,12 +661,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> > static unsigned long shrink_page_list(struct list_head *page_list,
> > struct scan_control *sc,
> > enum pageout_io sync_writeback,
> > + int file,
> > unsigned long *nr_still_dirty)
> > {
> > LIST_HEAD(ret_pages);
> > LIST_HEAD(free_pages);
> > int pgactivate = 0;
> > unsigned long nr_dirty = 0;
> > + unsigned long nr_dirty_seen = 0;
> > unsigned long nr_reclaimed = 0;
> >
> > cond_resched();
> > @@ -748,6 +762,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > }
> >
> > if (PageDirty(page)) {
> > + nr_dirty_seen++;
> > +
> > /*
> > * Only kswapd can writeback filesystem pages to
> > * avoid risk of stack overflow
> > @@ -875,6 +891,18 @@ keep:
> >
> > list_splice(&ret_pages, page_list);
> >
> > + /*
> > + * If reclaim is encountering dirty pages, it may be because
> > + * dirty pages are reaching the end of the LRU even though the
> > + * dirty_ratio may be satisified. In this case, wake flusher
> > + * threads to pro-actively clean up to a maximum of
> > + * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> > + * !may_writepage indicates that this is a direct reclaimer in
> > + * laptop mode avoiding disk spin-ups
> > + */
> > + if (file && nr_dirty_seen && sc->may_writepage)
> > + wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
>
> Umm..
> I don't think this guessing is so acculate. following is brief of
> current isolate_lru_pages().
>
>
> static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> struct list_head *src, struct list_head *dst,
> unsigned long *scanned, int order, int mode, int file)
> {
> for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
> __isolate_lru_page(page, mode, file))
>
> if (!order)
> continue;
>
> /*
> * Attempt to take all pages in the order aligned region
> * surrounding the tag page. Only take those pages of
> * the same active state as that tag page. We may safely
> * round the target page pfn down to the requested order
> * as the mem_map is guarenteed valid out to MAX_ORDER,
> * where that page is in a different zone we will detect
> * it from its zone id and abort this block scan.
> */
> for (; pfn < end_pfn; pfn++) {
> struct page *cursor_page;
> (snip)
> }
>
> (This was unchanged since initial lumpy reclaim commit)
>

I think what you are pointing out is that when lumpy-reclaiming from the anon
LRU, there may be file pages on the page_list being shrinked. In that case, we
might miss an opportunity to wake the flusher threads when it was appropriate.

Is that accurate or have you another concern?

> That said, merely order-1 isolate_lru_pages(ISOLATE_INACTIVE) makes pfn
> neighbor search. then, we might found dirty pages even though the page
> don't stay in end of lru.
>
> What do you think?
>

For low-order lumpy reclaim, I think it should only be necessary to wake
the flusher threads when scanning the file LRU. While there may be file
pages lumpy reclaimed while scanning the anon list, I think we would
have to show it was a common and real problem before adding the
necessary accounting and checks.

>
> > +
> > *nr_still_dirty = nr_dirty;
> > count_vm_events(PGACTIVATE, pgactivate);
> > return nr_reclaimed;
> > @@ -1315,7 +1343,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> > spin_unlock_irq(&zone->lru_lock);
> >
> > nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> > - &nr_dirty);
> > + file, &nr_dirty);
> >
> > /*
> > * If specific pages are needed such as with direct reclaiming
> > @@ -1351,7 +1379,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> > count_vm_events(PGDEACTIVATE, nr_active);
> >
> > nr_reclaimed += shrink_page_list(&page_list, sc,
> > - PAGEOUT_IO_SYNC, &nr_dirty);
> > + PAGEOUT_IO_SYNC, file,
> > + &nr_dirty);
> > }
> > }
> >
> > --
> > 1.7.1
> >
>
>
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev |
Pages: 1 2
Prev: vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages
Next: writeback: sync old inodes first in background writeback