vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages [Kernel]

Prev: vmscan: Do not writeback filesystem pages in direct reclaim
Next: vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages

From: Wu Fengguang on 27 Jul 2010 11:00

On Tue, Jul 27, 2010 at 10:40:26PM +0800, Mel Gorman wrote:
> On Tue, Jul 27, 2010 at 10:34:23PM +0800, Wu Fengguang wrote:
> > > If you plan to keep wakeup_flusher_threads(), a simpler form may be
> > > sufficient, eg.
> > >
> > > laptop_mode ? 0 : (nr_dirty * 16)
> >
> > This number is not sensitive because the writeback code may well round
> > it up to some more IO efficient value (currently 4MB). AFAIK the
> > nr_pages parameters passed by all existing flusher callers are some
> > rule-of-thumb value, and far from being an exact number.
> >
>
> I get that it's a rule of thumb but decided I would still pass in some value
> related to nr_dirty that was bounded in some manner.
> Currently, that bound is 4MB but maybe it should have been bound to
> MAX_WRITEBACK_PAGES (which is 4MB for x86, but could be anything
> depending on the base page size).

I see your worry about much bigger page size making

vmscan batch size > writeback batch size

and it's a legitimate worry.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andrew Morton on 30 Jul 2010 18:20

On Fri, 30 Jul 2010 14:37:00 +0100
Mel Gorman <mel(a)csn.ul.ie> wrote:

> There are a number of cases where pages get cleaned but two of concern
> to this patch are;
> o When dirtying pages, processes may be throttled to clean pages if
> dirty_ratio is not met.

Ambiguous. I assume you meant "if dirty_ratio is exceeded".

> o Pages belonging to inodes dirtied longer than
> dirty_writeback_centisecs get cleaned.
>
> The problem for reclaim is that dirty pages can reach the end of the LRU if
> pages are being dirtied slowly so that neither the throttling or a flusher
> thread waking periodically cleans them.
>
> Background flush is already cleaning old or expired inodes first but the
> expire time is too far in the future at the time of page reclaim. To mitigate
> future problems, this patch wakes flusher threads to clean 4M of data -
> an amount that should be manageable without causing congestion in many cases.
>
> Ideally, the background flushers would only be cleaning pages belonging
> to the zone being scanned but it's not clear if this would be of benefit
> (less IO) or not (potentially less efficient IO if an inode is scattered
> across multiple zones).
>

Sigh. We have sooo many problems with writeback and latency. Read
https://bugzilla.kernel.org/show_bug.cgi?id=12309 and weep. Everyone's
running away from the issue and here we are adding code to solve some
alleged stack-overflow problem which seems to be largely a non-problem,
by making changes which may worsen our real problems.

direct-reclaim wants to write a dirty page because that page is in the
zone which the caller wants to allcoate from! Telling the flusher
threads to perform generic writeback will sometimes cause them to just
gum the disk up with pages from different zones, making it even
harder/slower to allocate a page from the zones we're interested in,
no?

If/when that happens, the problem will be rare, subtle, will take a
long time to get reported and will take years to understand and fix and
will probably be reported in the monster bug report which everyone's
hiding from anyway.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Trond Myklebust on 30 Jul 2010 18:50

On Fri, 2010-07-30 at 15:06 -0700, Andrew Morton wrote:
> On Fri, 30 Jul 2010 14:37:00 +0100
> Mel Gorman <mel(a)csn.ul.ie> wrote:
>
> > There are a number of cases where pages get cleaned but two of concern
> > to this patch are;
> > o When dirtying pages, processes may be throttled to clean pages if
> > dirty_ratio is not met.
>
> Ambiguous. I assume you meant "if dirty_ratio is exceeded".
>
> > o Pages belonging to inodes dirtied longer than
> > dirty_writeback_centisecs get cleaned.
> >
> > The problem for reclaim is that dirty pages can reach the end of the LRU if
> > pages are being dirtied slowly so that neither the throttling or a flusher
> > thread waking periodically cleans them.
> >
> > Background flush is already cleaning old or expired inodes first but the
> > expire time is too far in the future at the time of page reclaim. To mitigate
> > future problems, this patch wakes flusher threads to clean 4M of data -
> > an amount that should be manageable without causing congestion in many cases.
> >
> > Ideally, the background flushers would only be cleaning pages belonging
> > to the zone being scanned but it's not clear if this would be of benefit
> > (less IO) or not (potentially less efficient IO if an inode is scattered
> > across multiple zones).
> >
>
> Sigh. We have sooo many problems with writeback and latency. Read
> https://bugzilla.kernel.org/show_bug.cgi?id=12309 and weep. Everyone's
> running away from the issue and here we are adding code to solve some
> alleged stack-overflow problem which seems to be largely a non-problem,
> by making changes which may worsen our real problems.
>
> direct-reclaim wants to write a dirty page because that page is in the
> zone which the caller wants to allcoate from! Telling the flusher
> threads to perform generic writeback will sometimes cause them to just
> gum the disk up with pages from different zones, making it even
> harder/slower to allocate a page from the zones we're interested in,
> no?
>
> If/when that happens, the problem will be rare, subtle, will take a
> long time to get reported and will take years to understand and fix and
> will probably be reported in the monster bug report which everyone's
> hiding from anyway.

There is that, and then there are issues with the VM simply lying to the
filesystems.

See https://bugzilla.kernel.org/show_bug.cgi?id=16056

Which basically boils down to the following: kswapd tells the filesystem
that it is quite safe to do GFP_KERNEL allocations in pageouts and as
part of try_to_release_page().

In the case of pageouts, it does set the 'WB_SYNC_NONE', 'nonblocking'
and 'for_reclaim' flags in the writeback_control struct, and so the
filesystem has at least some hint that it should do non-blocking i/o.

However if you trust the GFP_KERNEL flag in try_to_release_page() then
the kernel can and will deadlock, and so I had to add in a hack
specifically to tell the NFS client not to trust that flag if it comes
from kswapd.

Trond

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Wu Fengguang on 1 Aug 2010 07:20

> Sigh. We have sooo many problems with writeback and latency. Read
> https://bugzilla.kernel.org/show_bug.cgi?id=12309 and weep. Everyone's
> running away from the issue and here we are adding code to solve some
> alleged stack-overflow problem which seems to be largely a non-problem,
> by making changes which may worsen our real problems.

This looks like some vmscan/writeback interaction issue.

Firstly, the CFQ io scheduler can already prevent read IO from being
delayed by lots of ASYNC write IO. See the commits 365722bb/8e2967555
in late 2009.

Reading a big file in an idle system:
680897928 bytes (681 MB) copied, 15.8986 s, 42.8 MB/s

Reading a big file while doing sequential writes to another file:
680897928 bytes (681 MB) copied, 27.6007 s, 24.7 MB/s
680897928 bytes (681 MB) copied, 25.6592 s, 26.5 MB/s

So CFQ offers reasonable read performance under heavy writeback.

Secondly, I can only feel the responsiveness lags when there are
memory pressures _in addition to_ heavy writeback.

cp /dev/zero /tmp

No lags.

usemem 1g --sleep 1000

Still no lags.

usemem 1g --sleep 1000

Still no lags.

usemem 1g --sleep 1000

Begin to feel lags at times. My desktop has 4G memory and no swap
space. So the lags are correlated with page reclaim pressure.

The above symptoms are matched very well by the patches posted by
KOSAKI and me:

- vmscan: raise the bar to PAGEOUT_IO_SYNC stalls
- vmscan: synchronous lumpy reclaim don't call congestion_wait()

However kernels as early as 2.6.18 are reported to have the problem,
so there may be more hidden issues.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Wu Fengguang on 1 Aug 2010 08:00

> Sigh. We have sooo many problems with writeback and latency. Read
> https://bugzilla.kernel.org/show_bug.cgi?id=12309 and weep. Everyone's
> running away from the issue and here we are adding code to solve some
> alleged stack-overflow problem which seems to be largely a non-problem,
> by making changes which may worsen our real problems.

I'm sweeping bug 12309. Most people reports some data writes, though
relative few explicitly stated memory pressure is another necessary
condition.

One interesting report is #3. Thomas reported the same slowdown
_without_ any IO. He was able to narrow down the bug to somewhere
between 2.6.20.21 and 2.6.22.19. I searched through the git and found
a congestion_wait() in commit 232ea4d69d (throttle_vm_writeout():
don't loop on GFP_NOFS and GFP_NOIO allocations) which was later
removed by commit 369f2389e7 (writeback: remove unnecessary wait in
throttle_vm_writeout()).

How can the congestion_wait(HZ/10) be a problem? Because it
unconditionally enters wait loop. So if no IO is underway, it
virtually becomes a schedule_timeout(HZ/10) because there are
no IO completion events to wake it up.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: vmscan: Do not writeback filesystem pages in direct reclaim
Next: vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages