Avoid the use of congestion_wait under zone pressure [Kernel]

Prev: microblaze: Fix Makefile to delete build generated files
Next: [GIT PULL] MFD fixes for 2.6.34

From: Mel Gorman on 12 Mar 2010 04:20

On Thu, Mar 11, 2010 at 03:41:24PM -0800, Andrew Morton wrote:
> On Mon, 8 Mar 2010 11:48:20 +0000
> Mel Gorman <mel(a)csn.ul.ie> wrote:
>
> > Under memory pressure, the page allocator and kswapd can go to sleep using
> > congestion_wait(). In two of these cases, it may not be the appropriate
> > action as congestion may not be the problem.
>
> clear_bdi_congested() is called each time a write completes and the
> queue is below the congestion threshold.
>

Where you appear to get a kicking is if you want on "congestion" but no
writes are involved. In that case you potentially sleep for the whole timeout
waiting on an event that is not going to occur.

> So if the page allocator or kswapd call congestion_wait() against a
> non-congested queue, they'll wake up on the very next write completion.
>
> Hence the above-quoted claim seems to me to be a significant mis-analysis and
> perhaps explains why the patchset didn't seem to help anything?
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mel Gorman on 12 Mar 2010 05:50

On Fri, Mar 12, 2010 at 02:05:26AM -0500, Andrew Morton wrote:
> On Fri, 12 Mar 2010 07:39:26 +0100 Christian Ehrhardt <ehrhardt(a)linux.vnet.ibm.com> wrote:
>
> >
> >
> > Andrew Morton wrote:
> > > On Mon, 8 Mar 2010 11:48:20 +0000
> > > Mel Gorman <mel(a)csn.ul.ie> wrote:
> > >
> > >> Under memory pressure, the page allocator and kswapd can go to sleep using
> > >> congestion_wait(). In two of these cases, it may not be the appropriate
> > >> action as congestion may not be the problem.
> > >
> > > clear_bdi_congested() is called each time a write completes and the
> > > queue is below the congestion threshold.
> > >
> > > So if the page allocator or kswapd call congestion_wait() against a
> > > non-congested queue, they'll wake up on the very next write completion.
> >
> > Well the issue came up in all kind of loads where you don't have any
> > writes at all that can wake up congestion_wait.
> > Thats true for several benchmarks, but also real workload as well e.g. A
> > backup job reading almost all files sequentially and pumping out stuff
> > via network.
>
> Why is reclaim going into congestion_wait() at all if there's heaps of
> clean reclaimable pagecache lying around?
>
> (I don't thing the read side of the congestion_wqh[] has ever been used, btw)
>

I believe it's a race albeit one that has been there a long time.

In __alloc_pages_direct_reclaim, a process does approximately the
following

1. Enters direct reclaim
2. Calls cond_reched()
3. Drain pages if necessary
4. Attempt to allocate a page

Between steps 2 and 3, it's possible to have reclaimed the pages but
another process allocate them. It then proceeds and decides try again
but calls congestion_wait() before it loops around.

Plenty of read cache reclaimed but no forward progress.

> > > Hence the above-quoted claim seems to me to be a significant mis-analysis and
> > > perhaps explains why the patchset didn't seem to help anything?
> >
> > While I might have misunderstood you and it is a mis-analysis in your
> > opinion, it fixes a -80% Throughput regression on sequential read
> > workloads, thats not nothing - its more like absolutely required :-)
> >
> > You might check out the discussion with the subject "Performance
> > regression in scsi sequential throughput (iozone) due to "e084b -
> > page-allocator: preserve PFN ordering when __GFP_COLD is set"".
> > While the original subject is misleading from todays point of view, it
> > contains a lengthy discussion about exactly when/why/where time is lost
> > due to congestion wait with a lot of traces, counters, data attachments
> > and such stuff.
>
> Well if we're not encountering lots of dirty pages in reclaim then we
> shouldn't be waiting for writes to retire, of course.
>
> But if we're not encountering lots of dirty pages in reclaim, we should
> be reclaiming pages, normally.
>

We probably are.

> I could understand reclaim accidentally going into congestion_wait() if
> it hit a large pile of pages which are unreclaimable for reasons other
> than being dirty, but is that happening in this case?
>

Probably not. It's almost certainly the race I described above.

> If not, we broke it again.
>

We were broken with respect to this in the first place. That
cond_reched() is badly placed and waiting on congestion when congestion
might not be involved is also a bit odd.

It's possible that Christian's specific problem would also be addressed
by the following patch. Christian, willing to test?

It still feels a bit unnatural though that the page allocator waits on
congestion when what it really cares about is watermarks. Even if this
patch works for Christian, I think it still has merit so will kick it a
few more times.

==== CUT HERE ====
page-allocator: Attempt page allocation immediately after direct reclaim

After a process completes direct reclaim it calls cond_resched() as
potentially it has been running a long time. When it wakes up, it
attempts to allocate a page. There is a large window during which
another process can allocate the pages reclaimed by direct reclaim. This
patch attempts to allocate a page immediately after direct reclaim but
will still go to sleep afterwards if its quantum has expired.

Signed-off-by: Mel Gorman <mel(a)csn.ul.ie>
---
mm/page_alloc.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a8182c8..973b7fc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1721,8 +1721,6 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
lockdep_clear_current_reclaim_state();
p->flags &= ~PF_MEMALLOC;

- cond_resched();
-
if (order != 0)
drain_all_pages();

@@ -1731,6 +1729,9 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
zonelist, high_zoneidx,
alloc_flags, preferred_zone,
migratetype);
+
+ cond_resched();
+
return page;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mel Gorman on 15 Mar 2010 08:40

On Fri, Mar 12, 2010 at 09:37:55AM -0500, Andrew Morton wrote:
> On Fri, 12 Mar 2010 13:15:05 +0100 Christian Ehrhardt <ehrhardt(a)linux.vnet.ibm.com> wrote:
>
> > > It still feels a bit unnatural though that the page allocator waits on
> > > congestion when what it really cares about is watermarks. Even if this
> > > patch works for Christian, I think it still has merit so will kick it a
> > > few more times.
> >
> > In whatever way I can look at it watermark_wait should be supperior to
> > congestion_wait. Because as Mel points out waiting for watermarks is
> > what is semantically correct there.
>
> If a direct-reclaimer waits for some thresholds to be achieved then what
> task is doing reclaim?
>
> Ultimately, kswapd.

Well, not quite. The direct reclaimer will still wake up after a timeout
and try again regardless of whether watermarks have been met or not. The
intention is to back after after direct reclaim has failed. Granted, the
window during which a direct reclaim finishes and an allocation attempt
occurs is unnecessarily large. This may be addressed by the patch that
changes where cond_resched() is called.

> This will introduce a hard dependency upon kswapd
> activity. This might introduce scalability problems. And latency
> problems if kswapd if off doodling with a slow device (say), or doing a
> journal commit. And perhaps deadlocks if kswapd tries to take a lock
> which one of the waiting-for-watermark direct relcaimers holds.
>

What lock could they be holding? Even if that is the case, the direct
reclaimers do not wait indefinitily.

> Generally, kswapd is an optional, best-effort latency optimisation
> thing and we haven't designed for it to be a critical service.
> Probably stuff would break were we to do so.
>

No disagreements there.

> This is one of the reasons why we avoided creating such dependencies in
> reclaim. Instead, what we do when a reclaimer is encountering lots of
> dirty or in-flight pages is
>
> msleep(100);
>
> then try again. We're waiting for the disks, not kswapd.
>
> Only the hard-wired 100 is a bit silly, so we made the "100" variable,
> inversely dependent upon the number of disks and their speed. If you
> have more and faster disks then you sleep for less time.
>
> And that's what congestion_wait() does, in a very simplistic fashion.
> It's a facility which direct-reclaimers use to ratelimit themselves in
> inverse proportion to the speed with which the system can retire writes.
>

The problem being hit is when a direct reclaimer goes to sleep waiting
on congestion when in reality there were not lots of dirty or in-flight
pages. It goes to sleep for the wrong reasons and doesn't get woken up
again until the timeout expires.

Bear in mind that even if congestion clears, it just means that dirty
pages are now clean although I admit that the next direct reclaim it
does is going to encounter clean pages and should succeed.

Lets see how the other patch that changes when cond_reched() gets called
gets on. If it also works out, then it's harder to justify this patch.
If it doesn't work out then it'll need to be kicked another few times.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mel Gorman on 16 Mar 2010 06:20

On Mon, Mar 15, 2010 at 01:09:35PM -0700, Andrew Morton wrote:
> On Mon, 15 Mar 2010 13:34:50 +0100
> Christian Ehrhardt <ehrhardt(a)linux.vnet.ibm.com> wrote:
>
> > c) If direct reclaim did reasonable progress in try_to_free but did not
> > get a page, AND there is no write in flight at all then let it try again
> > to free up something.
> > This could be extended by some kind of max retry to avoid some weird
> > looping cases as well.
> >
> > d) Another way might be as easy as letting congestion_wait return
> > immediately if there are no outstanding writes - this would keep the
> > behavior for cases with write and avoid the "running always in full
> > timeout" issue without writes.
>
> They're pretty much equivalent and would work. But there are two
> things I still don't understand:
>

Unfortunately, this regression is very poorly understood. I haven't been able
to reproduce it locally and while Christian has provided various debugging
information, it still isn't clear why the problem occurs now.

> 1: Why is direct reclaim calling congestion_wait() at all? If no
> writes are going on there's lots of clean pagecache around so reclaim
> should trivially succeed. What's preventing it from doing so?
>

Memory pressure I think. The workload involves 16 processes (see
http://lkml.org/lkml/2009/12/7/237). I suspect they are all direct reclaimers
and some processes are getting their pages stolen before they have a
chance to allocate them. It's knowing that adding a small amount of
memory "fixes" this problem.

> 2: This is, I think, new behaviour. A regression. What caused it?
>

Short answer, I don't know.

Longer answer. Initially, this was reported as being caused by commit e084b2d:
page-allocator: preserve PFN ordering when __GFP_COLD is set but it was never
established why and reverting it was unpalatable because it fixed another
performance problem. According to Christian, the controller does nothing
with the merging of IO requests and he was very sure about this. As all the
patch does is change the order that pages are returned in and the timing
slightly due to differences in cache hotness, although the fact that such
a small change could make a big difference in reclaim later was surprising.
There were other bugs that might have complicated this such as errors in free
page counters but they were fixed up and the problem still did not go away.

It was after much debugging that it was found that direct reclaim was
returning, the subsequent allocation attempt failed and congestion_wait()
was called but without dirty pages, congestion or writes, it waits for
the full timeout. congestion_wait() was also being called a lot more
frequently so something was causing reclaim to fail more frequently
(http://lkml.org/lkml/2009/12/18/150). Again, I couldn't figure out why
e084b2d would make a difference.

Later, it got even worse because patches e084b2d and 5f8dcc21 had to be
reverted in 2.6.33 to "resolve" the problem. 5f8dcc21 was more plausible as it
affected how many pages were on the per-cpu lists but making it behave like
2.6.32 did not help the situation. Again, it looked like a very small timing
problem but it could not be isolated exactly why reclaim would fail. Again,
other bugs were found and fixed but made no difference.

What lead to this patch was recognising we could enter congestion_wait()
and wait the entire timeout because no writes were in progress or dirty
pages to be cleaned. As what was really of interest was watermarks in
this path, the patch intended to make the page allocator care about
watermarks instead of congestion. We know it was treating symptoms
rather than understanding the underlying problem but I was somewhat at a
loss to explain why small changes in timing made such a large
difference.

Any new insight is welcome.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mel Gorman on 18 Mar 2010 13:50

| Next | Last
Pages: 1 2
Prev: microblaze: Fix Makefile to delete build generated files
Next: [GIT PULL] MFD fixes for 2.6.34