Avoid the use of congestion_wait under zone pressure [Kernel]

Prev: davinci: MMC: Pass number of SG segments as platform data
Next: [PATCH v1 2/3] x86/PCI: trim _CRS windows when they conflict with previous reservations

From: Andrew Morton on 11 Mar 2010 18:50

On Mon, 8 Mar 2010 11:48:20 +0000
Mel Gorman <mel(a)csn.ul.ie> wrote:

> Under memory pressure, the page allocator and kswapd can go to sleep using
> congestion_wait(). In two of these cases, it may not be the appropriate
> action as congestion may not be the problem.

clear_bdi_congested() is called each time a write completes and the
queue is below the congestion threshold.

So if the page allocator or kswapd call congestion_wait() against a
non-congested queue, they'll wake up on the very next write completion.

Hence the above-quoted claim seems to me to be a significant mis-analysis and
perhaps explains why the patchset didn't seem to help anything?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andrew Morton on 12 Mar 2010 05:10

On Fri, 12 Mar 2010 07:39:26 +0100 Christian Ehrhardt <ehrhardt(a)linux.vnet.ibm.com> wrote:

>
>
> Andrew Morton wrote:
> > On Mon, 8 Mar 2010 11:48:20 +0000
> > Mel Gorman <mel(a)csn.ul.ie> wrote:
> >
> >> Under memory pressure, the page allocator and kswapd can go to sleep using
> >> congestion_wait(). In two of these cases, it may not be the appropriate
> >> action as congestion may not be the problem.
> >
> > clear_bdi_congested() is called each time a write completes and the
> > queue is below the congestion threshold.
> >
> > So if the page allocator or kswapd call congestion_wait() against a
> > non-congested queue, they'll wake up on the very next write completion.
>
> Well the issue came up in all kind of loads where you don't have any
> writes at all that can wake up congestion_wait.
> Thats true for several benchmarks, but also real workload as well e.g. A
> backup job reading almost all files sequentially and pumping out stuff
> via network.

Why is reclaim going into congestion_wait() at all if there's heaps of
clean reclaimable pagecache lying around?

(I don't thing the read side of the congestion_wqh[] has ever been used, btw)

> > Hence the above-quoted claim seems to me to be a significant mis-analysis and
> > perhaps explains why the patchset didn't seem to help anything?
>
> While I might have misunderstood you and it is a mis-analysis in your
> opinion, it fixes a -80% Throughput regression on sequential read
> workloads, thats not nothing - its more like absolutely required :-)
>
> You might check out the discussion with the subject "Performance
> regression in scsi sequential throughput (iozone) due to "e084b -
> page-allocator: preserve PFN ordering when __GFP_COLD is set"".
> While the original subject is misleading from todays point of view, it
> contains a lengthy discussion about exactly when/why/where time is lost
> due to congestion wait with a lot of traces, counters, data attachments
> and such stuff.

Well if we're not encountering lots of dirty pages in reclaim then we
shouldn't be waiting for writes to retire, of course.

But if we're not encountering lots of dirty pages in reclaim, we should
be reclaiming pages, normally.

I could understand reclaim accidentally going into congestion_wait() if
it hit a large pile of pages which are unreclaimable for reasons other
than being dirty, but is that happening in this case?

If not, we broke it again.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andrew Morton on 12 Mar 2010 12:40

On Fri, 12 Mar 2010 13:15:05 +0100 Christian Ehrhardt <ehrhardt(a)linux.vnet.ibm.com> wrote:

> > It still feels a bit unnatural though that the page allocator waits on
> > congestion when what it really cares about is watermarks. Even if this
> > patch works for Christian, I think it still has merit so will kick it a
> > few more times.
>
> In whatever way I can look at it watermark_wait should be supperior to
> congestion_wait. Because as Mel points out waiting for watermarks is
> what is semantically correct there.

If a direct-reclaimer waits for some thresholds to be achieved then what
task is doing reclaim?

Ultimately, kswapd. This will introduce a hard dependency upon kswapd
activity. This might introduce scalability problems. And latency
problems if kswapd if off doodling with a slow device (say), or doing a
journal commit. And perhaps deadlocks if kswapd tries to take a lock
which one of the waiting-for-watermark direct relcaimers holds.

Generally, kswapd is an optional, best-effort latency optimisation
thing and we haven't designed for it to be a critical service.
Probably stuff would break were we to do so.

This is one of the reasons why we avoided creating such dependencies in
reclaim. Instead, what we do when a reclaimer is encountering lots of
dirty or in-flight pages is

msleep(100);

then try again. We're waiting for the disks, not kswapd.

Only the hard-wired 100 is a bit silly, so we made the "100" variable,
inversely dependent upon the number of disks and their speed. If you
have more and faster disks then you sleep for less time.

And that's what congestion_wait() does, in a very simplistic fashion.
It's a facility which direct-reclaimers use to ratelimit themselves in
inverse proportion to the speed with which the system can retire writes.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andrew Morton on 15 Mar 2010 16:20

On Mon, 15 Mar 2010 13:34:50 +0100
Christian Ehrhardt <ehrhardt(a)linux.vnet.ibm.com> wrote:

> c) If direct reclaim did reasonable progress in try_to_free but did not
> get a page, AND there is no write in flight at all then let it try again
> to free up something.
> This could be extended by some kind of max retry to avoid some weird
> looping cases as well.
>
> d) Another way might be as easy as letting congestion_wait return
> immediately if there are no outstanding writes - this would keep the
> behavior for cases with write and avoid the "running always in full
> timeout" issue without writes.

They're pretty much equivalent and would work. But there are two
things I still don't understand:

1: Why is direct reclaim calling congestion_wait() at all? If no
writes are going on there's lots of clean pagecache around so reclaim
should trivially succeed. What's preventing it from doing so?

2: This is, I think, new behaviour. A regression. What caused it?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

|
Pages: 1
Prev: davinci: MMC: Pass number of SG segments as platform data
Next: [PATCH v1 2/3] x86/PCI: trim _CRS windows when they conflict with previous reservations