From: Mel Gorman on
Sorry for the long delay, I got side-tracked on other bugs.

This is a follow-on series from the series "Avoid overflowing of stack
during page reclaim". It eliminates writeback requiring a filesystem from
direct reclaim and follows on by reducing the amount of IO required from
page reclaim to mitigate any corner cases from the modification.

Changelog since V3
o Distinguish between file and anon related IO from page reclaim
o Allow anon writeback from reclaim context
o Sync old inodes first in background writeback
o Pre-emptively clean pages when dirty pages are encountered on the LRU
o Rebase to 2.6.35-rc5

Changelog since V2
o Add acks and reviewed-bys
o Do not lock multiple pages at the same time for writeback as it's unsafe
o Drop the clean_page_list function. It alters timing with very little
benefit. Without the contiguous writing, it doesn't do much to simplify
the subsequent patches either
o Throttle processes that encounter dirty pages in direct reclaim. Instead
wakeup flusher threads to clean the number of pages encountered that were
dirty

Changelog since V1
o Merge with series that reduces stack usage in page reclaim in general
o Allow memcg to writeback pages as they are not expected to overflow stack
o Drop the contiguous-write patch for the moment

There is a problem in the stack depth usage of page reclaim. Particularly
during direct reclaim, it is possible to overflow the stack if it calls into
the filesystems writepage function. This patch series begins by preventing
writeback from direct reclaim and allowing btrfs and xfs to writeback from
kswapd context. As this is a potentially large change, the remainder of
the series aims to reduce any filesystem writeback from page reclaim and
depend more on background flush.

The first patch in the series is a roll-up of what should currently be
in mmotm. It's provided for convenience of testing.

Patch 2 and 3 note that it is important to distinguish between file and anon
page writeback from page reclaim as they use stack to different depths. It
updates the trace points and scripts appropriately noting which mmotm patch
they should be merged with.

Patch 4 prevents direct reclaim writing out filesystem pages while still
allowing writeback of anon pages which is in less danger of stack overflow
and doesn't have something like background flush to clean the pages.
For filesystem pages, flusher threads are asked to clean the number of
pages encountered, the caller waits on congestion and puts the pages back
on the LRU. For lumpy reclaim, the caller will wait for a time calling the
flusher multiple times waiting on dirty pages to be written out before trying
to reclaim the dirty pages a second time. This increases the responsibility
of kswapd somewhat because it's now cleaning pages on behalf of direct
reclaimers but unlike background flushers, kswapd knows what zone pages
need to be cleaned from. As it is async IO, it should not cause kswapd to
stall (at least until the queue is congested) but the order that pages are
reclaimed on the LRU is altered. Dirty pages that would have been reclaimed
by direct reclaimers are getting another lap on the LRU. The dirty pages
could have been put on a dedicated list but this increased counter overhead
and the number of lists and it is unclear if it is necessary.

Patches 5 and 6 revert chances on XFS and btrfs that ignore writeback from
reclaim context which is a relatively recent change. extX could be modified
to allow kswapd to writeback but it is a relatively deep change. There may
be some collision with items in the filesystem git trees but it is expected
to be trivial to resolve.

Patch 7 makes background flush behave more like kupdate by syncing old or
expired inodes first as implemented by Wu Fengguang. As filesystem pages are
added onto the inactive queue and only promoted if referenced, it makes sense
to write old pages first to reduce the chances page reclaim is initiating IO.

Patch 8 notes that dirty pages can still be found at the end of the LRU.
If a number of them are encountered, it's reasonable to assume that a similar
number of dirty pages will be discovered in the very near future as that was
the dirtying pattern at the time. The patch pre-emptively kicks background
flusher to clean a number of pages creating feedback from page reclaim to
background flusher that is based on scanning rates. Christoph has described
discussions on this patch as a "band-aid" but Rik liked the idea and the
patch does have interesting results so is worth a closer look.

I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each machine
had 3G of RAM and the CPUs were

X86: Intel P4 2 core
X86-64: AMD Phenom 4-core
PPC64: PPC970MP

Each used a single disk and the onboard IO controller. Dirty ratio was left
at 20. Tests on an earlier series indicated that moving to 40 did not make
much difference. The filesystem used for all tests was XFS.

Four kernels are compared.

traceonly-v4r7 is the first 3 patches of this series
nodirect-v4r7 is the first 6 patches
flusholdest-v4r7 makes background flush behave like kupdated (patch 1-7)
flushforward-v4r7 pre-emptively cleans pages when encountered on the LRU (patch 1-8)

The results on each test is broken up into two parts. The first part is a
report based on the ftrace postprocessing script in patch 4 and reports on
direct reclaim and kswapd activity. The second part reports what percentage
of time was spent in direct reclaim and kswapd being awake.

To work out the percentage of time spent in direct reclaim, I used
/usr/bin/time to get the User + Sys CPU time. The stalled time was taken
from the post-processing script. The total time is (User + Sys + Stall)
and obviously the percentage is of stalled over total time.

I am omitting the actual performance results simply because they are not
interesting with very few significant changes.

kernbench
=========

No writeback from reclaim initiated and no performance change of significance.

IOzone
======

No writeback from reclaim initiated and no performance change of significance.


SysBench
========

The results were based on a read/write and as the machine is under-provisioned
for the type of tests, figures are very unstable so not reported. with
variances up to 15%. Part of the problem is that larger thread counts push
the test into swap as the memory is insufficient and destabilises results
further. I could tune for this, but it was reclaim that was important.

X86
raceonly-v4r7 nodirect-v4r7 flusholdest-v4r7 flushforward-v4r7
Direct reclaims 18 25 6 196
Direct reclaim pages scanned 1615 1662 605 22233
Direct reclaim write file async I/O 40 0 0 0
Direct reclaim write anon async I/O 0 0 13 9
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 171039 401450 313156 90960
Kswapd wakeups 685 532 611 262
Kswapd pages scanned 14272338 12209663 13799001 5230124
Kswapd reclaim write file async I/O 581811 23047 23795 759
Kswapd reclaim write anon async I/O 189590 124947 114948 42906
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (ms) 0.00 0.91 0.92 1.31
Time kswapd awake (ms) 1079.32 1039.42 1194.82 1091.06

User/Sys Time Running Test (seconds) 1312.24 1241.37 1308.16 1253.15
Percentage Time Spent Direct Reclaim 0.00% 0.00% 0.00% 0.00%
Total Elapsed Time (seconds) 8411.28 7471.15 8292.18 8170.16
Percentage Time kswapd Awake 3.45% 0.00% 0.00% 0.00%

Dirty file pages from X86 were not much of a problem to begin with and the
patches eliminate them as expected. What is interesting is nodirct-v4r7
made such a large difference to the amount of filesystem pages that had
to be written back. Apparently, background flush must have been doing a
better job getting them cleaned in time and the direct reclaim stalls are
harmful overall. Waking background threads for dirty pages made a very large
difference to the number of pages written back. With all patches applied,
just 759 filesystem pages were written back in comparison to 581811 in the
vanilla kernel and overall the number of pages scanned was reduced.

X86-64
traceonly-v4r7 nodirect-v4r7 flusholdest-v4r7 flushforward-v4r7
Direct reclaims 795 1662 2131 6459
Direct reclaim pages scanned 204900 127300 291647 317035
Direct reclaim write file async I/O 53763 0 0 0
Direct reclaim write anon async I/O 1256 730 6114 20
Direct reclaim write file sync I/O 10 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 690850 1457411 1713379 1648469
Kswapd wakeups 1683 1353 1275 1171
Kswapd pages scanned 17976327 15711169 16501926 12634291
Kswapd reclaim write file async I/O 818222 26560 42081 6311
Kswapd reclaim write anon async I/O 245442 218708 209703 205254
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (ms) 13.50 41.19 69.56 51.32
Time kswapd awake (ms) 2243.53 2515.34 2767.58 2607.94

User/Sys Time Running Test (seconds) 687.69 650.83 653.28 640.38
Percentage Time Spent Direct Reclaim 0.01% 0.00% 0.00% 0.00%
Total Elapsed Time (seconds) 6954.05 6472.68 6508.28 6211.11
Percentage Time kswapd Awake 0.04% 0.00% 0.00% 0.00%

Direct reclaim of filesystem pages is eliminated as expected. Again, the
overall number of pages that need to be written back by page reclaim is
reduced. Flushing just the oldest inode was not much of a help in terms
of how many pages needed to be written back from reclaim but pre-emptively
waking flusher threads helped a lot.

Oddly, more time was spent in direct reclaim with the patches as a greater
number of anon pages needed to be written back. It's possible this was
due to the test making more forward progress as indicated by the shorter
running time.

PPC64
traceonly-v4r7 nodirect-v4r7 flusholdest-v4r7 flushforward-v4r7
Direct reclaims 1517 34527 32365 51973
Direct reclaim pages scanned 144496 2041199 1950282 3137493
Direct reclaim write file async I/O 28147 0 0 0
Direct reclaim write anon async I/O 463 25258 10894 0
Direct reclaim write file sync I/O 7 0 0 0
Direct reclaim write anon sync I/O 0 1 0 0
Wake kswapd requests 1126060 6578275 6281512 6649558
Kswapd wakeups 591 262 229 247
Kswapd pages scanned 16522849 12277885 11076027 7614475
Kswapd reclaim write file async I/O 1302640 50301 43308 8658
Kswapd reclaim write anon async I/O 150876 146600 159229 134919
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (ms) 32.28 481.52 535.15 342.97
Time kswapd awake (ms) 1694.00 4789.76 4426.42 4309.49

User/Sys Time Running Test (seconds) 1294.96 1264.5 1254.92 1216.92
Percentage Time Spent Direct Reclaim 0.03% 0.00% 0.00% 0.00%
Total Elapsed Time (seconds) 8876.80 8446.49 7644.95 7519.83
Percentage Time kswapd Awake 0.05% 0.00% 0.00% 0.00%

Direct reclaim filesystem writes are eliminated but the scan rates went way
up. It implies that direct reclaim was spinning quite a bit and finding
clean pages allowing the test to complete 22 minutes faster. S Flushing
oldest inodes helped but pre-emptively waking background flushers helped
more in terms of the number of pages cleaned by page reclaim.

Stress HighAlloc
================

This test builds a large number of kernels simultaneously so that the total
workload is 1.5 times the size of RAM. It then attempts to allocate all of
RAM as huge pages. The metric is the percentage of memory allocated using
load (Pass 1), a second attempt under load (Pass 2) and when the kernel
compiles are finishes and the system is quiet (At Rest). The patches have
little impact on the success rates.

X86
traceonly-v4r7 nodirect-v4r7 flusholdest-v4r7 flushforward-v4r7
Direct reclaims 623 607 611 491
Direct reclaim pages scanned 126515 117477 142502 91649
Direct reclaim write file async I/O 896 0 0 0
Direct reclaim write anon async I/O 35286 27508 35688 24819
Direct reclaim write file sync I/O 580 0 0 0
Direct reclaim write anon sync I/O 13932 12301 15203 11509
Wake kswapd requests 1561 1650 1618 1152
Kswapd wakeups 183 209 211 79
Kswapd pages scanned 9391908 9144543 11418802 6959545
Kswapd reclaim write file async I/O 92730 7073 8215 807
Kswapd reclaim write anon async I/O 946499 831573 1164240 833063
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (ms) 4653.17 4193.28 5292.97 6954.96
Time kswapd awake (ms) 4618.67 3787.74 4856.45 55704.90

User/Sys Time Running Test (seconds) 2103.48 2161.14 2131 2160.01
Percentage Time Spent Direct Reclaim 0.33% 0.00% 0.00% 0.00%
Total Elapsed Time (seconds) 6996.43 6405.43 7584.74 8904.53
Percentage Time kswapd Awake 0.80% 0.00% 0.00% 0.00%

Total time running the test was increased unfortunately but this was
the only instance it occurred. Similar story as elsewhere otherwise -
filesystem direct writes are eliminated and overall filesystem writes from
page reclaim are significantly reduced to almost negligible levels (0.01%
of pages scanned by kswapd resulted in a filesystem write for the full
series in comparison to 0.99% in the vanilla kernel).

X86-64
traceonly-v4r7 nodirect-v4r7 flusholdest-v4r7 flushforward-v4r7
Direct reclaims 1275 1300 1222 1224
Direct reclaim pages scanned 156940 152253 148993 148726
Direct reclaim write file async I/O 2472 0 0 0
Direct reclaim write anon async I/O 29281 26887 28073 26283
Direct reclaim write file sync I/O 1943 0 0 0
Direct reclaim write anon sync I/O 11777 9258 10256 8510
Wake kswapd requests 4865 12895 1185 1176
Kswapd wakeups 869 757 789 822
Kswapd pages scanned 41664053 30419872 29602438 42603986
Kswapd reclaim write file async I/O 550544 16092 12775 4414
Kswapd reclaim write anon async I/O 2409931 1964446 1779486 1667076
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (ms) 8908.93 7920.53 6192.17 5926.47
Time kswapd awake (ms) 6045.11 5486.48 3945.35 3367.01

User/Sys Time Running Test (seconds) 2813.44 2818.17 2801.8 2803.61
Percentage Time Spent Direct Reclaim 0.21% 0.00% 0.00% 0.00%
Total Elapsed Time (seconds) 11217.45 10286.90 8534.22 8332.84
Percentage Time kswapd Awake 0.03% 0.00% 0.00% 0.00%

Unlike X86, total time spent on the test was significantly reduced and like
elsewhere, filesystem IO due to reclaim is way down.

PPC64
traceonly-v4r7 nodirect-v4r7 flusholdest-v4r7 flushforward-v4r7
Direct reclaims 665 709 652 663
Direct reclaim pages scanned 145630 125161 116556 124718
Direct reclaim write file async I/O 946 0 0 0
Direct reclaim write anon async I/O 26983 23160 28531 23360
Direct reclaim write file sync I/O 596 0 0 0
Direct reclaim write anon sync I/O 17517 13635 16114 13121
Wake kswapd requests 271 302 299 278
Kswapd wakeups 181 164 158 172
Kswapd pages scanned 68789711 68058349 54613548 64905996
Kswapd reclaim write file async I/O 159196 20569 17538 2475
Kswapd reclaim write anon async I/O 2311178 1962398 1811115 1829023
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (ms) 13784.95 12895.39 11132.26 11785.26
Time kswapd awake (ms) 13331.51 12603.74 10956.18 11479.22

User/Sys Time Running Test (seconds) 3567.03 2730.23 2682.86 2668.08
Percentage Time Spent Direct Reclaim 0.33% 0.00% 0.00% 0.00%
Total Elapsed Time (seconds) 15282.74 14347.67 12614.61 13386.85
Percentage Time kswapd Awake 0.08% 0.00% 0.00% 0.00%

Similar story, the test completed faster and page reclaim IO is down.

Overall, the patches seem to help. Reclaim activity is reduced while test
times are generally improved. A big concern with V3 was that direct reclaim
not being able to write pages could lead to unexpected behaviour. This
series mitigates that risk by reducing the amount of IO initiated by page
reclaim making it a rarer event.

Mel Gorman (7):
MMOTM MARKER
vmscan: tracing: Update trace event to track if page reclaim IO is
for anon or file pages
vmscan: tracing: Update post-processing script to distinguish between
anon and file IO from page reclaim
vmscan: Do not writeback filesystem pages in direct reclaim
fs,btrfs: Allow kswapd to writeback pages
fs,xfs: Allow kswapd to writeback pages
vmscan: Kick flusher threads to clean pages when reclaim is
encountering dirty pages

Wu Fengguang (1):
writeback: sync old inodes first in background writeback

.../trace/postprocess/trace-vmscan-postprocess.pl | 89 +++++++++-----
Makefile | 2 +-
fs/btrfs/disk-io.c | 21 +----
fs/btrfs/inode.c | 6 -
fs/fs-writeback.c | 19 +++-
fs/xfs/linux-2.6/xfs_aops.c | 15 ---
include/trace/events/vmscan.h | 8 +-
mm/vmscan.c | 121 ++++++++++++++++++-
8 files changed, 195 insertions(+), 86 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/