Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible [Kernel]

Prev: [PROPOSAL - FIRST POST] NMI & register clash handling infrastructure
Next: Using "page credits" as a solution for common thrashing scenarios

From: Mel Gorman on 11 Jun 2010 14:20

On Fri, Jun 11, 2010 at 12:29:12PM -0400, Christoph Hellwig wrote:
> On Tue, Jun 08, 2010 at 10:28:14AM +0100, Mel Gorman wrote:
> > > - we also need to care about ->releasepage. At least for XFS it
> > > can end up in the same deep allocator chain as ->writepage because
> > > it does all the extent state conversions, even if it doesn't
> > > start I/O.
> >
> > Dang.
> >
> > > I haven't managed yet to decode the ext4/btrfs codepaths
> > > for ->releasepage yet to figure out how they release a page that
> > > covers a delayed allocated or unwritten range.
> > >
> >
> > If ext4/btrfs are also very deep call-chains and this series is going more
> > or less the right direction, then avoiding calling ->releasepage from direct
> > reclaim is one, somewhat unfortunate, option. The second is to avoid it on
> > a per-filesystem basis for direct reclaim using PF_MEMALLOC to detect
> > reclaimers and PF_KSWAPD to tell the difference between direct
> > reclaimers and kswapd.
>
> I went throught this a bit more and I can't actually hit that code in
> XFS ->releasepage anymore. I've also audited the caller and can't see
> how we could theoretically hit it anymore. Do the VM gurus know a case
> where we would call ->releasepage on a page that's actually dirty and
> hasn't been through block_invalidatepage before?
>

Not a clue I'm afraid as I haven't dealt much with the interactions
between VM and FS in the past. Nick?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mel Gorman on 15 Jun 2010 11:00

On Tue, Jun 15, 2010 at 04:00:11PM +0200, Andrea Arcangeli wrote:
> Hi Mel,
>
> I know lots of people doesn't like direct reclaim,

It's not direct reclaim that is the problem per-se, it's direct reclaim
calling writepage and splicing two potentially deep call chains
together.

> but I personally do
> and I think if memory pressure is hard enough we should eventually
> enter direct reclaim full force including ->writepage to avoid false
> positive OOM failures.

Be that as it may, filesystems that have deep call paths for their
->writepage are ignoring both kswapd and direct reclaim so on XFS and
btrfs for example, this "full force" effect is not being reached.

> Transparent hugepage allocation in fact won't
> even wakeup kswapd that would be insist to create hugepages and shrink
> an excessive amount of memory (especially before memory compaction was
> merged, it shall be tried again but if memory compaction fails in
> kswapd context, definitely kswapd should immediately stop and not go
> ahead trying the create hugepages the blind way, kswapd
> order-awareness the blind way is surely detrimental and pointless).
>

kswapd does end up freeing a lot of memory in response to lumpy reclaim
because it also tries to restore watermarks for a high-order page. This
is disruptive to the system and something I'm going to revisit but it's
a separate topic for another discussion. I can see why transparent
hugepage support would not want this disruptive effect to occur where as
it might make sense when resizing the hugepage pool.

> When memory pressure is low, not going into ->writepage may be
> beneficial from latency prospective too. (but again it depends how
> much it matters to go in LRU and how beneficial is the cache, to know
> if it's worth taking clean cache away even if hotter than dirty cache)
>
> About the stack overflow did you ever got any stack-debug error?

Not an error. Got a report from Dave Chinner though and it's what kicked
off this whole routine in the first place. I've been recording stack
usage figures but not reporting them. In reclaim I'm getting to about 5K
deep but this was on simple storage and XFS was ignoring attempts for
reclaim to writeback.

http://lkml.org/lkml/2010/4/13/121

Here is one my my own stack traces though

Depth Size Location (49 entries)
----- ---- --------
0) 5064 304 get_page_from_freelist+0x2e4/0x722
1) 4760 240 __alloc_pages_nodemask+0x15f/0x6a7
2) 4520 48 kmem_getpages+0x61/0x12c
3) 4472 96 cache_grow+0xca/0x272
4) 4376 80 cache_alloc_refill+0x1d4/0x226
5) 4296 64 kmem_cache_alloc+0x129/0x1bc
6) 4232 16 mempool_alloc_slab+0x16/0x18
7) 4216 144 mempool_alloc+0x56/0x104
8) 4072 16 scsi_sg_alloc+0x48/0x4a [scsi_mod]
9) 4056 96 __sg_alloc_table+0x58/0xf8
10) 3960 32 scsi_init_sgtable+0x37/0x8f [scsi_mod]
11) 3928 32 scsi_init_io+0x24/0xce [scsi_mod]
12) 3896 48 scsi_setup_fs_cmnd+0xbc/0xc4 [scsi_mod]
13) 3848 144 sd_prep_fn+0x1d3/0xc13 [sd_mod]
14) 3704 64 blk_peek_request+0xe2/0x1a6
15) 3640 96 scsi_request_fn+0x87/0x522 [scsi_mod]
16) 3544 32 __blk_run_queue+0x88/0x14b
17) 3512 48 elv_insert+0xb7/0x254
18) 3464 48 __elv_add_request+0x9f/0xa7
19) 3416 128 __make_request+0x3f4/0x476
20) 3288 192 generic_make_request+0x332/0x3a4
21) 3096 64 submit_bio+0xc4/0xcd
22) 3032 80 _xfs_buf_ioapply+0x222/0x252 [xfs]
23) 2952 48 xfs_buf_iorequest+0x84/0xa1 [xfs]
24) 2904 32 xlog_bdstrat+0x47/0x4d [xfs]
25) 2872 64 xlog_sync+0x21a/0x329 [xfs]
26) 2808 48 xlog_state_release_iclog+0x9b/0xa8 [xfs]
27) 2760 176 xlog_write+0x356/0x506 [xfs]
28) 2584 96 xfs_log_write+0x5a/0x86 [xfs]
29) 2488 368 xfs_trans_commit_iclog+0x165/0x2c3 [xfs]
30) 2120 80 _xfs_trans_commit+0xd8/0x20d [xfs]
31) 2040 240 xfs_iomap_write_allocate+0x247/0x336 [xfs]
32) 1800 144 xfs_iomap+0x31a/0x345 [xfs]
33) 1656 48 xfs_map_blocks+0x3c/0x40 [xfs]
34) 1608 256 xfs_page_state_convert+0x2c4/0x597 [xfs]
35) 1352 64 xfs_vm_writepage+0xf5/0x12f [xfs]
36) 1288 32 __writepage+0x17/0x34
37) 1256 288 write_cache_pages+0x1f3/0x2f8
38) 968 16 generic_writepages+0x24/0x2a
39) 952 64 xfs_vm_writepages+0x4f/0x5c [xfs]
40) 888 16 do_writepages+0x21/0x2a
41) 872 48 writeback_single_inode+0xd8/0x2f4
42) 824 112 writeback_inodes_wb+0x41a/0x51e
43) 712 176 wb_writeback+0x13d/0x1b7
44) 536 128 wb_do_writeback+0x150/0x167
45) 408 80 bdi_writeback_task+0x43/0x117
46) 328 48 bdi_start_fn+0x76/0xd5
47) 280 96 kthread+0x82/0x8a
48) 184 184 kernel_thread_helper+0x4/0x10

XFS as you can see is quite deep there. Now consider if
get_page_from_freelist() there had entered direct reclaim and then tried
to writeback a page. That's the problem that is being worried about.

> We've
> plenty of instrumentation and ->writepage definitely runs with irq
> enable, so if there's any issue, it can't possibly be unnoticed. The
> worry about stack overflow shall be backed by numbers.
>
> You posted lots of latency numbers (surely latency will improve but
> it's only safe approach on light memory pressure, on heavy pressure
> it'll early-oom not to call ->writepage, and if cache is very
> important and system has little ram, not going in lru order may also
> screw fs-cache performance),

I also haven't been able to trigger a new OOM as a result of the patch
but maybe I'm missing something. To trigger an OOM, the bulk of the LRU
would have to be dirty and the direct reclaimer making no further
progress but if the bulk of the LRU has been dirtied like this, are we
not already in trouble?

We could have it that direct reclaimers kick the flusher threads when it
counters dirty pages and goes to sleep but this will increase latency
and considering the number of dirty pages direct reclaimers should be
seeing, I'm not sure it's necessary.

> but I didn't see any max-stack usage hard
> numbers, to back the claim that we're going to overflow.
>

I hadn't posted them because they had been posted previously and I
didn't think they were that interesting as such because it wasn't being
disputed.

> In any case I'd prefer to be able to still call ->writepage if memory
> pressure is high (at some point when priority going down and
> collecting clean cache doesn't still satisfy the allocation),

Well, kswapd is still writing pages if the pressure is high enough that
the flusher threads are not doing it and a direct reclaimer will wait on
congestion_wait() if the pressure gets high enough (PRIORITY < 2).

> during
> allocations in direct reclaim and increase the THREAD_SIZE than doing
> this purely for stack reasons as the VM will lose reliability if we
> forbid ->writepage at all in direct reclaim.

Well, we've lost that particular reliability already on btrfs and xfs
because they are ignoring the VM and increasing THREAD_SIZE would
increase the order used for stack allocations which causes problems of
its own.

The VM would lose a lot of reliability if we weren't throttling on pages
being dirtied in the fault path but because we are doing that, I don't
currently believe we are losing reliability by not writing back pages in
direct reclaim.

> Throttling on kswapd is
> possible but it's probably less efficient and on the stack we know
> exactly which kind of memory we should allocate, kswapd doesn't and it
> works global.
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mel Gorman on 15 Jun 2010 11:20

On Wed, Jun 16, 2010 at 01:08:00AM +1000, Nick Piggin wrote:
> On Tue, Jun 15, 2010 at 03:51:34PM +0100, Mel Gorman wrote:
> > On Tue, Jun 15, 2010 at 04:00:11PM +0200, Andrea Arcangeli wrote:
> > > When memory pressure is low, not going into ->writepage may be
> > > beneficial from latency prospective too. (but again it depends how
> > > much it matters to go in LRU and how beneficial is the cache, to know
> > > if it's worth taking clean cache away even if hotter than dirty cache)
> > >
> > > About the stack overflow did you ever got any stack-debug error?
> >
> > Not an error. Got a report from Dave Chinner though and it's what kicked
> > off this whole routine in the first place. I've been recording stack
> > usage figures but not reporting them. In reclaim I'm getting to about 5K
> > deep but this was on simple storage and XFS was ignoring attempts for
> > reclaim to writeback.
> >
> > http://lkml.org/lkml/2010/4/13/121
> >
> > Here is one my my own stack traces though
> >
> > Depth Size Location (49 entries)
> > ----- ---- --------
> > 0) 5064 304 get_page_from_freelist+0x2e4/0x722
> > 1) 4760 240 __alloc_pages_nodemask+0x15f/0x6a7
> > 2) 4520 48 kmem_getpages+0x61/0x12c
> > 3) 4472 96 cache_grow+0xca/0x272
> > 4) 4376 80 cache_alloc_refill+0x1d4/0x226
> > 5) 4296 64 kmem_cache_alloc+0x129/0x1bc
> > 6) 4232 16 mempool_alloc_slab+0x16/0x18
> > 7) 4216 144 mempool_alloc+0x56/0x104
> > 8) 4072 16 scsi_sg_alloc+0x48/0x4a [scsi_mod]
> > 9) 4056 96 __sg_alloc_table+0x58/0xf8
> > 10) 3960 32 scsi_init_sgtable+0x37/0x8f [scsi_mod]
> > 11) 3928 32 scsi_init_io+0x24/0xce [scsi_mod]
> > 12) 3896 48 scsi_setup_fs_cmnd+0xbc/0xc4 [scsi_mod]
> > 13) 3848 144 sd_prep_fn+0x1d3/0xc13 [sd_mod]
> > 14) 3704 64 blk_peek_request+0xe2/0x1a6
> > 15) 3640 96 scsi_request_fn+0x87/0x522 [scsi_mod]
> > 16) 3544 32 __blk_run_queue+0x88/0x14b
> > 17) 3512 48 elv_insert+0xb7/0x254
> > 18) 3464 48 __elv_add_request+0x9f/0xa7
> > 19) 3416 128 __make_request+0x3f4/0x476
> > 20) 3288 192 generic_make_request+0x332/0x3a4
> > 21) 3096 64 submit_bio+0xc4/0xcd
> > 22) 3032 80 _xfs_buf_ioapply+0x222/0x252 [xfs]
> > 23) 2952 48 xfs_buf_iorequest+0x84/0xa1 [xfs]
> > 24) 2904 32 xlog_bdstrat+0x47/0x4d [xfs]
> > 25) 2872 64 xlog_sync+0x21a/0x329 [xfs]
> > 26) 2808 48 xlog_state_release_iclog+0x9b/0xa8 [xfs]
> > 27) 2760 176 xlog_write+0x356/0x506 [xfs]
> > 28) 2584 96 xfs_log_write+0x5a/0x86 [xfs]
> > 29) 2488 368 xfs_trans_commit_iclog+0x165/0x2c3 [xfs]
> > 30) 2120 80 _xfs_trans_commit+0xd8/0x20d [xfs]
> > 31) 2040 240 xfs_iomap_write_allocate+0x247/0x336 [xfs]
> > 32) 1800 144 xfs_iomap+0x31a/0x345 [xfs]
> > 33) 1656 48 xfs_map_blocks+0x3c/0x40 [xfs]
> > 34) 1608 256 xfs_page_state_convert+0x2c4/0x597 [xfs]
> > 35) 1352 64 xfs_vm_writepage+0xf5/0x12f [xfs]
> > 36) 1288 32 __writepage+0x17/0x34
> > 37) 1256 288 write_cache_pages+0x1f3/0x2f8
> > 38) 968 16 generic_writepages+0x24/0x2a
> > 39) 952 64 xfs_vm_writepages+0x4f/0x5c [xfs]
> > 40) 888 16 do_writepages+0x21/0x2a
> > 41) 872 48 writeback_single_inode+0xd8/0x2f4
> > 42) 824 112 writeback_inodes_wb+0x41a/0x51e
> > 43) 712 176 wb_writeback+0x13d/0x1b7
> > 44) 536 128 wb_do_writeback+0x150/0x167
> > 45) 408 80 bdi_writeback_task+0x43/0x117
> > 46) 328 48 bdi_start_fn+0x76/0xd5
> > 47) 280 96 kthread+0x82/0x8a
> > 48) 184 184 kernel_thread_helper+0x4/0x10
> >
> > XFS as you can see is quite deep there. Now consider if
> > get_page_from_freelist() there had entered direct reclaim and then tried
> > to writeback a page. That's the problem that is being worried about.
>
> It would be a problem because it should be !__GFP_IO at that point so
> something would be seriously broken if it called ->writepage again.
>

True, ignore this as Christoph's example makes more sense.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mel Gorman on 15 Jun 2010 12:40

On Tue, Jun 15, 2010 at 06:14:19PM +0200, Andrea Arcangeli wrote:
> On Tue, Jun 15, 2010 at 04:38:38PM +0100, Mel Gorman wrote:
> > That is pretty much what Dave is claiming here at
> > http://lkml.org/lkml/2010/4/13/121 where if mempool_alloc_slab() needed
>
> This stack trace shows writepage called by shrink_page_list... that
> contradict Christoph's claim that xfs already won't writepage if
> invoked by direct reclaim.
>

See this

STATIC int
xfs_vm_writepage(
struct page *page,
struct writeback_control *wbc)
{
int error;
int need_trans;
int delalloc, unmapped, unwritten;
struct inode *inode = page->mapping->host;

trace_xfs_writepage(inode, page, 0);

/*
* Refuse to write the page out if we are called from reclaim
* context.
*
* This is primarily to avoid stack overflows when called from deep
* used stacks in random callers for direct reclaim, but disabling
* reclaim for kswap is a nice side-effect as kswapd causes rather
* suboptimal I/O patters, too.
*
* This should really be done by the core VM, but until that happens
* filesystems like XFS, btrfs and ext4 have to take care of this
* by themselves.
*/
if (current->flags & PF_MEMALLOC)
goto out_fail;

> > to allocate a page and writepage was entered, there would have been a
> > a problem.
>
> There can't be a problem if a page wasn't available in mempool because
> we can't nest two writepage on top of the other or it'd deadlock on fs
> locks and this is the reason of GFP_NOFS, like noticed in the email.
>

Indeed, this is another case where we wouldn't have bust, just are
dangerously close. As Dave pointed out, we might have been in trouble if
the storage was also complicated but there isn't specific proof - just a
lot of strong evidence.

My 5K example is poor I'll admit but the storage is also a bit simple.
Just one disk, no md, networking or the anything else. This is why the data
I showed focused on how many dirty pages were being encountered during LRU
scanning, stalls and the like rather than the stack usage itself.

> Surely this shows the writepage going very close to the stack
> size... probably not enough to trigger the stack detector but close
> enough to worry! Agreed.
>
> I think we just need to switch stack on do_try_to_free_pages to solve
> it, and not just writepage or the filesystems.
>

Again, missing the code to do it and am missing data showing that not
writing pages in direct reclaim is really a bad idea.

> > Broken or not, it's what some of them are doing to avoid stack
> > overflows. Worst, they are ignoring both kswapd and direct reclaim when they
> > only really needed to ignore kswapd. With this series at least, the
> > check for PF_MEMALLOC in ->writepage can be removed
>
> I don't get how we end up in xfs_buf_ioapply above though if xfs
> writepage is a noop on PF_MEMALLOC. Definitely PF_MEMALLOC is set
> before try_to_free_pages but in the above trace writepage still runs
> and submit the I/O.
>
> > This series would at least allow kswapd to turn dirty pages into clean
> > ones so it's an improvement.
>
> Not saying it's not an improvement, but still it's not necessarily the
> right direction.
>
> > Other than a lack of code to do it :/
>
> ;)
>
> > If you really feel strongly about this, you could follow on the series
> > by extending clean_page_list() to switch stack if !kswapd.
> >
> > This has actually been the case for a while. I vaguely recall FS people
>
> Again not what looks like from the stack trace. Also grepping for
> PF_MEMALLOC in fs/xfs shows nothing.

fs/xfs/linux-2.6/xfs_aops.c

> In fact it's ext4_write_inode
> that skips the write if PF_MEMALLOC is set, not writepage apparently
> (only did a quick grep so I might be wrong). I suspect
> ext4_write_inode is the case I just mentioned about slab shrink, not
> ->writepage ;).
>

After grepping through fs/, it was only xfs and btrfs that I saw were
specfically disabling writepage from reclaim context.

> inodes are small, it's no big deal to keep an inode pinned and not
> slab-reclaimable because dirty, while skipping real writepage in
> memory pressure could really open a regression in oom false positives!
> One pagecache much bigger than one inode and there can be plenty more
> dirty pagecache than inodes.
>
> > i.e. when direct reclaim encounters N dirty pages, unconditionally ask the
> > flusher threads to clean that number of pages, throttle by waiting for them
> > to be cleaned, reclaim them if they get cleaned or otherwise scan more pages
> > on the LRU.
>
> Not bad at all... throttling is what makes it safe too. Problem is all
> the rest that isn't solved by this and could be solved with a stack
> switch, that's my main reason for considering this a ->writepage only
> hack not complete enough to provide a generic solution for reclaim
> issues ending up in fs->dm->iscsi/bio. I also suspect xfs is more hog
> than others (might not be a coicidence the 7k happens with xfs
> writepage) and could be lightened up a bit by looking into it.
>

Other than the whole "lacking the code" thing and it's still not clear that
writing from direct reclaim is absolutly necessary for VM stability considering
it's been ignored today by at least two filesystems. I can add the throttling
logic if it'd make you happied but I know it'd be at least two weeks
before I could start from scratch on a
stack-switch-based-solution and a PITA considering that I'm not convinced
it's necessary :)

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev |
Pages: 1 2
Prev: [PROPOSAL - FIRST POST] NMI & register clash handling infrastructure
Next: Using "page credits" as a solution for common thrashing scenarios