From: Christoph Hellwig on
On Tue, Jun 15, 2010 at 05:08:50PM +0200, Andrea Arcangeli wrote:
> On Tue, Jun 15, 2010 at 10:43:42AM -0400, Christoph Hellwig wrote:
> > Other callers of ->writepage are fine because they come from a
> > controlled environment with relatively little stack usage. The problem
> > with direct reclaim is that we splice multiple stack hogs ontop of each
> > other.
>
> It's not like we're doing a stack recursive algorithm in kernel. These
> have to be "controlled hogs", so we must have space to run 4/5 of them
> on top of each other, that's the whole point.

We're not doing a full recursion. We're splicing a codepath that
normally could use the full stack (fs writeback / block I/O) into
a random other code path that could use the full stack, and add
some quite stack heavy allocator / reclaim code inbetween.

>
> I'm aware the ->writepage can run on any alloc_pages, but frankly I
> don't see a whole lot of difference between regular kernel code paths
> or msync. Sure they can be at higher stack usage, but not like with
> only 1000bytes left.

msync does not use any significant amount of stack:

0xc01f53b3 sys_msync [vmlinux]: 40
0xc022b165 vfs_fsync [vmlinux]: 12
0xc022b053 vfs_fsync_range [vmlinux]: 24
0xc01d7e63 filemap_write_and_wait_range [vmlinux]: 28
0xc01d7df3 __filemap_fdatawrite_range [vmlinux]: 56

and then we alredy enter ->writepages. Direct reclaim on the other
hand can happen from context that already is say 4 or 6 kilobytes
into stack usage. And the callchain from kmalloc() into ->writepage
alone adds another 0.7k of stack usage. There's not much left for
the filesystem after this.

> If you don't throttle against kswapd, or if even kswapd can't turn a
> dirty page into a clean one, you can get oom false positives. Anything
> is better than that. (provided you've proper stack instrumentation to
> notice when there is risk of a stack overflow, it's ages I never seen
> a stack overflow debug detector report)

I've never seen the stack overflow detector trigger on this, but I've
seen lots of real life stack overflows on the mailing lists. End
users don't run with it enabled normally, and most testing workloads
don't seem to hit direct reclaim enough to actually trigger this
reproducibly.

> Also note, there's nothing that prevents us from switching the stack
> to something else the moment we enter direct reclaim. It doesn't need
> to be physically contiguous. Just allocate a couple of 4k pages and
> switch to them every time a new hog starts in VM context. The only
> real complexity is in the stack unwind but if irqstack can cope with
> it sure stack unwind can cope with more "special" stacks too.

Which is a lot more complicated than loading off the page cleaning
from direct reclaim to dedicated threads - be that the flusher threads
or kswapd.

> Ignoring ->writepage on VM invocations at best can only hide VM
> inefficiencies with the downside of breaking the VM in corner cases
> with heavy VM pressure.

It allows the system to survive in case direct reclaim is called instead
of crashing with a stack overflow. And at least in my testing the
VM seems to cope rather well with not beeing able to write out
filesystem pages from direct reclaim. That doesn't mean that this
behaviour can't be further improved on.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli on
On Tue, Jun 15, 2010 at 11:25:26AM -0400, Christoph Hellwig wrote:
> hand can happen from context that already is say 4 or 6 kilobytes
> into stack usage. And the callchain from kmalloc() into ->writepage

Mel's stack trace of 5k was still not realistic as it doesn't call
writepage there. I was just asking the 6k example vs msync.

Plus shrink dcache/inodes may also invoke I/O and end up with all
those hogs.

> I've never seen the stack overflow detector trigger on this, but I've
> seen lots of real life stack overflows on the mailing lists. End
> users don't run with it enabled normally, and most testing workloads
> don't seem to hit direct reclaim enough to actually trigger this
> reproducibly.

How do you know it's a stack overflow if it's not the stack overflow
detector firing before the fact, could be bad ram too, usually?

> Which is a lot more complicated than loading off the page cleaning
> from direct reclaim to dedicated threads - be that the flusher threads
> or kswapd.

More complicated for sure. But surely I like that more than vetoing
->writepage from VM context, especially if it's a fs decision. fs
shouldn't decide that.

> It allows the system to survive in case direct reclaim is called instead
> of crashing with a stack overflow. And at least in my testing the
> VM seems to cope rather well with not beeing able to write out
> filesystem pages from direct reclaim. That doesn't mean that this
> behaviour can't be further improved on.

Agreed. Surely it seems to work ok for me too, but it may hide VM
issues, it makes the VM less reliable against potential false positive
OOM, and it's better if we just teach the VM to switch stack before
invoking the freeing methods, so it automatically solves dcache/icache
collection ending up writing data etc...

Then if we don't want to call ->writepage we won't do it for other
reasons, but we can solve this in a generic and reliable way that
covers not just ->writepage but all source I/O, including swapout over
iscsi, vfs etc...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli on
On Tue, Jun 15, 2010 at 04:38:38PM +0100, Mel Gorman wrote:
> That is pretty much what Dave is claiming here at
> http://lkml.org/lkml/2010/4/13/121 where if mempool_alloc_slab() needed

This stack trace shows writepage called by shrink_page_list... that
contradict Christoph's claim that xfs already won't writepage if
invoked by direct reclaim.

> to allocate a page and writepage was entered, there would have been a
> a problem.

There can't be a problem if a page wasn't available in mempool because
we can't nest two writepage on top of the other or it'd deadlock on fs
locks and this is the reason of GFP_NOFS, like noticed in the email.

Surely this shows the writepage going very close to the stack
size... probably not enough to trigger the stack detector but close
enough to worry! Agreed.

I think we just need to switch stack on do_try_to_free_pages to solve
it, and not just writepage or the filesystems.

> Broken or not, it's what some of them are doing to avoid stack
> overflows. Worst, they are ignoring both kswapd and direct reclaim when they
> only really needed to ignore kswapd. With this series at least, the
> check for PF_MEMALLOC in ->writepage can be removed

I don't get how we end up in xfs_buf_ioapply above though if xfs
writepage is a noop on PF_MEMALLOC. Definitely PF_MEMALLOC is set
before try_to_free_pages but in the above trace writepage still runs
and submit the I/O.

> This series would at least allow kswapd to turn dirty pages into clean
> ones so it's an improvement.

Not saying it's not an improvement, but still it's not necessarily the
right direction.

> Other than a lack of code to do it :/

;)

> If you really feel strongly about this, you could follow on the series
> by extending clean_page_list() to switch stack if !kswapd.
>
> This has actually been the case for a while. I vaguely recall FS people

Again not what looks like from the stack trace. Also grepping for
PF_MEMALLOC in fs/xfs shows nothing. In fact it's ext4_write_inode
that skips the write if PF_MEMALLOC is set, not writepage apparently
(only did a quick grep so I might be wrong). I suspect
ext4_write_inode is the case I just mentioned about slab shrink, not
->writepage ;).

inodes are small, it's no big deal to keep an inode pinned and not
slab-reclaimable because dirty, while skipping real writepage in
memory pressure could really open a regression in oom false positives!
One pagecache much bigger than one inode and there can be plenty more
dirty pagecache than inodes.

> i.e. when direct reclaim encounters N dirty pages, unconditionally ask the
> flusher threads to clean that number of pages, throttle by waiting for them
> to be cleaned, reclaim them if they get cleaned or otherwise scan more pages
> on the LRU.

Not bad at all... throttling is what makes it safe too. Problem is all
the rest that isn't solved by this and could be solved with a stack
switch, that's my main reason for considering this a ->writepage only
hack not complete enough to provide a generic solution for reclaim
issues ending up in fs->dm->iscsi/bio. I also suspect xfs is more hog
than others (might not be a coicidence the 7k happens with xfs
writepage) and could be lightened up a bit by looking into it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Christoph Hellwig on
On Tue, Jun 15, 2010 at 06:14:19PM +0200, Andrea Arcangeli wrote:
> On Tue, Jun 15, 2010 at 04:38:38PM +0100, Mel Gorman wrote:
> > That is pretty much what Dave is claiming here at
> > http://lkml.org/lkml/2010/4/13/121 where if mempool_alloc_slab() needed
>
> This stack trace shows writepage called by shrink_page_list... that
> contradict Christoph's claim that xfs already won't writepage if
> invoked by direct reclaim.

We only recently did that - before that we tried to get the VM fixed
multiple times but finally had to bite the bullet and follow ext4 and
btrfs in that regard.

> Again not what looks like from the stack trace. Also grepping for
> PF_MEMALLOC in fs/xfs shows nothing. In fact it's ext4_write_inode
> that skips the write if PF_MEMALLOC is set, not writepage apparently
> (only did a quick grep so I might be wrong). I suspect
> ext4_write_inode is the case I just mentioned about slab shrink, not
> ->writepage ;).

ext4 in fact does not check PF_MEMALLOC but simply refuses to write
out anything in ->writepage in most cases. There is a corner case
when the page doesn't have any buffers attached where it wouldn't
have write out data, without actually calling the allocator. I
suspect this code actually is a leftover as we don't normally strip
buffers from a page that had them before.

> inodes are small, it's no big deal to keep an inode pinned and not
> slab-reclaimable because dirty, while skipping real writepage in
> memory pressure could really open a regression in oom false positives!
> One pagecache much bigger than one inode and there can be plenty more
> dirty pagecache than inodes.

At least for XFS ->write_inode is really simple these days. If it's
a synchronous writeout, which won't happen from these path it logs the
inode, which is far less harmless than the whole allocator code, and
for write = 0 it only adds it to the delayed write queue, which doesn't
call into the I/O stack at all.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Christoph Hellwig on
On Tue, Jun 15, 2010 at 05:45:16PM +0200, Andrea Arcangeli wrote:
> On Tue, Jun 15, 2010 at 11:25:26AM -0400, Christoph Hellwig wrote:
> > hand can happen from context that already is say 4 or 6 kilobytes
> > into stack usage. And the callchain from kmalloc() into ->writepage
>
> Mel's stack trace of 5k was still not realistic as it doesn't call
> writepage there. I was just asking the 6k example vs msync.

FYI here is the most recent one that Michael Monnerie reported after he
hit it on a production machine. It's what finally prompted us to add
the check in ->writepage:

[21877.948005] BUG: scheduling while atomic: rsync/2345/0xffff8800
[21877.948005] Modules linked in: af_packet nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 ramzswap xvmalloc lzo_decompress lzo_compress loop dm_mod reiserfs xfs exportfs xennet xenblk cdrom
[21877.948005] Pid: 2345, comm: rsync Not tainted 2.6.31.12-0.2-xen #1
[21877.948005] Call Trace:
[21877.949649] [<ffffffff800119b9>] try_stack_unwind+0x189/0x1b0
[21877.949659] [<ffffffff8000f466>] dump_trace+0xa6/0x1e0
[21877.949666] [<ffffffff800114c4>] show_trace_log_lvl+0x64/0x90
[21877.949676] [<ffffffff80011513>] show_trace+0x23/0x40
[21877.949684] [<ffffffff8046b92c>] dump_stack+0x81/0x9e
[21877.949695] [<ffffffff8003f398>] __schedule_bug+0x78/0x90
[21877.949702] [<ffffffff8046c97c>] thread_return+0x1d7/0x3fb
[21877.949709] [<ffffffff8046cf85>] schedule_timeout+0x195/0x200
[21877.949717] [<ffffffff8046be2b>] wait_for_common+0x10b/0x230
[21877.949726] [<ffffffff8046c09b>] wait_for_completion+0x2b/0x50
[21877.949768] [<ffffffffa009e741>] xfs_buf_iowait+0x31/0x80 [xfs]
[21877.949894] [<ffffffffa009ea30>] _xfs_buf_read+0x70/0x80 [xfs]
[21877.949992] [<ffffffffa009ef8b>] xfs_buf_read_flags+0x8b/0xd0 [xfs]
[21877.950089] [<ffffffffa0091ab9>] xfs_trans_read_buf+0x1e9/0x320 [xfs]
[21877.950174] [<ffffffffa005b278>] xfs_btree_read_buf_block+0x68/0xe0 [xfs]
[21877.950232] [<ffffffffa005b99e>] xfs_btree_lookup_get_block+0x8e/0x110 [xfs]
[21877.950281] [<ffffffffa005c0af>] xfs_btree_lookup+0xdf/0x4d0 [xfs]
[21877.950329] [<ffffffffa0042b77>] xfs_alloc_lookup_eq+0x27/0x50 [xfs]
[21877.950361] [<ffffffffa0042f09>] xfs_alloc_fixup_trees+0x249/0x370 [xfs]
[21877.950397] [<ffffffffa0044c30>] xfs_alloc_ag_vextent_near+0x4e0/0x9a0 [xfs]
[21877.950432] [<ffffffffa00451f5>] xfs_alloc_ag_vextent+0x105/0x160 [xfs]
[21877.950471] [<ffffffffa0045bb4>] xfs_alloc_vextent+0x3b4/0x4b0 [xfs]
[21877.950504] [<ffffffffa0058da8>] xfs_bmbt_alloc_block+0xf8/0x210 [xfs]
[21877.950550] [<ffffffffa005e3b7>] xfs_btree_split+0xc7/0x720 [xfs]
[21877.950597] [<ffffffffa005ef8c>] xfs_btree_make_block_unfull+0x15c/0x1c0 [xfs]
[21877.950643] [<ffffffffa005f3ff>] xfs_btree_insrec+0x40f/0x5c0 [xfs]
[21877.950689] [<ffffffffa005f651>] xfs_btree_insert+0xa1/0x1b0 [xfs]
[21877.950748] [<ffffffffa005325e>] xfs_bmap_add_extent_delay_real+0x82e/0x12a0 [xfs]
[21877.950787] [<ffffffffa00540f4>] xfs_bmap_add_extent+0x424/0x450 [xfs]
[21877.950833] [<ffffffffa00573f3>] xfs_bmapi+0xda3/0x1320 [xfs]
[21877.950879] [<ffffffffa007c248>] xfs_iomap_write_allocate+0x1d8/0x3f0 [xfs]
[21877.950953] [<ffffffffa007d089>] xfs_iomap+0x2c9/0x300 [xfs]
[21877.951021] [<ffffffffa009a1b8>] xfs_map_blocks+0x38/0x60 [xfs]
[21877.951108] [<ffffffffa009b93a>] xfs_page_state_convert+0x3fa/0x720 [xfs]
[21877.951204] [<ffffffffa009bde4>] xfs_vm_writepage+0x84/0x160 [xfs]
[21877.951301] [<ffffffff800e3603>] pageout+0x143/0x2b0
[21877.951308] [<ffffffff800e514e>] shrink_page_list+0x26e/0x650
[21877.951314] [<ffffffff800e5803>] shrink_inactive_list+0x2d3/0x7c0
[21877.951320] [<ffffffff800e5d4b>] shrink_list+0x5b/0x110
[21877.951325] [<ffffffff800e5f71>] shrink_zone+0x171/0x250
[21877.951330] [<ffffffff800e60d3>] shrink_zones+0x83/0x120
[21877.951336] [<ffffffff800e620e>] do_try_to_free_pages+0x9e/0x380
[21877.951342] [<ffffffff800e6607>] try_to_free_pages+0x77/0xa0
[21877.951349] [<ffffffff800dbfa3>] __alloc_pages_slowpath+0x2d3/0x5c0
[21877.951355] [<ffffffff800dc3e1>] __alloc_pages_nodemask+0x151/0x160
[21877.951362] [<ffffffff800d44b7>] __page_cache_alloc+0x27/0x50
[21877.951368] [<ffffffff800d68ca>] grab_cache_page_write_begin+0x9a/0xe0
[21877.951376] [<ffffffff8014bdfe>] block_write_begin+0xae/0x120
[21877.951396] [<ffffffffa009ac24>] xfs_vm_write_begin+0x34/0x50 [xfs]
[21877.951482] [<ffffffff800d4b31>] generic_perform_write+0xc1/0x1f0
[21877.951489] [<ffffffff800d5d00>] generic_file_buffered_write+0x90/0x160
[21877.951512] [<ffffffffa00a4711>] xfs_write+0x521/0xb60 [xfs]
[21877.951624] [<ffffffffa009fb80>] xfs_file_aio_write+0x70/0xa0 [xfs]
[21877.951711] [<ffffffff80118c42>] do_sync_write+0x102/0x160
[21877.951718] [<ffffffff80118fc8>] vfs_write+0xd8/0x1c0
[21877.951723] [<ffffffff8011995b>] sys_write+0x5b/0xa0
[21877.951729] [<ffffffff8000c868>] system_call_fastpath+0x16/0x1b
[21877.951736] [<00007fc41b0fab10>] 0x7fc41b0fab10
[21877.951750] BUG: unable to handle kernel paging request at 0000000108743280
[21877.951755] IP: [<ffffffff80034832>] dequeue_task+0x72/0x110
[21877.951766] PGD 31c6f067 PUD 0
[21877.951770] Thread overran stack, or stack corrupted

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/