Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible [Kernel]

Prev: [GIT PULL] UBI changes for 2.6.35-rc3
Next: [PATCH 1/5] ACPI / ACPICA: Use helper function for computing GPE masks

From: Andrea Arcangeli on 15 Jun 2010 10:30

On Tue, Jun 15, 2010 at 10:11:22AM -0400, Christoph Hellwig wrote:
> On Tue, Jun 15, 2010 at 04:00:11PM +0200, Andrea Arcangeli wrote:
> > collecting clean cache doesn't still satisfy the allocation), during
> > allocations in direct reclaim and increase the THREAD_SIZE than doing
> > this purely for stack reasons as the VM will lose reliability if we
>
> This basically means doubling the stack size, as you can splice together
> two extremtly stack hungry codepathes in the worst case. Do you really
> want order 2 stack allocations?

If we were forbidden to call ->writepage just because of stack
overflow yes as I don't think it's big deal with memory compaction and
I see this as a too limiting design to allow ->writepage only in
kernel thread. ->writepage is also called by the pagecache layer,
msync etc.. not just by kswapd.

But let's defer this after we have any resemblance of hard numbers of
worst-case stack usage measured during the aforementioned workload, I
didn't read all the details as I'm quite against this design, but I
didn't see any stack usage number or any sign of stack-overflow debug
triggering. I'd suggest to measure the max stack usage first and worry
later.

And if ->writepage is a stack hog in some fs, I'd rather see
->writepage made less stack hungry (with proper warning at runtime
with debug option enabled) than vetoed. The VM itself shouldn't be a
stack hog already. I don't see a particular reason why writepage
should be so stuck hungry compared to the rest of the kernel, it just
have to do I/O, if it requires complex data structure it should
kmalloc those and stay light on stack as everybody else.

And if something I'm worried more about slab shrink than ->writepage
as that enters the vfs layer and then the lowlevel fs to collect the
dentry, inode etc...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Christoph Hellwig on 15 Jun 2010 10:50

On Tue, Jun 15, 2010 at 04:22:19PM +0200, Andrea Arcangeli wrote:
> If we were forbidden to call ->writepage just because of stack
> overflow yes as I don't think it's big deal with memory compaction and
> I see this as a too limiting design to allow ->writepage only in
> kernel thread. ->writepage is also called by the pagecache layer,
> msync etc.. not just by kswapd.

Other callers of ->writepage are fine because they come from a
controlled environment with relatively little stack usage. The problem
with direct reclaim is that we splice multiple stack hogs ontop of each
other.

Direct reclaim can come from any point that does memory allocations,
including those that absolutely have to because their stack "quota"
is almost used up. Let's look at a worst case scenario:

We're in a deep stack codepath, say

(1) core_sys_select, which has to kmalloc the array if it doesn't
fit on the huge stack variable. All fine by now, it stays in it's
stack quota.
(2) That code now calls into the slab allocator, which doesn't find free
space in the large slab, and then calls into kmem_getpages, adding
more stack usage.
(3) That calls into alloc_pages_exact_node which adds stack usage of
the page allocator.
(4) no free pages in the zone anymore, and direct reclaim is invoked,
adding the stack usage of the reclaim code, which currently is
quite heavy.
(5) direct reclaim calls into foofs ->writepage. foofs_writepage
notices the page is delayed allocated and needs to conver it.
It now has to start a transaction, then call the extent management
code to convert the extent, which calls into the space managment
code, which calls into the buffercache for the metadata buffers,
which needs to submit a bio to read/write the metadata.
(6) The metadata buffer goes through submit_bio and the block layer
code. Because we're doing a synchronous writeout it gets directly
dispatched to the block layer.
(7) for extra fun add a few remapping layers for raid or similar to
add to the stack usage.
(8) The lowlevel block driver is iscsi or something similar, so after
going through the scsi layer adding more stack it now goes through
the networking layer with tcp and ipv4 (if you're unlucky ipv6)
code
(9) we finally end up in the lowlevel networking driver (except that we
would have long overflown the stack)

And for extrea fun:

(10) Just when we're way down that stack an IRQ comes in on the CPU that
we're executing on. Because we don't enable irqstacks for the only
sensible stack configuration (yeah, still bitter about the patch
for that getting ignored) it goes on the same stack above.

And note that the above does not only happen with ext4/btrfs/xfs that
have delayed allocations. With every other filesystem it can also
happen, just a lot less likely - when writing to a file through shared
mmaps we still have to call the allocator from ->writepage in
ext2/ext3/reiserfs/etc.

And seriously, if the VM isn't stopped from calling ->writepage from
reclaim context we FS people will simply ignore any ->writepage from
reclaim context. Been there, done that and never again.

Just wondering, what filesystems do your hugepage testing systems use?
If it's any of the ext4/btrfs/xfs above you're already seeing the
filesystem refuse ->writepage from both kswapd and direct reclaim,
so Mel's series will allow us to reclaim pages from more contexts
than before.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Rik van Riel on 15 Jun 2010 11:00

On 06/15/2010 10:51 AM, Mel Gorman wrote:
> On Tue, Jun 15, 2010 at 04:00:11PM +0200, Andrea Arcangeli wrote:
>> Hi Mel,
>>
>> I know lots of people doesn't like direct reclaim,
>
> It's not direct reclaim that is the problem per-se, it's direct reclaim
> calling writepage and splicing two potentially deep call chains
> together.

I have talked to Mel on IRC, and the above means:

"calling alloc_pages from an already deep stack frame,
and then going into direct reclaim"

That explanation would have been helpful in email :)

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli on 15 Jun 2010 11:10

On Tue, Jun 15, 2010 at 10:43:42AM -0400, Christoph Hellwig wrote:
> Other callers of ->writepage are fine because they come from a
> controlled environment with relatively little stack usage. The problem
> with direct reclaim is that we splice multiple stack hogs ontop of each
> other.

It's not like we're doing a stack recursive algorithm in kernel. These
have to be "controlled hogs", so we must have space to run 4/5 of them
on top of each other, that's the whole point.

I'm aware the ->writepage can run on any alloc_pages, but frankly I
don't see a whole lot of difference between regular kernel code paths
or msync. Sure they can be at higher stack usage, but not like with
only 1000bytes left.

> And seriously, if the VM isn't stopped from calling ->writepage from
> reclaim context we FS people will simply ignore any ->writepage from
> reclaim context. Been there, done that and never again.
>
> Just wondering, what filesystems do your hugepage testing systems use?
> If it's any of the ext4/btrfs/xfs above you're already seeing the
> filesystem refuse ->writepage from both kswapd and direct reclaim,
> so Mel's series will allow us to reclaim pages from more contexts
> than before.

fs ignoring ->writepage during memory pressure (even from kswapd) is
broken, this is not up to the fs to decide. I'm using ext4 on most of
my testing, it works ok, but it doesn't make it right (if fact if
performance declines without that hack, it may prove VM needs fixing,
it doesn't justify the hack).

If you don't throttle against kswapd, or if even kswapd can't turn a
dirty page into a clean one, you can get oom false positives. Anything
is better than that. (provided you've proper stack instrumentation to
notice when there is risk of a stack overflow, it's ages I never seen
a stack overflow debug detector report)

The irq stack must be enabled and this isn't about direct reclaim but
about irqs in general and their potential nesting with softirq calls
too.

Also note, there's nothing that prevents us from switching the stack
to something else the moment we enter direct reclaim. It doesn't need
to be physically contiguous. Just allocate a couple of 4k pages and
switch to them every time a new hog starts in VM context. The only
real complexity is in the stack unwind but if irqstack can cope with
it sure stack unwind can cope with more "special" stacks too.

Ignoring ->writepage on VM invocations at best can only hide VM
inefficiencies with the downside of breaking the VM in corner cases
with heavy VM pressure.

Crippling down the kernel by vetoing ->writepage to me looks very
wrong, but I'd be totally supportive of a "special" writepage stack or
special iscsi stack etc...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 15 Jun 2010 11:10

On Tue, Jun 15, 2010 at 03:51:34PM +0100, Mel Gorman wrote:
> On Tue, Jun 15, 2010 at 04:00:11PM +0200, Andrea Arcangeli wrote:
> > When memory pressure is low, not going into ->writepage may be
> > beneficial from latency prospective too. (but again it depends how
> > much it matters to go in LRU and how beneficial is the cache, to know
> > if it's worth taking clean cache away even if hotter than dirty cache)
> >
> > About the stack overflow did you ever got any stack-debug error?
>
> Not an error. Got a report from Dave Chinner though and it's what kicked
> off this whole routine in the first place. I've been recording stack
> usage figures but not reporting them. In reclaim I'm getting to about 5K
> deep but this was on simple storage and XFS was ignoring attempts for
> reclaim to writeback.
>
> http://lkml.org/lkml/2010/4/13/121
>
> Here is one my my own stack traces though
>
> Depth Size Location (49 entries)
> ----- ---- --------
> 0) 5064 304 get_page_from_freelist+0x2e4/0x722
> 1) 4760 240 __alloc_pages_nodemask+0x15f/0x6a7
> 2) 4520 48 kmem_getpages+0x61/0x12c
> 3) 4472 96 cache_grow+0xca/0x272
> 4) 4376 80 cache_alloc_refill+0x1d4/0x226
> 5) 4296 64 kmem_cache_alloc+0x129/0x1bc
> 6) 4232 16 mempool_alloc_slab+0x16/0x18
> 7) 4216 144 mempool_alloc+0x56/0x104
> 8) 4072 16 scsi_sg_alloc+0x48/0x4a [scsi_mod]
> 9) 4056 96 __sg_alloc_table+0x58/0xf8
> 10) 3960 32 scsi_init_sgtable+0x37/0x8f [scsi_mod]
> 11) 3928 32 scsi_init_io+0x24/0xce [scsi_mod]
> 12) 3896 48 scsi_setup_fs_cmnd+0xbc/0xc4 [scsi_mod]
> 13) 3848 144 sd_prep_fn+0x1d3/0xc13 [sd_mod]
> 14) 3704 64 blk_peek_request+0xe2/0x1a6
> 15) 3640 96 scsi_request_fn+0x87/0x522 [scsi_mod]
> 16) 3544 32 __blk_run_queue+0x88/0x14b
> 17) 3512 48 elv_insert+0xb7/0x254
> 18) 3464 48 __elv_add_request+0x9f/0xa7
> 19) 3416 128 __make_request+0x3f4/0x476
> 20) 3288 192 generic_make_request+0x332/0x3a4
> 21) 3096 64 submit_bio+0xc4/0xcd
> 22) 3032 80 _xfs_buf_ioapply+0x222/0x252 [xfs]
> 23) 2952 48 xfs_buf_iorequest+0x84/0xa1 [xfs]
> 24) 2904 32 xlog_bdstrat+0x47/0x4d [xfs]
> 25) 2872 64 xlog_sync+0x21a/0x329 [xfs]
> 26) 2808 48 xlog_state_release_iclog+0x9b/0xa8 [xfs]
> 27) 2760 176 xlog_write+0x356/0x506 [xfs]
> 28) 2584 96 xfs_log_write+0x5a/0x86 [xfs]
> 29) 2488 368 xfs_trans_commit_iclog+0x165/0x2c3 [xfs]
> 30) 2120 80 _xfs_trans_commit+0xd8/0x20d [xfs]
> 31) 2040 240 xfs_iomap_write_allocate+0x247/0x336 [xfs]
> 32) 1800 144 xfs_iomap+0x31a/0x345 [xfs]
> 33) 1656 48 xfs_map_blocks+0x3c/0x40 [xfs]
> 34) 1608 256 xfs_page_state_convert+0x2c4/0x597 [xfs]
> 35) 1352 64 xfs_vm_writepage+0xf5/0x12f [xfs]
> 36) 1288 32 __writepage+0x17/0x34
> 37) 1256 288 write_cache_pages+0x1f3/0x2f8
> 38) 968 16 generic_writepages+0x24/0x2a
> 39) 952 64 xfs_vm_writepages+0x4f/0x5c [xfs]
> 40) 888 16 do_writepages+0x21/0x2a
> 41) 872 48 writeback_single_inode+0xd8/0x2f4
> 42) 824 112 writeback_inodes_wb+0x41a/0x51e
> 43) 712 176 wb_writeback+0x13d/0x1b7
> 44) 536 128 wb_do_writeback+0x150/0x167
> 45) 408 80 bdi_writeback_task+0x43/0x117
> 46) 328 48 bdi_start_fn+0x76/0xd5
> 47) 280 96 kthread+0x82/0x8a
> 48) 184 184 kernel_thread_helper+0x4/0x10
>
> XFS as you can see is quite deep there. Now consider if
> get_page_from_freelist() there had entered direct reclaim and then tried
> to writeback a page. That's the problem that is being worried about.

It would be a problem because it should be !__GFP_IO at that point so
something would be seriously broken if it called ->writepage again.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: [GIT PULL] UBI changes for 2.6.35-rc3
Next: [PATCH 1/5] ACPI / ACPICA: Use helper function for computing GPE masks