Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible [Kernel]

Prev: [GIT PULL] UBI changes for 2.6.35-rc3
Next: [PATCH 1/5] ACPI / ACPICA: Use helper function for computing GPE masks

From: Christoph Hellwig on 15 Jun 2010 12:40

On Tue, Jun 15, 2010 at 05:30:44PM +0100, Mel Gorman wrote:
> After grepping through fs/, it was only xfs and btrfs that I saw were
> specfically disabling writepage from reclaim context.

ext4 doesn't specificly disable writeback from reclaim context, but
in a rather convoluted way disabled basically all writeback through
->writepage. The only thing allowed is overwrites of already allocated
blocks.

In addition to that reiserfs also frefuses to write back pages
from reclaim context if they require a transaction, which is the case
if the file was written to through mmap.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Christoph Hellwig on 15 Jun 2010 12:50

On Tue, Jun 15, 2010 at 06:14:19PM +0200, Andrea Arcangeli wrote:
> Again not what looks like from the stack trace. Also grepping for
> PF_MEMALLOC in fs/xfs shows nothing. In fact it's ext4_write_inode
> that skips the write if PF_MEMALLOC is set, not writepage apparently
> (only did a quick grep so I might be wrong). I suspect
> ext4_write_inode is the case I just mentioned about slab shrink, not
> ->writepage ;).
>
> inodes are small, it's no big deal to keep an inode pinned and not
> slab-reclaimable because dirty, while skipping real writepage in
> memory pressure could really open a regression in oom false positives!
> One pagecache much bigger than one inode and there can be plenty more
> dirty pagecache than inodes.

Btw, those comments in ext3/ext4 don't make much sense. The only
time iput_final ever calls into ->write_inode is when the filesystem
is beeing unmounted, which never happens with PF_MEMALLOC set.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 15 Jun 2010 13:00

On Tue, Jun 15, 2010 at 12:49:49PM -0400, Rik van Riel wrote:
> On 06/15/2010 12:26 PM, Christoph Hellwig wrote:
> >On Tue, Jun 15, 2010 at 05:45:16PM +0200, Andrea Arcangeli wrote:
> >[21877.951204] [<ffffffffa009bde4>] xfs_vm_writepage+0x84/0x160 [xfs]
> >[21877.951301] [<ffffffff800e3603>] pageout+0x143/0x2b0
> >[21877.951308] [<ffffffff800e514e>] shrink_page_list+0x26e/0x650
> >[21877.951314] [<ffffffff800e5803>] shrink_inactive_list+0x2d3/0x7c0
> >[21877.951320] [<ffffffff800e5d4b>] shrink_list+0x5b/0x110
> >[21877.951325] [<ffffffff800e5f71>] shrink_zone+0x171/0x250
> >[21877.951330] [<ffffffff800e60d3>] shrink_zones+0x83/0x120
> >[21877.951336] [<ffffffff800e620e>] do_try_to_free_pages+0x9e/0x380
> >[21877.951342] [<ffffffff800e6607>] try_to_free_pages+0x77/0xa0
> >[21877.951349] [<ffffffff800dbfa3>] __alloc_pages_slowpath+0x2d3/0x5c0
> >[21877.951355] [<ffffffff800dc3e1>] __alloc_pages_nodemask+0x151/0x160
> >[21877.951362] [<ffffffff800d44b7>] __page_cache_alloc+0x27/0x50
> >[21877.951368] [<ffffffff800d68ca>] grab_cache_page_write_begin+0x9a/0xe0
> >[21877.951376] [<ffffffff8014bdfe>] block_write_begin+0xae/0x120
> >[21877.951396] [<ffffffffa009ac24>] xfs_vm_write_begin+0x34/0x50 [xfs]
>
> This is already in a filesystem. Why does ->writepage get
> called a second time? Shouldn't this have a gfp_mask
> without __GFP_FS set?

No, we're allowed to use __GFP_FS with i_mutex held.

>
> >[21877.951482] [<ffffffff800d4b31>] generic_perform_write+0xc1/0x1f0
> >[21877.951489] [<ffffffff800d5d00>] generic_file_buffered_write+0x90/0x160
> >[21877.951512] [<ffffffffa00a4711>] xfs_write+0x521/0xb60 [xfs]
> >[21877.951624] [<ffffffffa009fb80>] xfs_file_aio_write+0x70/0xa0 [xfs]
> >[21877.951711] [<ffffffff80118c42>] do_sync_write+0x102/0x160
> >[21877.951718] [<ffffffff80118fc8>] vfs_write+0xd8/0x1c0
> >[21877.951723] [<ffffffff8011995b>] sys_write+0x5b/0xa0
> >[21877.951729] [<ffffffff8000c868>] system_call_fastpath+0x16/0x1b
> >[21877.951736] [<00007fc41b0fab10>] 0x7fc41b0fab10
> >[21877.951750] BUG: unable to handle kernel paging request at 0000000108743280
> >[21877.951755] IP: [<ffffffff80034832>] dequeue_task+0x72/0x110
> >[21877.951766] PGD 31c6f067 PUD 0
> >[21877.951770] Thread overran stack, or stack corrupted
> >
>
>
> --
> All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Christoph Hellwig on 15 Jun 2010 13:00

On Tue, Jun 15, 2010 at 12:49:49PM -0400, Rik van Riel wrote:
> This is already in a filesystem. Why does ->writepage get
> called a second time? Shouldn't this have a gfp_mask
> without __GFP_FS set?

Why would it? GFP_NOFS is not for all filesystem code, but only for
code where we can't re-enter the filesystem due to deadlock potential.

Except for a few filesystems that have transactions open inside
->aio_write no one uses GFP_NOFS from that path.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Rik van Riel on 15 Jun 2010 13:00

On 06/15/2010 12:26 PM, Christoph Hellwig wrote:
> On Tue, Jun 15, 2010 at 05:45:16PM +0200, Andrea Arcangeli wrote:
>> On Tue, Jun 15, 2010 at 11:25:26AM -0400, Christoph Hellwig wrote:
>>> hand can happen from context that already is say 4 or 6 kilobytes
>>> into stack usage. And the callchain from kmalloc() into ->writepage
>>
>> Mel's stack trace of 5k was still not realistic as it doesn't call
>> writepage there. I was just asking the 6k example vs msync.
>
> FYI here is the most recent one that Michael Monnerie reported after he
> hit it on a production machine. It's what finally prompted us to add
> the check in ->writepage:
>
> [21877.948005] BUG: scheduling while atomic: rsync/2345/0xffff8800
> [21877.948005] Modules linked in: af_packet nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 ramzswap xvmalloc lzo_decompress lzo_compress loop dm_mod reiserfs xfs exportfs xennet xenblk cdrom
> [21877.948005] Pid: 2345, comm: rsync Not tainted 2.6.31.12-0.2-xen #1
> [21877.948005] Call Trace:
> [21877.949649] [<ffffffff800119b9>] try_stack_unwind+0x189/0x1b0
> [21877.949659] [<ffffffff8000f466>] dump_trace+0xa6/0x1e0
> [21877.949666] [<ffffffff800114c4>] show_trace_log_lvl+0x64/0x90
> [21877.949676] [<ffffffff80011513>] show_trace+0x23/0x40
> [21877.949684] [<ffffffff8046b92c>] dump_stack+0x81/0x9e
> [21877.949695] [<ffffffff8003f398>] __schedule_bug+0x78/0x90
> [21877.949702] [<ffffffff8046c97c>] thread_return+0x1d7/0x3fb
> [21877.949709] [<ffffffff8046cf85>] schedule_timeout+0x195/0x200
> [21877.949717] [<ffffffff8046be2b>] wait_for_common+0x10b/0x230
> [21877.949726] [<ffffffff8046c09b>] wait_for_completion+0x2b/0x50
> [21877.949768] [<ffffffffa009e741>] xfs_buf_iowait+0x31/0x80 [xfs]
> [21877.949894] [<ffffffffa009ea30>] _xfs_buf_read+0x70/0x80 [xfs]
> [21877.949992] [<ffffffffa009ef8b>] xfs_buf_read_flags+0x8b/0xd0 [xfs]
> [21877.950089] [<ffffffffa0091ab9>] xfs_trans_read_buf+0x1e9/0x320 [xfs]
> [21877.950174] [<ffffffffa005b278>] xfs_btree_read_buf_block+0x68/0xe0 [xfs]
> [21877.950232] [<ffffffffa005b99e>] xfs_btree_lookup_get_block+0x8e/0x110 [xfs]
> [21877.950281] [<ffffffffa005c0af>] xfs_btree_lookup+0xdf/0x4d0 [xfs]
> [21877.950329] [<ffffffffa0042b77>] xfs_alloc_lookup_eq+0x27/0x50 [xfs]
> [21877.950361] [<ffffffffa0042f09>] xfs_alloc_fixup_trees+0x249/0x370 [xfs]
> [21877.950397] [<ffffffffa0044c30>] xfs_alloc_ag_vextent_near+0x4e0/0x9a0 [xfs]
> [21877.950432] [<ffffffffa00451f5>] xfs_alloc_ag_vextent+0x105/0x160 [xfs]
> [21877.950471] [<ffffffffa0045bb4>] xfs_alloc_vextent+0x3b4/0x4b0 [xfs]
> [21877.950504] [<ffffffffa0058da8>] xfs_bmbt_alloc_block+0xf8/0x210 [xfs]
> [21877.950550] [<ffffffffa005e3b7>] xfs_btree_split+0xc7/0x720 [xfs]
> [21877.950597] [<ffffffffa005ef8c>] xfs_btree_make_block_unfull+0x15c/0x1c0 [xfs]
> [21877.950643] [<ffffffffa005f3ff>] xfs_btree_insrec+0x40f/0x5c0 [xfs]
> [21877.950689] [<ffffffffa005f651>] xfs_btree_insert+0xa1/0x1b0 [xfs]
> [21877.950748] [<ffffffffa005325e>] xfs_bmap_add_extent_delay_real+0x82e/0x12a0 [xfs]
> [21877.950787] [<ffffffffa00540f4>] xfs_bmap_add_extent+0x424/0x450 [xfs]
> [21877.950833] [<ffffffffa00573f3>] xfs_bmapi+0xda3/0x1320 [xfs]
> [21877.950879] [<ffffffffa007c248>] xfs_iomap_write_allocate+0x1d8/0x3f0 [xfs]
> [21877.950953] [<ffffffffa007d089>] xfs_iomap+0x2c9/0x300 [xfs]
> [21877.951021] [<ffffffffa009a1b8>] xfs_map_blocks+0x38/0x60 [xfs]
> [21877.951108] [<ffffffffa009b93a>] xfs_page_state_convert+0x3fa/0x720 [xfs]
> [21877.951204] [<ffffffffa009bde4>] xfs_vm_writepage+0x84/0x160 [xfs]
> [21877.951301] [<ffffffff800e3603>] pageout+0x143/0x2b0
> [21877.951308] [<ffffffff800e514e>] shrink_page_list+0x26e/0x650
> [21877.951314] [<ffffffff800e5803>] shrink_inactive_list+0x2d3/0x7c0
> [21877.951320] [<ffffffff800e5d4b>] shrink_list+0x5b/0x110
> [21877.951325] [<ffffffff800e5f71>] shrink_zone+0x171/0x250
> [21877.951330] [<ffffffff800e60d3>] shrink_zones+0x83/0x120
> [21877.951336] [<ffffffff800e620e>] do_try_to_free_pages+0x9e/0x380
> [21877.951342] [<ffffffff800e6607>] try_to_free_pages+0x77/0xa0
> [21877.951349] [<ffffffff800dbfa3>] __alloc_pages_slowpath+0x2d3/0x5c0
> [21877.951355] [<ffffffff800dc3e1>] __alloc_pages_nodemask+0x151/0x160
> [21877.951362] [<ffffffff800d44b7>] __page_cache_alloc+0x27/0x50
> [21877.951368] [<ffffffff800d68ca>] grab_cache_page_write_begin+0x9a/0xe0
> [21877.951376] [<ffffffff8014bdfe>] block_write_begin+0xae/0x120
> [21877.951396] [<ffffffffa009ac24>] xfs_vm_write_begin+0x34/0x50 [xfs]

This is already in a filesystem. Why does ->writepage get
called a second time? Shouldn't this have a gfp_mask
without __GFP_FS set?

> [21877.951482] [<ffffffff800d4b31>] generic_perform_write+0xc1/0x1f0
> [21877.951489] [<ffffffff800d5d00>] generic_file_buffered_write+0x90/0x160
> [21877.951512] [<ffffffffa00a4711>] xfs_write+0x521/0xb60 [xfs]
> [21877.951624] [<ffffffffa009fb80>] xfs_file_aio_write+0x70/0xa0 [xfs]
> [21877.951711] [<ffffffff80118c42>] do_sync_write+0x102/0x160
> [21877.951718] [<ffffffff80118fc8>] vfs_write+0xd8/0x1c0
> [21877.951723] [<ffffffff8011995b>] sys_write+0x5b/0xa0
> [21877.951729] [<ffffffff8000c868>] system_call_fastpath+0x16/0x1b
> [21877.951736] [<00007fc41b0fab10>] 0x7fc41b0fab10
> [21877.951750] BUG: unable to handle kernel paging request at 0000000108743280
> [21877.951755] IP: [<ffffffff80034832>] dequeue_task+0x72/0x110
> [21877.951766] PGD 31c6f067 PUD 0
> [21877.951770] Thread overran stack, or stack corrupted
>

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: [GIT PULL] UBI changes for 2.6.35-rc3
Next: [PATCH 1/5] ACPI / ACPICA: Use helper function for computing GPE masks