From: Jens Axboe on
On 2010-06-16 07:11, Christoph Hellwig wrote:
> On Wed, Jun 16, 2010 at 11:20:24AM +0900, KAMEZAWA Hiroyuki wrote:
>> BTW, copy_from_user/copy_to_user is _real_ problem, I'm afraid following
>> much more than memcg.
>>
>> handle_mm_fault()
>> -> handle_pte_fault()
>> -> do_wp_page()
>> -> balance_dirty_page_rate_limited()
>> -> balance_dirty_pages()
>> -> writeback_inodes_wbc()
>> -> writeback_inodes_wb()
>> -> writeback_sb_inodes()
>> -> writeback_single_inode()
>> -> do_writepages()
>> -> generic_write_pages()
>> -> write_cache_pages() // use on-stack pagevec.
>> -> writepage()
>
> Yes, this is a massive issue. Strangely enough I just wondered about
> this callstack as balance_dirty_pages is the only place calling into the
> per-bdi/sb writeback code directly instead of offloading it to the
> flusher threads. It's something that should be fixed rather quickly
> IMHO. write_cache_pages and other bits of this writeback code can use
> quite large amounts of stack.

I've had the same thought as well, bdp() should just signal a writeback
instead. Much cleaner than doing cleaning from that point.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: KAMEZAWA Hiroyuki on
On Wed, 16 Jun 2010 01:06:40 -0400
Christoph Hellwig <hch(a)infradead.org> wrote:

> On Wed, Jun 16, 2010 at 09:17:55AM +0900, KAMEZAWA Hiroyuki wrote:
> > yes. It's only called from
> > - page fault
> > - add_to_page_cache()
> >
> > I think we'll see no stack problem. Now, memcg doesn't wakeup kswapd for
> > reclaiming memory, it needs direct writeback.
>
> The page fault code should be fine, but add_to_page_cache can be called
> with quite deep stacks. Two examples are grab_cache_page_write_begin
> which already was part of one of the stack overflows mentioned in this
> thread, or find_or_create_page which can be called via
> _xfs_buf_lookup_pages, which can be called from under the whole XFS
> allocator, or via grow_dev_page which might have a similarly deep
> stack for users of the normal buffer cache. Although for the
> find_or_create_page we usually should not have __GFP_FS set in the
> gfp_mask.
>

Hmm. ok, then, memory cgroup needs some care.

BTW, why xbf_buf_create() use GFP_KERNEL even if it can be blocked ?
memory cgroup just limits pages for users, then, doesn't intend to
limit kernel pages. If this buffer is not for user(visible page cache), but for
internal structure, I'll have to add a code for ignoreing memory cgroup check
when gfp_mask doesn't have GFP_MOVABLE.


Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Christoph Hellwig on
On Thu, Jun 17, 2010 at 09:25:38AM +0900, KAMEZAWA Hiroyuki wrote:
>
> BTW, why xbf_buf_create() use GFP_KERNEL even if it can be blocked ?
> memory cgroup just limits pages for users, then, doesn't intend to
> limit kernel pages.

You mean xfs_buf_allocate? It doesn't in the end. It goes through the
xfs_kmem helper which clear __GFP_FS if we're currently inside a
filesystem transaction (PF_FSTRANS is set) or a caller specificly
requested it to be disabled even without that by passig the
XBF_DONT_BLOCK flag.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: KAMEZAWA Hiroyuki on
On Thu, 17 Jun 2010 02:16:47 -0400
Christoph Hellwig <hch(a)infradead.org> wrote:

> On Thu, Jun 17, 2010 at 09:25:38AM +0900, KAMEZAWA Hiroyuki wrote:
> >
> > BTW, why xbf_buf_create() use GFP_KERNEL even if it can be blocked ?
> > memory cgroup just limits pages for users, then, doesn't intend to
> > limit kernel pages.
>
> You mean xfs_buf_allocate? It doesn't in the end. It goes through the
> xfs_kmem helper which clear __GFP_FS if we're currently inside a
> filesystem transaction (PF_FSTRANS is set) or a caller specificly
> requested it to be disabled even without that by passig the
> XBF_DONT_BLOCK flag.
>
Ah, sorry. My question was wrong.

If xfs_buf_allocate() is not for pages on LRU but for kernel memory,
memory cgroup has no reason to charge against it because we can't reclaim
memory which is not on LRU.

Then, I wonder I may have to add following check

if (!(gfp_mask & __GFP_RECLAIMABLE)) {
/* ignore this. we just charge against reclaimable memory on LRU. */
return 0;
}

to mem_cgroup_charge_cache() which is a hook for accounting page-cache.


Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andrew Morton on
On Tue, 29 Jun 2010 12:34:46 +0100
Mel Gorman <mel(a)csn.ul.ie> wrote:

> When memory is under enough pressure, a process may enter direct
> reclaim to free pages in the same manner kswapd does. If a dirty page is
> encountered during the scan, this page is written to backing storage using
> mapping->writepage. This can result in very deep call stacks, particularly
> if the target storage or filesystem are complex. It has already been observed
> on XFS that the stack overflows but the problem is not XFS-specific.
>
> This patch prevents direct reclaim writing back pages by not setting
> may_writepage in scan_control. Instead, dirty pages are placed back on the
> LRU lists for either background writing by the BDI threads or kswapd. If
> in direct lumpy reclaim and dirty pages are encountered, the process will
> stall for the background flusher before trying to reclaim the pages again.
>
> Memory control groups do not have a kswapd-like thread nor do pages get
> direct reclaimed from the page allocator. Instead, memory control group
> pages are reclaimed when the quota is being exceeded or the group is being
> shrunk. As it is not expected that the entry points into page reclaim are
> deep call chains memcg is still allowed to writeback dirty pages.

I already had "[PATCH 01/14] vmscan: Fix mapping use after free" and
I'll send that in for 2.6.35.

I grabbed [02/14] up to [11/14]. Including "[PATCH 06/14] vmscan: kill
prev_priority completely", grumpyouallsuck.

I wimped out at this, "Do not writeback pages in direct reclaim". It
really is a profound change and needs a bit more thought, discussion
and if possible testing which is designed to explore possible pathologies.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/