From: Minchan Kim on
On Wed, Jul 7, 2010 at 5:27 AM, Johannes Weiner <hannes(a)cmpxchg.org> wrote:
> On Tue, Jul 06, 2010 at 04:25:39PM +0100, Mel Gorman wrote:
>> On Tue, Jul 06, 2010 at 08:24:57PM +0900, Minchan Kim wrote:
>> > but it is still problem in case of swap file.
>> > That's because swapout on swapfile cause file system writepage which
>> > makes kernel stack overflow.
>>
>> I don't *think* this is a problem unless I missed where writing out to
>> swap enters teh filesystem code. I'll double check.
>
> It bypasses the fs.  On swapon, the blocks are resolved
> (mm/swapfile.c::setup_swap_extents) and then the writeout path uses
> bios directly (mm/page_io.c::swap_writepage).
>
> (GFP_NOFS still includes __GFP_IO, so allows swapping)
>
>        Hannes

Thanks, Hannes. You're right.
Extents would be resolved by setup_swap_extents.
Sorry for confusing, Mel.

It was just my guessing about Kosaki's mention but he might say another story.
Ignore me.

--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on
On Wed, Jul 07, 2010 at 07:28:14AM +0900, Minchan Kim wrote:
> On Wed, Jul 7, 2010 at 5:27 AM, Johannes Weiner <hannes(a)cmpxchg.org> wrote:
> > On Tue, Jul 06, 2010 at 04:25:39PM +0100, Mel Gorman wrote:
> >> On Tue, Jul 06, 2010 at 08:24:57PM +0900, Minchan Kim wrote:
> >> > but it is still problem in case of swap file.
> >> > That's because swapout on swapfile cause file system writepage which
> >> > makes kernel stack overflow.
> >>
> >> I don't *think* this is a problem unless I missed where writing out to
> >> swap enters teh filesystem code. I'll double check.
> >
> > It bypasses the fs. �On swapon, the blocks are resolved
> > (mm/swapfile.c::setup_swap_extents) and then the writeout path uses
> > bios directly (mm/page_io.c::swap_writepage).
> >
> > (GFP_NOFS still includes __GFP_IO, so allows swapping)
> >
> > � � � �Hannes
>
> Thanks, Hannes. You're right.
> Extents would be resolved by setup_swap_extents.
> Sorry for confusing, Mel.
>

No confusion. I was 99.99999% certain this was the case and had tested with
a few bug_on's just in case but confirmation is helpful. Thanks both.

What I have now is direct writeback for anon files. For files be it from
kswapd or direct reclaim, I kick writeback pre-emptively by an amount based
on the dirty pages encountered because monitoring from systemtap indicated
that we were getting a large percentage of the dirty file pages at the end
of the LRU lists (bad). Initial tests show that page reclaim writeback is
reduced from kswapd by 97% with this sort of pre-emptive kicking of flusher
threads based on these figures from sysbench.

traceonly-v4r1 stackreduce-v4r1 flushforward-v4r4
Direct reclaims 621 710 30928
Direct reclaim pages scanned 141316 141184 1912093
Direct reclaim write file async I/O 23904 28714 0
Direct reclaim write anon async I/O 716 918 88
Direct reclaim write file sync I/O 0 0 0
Direct reclaim write anon sync I/O 0 0 0
Wake kswapd requests 713250 735588 5626413
Kswapd wakeups 1805 1498 641
Kswapd pages scanned 17065538 15605327 9524623
Kswapd reclaim write file async I/O 715768 617225 23938 <-- Wooo
Kswapd reclaim write anon async I/O 218003 214051 198746
Kswapd reclaim write file sync I/O 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0
Time stalled direct reclaim (ms) 9.87 11.63 315.30
Time kswapd awake (ms) 1884.91 2088.23 3542.92

This is "good" IMO because file IO from page reclaim is frowned upon because
of poor IO patterns. There isn't a launder process I can kick for anon pages
to get overall reclaim IO down but it's not clear it's worth it at this
juncture because AFAIK, IO to swap blows anyway. The biggest plus is that
direct reclaim still not call into the filesystem with my current series so
stack overflows are less of a heartache. As the number of pages encountered
for filesystem writeback are reduced, it's also less of a problem for memcg.

The direct reclaim stall latency increases because of congestion_wait
throttling but the overall tests completes 602 seconds faster or by 8% (figures
not included). Scanning rates go up but with reduced-time-to-completion,
on balance I think it works out.

Andrew has picked up some of the series but I have another modification
to the tracepoints to differenciate between anon and file IO which I now
think is a very important distinction as flushers work on one but not the
other. I also must rebase upon a mmotm based on 2.6.35-rc4 before re-posting
the series but broadly speaking, I think we are going the right direction
without needing stack-switching tricks.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Christoph Hellwig on
On Tue, Jul 06, 2010 at 10:27:58PM +0200, Johannes Weiner wrote:
> It bypasses the fs. On swapon, the blocks are resolved
> (mm/swapfile.c::setup_swap_extents) and then the writeout path uses
> bios directly (mm/page_io.c::swap_writepage).
>
> (GFP_NOFS still includes __GFP_IO, so allows swapping)

Exactly. Note that while the stack problems for swap writeout aren't
as bad as for filesystems as the whole allocator / extent map footprint
is missing it might still be an issue. We still splice the whole block
I/O stack footprint over a random stack that might be filled up a lot.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Wu Fengguang on
Hi Mel,

> Second, using systemtap, I was able to see that file-backed dirty
> pages have a tendency to be near the end of the LRU even though they
> are a small percentage of the overall pages in the LRU. I'm hoping
> to figure out why this is as it would make avoiding writeback a lot
> less controversial.

Your intuitions are correct -- the current background writeback logic
fails to write elder inodes first. Under heavy loads the background
writeback job may run for ever, totally ignoring the time order of
inode->dirtied_when. This is probably why you see lots of dirty pages
near the end of LRU.

Here is an old patch for fixing this. Sorry for being late. I'll
pick up and refresh the patch series ASAP. (I made a mistake last
year to post too many patches at one time. I'll break them up into
more manageable pieces.)

[PATCH 31/45] writeback: sync old inodes first in background writeback
<https://kerneltrap.org/mailarchive/linux-fsdevel/2009/10/7/6476313>

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on
On Tue, Jul 06, 2010 at 09:15:33PM -0400, Christoph Hellwig wrote:
> On Wed, Jul 07, 2010 at 01:24:58AM +0100, Mel Gorman wrote:
> > What I have now is direct writeback for anon files. For files be it from
> > kswapd or direct reclaim, I kick writeback pre-emptively by an amount based
> > on the dirty pages encountered because monitoring from systemtap indicated
> > that we were getting a large percentage of the dirty file pages at the end
> > of the LRU lists (bad). Initial tests show that page reclaim writeback is
> > reduced from kswapd by 97% with this sort of pre-emptive kicking of flusher
> > threads based on these figures from sysbench.
>
> That sounds like yet another bad aid to me. Instead it would be much
> better to not have so many file pages at the end of LRU by tuning the
> flusher threads and VM better.
>

Do you mean "so many dirty file pages"? I'm going to assume you do.

How do you suggest tuning this? The modification I tried was "if N dirty
pages are found during a SWAP_CLUSTER_MAX scan of pages, assume an average
dirtying density of at least that during the time those pages were inserted on
the LRU. In response, ask the flushers to flush 1.5X". This roughly responds
to the conditions it finds as they are encountered and is based on scanning
rates instead of time. It seemed like a reasonable option.

Based on what I've seen, we are generally below the dirty_ratio and the
flushers are behaving as expected so there is little tuning available there. As
new dirty pages are added to the inactive list, they are allowed to reach the
bottom of the LRU before the periodic sync kicks in. From what I can tell,
it's already the case that flusher threads are cleaning the oldest inodes
first and I'd expect there to be a rough correlation between oldest inode
and oldest pages.

We could reduce the dirty_ratio but people already complain about workloads
that do not allow enough pages to be dirtied. We could decrease the sync
time for flusher threads but then it might be starting IO sooner than it
should and it might be unnecessary if the system is under no memory pressure.

Alternatives?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/