Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible [Kernel]

Prev: [PROPOSAL - FIRST POST] NMI & register clash handling infrastructure
Next: Using "page credits" as a solution for common thrashing scenarios

From: Mel Gorman on 8 Jun 2010 05:30

On Tue, Jun 08, 2010 at 05:08:11AM -0400, Christoph Hellwig wrote:
> On Tue, Jun 08, 2010 at 10:02:19AM +0100, Mel Gorman wrote:
> > seeky patterns. The second is that direct reclaim calling the filesystem
> > splices two potentially deep call paths together and potentially overflows
> > the stack on complex storage or filesystems. This series is an early draft
> > at tackling both of these problems and is in three stages.
>
> Btw, one more thing came up when I discussed the issue again with Dave
> recently:
>
> - we also need to care about ->releasepage. At least for XFS it
> can end up in the same deep allocator chain as ->writepage because
> it does all the extent state conversions, even if it doesn't
> start I/O.

Dang.

> I haven't managed yet to decode the ext4/btrfs codepaths
> for ->releasepage yet to figure out how they release a page that
> covers a delayed allocated or unwritten range.
>

If ext4/btrfs are also very deep call-chains and this series is going more
or less the right direction, then avoiding calling ->releasepage from direct
reclaim is one, somewhat unfortunate, option. The second is to avoid it on
a per-filesystem basis for direct reclaim using PF_MEMALLOC to detect
reclaimers and PF_KSWAPD to tell the difference between direct
reclaimers and kswapd.

Either way, these pages could be treated similar to dirty pages on the
dirty_pages list.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mel Gorman on 9 Jun 2010 06:00

On Wed, Jun 09, 2010 at 11:52:11AM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 8 Jun 2010 10:02:19 +0100
> > <SNIP>
> >
> > Patch 5 writes out contiguous ranges of pages where possible using
> > a_ops->writepages. When writing a range, the inode is pinned and the page
> > lock released before submitting to writepages(). This potentially generates
> > a better IO pattern and it should avoid a lock inversion problem within the
> > filesystem that wants the same page lock held by the VM. The downside with
> > writing ranges is that the VM may not be generating more IO than necessary.
> >
> > Patch 6 prevents direct reclaim writing out pages at all and instead dirty
> > pages are put back on the LRU. For lumpy reclaim, the caller will briefly
> > wait on dirty pages to be written out before trying to reclaim the dirty
> > pages a second time.
> >
> > The last patch increases the responsibility of kswapd somewhat because
> > it's now cleaning pages on behalf of direct reclaimers but kswapd seemed
> > a better fit than background flushers to clean pages as it knows where the
> > pages needing cleaning are. As it's async IO, it should not cause kswapd to
> > stall (at least until the queue is congested) but the order that pages are
> > reclaimed on the LRU is altered. Dirty pages that would have been reclaimed
> > by direct reclaimers are getting another lap on the LRU. The dirty pages
> > could have been put on a dedicated list but this increased counter overhead
> > and the number of lists and it is unclear if it is necessary.
> >
> > <SNIP>
>
> My concern is how memcg should work. IOW, what changes will be necessary for
> memcg to work with the new vmscan logic as no-direct-writeback.
>

At worst, memcg waits on background flushers to clean their pages but
obviously this could lead to stalls in containers if it happened to be full
of dirty pages.

Do you have test scenarios already setup for functional and performance
regression testing of containers? If so, can you run tests with this series
and see what sort of impact you find? I haven't done performance testing
with containers to date so I don't know what the expected values are.

> Maybe an ideal solution will be
> - support buffered I/O tracking in I/O cgroup.
> - flusher threads should work with I/O cgroup.
> - memcg itself should support dirty ratio. and add a trigger to kick flusher
> threads for dirty pages in a memcg.
> But I know it's a long way.
>

I'm not very familiar with memcg I'm afraid or its requirements so I am
having trouble guessing which of these would behave the best. You could take
a gamble on having memcg doing writeback in direct reclaim but you may run
into the same problem of overflowing stacks.

I'm not sure how a flusher thread would work just within a cgroup. It
would have to do a lot of searching to find the pages it needs
considering that it's looking at inodes rather than pages.

One possibility I guess would be to create a flusher-like thread if a direct
reclaimer finds that the dirty pages in the container are above the dirty
ratio. It would scan and clean all dirty pages in the container LRU on behalf
of dirty reclaimers.

Another possibility would be to have kswapd work in containers.
Specifically, if wakeup_kswapd() is called with a cgroup that it's added
to a list. kswapd gives priority to global reclaim but would
occasionally check if there is a container that needs kswapd on a
pending list and if so, work within the container. Is there a good
reason why kswapd does not work within container groups?

Finally, you could just allow reclaim within a memcg do writeback. Right
now, the check is based on current_is_kswapd() but I could create a helper
function that also checked for sc->mem_cgroup. Direct reclaim from the
page allocator never appears to work within a container group (which
raises questions in itself such as why a process in a container would
reclaim pages outside the container?) so it would remain safe.

> How the new logic works with memcg ? Because memcg doesn't trigger kswapd,
> memcg has to wait for a flusher thread make pages clean ?

Right now, memcg has to wait for a flusher thread to make pages clean.

> Or memcg should have kswapd-for-memcg ?
>
> Is it okay to call writeback directly when !scanning_global_lru() ?
> memcg's reclaim routine is only called from specific positions, so, I guess
> no stack problem.

It's a judgement call from you really. I see that direct reclaimers do
not set mem_cgroup so it's down to - are you reasonably sure that all
the paths that reclaim based on a container are not deep? I looked
around for a while and the bulk appeared to be in the fault path so I
would guess "yes" but as I'm not familiar with the memcg implementation
I'll have missed a lot.

> But we just have I/O pattern problem.

True.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mel Gorman on 9 Jun 2010 21:20

On Thu, Jun 10, 2010 at 09:38:42AM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 9 Jun 2010 10:52:00 +0100
> Mel Gorman <mel(a)csn.ul.ie> wrote:
>
> > On Wed, Jun 09, 2010 at 11:52:11AM +0900, KAMEZAWA Hiroyuki wrote:
>
> > > > <SNIP>
> > >
> > > My concern is how memcg should work. IOW, what changes will be necessary for
> > > memcg to work with the new vmscan logic as no-direct-writeback.
> > >
> >
> > At worst, memcg waits on background flushers to clean their pages but
> > obviously this could lead to stalls in containers if it happened to be full
> > of dirty pages.
> >
>
> yes.
>

I'd like to have a better intuitive idea of how bad this really is. My current
understanding is that we don't know how bad it really is but it "feels vaguely
bad". I am focusing on the global scenario right now but only because I know
it's important and that any complex IO or filesystem can overflow the stack.

> > Do you have test scenarios already setup for functional and performance
> > regression testing of containers? If so, can you run tests with this series
> > and see what sort of impact you find? I haven't done performance testing
> > with containers to date so I don't know what the expected values are.
> >
> Maybe kernbench is enough.

I think kernbench only stresses reclaim in very specific scenarios. I wouldn't
think it is reliable for this sort of evaluation. I've been using sysbench
and the high-order stress allocation to evaulate IO and lumpy reclaim. I'm
missing a test on stack-usage because I lack a test situation with complex
IO (I lack the resources to setup such a thing) or a complex FS (because I
haven't set up one).

> I think it does enough write and malloc.
> 'Limit' size for test depends on your host. I sometimes does this on
> 8cpu SMP box.
>
> # mount -t cgroup none /cgroups -o memory
> # mkdir /cgroups/A
> # echo $$ > /cgroups/A
> # echo 300M > /cgroups/memory.limit_in_bytes
> # make -j 8 or make -j 16
>

That sort of scenario would be barely pushed by kernbench. For a single
kernel build, it's about 250-400M depending on the .config but it's still
a bit unreliable. Critically, it's not the sort of workload that would have
lots of long-lived mappings that would hurt a workload a lot if it was being
paged out.

> Comparing size of swap and speed will be interesting.
> (Above 300M is enough small because my test machine has 24G memory.)
>
> Or
> # mount -t cgroup none /cgroups -o memory
> # mkdir /cgroups/A
> # echo $$ > /cgroups/A
> # echo 50M > /cgroups/memory.limit_in_bytes
> # dd if=/dev/zero of=./tmpfile bs=65536 count=100000
>

That would push it more, but you're still talking about short-lived
processes with a lowish footprint that might not hurt in an obvious
manner in a writeback situation.

A more reliable measure would be sysbench sized to the size of the container
I imagined.

> or some. When I tested the original patch for "avoiding writeback" by
> Dave Chinner, I saw 2 ooms in 10 tests.
> If not patched, I never see OOM.
>

My patches build on Dave's approach somewhat by waiting in lumpy reclaim
for the IO to happen and in the general case by batching the dirty pages
together for kswapd.

> > > Maybe an ideal solution will be
> > > - support buffered I/O tracking in I/O cgroup.
> > > - flusher threads should work with I/O cgroup.
> > > - memcg itself should support dirty ratio. and add a trigger to kick flusher
> > > threads for dirty pages in a memcg.
> > > But I know it's a long way.
> > >
> >
> > I'm not very familiar with memcg I'm afraid or its requirements so I am
> > having trouble guessing which of these would behave the best. You could take
> > a gamble on having memcg doing writeback in direct reclaim but you may run
> > into the same problem of overflowing stacks.
> >
>
> maybe.
>

Maybe it would be reasonable as a starting point but we'd have to be
very careful of the stack usage figures? I'm leaning towards this
approach to start with.

I'm preparing another release that takes my two most important patches
about reclaim but also reduces usage in page relcaim (a combination of
two previously released series). In combination, it might be ok for the
memcg paths to reclaim pages from a stack perspective although the IO
pattern might still blow.

> > I'm not sure how a flusher thread would work just within a cgroup. It
> > would have to do a lot of searching to find the pages it needs
> > considering that it's looking at inodes rather than pages.
> >
>
> yes. So, I(we) need some way for coloring inode for selectable writeback.
> But people in this area are very nervous about performance (me too ;), I've
> not found the answer yet.
>

I worry that too much targetting of writing back a specific inode would
have other consequences.

>
> > One possibility I guess would be to create a flusher-like thread if a direct
> > reclaimer finds that the dirty pages in the container are above the dirty
> > ratio. It would scan and clean all dirty pages in the container LRU on behalf
> > of dirty reclaimers.
> >
>
> Yes, that's possible. But Andrew recommends not to do add more threads. So,
> I'll use workqueue if necessary.
>

I also was not happy with adding more threads unless we had to. In a
sense, I preferred adding logic to kswapd that switched between
reclaiming for global and containers but too much of how it behaves
depends on "how many containers are there"

In this sense, I would lean more torwards letting containers write back
pages in reclaim and see what the stack usage looks like.

> > Another possibility would be to have kswapd work in containers.
> > Specifically, if wakeup_kswapd() is called with a cgroup that it's added
> > to a list. kswapd gives priority to global reclaim but would
> > occasionally check if there is a container that needs kswapd on a
> > pending list and if so, work within the container. Is there a good
> > reason why kswapd does not work within container groups?
> >
>
> One reason is node v.s. memcg.
> Because memcg doesn't limit memory placement, a container can contain pages
> from the all nodes. So,it's a bit problem which node's kswapd we should run .
> (but yes, maybe small problem.)

I would hope it's small. I would expect a correlation between containers
and the nodes they have access to.

> Another is memory-reclaim-prioirty between memcg.
> (I don't want to add such a knob...)
>
> Maybe it's time to consider about that.
> Now, we're using kswapd for softlimit. I think similar hints for kswapd
> should work. yes.
>
> > Finally, you could just allow reclaim within a memcg do writeback. Right
> > now, the check is based on current_is_kswapd() but I could create a helper
> > function that also checked for sc->mem_cgroup. Direct reclaim from the
> > page allocator never appears to work within a container group (which
> > raises questions in itself such as why a process in a container would
> > reclaim pages outside the container?) so it would remain safe.
> >
>
> isolate_lru_pages() for memcg finds only pages in a memcg ;)
>

Ok.

>
> > > How the new logic works with memcg ? Because memcg doesn't trigger kswapd,
> > > memcg has to wait for a flusher thread make pages clean ?
> >
> > Right now, memcg has to wait for a flusher thread to make pages clean.
> >
>
> ok.
>
>
> > > Or memcg should have kswapd-for-memcg ?
> > >
> > > Is it okay to call writeback directly when !scanning_global_lru() ?
> > > memcg's reclaim routine is only called from specific positions, so, I guess
> > > no stack problem.
> >
> > It's a judgement call from you really. I see that direct reclaimers do
> > not set mem_cgroup so it's down to - are you reasonably sure that all
> > the paths that reclaim based on a container are not deep?
>
> One concerns is add_to_page_cache(). If it's called in deep stack, my assumption
> is wrong.
>

But I wouldn't expect that in general to be very deep. Maybe I'm wrong.

> > I looked
> > around for a while and the bulk appeared to be in the fault path so I
> > would guess "yes" but as I'm not familiar with the memcg implementation
> > I'll have missed a lot.
> >
> > > But we just have I/O pattern problem.
> >
> > True.
> >
>
> Okay, I'll consider about how to kick kswapd via memcg or flusher-for-memcg.
> Please go ahead as you want. I love good I/O pattern, too.
>

For the moment, I'm strongly leaning towards allowing memcg to write
back pages. The IO pattern might not be great, but it would be in line
with current behaviour. The critical question is really "is it possible
to overflow the stack?".

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mel Gorman on 11 Jun 2010 08:40

On Thu, Jun 10, 2010 at 10:57:49PM -0700, Andrew Morton wrote:
> On Tue, 8 Jun 2010 10:02:19 +0100 Mel Gorman <mel(a)csn.ul.ie> wrote:
>
> > To summarise, there are two big problems with page reclaim right now. The
> > first is that page reclaim uses a_op->writepage to write a back back
> > under the page lock which is inefficient from an IO perspective due to
> > seeky patterns.
>
> No it isn't. If we have a pile of file-contiguous, disk-contiguous
> dirty pages on the tail of the LRU then the single writepage()s will
> work just fine due to request merging.
>

Ok, I was under the mistaken impression that filesystems wanted to be
given ranges of pages where possible. Considering that there has been no
reaction to the patch in question from the filesystem people cc'd, I'll
drop the problem for now.

>
>
> Look. This is getting very frustrating. I keep saying the same thing
> and keep getting ignored. Once more:
>
> WE BROKE IT!
>
> PLEASE STOP WRITING CODE!
>
> FIND OUT HOW WE BROKE IT!
>
> Loud enough yet?
>

Yep. I've started a new series of tests that capture the trace points
during each test to get some data on how many dirty pages are really
being written back. They takes a long time to complete unfortunately.

> It used to be the case that only very small amounts of IO occurred in
> page reclaim - the vast majority of writeback happened within
> write()->balance_dirty_pages(). Then (and I think it was around 2.6.12)
> we broke it, and page reclaim started doing lots of writeout.
>

Ok, I'll work out exactly how many dirty pages are being written back
then. The data I have at the moment covers the whole test, so I cannot
be certain if all the writeback happened during one stress test or
whether it's a comment event.

> So the thing to do is to either find out how we broke it and see if it
> can be repaired, or change the VM so that it doesn't do so much
> LRU-based writeout. Rather than fiddling around trying to make the
> we-broke-it code run its brokenness faster.
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mel Gorman on 11 Jun 2010 14:20

On Fri, Jun 11, 2010 at 12:30:26PM -0400, Christoph Hellwig wrote:
> On Fri, Jun 11, 2010 at 01:33:20PM +0100, Mel Gorman wrote:
> > Ok, I was under the mistaken impression that filesystems wanted to be
> > given ranges of pages where possible. Considering that there has been no
> > reaction to the patch in question from the filesystem people cc'd, I'll
> > drop the problem for now.
>
> Yes, we'd prefer them if possible. Then again we'd really prefer to
> get as much I/O as possible from the flusher threads, and not kswapd.
>

Ok, for the moment I'll put it on the maybe pile and drop it from the
series. We can revisit if a use is found for it and we're happy that
there wasn't some other bug leaving dirty pages on the LRU for too long.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2
Prev: [PROPOSAL - FIRST POST] NMI & register clash handling infrastructure
Next: Using "page credits" as a solution for common thrashing scenarios