Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible [Kernel]

Prev: [GIT PULL] UBI changes for 2.6.35-rc3
Next: [PATCH 1/5] ACPI / ACPICA: Use helper function for computing GPE masks

From: Christoph Hellwig on 8 Jun 2010 05:10

On Tue, Jun 08, 2010 at 10:02:19AM +0100, Mel Gorman wrote:
> seeky patterns. The second is that direct reclaim calling the filesystem
> splices two potentially deep call paths together and potentially overflows
> the stack on complex storage or filesystems. This series is an early draft
> at tackling both of these problems and is in three stages.

Btw, one more thing came up when I discussed the issue again with Dave
recently:

- we also need to care about ->releasepage. At least for XFS it
can end up in the same deep allocator chain as ->writepage because
it does all the extent state conversions, even if it doesn't
start I/O. I haven't managed yet to decode the ext4/btrfs codepaths
for ->releasepage yet to figure out how they release a page that
covers a delayed allocated or unwritten range.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: KAMEZAWA Hiroyuki on 8 Jun 2010 23:00

On Tue, 8 Jun 2010 10:02:19 +0100
Mel Gorman <mel(a)csn.ul.ie> wrote:

> I finally got a chance last week to visit the topic of direct reclaim
> avoiding the writing out pages. As it came up during discussions the last
> time, I also had a stab at making the VM writing ranges of pages instead
> of individual pages. I am not proposing for merging yet until I want to see
> what people think of this general direction and if we can agree on if this
> is the right one or not.
>
> To summarise, there are two big problems with page reclaim right now. The
> first is that page reclaim uses a_op->writepage to write a back back
> under the page lock which is inefficient from an IO perspective due to
> seeky patterns. The second is that direct reclaim calling the filesystem
> splices two potentially deep call paths together and potentially overflows
> the stack on complex storage or filesystems. This series is an early draft
> at tackling both of these problems and is in three stages.
>
> The first 4 patches are a forward-port of trace points that are partly
> based on trace points defined by Larry Woodman but never merged. They trace
> parts of kswapd, direct reclaim, LRU page isolation and page writeback. The
> tracepoints can be used to evaluate what is happening within reclaim and
> whether things are getting better or worse. They do not have to be part of
> the final series but might be useful during discussion.
>
> Patch 5 writes out contiguous ranges of pages where possible using
> a_ops->writepages. When writing a range, the inode is pinned and the page
> lock released before submitting to writepages(). This potentially generates
> a better IO pattern and it should avoid a lock inversion problem within the
> filesystem that wants the same page lock held by the VM. The downside with
> writing ranges is that the VM may not be generating more IO than necessary.
>
> Patch 6 prevents direct reclaim writing out pages at all and instead dirty
> pages are put back on the LRU. For lumpy reclaim, the caller will briefly
> wait on dirty pages to be written out before trying to reclaim the dirty
> pages a second time.
>
> The last patch increases the responsibility of kswapd somewhat because
> it's now cleaning pages on behalf of direct reclaimers but kswapd seemed
> a better fit than background flushers to clean pages as it knows where the
> pages needing cleaning are. As it's async IO, it should not cause kswapd to
> stall (at least until the queue is congested) but the order that pages are
> reclaimed on the LRU is altered. Dirty pages that would have been reclaimed
> by direct reclaimers are getting another lap on the LRU. The dirty pages
> could have been put on a dedicated list but this increased counter overhead
> and the number of lists and it is unclear if it is necessary.
>
> The series has survived performance and stress testing, particularly around
> high-order allocations on X86, X86-64 and PPC64. The results of the tests
> showed that while lumpy reclaim has a slightly lower success rate when
> allocating huge pages but it was still very acceptable rates, reclaim was
> a lot less disruptive and allocation latency was lower.
>
> Comments?
>

My concern is how memcg should work. IOW, what changes will be necessary for
memcg to work with the new vmscan logic as no-direct-writeback.

Maybe an ideal solution will be
- support buffered I/O tracking in I/O cgroup.
- flusher threads should work with I/O cgroup.
- memcg itself should support dirty ratio. and add a trigger to kick flusher
threads for dirty pages in a memcg.
But I know it's a long way.

How the new logic works with memcg ? Because memcg doesn't trigger kswapd,
memcg has to wait for a flusher thread make pages clean ?
Or memcg should have kswapd-for-memcg ?

Is it okay to call writeback directly when !scanning_global_lru() ?
memcg's reclaim routine is only called from specific positions, so, I guess
no stack problem. But we just have I/O pattern problem.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: KAMEZAWA Hiroyuki on 9 Jun 2010 20:50

On Wed, 9 Jun 2010 10:52:00 +0100
Mel Gorman <mel(a)csn.ul.ie> wrote:

> On Wed, Jun 09, 2010 at 11:52:11AM +0900, KAMEZAWA Hiroyuki wrote:

> > > <SNIP>
> >
> > My concern is how memcg should work. IOW, what changes will be necessary for
> > memcg to work with the new vmscan logic as no-direct-writeback.
> >
>
> At worst, memcg waits on background flushers to clean their pages but
> obviously this could lead to stalls in containers if it happened to be full
> of dirty pages.
>
yes.

> Do you have test scenarios already setup for functional and performance
> regression testing of containers? If so, can you run tests with this series
> and see what sort of impact you find? I haven't done performance testing
> with containers to date so I don't know what the expected values are.
>
Maybe kernbench is enough. I think it does enough write and malloc.
'Limit' size for test depends on your host. I sometimes does this on
8cpu SMP box.

# mount -t cgroup none /cgroups -o memory
# mkdir /cgroups/A
# echo $$ > /cgroups/A
# echo 300M > /cgroups/memory.limit_in_bytes
# make -j 8 or make -j 16

Comparing size of swap and speed will be interesting.
(Above 300M is enough small because my test machine has 24G memory.)

Or
# mount -t cgroup none /cgroups -o memory
# mkdir /cgroups/A
# echo $$ > /cgroups/A
# echo 50M > /cgroups/memory.limit_in_bytes
# dd if=/dev/zero of=./tmpfile bs=65536 count=100000

or some. When I tested the original patch for "avoiding writeback" by
Dave Chinner, I saw 2 ooms in 10 tests.
If not patched, I never see OOM.

> > Maybe an ideal solution will be
> > - support buffered I/O tracking in I/O cgroup.
> > - flusher threads should work with I/O cgroup.
> > - memcg itself should support dirty ratio. and add a trigger to kick flusher
> > threads for dirty pages in a memcg.
> > But I know it's a long way.
> >
>
> I'm not very familiar with memcg I'm afraid or its requirements so I am
> having trouble guessing which of these would behave the best. You could take
> a gamble on having memcg doing writeback in direct reclaim but you may run
> into the same problem of overflowing stacks.
>
maybe.

> I'm not sure how a flusher thread would work just within a cgroup. It
> would have to do a lot of searching to find the pages it needs
> considering that it's looking at inodes rather than pages.
>
yes. So, I(we) need some way for coloring inode for selectable writeback.
But people in this area are very nervous about performance (me too ;), I've
not found the answer yet.

> One possibility I guess would be to create a flusher-like thread if a direct
> reclaimer finds that the dirty pages in the container are above the dirty
> ratio. It would scan and clean all dirty pages in the container LRU on behalf
> of dirty reclaimers.
>
Yes, that's possible. But Andrew recommends not to do add more threads. So,
I'll use workqueue if necessary.

> Another possibility would be to have kswapd work in containers.
> Specifically, if wakeup_kswapd() is called with a cgroup that it's added
> to a list. kswapd gives priority to global reclaim but would
> occasionally check if there is a container that needs kswapd on a
> pending list and if so, work within the container. Is there a good
> reason why kswapd does not work within container groups?
>
One reason is node v.s. memcg.
Because memcg doesn't limit memory placement, a container can contain pages
from the all nodes. So,it's a bit problem which node's kswapd we should run .
(but yes, maybe small problem.)
Another is memory-reclaim-prioirty between memcg.
(I don't want to add such a knob...)

Maybe it's time to consider about that.
Now, we're using kswapd for softlimit. I think similar hints for kswapd
should work. yes.

> Finally, you could just allow reclaim within a memcg do writeback. Right
> now, the check is based on current_is_kswapd() but I could create a helper
> function that also checked for sc->mem_cgroup. Direct reclaim from the
> page allocator never appears to work within a container group (which
> raises questions in itself such as why a process in a container would
> reclaim pages outside the container?) so it would remain safe.
>
isolate_lru_pages() for memcg finds only pages in a memcg ;)

> > How the new logic works with memcg ? Because memcg doesn't trigger kswapd,
> > memcg has to wait for a flusher thread make pages clean ?
>
> Right now, memcg has to wait for a flusher thread to make pages clean.
>
ok.

> > Or memcg should have kswapd-for-memcg ?
> >
> > Is it okay to call writeback directly when !scanning_global_lru() ?
> > memcg's reclaim routine is only called from specific positions, so, I guess
> > no stack problem.
>
> It's a judgement call from you really. I see that direct reclaimers do
> not set mem_cgroup so it's down to - are you reasonably sure that all
> the paths that reclaim based on a container are not deep?

One concerns is add_to_page_cache(). If it's called in deep stack, my assumption
is wrong.

> I looked
> around for a while and the bulk appeared to be in the fault path so I
> would guess "yes" but as I'm not familiar with the memcg implementation
> I'll have missed a lot.
>
> > But we just have I/O pattern problem.
>
> True.
>

Okay, I'll consider about how to kick kswapd via memcg or flusher-for-memcg.
Please go ahead as you want. I love good I/O pattern, too.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: KAMEZAWA Hiroyuki on 9 Jun 2010 21:40

On Thu, 10 Jun 2010 02:10:35 +0100
Mel Gorman <mel(a)csn.ul.ie> wrote:
> > # mount -t cgroup none /cgroups -o memory
> > # mkdir /cgroups/A
> > # echo $$ > /cgroups/A
> > # echo 300M > /cgroups/memory.limit_in_bytes
> > # make -j 8 or make -j 16
> >
>
> That sort of scenario would be barely pushed by kernbench. For a single
> kernel build, it's about 250-400M depending on the .config but it's still
> a bit unreliable. Critically, it's not the sort of workload that would have
> lots of long-lived mappings that would hurt a workload a lot if it was being
> paged out.

You're right. An excuse for me is that my concern is usually the amount of
swap-out and OOM at rapid/heavy pressure comes because it's visible to
users easily. So, I use short-lived process test.

> Maybe it would be reasonable as a starting point but we'd have to be
> very careful of the stack usage figures? I'm leaning towards this
> approach to start with.
>
> I'm preparing another release that takes my two most important patches
> about reclaim but also reduces usage in page relcaim (a combination of
> two previously released series). In combination, it might be ok for the
> memcg paths to reclaim pages from a stack perspective although the IO
> pattern might still blow.

sounds nice.

> > > I'm not sure how a flusher thread would work just within a cgroup. It
> > > would have to do a lot of searching to find the pages it needs
> > > considering that it's looking at inodes rather than pages.
> > >
> >
> > yes. So, I(we) need some way for coloring inode for selectable writeback.
> > But people in this area are very nervous about performance (me too ;), I've
> > not found the answer yet.
> >
>
> I worry that too much targetting of writing back a specific inode would
> have other consequences.

I personally think this(writeback scheduling) is a job for I/O cgroup.
So, I guess what memcg can do is dirty-ratio-limiting, at most. The user has to
set well-balanced combination of memory+I/O cgroup.
Sorry for wrong mixture of story.

> > Okay, I'll consider about how to kick kswapd via memcg or flusher-for-memcg.
> > Please go ahead as you want. I love good I/O pattern, too.
> >
>
> For the moment, I'm strongly leaning towards allowing memcg to write
> back pages. The IO pattern might not be great, but it would be in line
> with current behaviour. The critical question is really "is it possible
> to overflow the stack?".
>

Because I don't use XFS, I don't have relaiable answer, now. But, at least,
memcg's memory reclaim will never be called as top of do_select(), which
uses 1000 bytes.

We have to consider long-term fix for I/O patterns under memmcg but
please global-reclaim-update-first. We did in that way when splitting LRU
to ANON and FILE. I don't want to make memcg as a burden for updating
vmscan.c better.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andrew Morton on 11 Jun 2010 02:00

On Tue, 8 Jun 2010 10:02:19 +0100 Mel Gorman <mel(a)csn.ul.ie> wrote:

> To summarise, there are two big problems with page reclaim right now. The
> first is that page reclaim uses a_op->writepage to write a back back
> under the page lock which is inefficient from an IO perspective due to
> seeky patterns.

No it isn't. If we have a pile of file-contiguous, disk-contiguous
dirty pages on the tail of the LRU then the single writepage()s will
work just fine due to request merging.

Look. This is getting very frustrating. I keep saying the same thing
and keep getting ignored. Once more:

WE BROKE IT!

PLEASE STOP WRITING CODE!

FIND OUT HOW WE BROKE IT!

Loud enough yet?

It used to be the case that only very small amounts of IO occurred in
page reclaim - the vast majority of writeback happened within
write()->balance_dirty_pages(). Then (and I think it was around 2.6.12)
we broke it, and page reclaim started doing lots of writeout.

So the thing to do is to either find out how we broke it and see if it
can be repaired, or change the VM so that it doesn't do so much
LRU-based writeout. Rather than fiddling around trying to make the
we-broke-it code run its brokenness faster.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2 3 4 5 6 7
Prev: [GIT PULL] UBI changes for 2.6.35-rc3
Next: [PATCH 1/5] ACPI / ACPICA: Use helper function for computing GPE masks