From: KOSAKI Motohiro on
In this week, I've tested some IO congested workload for a while. and probably
I did reproduced Andreas's issue.

So, I would like to explain current lumpy reclaim how works and why so much sucks.


1. Now isolate_lru_pages() have following pfn neighber grabbing logic.

for (; pfn < end_pfn; pfn++) {
(snip)
if (__isolate_lru_page(cursor_page, mode, file) == 0) {
list_move(&cursor_page->lru, dst);
mem_cgroup_del_lru(cursor_page);
nr_taken++;
nr_lumpy_taken++;
if (PageDirty(cursor_page))
nr_lumpy_dirty++;
scan++;
} else {
if (mode == ISOLATE_BOTH &&
page_count(cursor_page))
nr_lumpy_failed++;
}
}

Mainly, __isolate_lru_page() failure can be caused following reasons.
(1) the page have already been freed and is in buddy.
(2) the page is used for non user process purpose
(3) the page is unevictable (e.g. mlocked)

(2), (3) have very different characteristic from (1). the lumpy reclaim
mean 'contenious physical memory reclaiming'. that said, if we are trying
order 9 reclaim, 512 pages reclaim success and 511 pages reclaim success
are completely differennt. former mean lumpy reclaim successfull, latter mean
failure. So, if (2) or (3) occur, that pfn have lost a possibility of lumpy
reclaim successfull. then, we should stop pfn neighbor search immediately and
try to get lru next page. (i.e. we should use 'break' statement instead 'continue')

2. synchronous lumpy reclaim condition is insane.

currently, synchrounous lumpy reclaim will be invoked when following
condition.

if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
sc->lumpy_reclaim_mode) {

but "nr_reclaimed < nr_taken" is pretty stupid. if isolated pages have
much dirty pages, pageout() only issue first 113 IOs.
(if io queue have >113 requests, bdi_write_congested() return true and
may_write_to_queue() return false)

So, we haven't call ->writepage(), congestion_wait() and wait_on_page_writeback()
are surely stupid.


3. pageout() is intended anynchronous api. but doesn't works so.

pageout() call ->writepage with wbc->nonblocking=1. because if the system have
default vm.dirty_ratio (i.e. 20), we have 80% clean memory. so, getting stuck
on one page is stupid, we should scan much pages as soon as possible.

HOWEVER, block layer ignore this argument. if slow usb memory device connect
to the system, ->writepage() will sleep long time. because submit_bio() call
get_request_wait() unconditionally and it doesn't have any PF_MEMALLOC task
bonus.


4. synchronous lumpy reclaim call clear_active_flags(). but it is also silly.

Now, page_check_references() ignore pte young bit when we are processing lumpy reclaim.
Then, In almostly case, PageActive() mean "swap device is full". Therefore,
waiting IO and retry pageout() are just silly.


In andres's case, congestion_wait() and get_request_wait() are root cause.
Other issue is problematic when more higher order lumpy reclaim.


Now, I'm preparing some patches and probably I can send them tommorow.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on
On Wed, Jul 28, 2010 at 08:40:21PM +0900, KOSAKI Motohiro wrote:
> In this week, I've tested some IO congested workload for a while. and probably
> I did reproduced Andreas's issue.
>
> So, I would like to explain current lumpy reclaim how works and why so much sucks.
>
>
> 1. Now isolate_lru_pages() have following pfn neighber grabbing logic.
>
> for (; pfn < end_pfn; pfn++) {
> (snip)
> if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> list_move(&cursor_page->lru, dst);
> mem_cgroup_del_lru(cursor_page);
> nr_taken++;
> nr_lumpy_taken++;
> if (PageDirty(cursor_page))
> nr_lumpy_dirty++;
> scan++;
> } else {
> if (mode == ISOLATE_BOTH &&
> page_count(cursor_page))
> nr_lumpy_failed++;
> }
> }
>
> Mainly, __isolate_lru_page() failure can be caused following reasons.
> (1) the page have already been freed and is in buddy.
> (2) the page is used for non user process purpose
> (3) the page is unevictable (e.g. mlocked)
>
> (2), (3) have very different characteristic from (1). the lumpy reclaim
> mean 'contenious physical memory reclaiming'. that said, if we are trying
> order 9 reclaim, 512 pages reclaim success and 511 pages reclaim success
> are completely differennt.

Yep, and this can occur quite regularly. Judging from the ftrace
results, contig_failed is frequently positive although whether this is
due to the page being about to be freed or because it's due (2), I don't
know.

> former mean lumpy reclaim successfull, latter mean
> failure. So, if (2) or (3) occur, that pfn have lost a possibility of lumpy
> reclaim successfull. then, we should stop pfn neighbor search immediately and
> try to get lru next page. (i.e. we should use 'break' statement instead 'continue')
>

Easy enough to do.

> 2. synchronous lumpy reclaim condition is insane.
>
> currently, synchrounous lumpy reclaim will be invoked when following
> condition.
>
> if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> sc->lumpy_reclaim_mode) {
>
> but "nr_reclaimed < nr_taken" is pretty stupid. if isolated pages have
> much dirty pages, pageout() only issue first 113 IOs.
> (if io queue have >113 requests, bdi_write_congested() return true and
> may_write_to_queue() return false)
>
> So, we haven't call ->writepage(), congestion_wait() and wait_on_page_writeback()
> are surely stupid.
>

This is somewhat intentional though. See the comment

/*
* Synchronous reclaim is performed in two passes,
* first an asynchronous pass over the list to
* start parallel writeback, and a second synchronous
* pass to wait for the IO to complete......

If all pages on the list were not taken, it means that some of the them
were dirty but most should now be queued for writeback (possibly not all if
congested). The intention is to loop a second time waiting for that writeback
to complete before continueing on.

> 3. pageout() is intended anynchronous api. but doesn't works so.
>
> pageout() call ->writepage with wbc->nonblocking=1. because if the system have
> default vm.dirty_ratio (i.e. 20), we have 80% clean memory. so, getting stuck
> on one page is stupid, we should scan much pages as soon as possible.
>
> HOWEVER, block layer ignore this argument. if slow usb memory device connect
> to the system, ->writepage() will sleep long time. because submit_bio() call
> get_request_wait() unconditionally and it doesn't have any PF_MEMALLOC task
> bonus.
>

Is this not a problem in the writeback layer rather than pageout()
specifically?

>
> 4. synchronous lumpy reclaim call clear_active_flags(). but it is also silly.
>
> Now, page_check_references() ignore pte young bit when we are processing lumpy reclaim.
> Then, In almostly case, PageActive() mean "swap device is full". Therefore,
> waiting IO and retry pageout() are just silly.
>

try_to_unmap also obey reference bits. If you remove the call to
clear_active_flags, then pageout should pass TTY_IGNORE_ACCESS to
try_to_unmap(). I had a patch to do this but it didn't improve
high-order allocation success rates any so I dropped it.

> In andres's case, congestion_wait() and get_request_wait() are root cause.
> Other issue is problematic when more higher order lumpy reclaim.
>
> Now, I'm preparing some patches and probably I can send them tommorow.
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andrew Morton on
On Wed, 28 Jul 2010 20:40:21 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com> wrote:

> 3. pageout() is intended anynchronous api. but doesn't works so.
>
> pageout() call ->writepage with wbc->nonblocking=1. because if the system have
> default vm.dirty_ratio (i.e. 20), we have 80% clean memory. so, getting stuck
> on one page is stupid, we should scan much pages as soon as possible.
>
> HOWEVER, block layer ignore this argument. if slow usb memory device connect
> to the system, ->writepage() will sleep long time. because submit_bio() call
> get_request_wait() unconditionally and it doesn't have any PF_MEMALLOC task
> bonus.

The idea is that vmscan doesn't call ->writepage if the underlying
queue is congested. may_write_to_queue()->bdi_queue_congested() should
return false and we skip the write.

If that logic is broken then that would explain a few things...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: KOSAKI Motohiro on
> On Wed, 28 Jul 2010 20:40:21 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com> wrote:
>
> > 3. pageout() is intended anynchronous api. but doesn't works so.
> >
> > pageout() call ->writepage with wbc->nonblocking=1. because if the system have
> > default vm.dirty_ratio (i.e. 20), we have 80% clean memory. so, getting stuck
> > on one page is stupid, we should scan much pages as soon as possible.
> >
> > HOWEVER, block layer ignore this argument. if slow usb memory device connect
> > to the system, ->writepage() will sleep long time. because submit_bio() call
> > get_request_wait() unconditionally and it doesn't have any PF_MEMALLOC task
> > bonus.
>
> The idea is that vmscan doesn't call ->writepage if the underlying
> queue is congested. may_write_to_queue()->bdi_queue_congested() should
> return false and we skip the write.
>
> If that logic is broken then that would explain a few things...

we already have it in may_write_to_queue(). but kswapd and zone-reclaim have
PF_SWAPWRITE then ignore queue congestion. (btw, I believe zone-reclaim
shouldn't use PF_SWAPWRITE). so, kswapd get stuck in get_request_wait() frequently.

following commit explain why kswapd have to ignore queue congestion....

commit c4e2d7ddde9693a4c05da7afd485db02c27a7a09
Author: akpm <akpm>
Date: Sun Dec 22 01:07:33 2002 +0000

[PATCH] Give kswapd writeback higher priority than pdflush

The `low latency page reclaim' design works by preventing page
allocators from blocking on request queues (and by preventing them from
blocking against writeback of individual pages, but that is immaterial
here).

This has a problem under some situations. pdflush (or a write(2)
caller) could be saturating the queue with highmem pages. This
prevents anyone from writing back ZONE_NORMAL pages. We end up doing
enormous amounts of scenning.


And following commit made hard limit in io queue and changed vmscan writeout
behavior a lot if my understanding is correct.


commit 082cf69eb82681f4eacb3a5653834c7970714bef
Author: Jens Axboe <axboe(a)suse.de>
Date: Tue Jun 28 16:35:11 2005 +0200

[PATCH] ll_rw_blk: prevent huge request allocations

Currently we cap request allocations at q->nr_requests, but we allow a
batching io context to allocate up to 32 more (default setting). This
can flood the queue with request allocations, with only a few batching
processes. The real fix would be to limit the number of batchers, but
as that isn't currently tracked, I suggest we just cap the maximum
number of allocated requests to eg 50% over the limit.

This was observed in real life, users typically see this as vmstat bo
numbers going off the wall with seconds of no queueing afterwards.
Behaviour this bursty is not beneficial.

Signed-off-by: Jens Axboe <axboe(a)suse.de>
Signed-off-by: Linus Torvalds <torvalds(a)osdl.org>

diff --git a/drivers/block/ll_rw_blk.c b/drivers/block/ll_rw_blk.c
index 234fdcf..6c98cf0 100644
--- a/drivers/block/ll_rw_blk.c
+++ b/drivers/block/ll_rw_blk.c
@@ -1912,6 +1912,15 @@ static struct request *get_request(request_queue_t *q, int rw, struct bio *bio,
}

get_rq:
+ /*
+ * Only allow batching queuers to allocate up to 50% over the defined
+ * limit of requests, otherwise we could have thousands of requests
+ * allocated with any setting of ->nr_requests
+ */
+ if (rl->count[rw] >= (3 * q->nr_requests / 2)) {
+ spin_unlock_irq(q->queue_lock);
+ goto out;
+ }
rl->count[rw]++;
rl->starved[rw] = 0;
if (rl->count[rw] >= queue_congestion_on_threshold(q))



So, I think we still have highmem issue. then I did think kswapd writebacking
still need to have higher priority than flusher. Am I missing something?



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: KOSAKI Motohiro on
> On Wed, Jul 28, 2010 at 08:40:21PM +0900, KOSAKI Motohiro wrote:
> > In this week, I've tested some IO congested workload for a while. and probably
> > I did reproduced Andreas's issue.
> >
> > So, I would like to explain current lumpy reclaim how works and why so much sucks.
> >
> >
> > 1. Now isolate_lru_pages() have following pfn neighber grabbing logic.
> >
> > for (; pfn < end_pfn; pfn++) {
> > (snip)
> > if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> > list_move(&cursor_page->lru, dst);
> > mem_cgroup_del_lru(cursor_page);
> > nr_taken++;
> > nr_lumpy_taken++;
> > if (PageDirty(cursor_page))
> > nr_lumpy_dirty++;
> > scan++;
> > } else {
> > if (mode == ISOLATE_BOTH &&
> > page_count(cursor_page))
> > nr_lumpy_failed++;
> > }
> > }
> >
> > Mainly, __isolate_lru_page() failure can be caused following reasons.
> > (1) the page have already been freed and is in buddy.
> > (2) the page is used for non user process purpose
> > (3) the page is unevictable (e.g. mlocked)
> >
> > (2), (3) have very different characteristic from (1). the lumpy reclaim
> > mean 'contenious physical memory reclaiming'. that said, if we are trying
> > order 9 reclaim, 512 pages reclaim success and 511 pages reclaim success
> > are completely differennt.
>
> Yep, and this can occur quite regularly. Judging from the ftrace
> results, contig_failed is frequently positive although whether this is
> due to the page being about to be freed or because it's due (2), I don't
> know.
>
> > former mean lumpy reclaim successfull, latter mean
> > failure. So, if (2) or (3) occur, that pfn have lost a possibility of lumpy
> > reclaim successfull. then, we should stop pfn neighbor search immediately and
> > try to get lru next page. (i.e. we should use 'break' statement instead 'continue')
> >
>
> Easy enough to do.

Yup.


> > 2. synchronous lumpy reclaim condition is insane.
> >
> > currently, synchrounous lumpy reclaim will be invoked when following
> > condition.
> >
> > if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> > sc->lumpy_reclaim_mode) {
> >
> > but "nr_reclaimed < nr_taken" is pretty stupid. if isolated pages have
> > much dirty pages, pageout() only issue first 113 IOs.
> > (if io queue have >113 requests, bdi_write_congested() return true and
> > may_write_to_queue() return false)
> >
> > So, we haven't call ->writepage(), congestion_wait() and wait_on_page_writeback()
> > are surely stupid.
> >
>
> This is somewhat intentional though. See the comment
>
> /*
> * Synchronous reclaim is performed in two passes,
> * first an asynchronous pass over the list to
> * start parallel writeback, and a second synchronous
> * pass to wait for the IO to complete......
>
> If all pages on the list were not taken, it means that some of the them
> were dirty but most should now be queued for writeback (possibly not all if
> congested). The intention is to loop a second time waiting for that writeback
> to complete before continueing on.

May I explain more a bit? Generically, a worth of retrying depend on successful ratio.
now shrink_page_list() can't free the page when following situation.

1. trylock_page() failure
2. page is unevictable
3. zone reclaim and page is mapped
4. PageWriteback() is true and not synchronous lumpy reclaim
5. page is swapbacked and swap is full
6. add_to_swap() fail (note, this is frequently fail rather than expected because
it is using GFP_NOMEMALLOC)
7. page is dirty and gfpmask don't have GFP_IO, GFP_FS
8. page is pinned
9. IO queue is congested
10. pageout() start IO, but not finished

So, (4) and (10) are perfectly good condition to wait. (1) and (8) might be solved
by sleeping awhile, but it's unrelated on io-congestion. but might not be. It only works
by lucky. So I don't like to depned on luck. (9) can be solved by io
waiting. but congestion_wait() is NOT correct wait. congestion_wait() mean
"sleep until one or more block device in the system are no congested". That said,
if the system have two or more disks, congestion_wait() doesn't works well for
synchronous lumpy reclaim purpose. btw, desktop user oftern use USB storage
device. (2), (3), (5), (6) and (7) can't be solved by waiting. It's just silly.

In the other hand, synchrounous lumpy reclaim work fine following situation.

1. called shrink_page_list(PAGEOUT_IO_ASYNC)
2. pageout() kicked IO
3. waiting by wait_on_page_writeback()
4. application touched the page again. and the page became dirty again
5. IO finished, and wakeuped reclaim thread
6. called pageout()
7. called wait_on_page_writeback() again
8. ok. we are successful high order reclaim

So, I'd like to narrowing to invoke synchrounous lumpy reclaim condtion.


>
> > 3. pageout() is intended anynchronous api. but doesn't works so.
> >
> > pageout() call ->writepage with wbc->nonblocking=1. because if the system have
> > default vm.dirty_ratio (i.e. 20), we have 80% clean memory. so, getting stuck
> > on one page is stupid, we should scan much pages as soon as possible.
> >
> > HOWEVER, block layer ignore this argument. if slow usb memory device connect
> > to the system, ->writepage() will sleep long time. because submit_bio() call
> > get_request_wait() unconditionally and it doesn't have any PF_MEMALLOC task
> > bonus.
>
> Is this not a problem in the writeback layer rather than pageout()
> specifically?

Well, outside pageout(), probably only XFS makes PF_MEMALLOC + writeout.
because PF_MEMALLOC is enabled only very limited situation. but I don't know
XFS detail at all. I can't tell this area...




> > 4. synchronous lumpy reclaim call clear_active_flags(). but it is also silly.
> >
> > Now, page_check_references() ignore pte young bit when we are processing lumpy reclaim.
> > Then, In almostly case, PageActive() mean "swap device is full". Therefore,
> > waiting IO and retry pageout() are just silly.
> >
>
> try_to_unmap also obey reference bits. If you remove the call to
> clear_active_flags, then pageout should pass TTY_IGNORE_ACCESS to
> try_to_unmap(). I had a patch to do this but it didn't improve
> high-order allocation success rates any so I dropped it.

I think this is unrelated issue. actually, page_referenced() is called before try_to_unmap()
and page_referenced() will drop pte young bit. This logic have very narrowing race. but
I don't think this is big matter practically.

And, As I said, PageActive() mean retry is not meaningful. usuallty swap full doen't clear
even if waiting a while.


Thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/