From: KOSAKI Motohiro on
Hi Trond,

> There is that, and then there are issues with the VM simply lying to the
> filesystems.
>
> See https://bugzilla.kernel.org/show_bug.cgi?id=16056
>
> Which basically boils down to the following: kswapd tells the filesystem
> that it is quite safe to do GFP_KERNEL allocations in pageouts and as
> part of try_to_release_page().
>
> In the case of pageouts, it does set the 'WB_SYNC_NONE', 'nonblocking'
> and 'for_reclaim' flags in the writeback_control struct, and so the
> filesystem has at least some hint that it should do non-blocking i/o.
>
> However if you trust the GFP_KERNEL flag in try_to_release_page() then
> the kernel can and will deadlock, and so I had to add in a hack
> specifically to tell the NFS client not to trust that flag if it comes
> from kswapd.

Can you please elaborate your issue more? vmscan logic is, briefly, below

if (PageDirty(page))
pageout(page)
if (page_has_private(page)) {
try_to_release_page(page, sc->gfp_mask))

So, I'm interest why nfs need to writeback at ->release_page again even
though pageout() call ->writepage and it was successfull.

In other word, an argument gfp_mask of try_to_release_page() is suspected
to pass kmalloc()/alloc_page() familiy. and page allocator have already care
PF_MEMALLOC flag.

So, My question is, What do you want additional work to VM folks?
Can you please share nfs design and what we should?


btw, Another question, Recently, Xiaotian Feng posted "swap over nfs -v21"
patch series. they have new reservation memory framework. Is this help you?




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: KOSAKI Motohiro on
Hi

> The problem that I am seeing is that the try_to_release_page() needs to
> be told to act as a non-blocking call when the process is kswapd, just
> like the pageout() call.
>
> Currently, the sc->gfp_mask is set to GFP_KERNEL, which normally means
> that the call may wait on I/O to complete. However, what I'm seeing in
> the bugzilla above is that if kswapd waits on an RPC call, then the
> whole VM may gum up: typically, the traces show that the socket layer
> cannot allocate memory to hold the RPC reply from the server, and so it
> is kicking kswapd to have it reclaim some pages, however kswapd is stuck
> in try_to_release_page() waiting for that same I/O to complete, hence
> the deadlock...

Ah, I see. so as far as I understand, you mean
- Socket layer use GFP_ATOMIC, then they don't call try_to_free_pages().
IOW, kswapd is only memory reclaiming thread.
- Kswapd got stuck in ->release_page().
- In usual use case, another thread call kmalloc(GFP_KERNEL) and makes
foreground reclaim, then, restore kswapd stucking. but your case
there is no such thread.

Hm, interesting.

In short term, current nfs fix (checking PF_MEMALLOC in nfs_wb_page())
seems best way. it's no side effect if my understanding is correct.


> IOW: I think kswapd at least should be calling try_to_release_page()
> with a gfp-flag of '0' to avoid deadlocking on I/O.

Hmmm.
0 seems to have very strong meanings rather than nfs required.
There is no reason to prevent grabbing mutex, calling cond_resched() etc etc...

[digging old git history]

Ho hum...

Old commit log says passing gfp-flag=0 break xfs. but current xfs doesn't
use gfp_mask argument. hm.


============================================================
commit 68678e2fc6cfdfd013a2513fe416726f3c05b28d
Author: akpm <akpm>
Date: Tue Sep 10 18:09:08 2002 +0000

[PATCH] pass the correct flags to aops->releasepage()

Restore the gfp_mask in the VM's call to a_ops->releasepage(). We can
block in there again, and XFS (at least) can use that.

BKrev: 3d7e35445skDsKDFM6rdiwTY-5elsw

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5ed1ec3..89d801e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -208,7 +208,7 @@ shrink_list(struct list_head *page_list, int nr_pages,
* Otherwise, leave the page on the LRU so it is swappable.
*/
if (PagePrivate(page)) {
- if (!try_to_release_page(page, 0))
+ if (!try_to_release_page(page, gfp_mask))
goto keep_locked;
if (!mapping && page_count(page) == 1)
goto free_it;
============================================================

Now, gfp_mask of try_to_release_page() are used in two place.

btrfs: btrfs_releasepage (check GFP_WAIT)
nfs: nfs_release_page ((gfp & GFP_KERNEL) == GFP_KERNEL)

Probably, btrfs can remove such GFP_WAIT check from try_release_extent_mapping
because it doesn't sleep. I dunno. if so, we can change it to 0 again. but
I'm not sure it has enough worth thing.

Chris, can we hear how btrfs handle gfp_mask argument of release_page()?



btw, VM fokls need more consider kswapd design. now kswapd oftern sleep.
But Trond's bug report says, waiting itself can makes deadlock potentially.
Perhaps it's merely imagine thing. but need to some consider...




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: KOSAKI Motohiro on

sorry for the _very_ delayed review.

> There are a number of cases where pages get cleaned but two of concern
> to this patch are;
> o When dirtying pages, processes may be throttled to clean pages if
> dirty_ratio is not met.
> o Pages belonging to inodes dirtied longer than
> dirty_writeback_centisecs get cleaned.
>
> The problem for reclaim is that dirty pages can reach the end of the LRU if
> pages are being dirtied slowly so that neither the throttling or a flusher
> thread waking periodically cleans them.
>
> Background flush is already cleaning old or expired inodes first but the
> expire time is too far in the future at the time of page reclaim. To mitigate
> future problems, this patch wakes flusher threads to clean 4M of data -
> an amount that should be manageable without causing congestion in many cases.
>
> Ideally, the background flushers would only be cleaning pages belonging
> to the zone being scanned but it's not clear if this would be of benefit
> (less IO) or not (potentially less efficient IO if an inode is scattered
> across multiple zones).
>
> Signed-off-by: Mel Gorman <mel(a)csn.ul.ie>
> ---
> mm/vmscan.c | 33 +++++++++++++++++++++++++++++++--
> 1 files changed, 31 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2d2b588..c4c81bc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -142,6 +142,18 @@ static DECLARE_RWSEM(shrinker_rwsem);
> /* Direct lumpy reclaim waits up to five seconds for background cleaning */
> #define MAX_SWAP_CLEAN_WAIT 50
>
> +/*
> + * When reclaim encounters dirty data, wakeup flusher threads to clean
> + * a maximum of 4M of data.
> + */
> +#define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> +#define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> +static inline long nr_writeback_pages(unsigned long nr_dirty)
> +{
> + return laptop_mode ? 0 :
> + min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> +}

??

As far as I remembered, Hannes pointed out wakeup_flusher_threads(0) is
incorrect. can you fix this?



> +
> static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> struct scan_control *sc)
> {
> @@ -649,12 +661,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> static unsigned long shrink_page_list(struct list_head *page_list,
> struct scan_control *sc,
> enum pageout_io sync_writeback,
> + int file,
> unsigned long *nr_still_dirty)
> {
> LIST_HEAD(ret_pages);
> LIST_HEAD(free_pages);
> int pgactivate = 0;
> unsigned long nr_dirty = 0;
> + unsigned long nr_dirty_seen = 0;
> unsigned long nr_reclaimed = 0;
>
> cond_resched();
> @@ -748,6 +762,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> }
>
> if (PageDirty(page)) {
> + nr_dirty_seen++;
> +
> /*
> * Only kswapd can writeback filesystem pages to
> * avoid risk of stack overflow
> @@ -875,6 +891,18 @@ keep:
>
> list_splice(&ret_pages, page_list);
>
> + /*
> + * If reclaim is encountering dirty pages, it may be because
> + * dirty pages are reaching the end of the LRU even though the
> + * dirty_ratio may be satisified. In this case, wake flusher
> + * threads to pro-actively clean up to a maximum of
> + * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> + * !may_writepage indicates that this is a direct reclaimer in
> + * laptop mode avoiding disk spin-ups
> + */
> + if (file && nr_dirty_seen && sc->may_writepage)
> + wakeup_flusher_threads(nr_writeback_pages(nr_dirty));

Umm..
I don't think this guessing is so acculate. following is brief of
current isolate_lru_pages().


static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
struct list_head *src, struct list_head *dst,
unsigned long *scanned, int order, int mode, int file)
{
for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
__isolate_lru_page(page, mode, file))

if (!order)
continue;

/*
* Attempt to take all pages in the order aligned region
* surrounding the tag page. Only take those pages of
* the same active state as that tag page. We may safely
* round the target page pfn down to the requested order
* as the mem_map is guarenteed valid out to MAX_ORDER,
* where that page is in a different zone we will detect
* it from its zone id and abort this block scan.
*/
for (; pfn < end_pfn; pfn++) {
struct page *cursor_page;
(snip)
}

(This was unchanged since initial lumpy reclaim commit)

That said, merely order-1 isolate_lru_pages(ISOLATE_INACTIVE) makes pfn
neighbor search. then, we might found dirty pages even though the page
don't stay in end of lru.

What do you think?


> +
> *nr_still_dirty = nr_dirty;
> count_vm_events(PGACTIVATE, pgactivate);
> return nr_reclaimed;
> @@ -1315,7 +1343,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> spin_unlock_irq(&zone->lru_lock);
>
> nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC,
> - &nr_dirty);
> + file, &nr_dirty);
>
> /*
> * If specific pages are needed such as with direct reclaiming
> @@ -1351,7 +1379,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> count_vm_events(PGDEACTIVATE, nr_active);
>
> nr_reclaimed += shrink_page_list(&page_list, sc,
> - PAGEOUT_IO_SYNC, &nr_dirty);
> + PAGEOUT_IO_SYNC, file,
> + &nr_dirty);
> }
> }
>
> --
> 1.7.1
>



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/