From: Christoph Hellwig on
I don't see a patch in this set which refuses writeback from the memcg
context, which we identified as having large stack footprint in hte
discussion of the last patch set.

Meanwhile I've submitted a patch to xfs to allow reclaim from kswapd,
and just prevent it from direct and memcg reclaim.

Btw, it might be worth to also allow kswap to all writeout on ext4,
but doing that will be a bit more complicated than the btrfs and xfs
variants as the code is rather convoluted.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on
On Tue, Jun 29, 2010 at 08:37:22AM -0400, Christoph Hellwig wrote:
> I don't see a patch in this set which refuses writeback from the memcg
> context, which we identified as having large stack footprint in hte
> discussion of the last patch set.
>

It wasn't clear to me what the right approach was there and should
have noted that in the intro. The last note I have on it is this message
http://kerneltrap.org/mailarchive/linux-kernel/2010/6/17/4584087 which might
avoid the deep stack usage but I wasn't 100% sure. As kswapd doesn't clean
pages for memcg, I left memcg being able to direct writeback to see what
the memcg people preferred.

> Meanwhile I've submitted a patch to xfs to allow reclaim from kswapd,
> and just prevent it from direct and memcg reclaim.
>

Good stuff.

> Btw, it might be worth to also allow kswap to all writeout on ext4,
> but doing that will be a bit more complicated than the btrfs and xfs
> variants as the code is rather convoluted.
>

Fully agreed. I looked into it and got caught in its twisty web so
postponed it until this much can be finalised, agreed upon or rejected -
all pre-requisities to making the ext4 work worthwhile.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: KAMEZAWA Hiroyuki on
On Tue, 29 Jun 2010 13:51:43 +0100
Mel Gorman <mel(a)csn.ul.ie> wrote:

> On Tue, Jun 29, 2010 at 08:37:22AM -0400, Christoph Hellwig wrote:
> > I don't see a patch in this set which refuses writeback from the memcg
> > context, which we identified as having large stack footprint in hte
> > discussion of the last patch set.
> >
>
> It wasn't clear to me what the right approach was there and should
> have noted that in the intro. The last note I have on it is this message
> http://kerneltrap.org/mailarchive/linux-kernel/2010/6/17/4584087 which might
> avoid the deep stack usage but I wasn't 100% sure. As kswapd doesn't clean
> pages for memcg, I left memcg being able to direct writeback to see what
> the memcg people preferred.
>

Hmm. If some filesystems don't support direct ->writeback, memcg shouldn't
depends on it. If so, memcg should depends on some writeback-thread (as kswapd).
ok.

Then, my concern here is that which kswapd we should wake up and how it can stop.
IOW, how kswapd can know a memcg has some remaining writeback and struck on it.

One idea is here. (this patch will not work...not tested at all.)
If we can have "victim page list" and kswapd can depend on it to know
"which pages should be written", kswapd can know when it should work.

cpu usage by memcg will be a new problem...but...

==
Add a new LRU "CLEANING" and make kswapd launder it.
This patch also changes PG_reclaim behavior. New PG_reclaim works
as
- If PG_reclaim is set, a page is on CLEAINING LIST.

And when kswapd launder a page
- issue an writeback. (I'm thinking whehter I should put this
cleaned page back to CLEANING lru and free it later.)
- if it can free directly, free it.
This just use current shrink_list().

Maybe this patch itself inlcludes many bad point...

---
fs/proc/meminfo.c | 2
include/linux/mm_inline.h | 9 ++
include/linux/mmzone.h | 7 ++
mm/filemap.c | 3
mm/internal.h | 1
mm/page-writeback.c | 1
mm/page_io.c | 1
mm/swap.c | 31 ++-------
mm/vmscan.c | 153 +++++++++++++++++++++++++++++++++++++++++++++-
9 files changed, 176 insertions(+), 32 deletions(-)

Index: mmotm-0611/include/linux/mmzone.h
===================================================================
--- mmotm-0611.orig/include/linux/mmzone.h
+++ mmotm-0611/include/linux/mmzone.h
@@ -85,6 +85,7 @@ enum zone_stat_item {
NR_INACTIVE_FILE, /* " " " " " */
NR_ACTIVE_FILE, /* " " " " " */
NR_UNEVICTABLE, /* " " " " " */
+ NR_CLEANING, /* " " " " " */
NR_MLOCK, /* mlock()ed pages found and moved off LRU */
NR_ANON_PAGES, /* Mapped anonymous pages */
NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
@@ -133,6 +134,7 @@ enum lru_list {
LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
LRU_UNEVICTABLE,
+ LRU_CLEANING,
NR_LRU_LISTS
};

@@ -155,6 +157,11 @@ static inline int is_unevictable_lru(enu
return (l == LRU_UNEVICTABLE);
}

+static inline int is_cleaning_lru(enum lru_list l)
+{
+ return (l == LRU_CLEANING);
+}
+
enum zone_watermarks {
WMARK_MIN,
WMARK_LOW,
Index: mmotm-0611/include/linux/mm_inline.h
===================================================================
--- mmotm-0611.orig/include/linux/mm_inline.h
+++ mmotm-0611/include/linux/mm_inline.h
@@ -56,7 +56,10 @@ del_page_from_lru(struct zone *zone, str
enum lru_list l;

list_del(&page->lru);
- if (PageUnevictable(page)) {
+ if (PageReclaim(page)) {
+ ClearPageReclaim(page);
+ l = LRU_CLEANING;
+ } else if (PageUnevictable(page)) {
__ClearPageUnevictable(page);
l = LRU_UNEVICTABLE;
} else {
@@ -81,7 +84,9 @@ static inline enum lru_list page_lru(str
{
enum lru_list lru;

- if (PageUnevictable(page))
+ if (PageReclaim(page)) {
+ lru = LRU_CLEANING;
+ } else if (PageUnevictable(page))
lru = LRU_UNEVICTABLE;
else {
lru = page_lru_base_type(page);
Index: mmotm-0611/fs/proc/meminfo.c
===================================================================
--- mmotm-0611.orig/fs/proc/meminfo.c
+++ mmotm-0611/fs/proc/meminfo.c
@@ -65,6 +65,7 @@ static int meminfo_proc_show(struct seq_
"Active(file): %8lu kB\n"
"Inactive(file): %8lu kB\n"
"Unevictable: %8lu kB\n"
+ "Cleaning: %8lu kB\n"
"Mlocked: %8lu kB\n"
#ifdef CONFIG_HIGHMEM
"HighTotal: %8lu kB\n"
@@ -114,6 +115,7 @@ static int meminfo_proc_show(struct seq_
K(pages[LRU_ACTIVE_FILE]),
K(pages[LRU_INACTIVE_FILE]),
K(pages[LRU_UNEVICTABLE]),
+ K(pages[LRU_CLEANING]),
K(global_page_state(NR_MLOCK)),
#ifdef CONFIG_HIGHMEM
K(i.totalhigh),
Index: mmotm-0611/mm/swap.c
===================================================================
--- mmotm-0611.orig/mm/swap.c
+++ mmotm-0611/mm/swap.c
@@ -118,8 +118,8 @@ static void pagevec_move_tail(struct pag
zone = pagezone;
spin_lock(&zone->lru_lock);
}
- if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
- int lru = page_lru_base_type(page);
+ if (PageLRU(page)) {
+ int lru = page_lru(page);
list_move_tail(&page->lru, &zone->lru[lru].list);
pgmoved++;
}
@@ -131,27 +131,6 @@ static void pagevec_move_tail(struct pag
pagevec_reinit(pvec);
}

-/*
- * Writeback is about to end against a page which has been marked for immediate
- * reclaim. If it still appears to be reclaimable, move it to the tail of the
- * inactive list.
- */
-void rotate_reclaimable_page(struct page *page)
-{
- if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
- !PageUnevictable(page) && PageLRU(page)) {
- struct pagevec *pvec;
- unsigned long flags;
-
- page_cache_get(page);
- local_irq_save(flags);
- pvec = &__get_cpu_var(lru_rotate_pvecs);
- if (!pagevec_add(pvec, page))
- pagevec_move_tail(pvec);
- local_irq_restore(flags);
- }
-}
-
static void update_page_reclaim_stat(struct zone *zone, struct page *page,
int file, int rotated)
{
@@ -235,10 +214,16 @@ void lru_cache_add_lru(struct page *page
{
if (PageActive(page)) {
VM_BUG_ON(PageUnevictable(page));
+ VM_BUG_ON(PageReclaim(page));
ClearPageActive(page);
} else if (PageUnevictable(page)) {
VM_BUG_ON(PageActive(page));
+ VM_BUG_ON(PageReclaim(page));
ClearPageUnevictable(page);
+ } else if (PageReclaim(page)) {
+ VM_BUG_ON(PageReclaim(page));
+ VM_BUG_ON(PageUnevictable(page));
+ ClearPageReclaim(page);
}

VM_BUG_ON(PageLRU(page) || PageActive(page) || PageUnevictable(page));
Index: mmotm-0611/mm/filemap.c
===================================================================
--- mmotm-0611.orig/mm/filemap.c
+++ mmotm-0611/mm/filemap.c
@@ -560,9 +560,6 @@ EXPORT_SYMBOL(unlock_page);
*/
void end_page_writeback(struct page *page)
{
- if (TestClearPageReclaim(page))
- rotate_reclaimable_page(page);
-
if (!test_clear_page_writeback(page))
BUG();

Index: mmotm-0611/mm/internal.h
===================================================================
--- mmotm-0611.orig/mm/internal.h
+++ mmotm-0611/mm/internal.h
@@ -259,3 +259,4 @@ extern u64 hwpoison_filter_flags_mask;
extern u64 hwpoison_filter_flags_value;
extern u64 hwpoison_filter_memcg;
extern u32 hwpoison_filter_enable;
+
Index: mmotm-0611/mm/page-writeback.c
===================================================================
--- mmotm-0611.orig/mm/page-writeback.c
+++ mmotm-0611/mm/page-writeback.c
@@ -1252,7 +1252,6 @@ int clear_page_dirty_for_io(struct page

BUG_ON(!PageLocked(page));

- ClearPageReclaim(page);
if (mapping && mapping_cap_account_dirty(mapping)) {
/*
* Yes, Virginia, this is indeed insane.
Index: mmotm-0611/mm/page_io.c
===================================================================
--- mmotm-0611.orig/mm/page_io.c
+++ mmotm-0611/mm/page_io.c
@@ -60,7 +60,6 @@ static void end_swap_bio_write(struct bi
imajor(bio->bi_bdev->bd_inode),
iminor(bio->bi_bdev->bd_inode),
(unsigned long long)bio->bi_sector);
- ClearPageReclaim(page);
}
end_page_writeback(page);
bio_put(bio);
Index: mmotm-0611/mm/vmscan.c
===================================================================
--- mmotm-0611.orig/mm/vmscan.c
+++ mmotm-0611/mm/vmscan.c
@@ -364,6 +364,12 @@ static pageout_t pageout(struct page *pa
if (!may_write_to_queue(mapping->backing_dev_info))
return PAGE_KEEP;

+ if (!current_is_kswapd()) {
+ /* pass this page to kswapd. */
+ SetPageReclaim(page);
+ return PAGE_KEEP;
+ }
+
if (clear_page_dirty_for_io(page)) {
int res;
struct writeback_control wbc = {
@@ -503,6 +509,8 @@ void putback_lru_page(struct page *page)

redo:
ClearPageUnevictable(page);
+ /* This function never puts pages to CLEANING queue */
+ ClearPageReclaim(page);

if (page_evictable(page, NULL)) {
/*
@@ -883,6 +891,8 @@ int __isolate_lru_page(struct page *page
* page release code relies on it.
*/
ClearPageLRU(page);
+ /* when someone isolate this page, clear reclaim status */
+ ClearPageReclaim(page);
ret = 0;
}

@@ -1020,7 +1030,7 @@ static unsigned long isolate_pages_globa
* # of pages of each types and clearing any active bits.
*/
static unsigned long count_page_types(struct list_head *page_list,
- unsigned int *count, int clear_active)
+ unsigned int *count, int clear_actives)
{
int nr_active = 0;
int lru;
@@ -1076,6 +1086,7 @@ int isolate_lru_page(struct page *page)
int lru = page_lru(page);
ret = 0;
ClearPageLRU(page);
+ ClearPageReclaim(page);

del_page_from_lru_list(zone, page, lru);
}
@@ -1109,6 +1120,103 @@ static int too_many_isolated(struct zone
return isolated > inactive;
}

+/* only called by kswapd to do I/O and put back clean paes to its LRU */
+static void shrink_cleaning_list(struct zone *zone)
+{
+ LIST_HEAD(page_list);
+ struct list_head *src;
+ struct pagevec pvec;
+ unsigned long nr_pageout;
+ unsigned long nr_cleaned;
+ struct scan_control sc = {
+ .gfp_mask = GFP_KERNEL,
+ .may_unmap = 1,
+ .may_swap = 1,
+ .nr_to_reclaim = ULONG_MAX,
+ .swappiness = vm_swappiness,
+ .order = 1,
+ .mem_cgroup = NULL,
+ };
+
+ pagevec_init(&pvec, 1);
+ lru_add_drain();
+
+ src = &zone->lru[LRU_CLEANING].list;
+ nr_pageout = 0;
+ nr_cleaned = 0;
+ spin_lock_irq(&zone->lru_lock);
+ do {
+ unsigned int count[NR_LRU_LISTS] = {0,};
+ unsigned int nr_anon, nr_file, nr_taken, check_clean, nr_freed;
+ unsigned long nr_scan;
+
+ if (list_empty(src))
+ goto done;
+
+ check_clean = max((unsigned long)SWAP_CLUSTER_MAX,
+ zone_page_state(zone, NR_CLEANING)/8);
+ /* we do global-only */
+ nr_taken = isolate_lru_pages(check_clean,
+ src, &page_list, &nr_scan,
+ 0, ISOLATE_BOTH, 0);
+ zone->pages_scanned += nr_scan;
+ __count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scan);
+ if (nr_taken == 0)
+ goto done;
+ __mod_zone_page_state(zone, NR_CLEANING, -nr_taken);
+ spin_unlock_irq(&zone->lru_lock);
+ /*
+ * Because PG_reclaim flag is deleted by isolate_lru_page(),
+ * we can count correct value
+ */
+ count_page_types(&page_list, count, 0);
+ nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
+ nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
+ __mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
+ __mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
+
+ nr_freed = shrink_page_list(&page_list, &sc, PAGEOUT_IO_ASYNC);
+ /*
+ * Put back any unfreeable pages.
+ */
+ while (!list_empty(&page_list)) {
+ int lru;
+ struct page *page;
+
+ page = lru_to_page(&page_list);
+ VM_BUG_ON(PageLRU(page));
+ list_del(&page->lru);
+ if (!unlikely(!page_evictable(page, NULL))) {
+ spin_unlock_irq(&zone->lru_lock);
+ putback_lru_page(page);
+ spin_lock_irq(&zone->lru_lock);
+ continue;
+ }
+ SetPageLRU(page);
+ lru = page_lru(page);
+ add_page_to_lru_list(zone, page, lru);
+ if (!pagevec_add(&pvec, page)) {
+ spin_unlock_irq(&zone->lru_lock);
+ __pagevec_release(&pvec);
+ spin_lock_irq(&zone->lru_lock);
+ }
+ }
+ __mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
+ __mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
+ nr_pageout += nr_taken - nr_freed;
+ nr_cleaned += nr_freed;
+ if (nr_pageout > SWAP_CLUSTER_MAX) {
+ /* there are remaining I/Os */
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
+ nr_pageout /= 2;
+ }
+ } while(nr_cleaned < SWAP_CLUSTER_MAX);
+done:
+ spin_unlock_irq(&zone->lru_lock);
+ pagevec_release(&pvec);
+ return;
+}
+
/*
* shrink_inactive_list() is a helper for shrink_zone(). It returns the number
* of reclaimed pages
@@ -1736,6 +1844,9 @@ static bool shrink_zones(int priority, s
sc->nodemask) {
if (!populated_zone(zone))
continue;
+
+ if (current_is_kswapd())
+ shrink_cleaning_list(zone);
/*
* Take care memory controller reclaiming has small influence
* to global LRU.
@@ -2222,6 +2333,42 @@ out:
return sc.nr_reclaimed;
}

+static void launder_pgdat(pg_data_t *pgdat)
+{
+ struct zone *zone;
+ int i;
+
+ for (i = 0; i < MAX_NR_ZONES; i++) {
+
+ zone = &pgdat->node_zones[i];
+ if (!populated_zone(zone))
+ continue;
+ if (zone_page_state(zone, NR_CLEANING))
+ break;
+ shrink_cleaning_list(zone);
+ }
+}
+
+/*
+ * Find a zone which has cleaning list.
+ */
+static int need_to_cleaning_node(pg_data_t *pgdat)
+{
+ int i;
+ struct zone *zone;
+
+ for (i = 0; i < MAX_NR_ZONES; i++) {
+
+ zone = &pgdat->node_zones[i];
+ if (!populated_zone(zone))
+ continue;
+ if (zone_page_state(zone, NR_CLEANING))
+ break;
+ }
+ return (i != MAX_NR_ZONES);
+}
+
+
/*
* The background pageout daemon, started as a kernel thread
* from the init process.
@@ -2275,7 +2422,9 @@ static int kswapd(void *p)
prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
new_order = pgdat->kswapd_max_order;
pgdat->kswapd_max_order = 0;
- if (order < new_order) {
+ if (need_to_cleaning_node(pgdat)) {
+ launder_pgdat(pgdat);
+ } else if (order < new_order) {
/*
* Don't sleep if someone wants a larger 'order'
* allocation

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on
On Wed, Jun 30, 2010 at 09:14:11AM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 29 Jun 2010 13:51:43 +0100
> Mel Gorman <mel(a)csn.ul.ie> wrote:
>
> > On Tue, Jun 29, 2010 at 08:37:22AM -0400, Christoph Hellwig wrote:
> > > I don't see a patch in this set which refuses writeback from the memcg
> > > context, which we identified as having large stack footprint in hte
> > > discussion of the last patch set.
> > >
> >
> > It wasn't clear to me what the right approach was there and should
> > have noted that in the intro. The last note I have on it is this message
> > http://kerneltrap.org/mailarchive/linux-kernel/2010/6/17/4584087 which might
> > avoid the deep stack usage but I wasn't 100% sure. As kswapd doesn't clean
> > pages for memcg, I left memcg being able to direct writeback to see what
> > the memcg people preferred.
> >
>
> Hmm. If some filesystems don't support direct ->writeback,

This is not strictly true. All of them support ->writeback, but some of
them are ignoring it from reclaim context - right now, xfs, btrfs and extN
so it's a sizable cross-section of filesystems we care about.

> memcg shouldn't
> depends on it. If so, memcg should depends on some writeback-thread (as kswapd).
> ok.
>
> Then, my concern here is that which kswapd we should wake up and how it can stop.

And also what the consequences are of kswapd being occupied with containers
instead of the global lists for a time.

> IOW, how kswapd can know a memcg has some remaining writeback and struck on it.
>

Another possibility for memcg would be to visit Andrea's suggestion on
switching stack in more detail. I still haven't gotten around to this as
phd stuff is sucking up piles of my time.

> One idea is here. (this patch will not work...not tested at all.)
> If we can have "victim page list" and kswapd can depend on it to know
> "which pages should be written", kswapd can know when it should work.
>
> cpu usage by memcg will be a new problem...but...
>
> ==
> Add a new LRU "CLEANING" and make kswapd launder it.
> This patch also changes PG_reclaim behavior. New PG_reclaim works
> as
> - If PG_reclaim is set, a page is on CLEAINING LIST.
>
> And when kswapd launder a page
> - issue an writeback. (I'm thinking whehter I should put this
> cleaned page back to CLEANING lru and free it later.)
> - if it can free directly, free it.
> This just use current shrink_list().
>
> Maybe this patch itself inlcludes many bad point...
>
> ---
> fs/proc/meminfo.c | 2
> include/linux/mm_inline.h | 9 ++
> include/linux/mmzone.h | 7 ++
> mm/filemap.c | 3
> mm/internal.h | 1
> mm/page-writeback.c | 1
> mm/page_io.c | 1
> mm/swap.c | 31 ++-------
> mm/vmscan.c | 153 +++++++++++++++++++++++++++++++++++++++++++++-
> 9 files changed, 176 insertions(+), 32 deletions(-)
>
> Index: mmotm-0611/include/linux/mmzone.h
> ===================================================================
> --- mmotm-0611.orig/include/linux/mmzone.h
> +++ mmotm-0611/include/linux/mmzone.h
> @@ -85,6 +85,7 @@ enum zone_stat_item {
> NR_INACTIVE_FILE, /* " " " " " */
> NR_ACTIVE_FILE, /* " " " " " */
> NR_UNEVICTABLE, /* " " " " " */
> + NR_CLEANING, /* " " " " " */
> NR_MLOCK, /* mlock()ed pages found and moved off LRU */
> NR_ANON_PAGES, /* Mapped anonymous pages */
> NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
> @@ -133,6 +134,7 @@ enum lru_list {
> LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
> LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
> LRU_UNEVICTABLE,
> + LRU_CLEANING,
> NR_LRU_LISTS
> };
>
> @@ -155,6 +157,11 @@ static inline int is_unevictable_lru(enu
> return (l == LRU_UNEVICTABLE);
> }
>
> +static inline int is_cleaning_lru(enum lru_list l)
> +{
> + return (l == LRU_CLEANING);
> +}
> +

Nit - LRU_CLEAN_PENDING might be clearer as CLEANING implies it is currently
being cleaned (implying it's the same as NR_WRITEBACK) or is definely dirty
implying it's the same as NR_DIRTY.

> enum zone_watermarks {
> WMARK_MIN,
> WMARK_LOW,
> Index: mmotm-0611/include/linux/mm_inline.h
> ===================================================================
> --- mmotm-0611.orig/include/linux/mm_inline.h
> +++ mmotm-0611/include/linux/mm_inline.h
> @@ -56,7 +56,10 @@ del_page_from_lru(struct zone *zone, str
> enum lru_list l;
>
> list_del(&page->lru);
> - if (PageUnevictable(page)) {
> + if (PageReclaim(page)) {
> + ClearPageReclaim(page);
> + l = LRU_CLEANING;
> + } else if (PageUnevictable(page)) {
> __ClearPageUnevictable(page);
> l = LRU_UNEVICTABLE;
> } else {

One point of note is that having a LRU cleaning list will alter the aging
of pages quite a bit.

A slightly greater concern is that clean pages can be temporarily "lost"
on the cleaning list. If a direct reclaimer moves pages to the LRU_CLEANING
list, it's no longer considering those pages even if a flusher thread
happened to clean those pages before kswapd had a chance. Lets say under
heavy memory pressure a lot of pages are being dirties and encountered on
the LRU list. They move to LRU_CLEANING where dirty balancing starts making
sure they get cleaned but are no longer being reclaimed.

Of course, I might be wrong but it's not a trivial direction to take.

> @@ -81,7 +84,9 @@ static inline enum lru_list page_lru(str
> {
> enum lru_list lru;
>
> - if (PageUnevictable(page))
> + if (PageReclaim(page)) {
> + lru = LRU_CLEANING;
> + } else if (PageUnevictable(page))
> lru = LRU_UNEVICTABLE;
> else {
> lru = page_lru_base_type(page);
> Index: mmotm-0611/fs/proc/meminfo.c
> ===================================================================
> --- mmotm-0611.orig/fs/proc/meminfo.c
> +++ mmotm-0611/fs/proc/meminfo.c
> @@ -65,6 +65,7 @@ static int meminfo_proc_show(struct seq_
> "Active(file): %8lu kB\n"
> "Inactive(file): %8lu kB\n"
> "Unevictable: %8lu kB\n"
> + "Cleaning: %8lu kB\n"
> "Mlocked: %8lu kB\n"
> #ifdef CONFIG_HIGHMEM
> "HighTotal: %8lu kB\n"
> @@ -114,6 +115,7 @@ static int meminfo_proc_show(struct seq_
> K(pages[LRU_ACTIVE_FILE]),
> K(pages[LRU_INACTIVE_FILE]),
> K(pages[LRU_UNEVICTABLE]),
> + K(pages[LRU_CLEANING]),
> K(global_page_state(NR_MLOCK)),
> #ifdef CONFIG_HIGHMEM
> K(i.totalhigh),
> Index: mmotm-0611/mm/swap.c
> ===================================================================
> --- mmotm-0611.orig/mm/swap.c
> +++ mmotm-0611/mm/swap.c
> @@ -118,8 +118,8 @@ static void pagevec_move_tail(struct pag
> zone = pagezone;
> spin_lock(&zone->lru_lock);
> }
> - if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
> - int lru = page_lru_base_type(page);
> + if (PageLRU(page)) {
> + int lru = page_lru(page);
> list_move_tail(&page->lru, &zone->lru[lru].list);
> pgmoved++;
> }
> @@ -131,27 +131,6 @@ static void pagevec_move_tail(struct pag
> pagevec_reinit(pvec);
> }
>
> -/*
> - * Writeback is about to end against a page which has been marked for immediate
> - * reclaim. If it still appears to be reclaimable, move it to the tail of the
> - * inactive list.
> - */
> -void rotate_reclaimable_page(struct page *page)
> -{
> - if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
> - !PageUnevictable(page) && PageLRU(page)) {
> - struct pagevec *pvec;
> - unsigned long flags;
> -
> - page_cache_get(page);
> - local_irq_save(flags);
> - pvec = &__get_cpu_var(lru_rotate_pvecs);
> - if (!pagevec_add(pvec, page))
> - pagevec_move_tail(pvec);
> - local_irq_restore(flags);
> - }
> -}
> -
> static void update_page_reclaim_stat(struct zone *zone, struct page *page,
> int file, int rotated)
> {
> @@ -235,10 +214,16 @@ void lru_cache_add_lru(struct page *page
> {
> if (PageActive(page)) {
> VM_BUG_ON(PageUnevictable(page));
> + VM_BUG_ON(PageReclaim(page));
> ClearPageActive(page);
> } else if (PageUnevictable(page)) {
> VM_BUG_ON(PageActive(page));
> + VM_BUG_ON(PageReclaim(page));
> ClearPageUnevictable(page);
> + } else if (PageReclaim(page)) {
> + VM_BUG_ON(PageReclaim(page));
> + VM_BUG_ON(PageUnevictable(page));
> + ClearPageReclaim(page);
> }
>
> VM_BUG_ON(PageLRU(page) || PageActive(page) || PageUnevictable(page));
> Index: mmotm-0611/mm/filemap.c
> ===================================================================
> --- mmotm-0611.orig/mm/filemap.c
> +++ mmotm-0611/mm/filemap.c
> @@ -560,9 +560,6 @@ EXPORT_SYMBOL(unlock_page);
> */
> void end_page_writeback(struct page *page)
> {
> - if (TestClearPageReclaim(page))
> - rotate_reclaimable_page(page);
> -
> if (!test_clear_page_writeback(page))
> BUG();
>
> Index: mmotm-0611/mm/internal.h
> ===================================================================
> --- mmotm-0611.orig/mm/internal.h
> +++ mmotm-0611/mm/internal.h
> @@ -259,3 +259,4 @@ extern u64 hwpoison_filter_flags_mask;
> extern u64 hwpoison_filter_flags_value;
> extern u64 hwpoison_filter_memcg;
> extern u32 hwpoison_filter_enable;
> +
> Index: mmotm-0611/mm/page-writeback.c
> ===================================================================
> --- mmotm-0611.orig/mm/page-writeback.c
> +++ mmotm-0611/mm/page-writeback.c
> @@ -1252,7 +1252,6 @@ int clear_page_dirty_for_io(struct page
>
> BUG_ON(!PageLocked(page));
>
> - ClearPageReclaim(page);
> if (mapping && mapping_cap_account_dirty(mapping)) {
> /*
> * Yes, Virginia, this is indeed insane.
> Index: mmotm-0611/mm/page_io.c
> ===================================================================
> --- mmotm-0611.orig/mm/page_io.c
> +++ mmotm-0611/mm/page_io.c
> @@ -60,7 +60,6 @@ static void end_swap_bio_write(struct bi
> imajor(bio->bi_bdev->bd_inode),
> iminor(bio->bi_bdev->bd_inode),
> (unsigned long long)bio->bi_sector);
> - ClearPageReclaim(page);
> }
> end_page_writeback(page);
> bio_put(bio);
> Index: mmotm-0611/mm/vmscan.c
> ===================================================================
> --- mmotm-0611.orig/mm/vmscan.c
> +++ mmotm-0611/mm/vmscan.c
> @@ -364,6 +364,12 @@ static pageout_t pageout(struct page *pa
> if (!may_write_to_queue(mapping->backing_dev_info))
> return PAGE_KEEP;
>
> + if (!current_is_kswapd()) {
> + /* pass this page to kswapd. */
> + SetPageReclaim(page);
> + return PAGE_KEEP;
> + }
> +
> if (clear_page_dirty_for_io(page)) {
> int res;
> struct writeback_control wbc = {
> @@ -503,6 +509,8 @@ void putback_lru_page(struct page *page)
>
> redo:
> ClearPageUnevictable(page);
> + /* This function never puts pages to CLEANING queue */
> + ClearPageReclaim(page);
>
> if (page_evictable(page, NULL)) {
> /*
> @@ -883,6 +891,8 @@ int __isolate_lru_page(struct page *page
> * page release code relies on it.
> */
> ClearPageLRU(page);
> + /* when someone isolate this page, clear reclaim status */
> + ClearPageReclaim(page);
> ret = 0;
> }
>
> @@ -1020,7 +1030,7 @@ static unsigned long isolate_pages_globa
> * # of pages of each types and clearing any active bits.
> */
> static unsigned long count_page_types(struct list_head *page_list,
> - unsigned int *count, int clear_active)
> + unsigned int *count, int clear_actives)
> {
> int nr_active = 0;
> int lru;
> @@ -1076,6 +1086,7 @@ int isolate_lru_page(struct page *page)
> int lru = page_lru(page);
> ret = 0;
> ClearPageLRU(page);
> + ClearPageReclaim(page);
>
> del_page_from_lru_list(zone, page, lru);
> }
> @@ -1109,6 +1120,103 @@ static int too_many_isolated(struct zone
> return isolated > inactive;
> }
>
> +/* only called by kswapd to do I/O and put back clean paes to its LRU */
> +static void shrink_cleaning_list(struct zone *zone)
> +{
> + LIST_HEAD(page_list);
> + struct list_head *src;
> + struct pagevec pvec;
> + unsigned long nr_pageout;
> + unsigned long nr_cleaned;
> + struct scan_control sc = {
> + .gfp_mask = GFP_KERNEL,
> + .may_unmap = 1,
> + .may_swap = 1,
> + .nr_to_reclaim = ULONG_MAX,
> + .swappiness = vm_swappiness,
> + .order = 1,
> + .mem_cgroup = NULL,
> + };
> +
> + pagevec_init(&pvec, 1);
> + lru_add_drain();
> +
> + src = &zone->lru[LRU_CLEANING].list;
> + nr_pageout = 0;
> + nr_cleaned = 0;
> + spin_lock_irq(&zone->lru_lock);
> + do {
> + unsigned int count[NR_LRU_LISTS] = {0,};
> + unsigned int nr_anon, nr_file, nr_taken, check_clean, nr_freed;
> + unsigned long nr_scan;
> +
> + if (list_empty(src))
> + goto done;
> +
> + check_clean = max((unsigned long)SWAP_CLUSTER_MAX,
> + zone_page_state(zone, NR_CLEANING)/8);
> + /* we do global-only */
> + nr_taken = isolate_lru_pages(check_clean,
> + src, &page_list, &nr_scan,
> + 0, ISOLATE_BOTH, 0);
> + zone->pages_scanned += nr_scan;
> + __count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scan);
> + if (nr_taken == 0)
> + goto done;
> + __mod_zone_page_state(zone, NR_CLEANING, -nr_taken);
> + spin_unlock_irq(&zone->lru_lock);
> + /*
> + * Because PG_reclaim flag is deleted by isolate_lru_page(),
> + * we can count correct value
> + */
> + count_page_types(&page_list, count, 0);
> + nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
> + nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
> + __mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
> + __mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
> +
> + nr_freed = shrink_page_list(&page_list, &sc, PAGEOUT_IO_ASYNC);

So, at this point the isolated pages are cleaned and put back which is
fine. If they were already clean, they get freed which is also fine. But
direct reclaimers do not call this function so they could be missing
clean and freeable pages which worries me.

> + /*
> + * Put back any unfreeable pages.
> + */
> + while (!list_empty(&page_list)) {
> + int lru;
> + struct page *page;
> +
> + page = lru_to_page(&page_list);
> + VM_BUG_ON(PageLRU(page));
> + list_del(&page->lru);
> + if (!unlikely(!page_evictable(page, NULL))) {
> + spin_unlock_irq(&zone->lru_lock);
> + putback_lru_page(page);
> + spin_lock_irq(&zone->lru_lock);
> + continue;
> + }
> + SetPageLRU(page);
> + lru = page_lru(page);
> + add_page_to_lru_list(zone, page, lru);
> + if (!pagevec_add(&pvec, page)) {
> + spin_unlock_irq(&zone->lru_lock);
> + __pagevec_release(&pvec);
> + spin_lock_irq(&zone->lru_lock);
> + }
> + }
> + __mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
> + __mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
> + nr_pageout += nr_taken - nr_freed;
> + nr_cleaned += nr_freed;
> + if (nr_pageout > SWAP_CLUSTER_MAX) {
> + /* there are remaining I/Os */
> + congestion_wait(BLK_RW_ASYNC, HZ/10);
> + nr_pageout /= 2;
> + }
> + } while(nr_cleaned < SWAP_CLUSTER_MAX);
> +done:
> + spin_unlock_irq(&zone->lru_lock);
> + pagevec_release(&pvec);
> + return;
> +}
> +
> /*
> * shrink_inactive_list() is a helper for shrink_zone(). It returns the number
> * of reclaimed pages
> @@ -1736,6 +1844,9 @@ static bool shrink_zones(int priority, s
> sc->nodemask) {
> if (!populated_zone(zone))
> continue;
> +
> + if (current_is_kswapd())
> + shrink_cleaning_list(zone);
> /*
> * Take care memory controller reclaiming has small influence
> * to global LRU.
> @@ -2222,6 +2333,42 @@ out:
> return sc.nr_reclaimed;
> }
>
> +static void launder_pgdat(pg_data_t *pgdat)
> +{
> + struct zone *zone;
> + int i;
> +
> + for (i = 0; i < MAX_NR_ZONES; i++) {
> +
> + zone = &pgdat->node_zones[i];
> + if (!populated_zone(zone))
> + continue;
> + if (zone_page_state(zone, NR_CLEANING))
> + break;
> + shrink_cleaning_list(zone);
> + }
> +}
> +
> +/*
> + * Find a zone which has cleaning list.
> + */
> +static int need_to_cleaning_node(pg_data_t *pgdat)
> +{
> + int i;
> + struct zone *zone;
> +
> + for (i = 0; i < MAX_NR_ZONES; i++) {
> +
> + zone = &pgdat->node_zones[i];
> + if (!populated_zone(zone))
> + continue;
> + if (zone_page_state(zone, NR_CLEANING))
> + break;
> + }
> + return (i != MAX_NR_ZONES);
> +}
> +
> +
> /*
> * The background pageout daemon, started as a kernel thread
> * from the init process.
> @@ -2275,7 +2422,9 @@ static int kswapd(void *p)
> prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> new_order = pgdat->kswapd_max_order;
> pgdat->kswapd_max_order = 0;
> - if (order < new_order) {
> + if (need_to_cleaning_node(pgdat)) {
> + launder_pgdat(pgdat);
> + } else if (order < new_order) {
> /*
> * Don't sleep if someone wants a larger 'order'
> * allocation

I see the direction you are thinking of but I have big concerns about clean
pages getting delayed for too long on the LRU_CLEANING pages before kswapd
puts them back in the right place. I think a safer direction would be for
memcg people to investigate Andrea's "switch stack" suggestion.

In the meantime for my own series, memcg now treats dirty pages similar to
lumpy reclaim. It asks flusher threads to clean pages but stalls waiting
for those pages to be cleaned for a time. This is an untested patch on top
of the current series.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5c4f08b..81c6fbe 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -367,10 +367,10 @@ int write_reclaim_page(struct page *page, struct address_space *mapping,
return PAGE_SUCCESS;
}

-/* kswapd and memcg can writeback as they are unlikely to overflow stack */
+/* For now, only kswapd can writeback as it will not overflow stack */
static inline bool reclaim_can_writeback(struct scan_control *sc)
{
- return current_is_kswapd() || sc->mem_cgroup != NULL;
+ return current_is_kswapd();
}

/*
@@ -900,10 +900,11 @@ keep_dirty:
congestion_wait(BLK_RW_ASYNC, HZ/10);

/*
- * As lumpy reclaim targets specific pages, wait on them
- * to be cleaned and try reclaim again for a time.
+ * As lumpy reclaim and memcg targets specific pages, wait on
+ * them to be cleaned and try reclaim again.
*/
- if (sync_writeback == PAGEOUT_IO_SYNC) {
+ if (sync_writeback == PAGEOUT_IO_SYNC ||
+ sc->mem_cgroup != NULL) {
dirty_isolated++;
list_splice(&dirty_pages, page_list);
INIT_LIST_HEAD(&dirty_pages);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: KAMEZAWA Hiroyuki on
On Fri, 2 Jul 2010 15:26:43 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu(a)jp.fujitsu.com> wrote:

> > I see the direction you are thinking of but I have big concerns about clean
> > pages getting delayed for too long on the LRU_CLEANING pages before kswapd
> > puts them back in the right place. I think a safer direction would be for
> > memcg people to investigate Andrea's "switch stack" suggestion.
> >
> Hmm, I may have to consider that. My concern is that IRQ's switch-stack works
> well just because no-task-switch in IRQ routine. (I'm sorry if I misunderstand.)
>
Ok, I'll think about this 1st.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/