From: Wu Fengguang on
On Wed, Mar 31, 2010 at 02:00:52PM +0800, KOSAKI Motohiro wrote:
> > KOSAKI-san,
> >
> > On Wed, Mar 31, 2010 at 01:38:12PM +0800, KOSAKI Motohiro wrote:
> > > > On Tue, Mar 30, 2010 at 02:08:53PM +0800, KOSAKI Motohiro wrote:
> > > > > Hi
> > > > >
> > > > > > Commit 84b18490d1f1bc7ed5095c929f78bc002eb70f26 introduces a regression.
> > > > > > With it, our tmpfs test always oom. The test has a lot of rotated anon
> > > > > > pages and cause percent[0] zero. Actually the percent[0] is a very small
> > > > > > value, but our calculation round it to zero. The commit makes vmscan
> > > > > > completely skip anon pages and cause oops.
> > > > > > An option is if percent[x] is zero in get_scan_ratio(), forces it
> > > > > > to 1. See below patch.
> > > > > > But the offending commit still changes behavior. Without the commit, we scan
> > > > > > all pages if priority is zero, below patch doesn't fix this. Don't know if
> > > > > > It's required to fix this too.
> > > > >
> > > > > Can you please post your /proc/meminfo and reproduce program? I'll digg it.
> > > > >
> > > > > Very unfortunately, this patch isn't acceptable. In past time, vmscan
> > > > > had similar logic, but 1% swap-out made lots bug reports.
> > > > if 1% is still big, how about below patch?
> > >
> > > This patch makes a lot of sense than previous. however I think <1% anon ratio
> > > shouldn't happen anyway because file lru doesn't have reclaimable pages.
> > > <1% seems no good reclaim rate.
> > >
> > > perhaps I'll take your patch for stable tree. but we need to attack the root
> > > cause. iow, I guess we need to fix scan ratio equation itself.
> >
> > I tend to regard this patch as a general improvement for both
> > .33-stable and .34.
> >
> > I do agree with you that it's desirable to do more test&analyze and
> > check further for possibly hidden problems.
>
> Yeah, I don't want ignore .33-stable too. if I can't find the root cause
> in 2-3 days, I'll revert guilty patch anyway.

OK, thank you!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andrew Morton on
On Wed, 31 Mar 2010 15:00:52 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com> wrote:

> > KOSAKI-san,
> >
> > On Wed, Mar 31, 2010 at 01:38:12PM +0800, KOSAKI Motohiro wrote:
> > > > On Tue, Mar 30, 2010 at 02:08:53PM +0800, KOSAKI Motohiro wrote:
> > > > > Hi
> > > > >
> > > > > > Commit 84b18490d1f1bc7ed5095c929f78bc002eb70f26 introduces a regression.
> > > > > > With it, our tmpfs test always oom. The test has a lot of rotated anon
> > > > > > pages and cause percent[0] zero. Actually the percent[0] is a very small
> > > > > > value, but our calculation round it to zero. The commit makes vmscan
> > > > > > completely skip anon pages and cause oops.
> > > > > > An option is if percent[x] is zero in get_scan_ratio(), forces it
> > > > > > to 1. See below patch.
> > > > > > But the offending commit still changes behavior. Without the commit, we scan
> > > > > > all pages if priority is zero, below patch doesn't fix this. Don't know if
> > > > > > It's required to fix this too.
> > > > >
> > > > > Can you please post your /proc/meminfo and reproduce program? I'll digg it.
> > > > >
> > > > > Very unfortunately, this patch isn't acceptable. In past time, vmscan
> > > > > had similar logic, but 1% swap-out made lots bug reports.
> > > > if 1% is still big, how about below patch?
> > >
> > > This patch makes a lot of sense than previous. however I think <1% anon ratio
> > > shouldn't happen anyway because file lru doesn't have reclaimable pages.
> > > <1% seems no good reclaim rate.
> > >
> > > perhaps I'll take your patch for stable tree. but we need to attack the root
> > > cause. iow, I guess we need to fix scan ratio equation itself.
> >
> > I tend to regard this patch as a general improvement for both
> > .33-stable and .34.
> >
> > I do agree with you that it's desirable to do more test&analyze and
> > check further for possibly hidden problems.
>
> Yeah, I don't want ignore .33-stable too. if I can't find the root cause
> in 2-3 days, I'll revert guilty patch anyway.
>

It's a good idea to avoid fixing a bug one-way-in-stable,
other-way-in-mainline. Because then we have new code in both trees
which is different. And the -stable guys sensibly like to see code get
a bit of a shakedown in mainline before backporting it.

So it would be better to merge the "simple" patch into mainline, tagged
for -stable backporting. Then we can later implement the larger fix in
mainline, perhaps starting by reverting the "simple" fix.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Shaohua Li on
On Wed, Mar 31, 2010 at 01:53:27PM +0800, KOSAKI Motohiro wrote:
> > > On Tue, Mar 30, 2010 at 02:08:53PM +0800, KOSAKI Motohiro wrote:
> > > > Hi
> > > >
> > > > > Commit 84b18490d1f1bc7ed5095c929f78bc002eb70f26 introduces a regression.
> > > > > With it, our tmpfs test always oom. The test has a lot of rotated anon
> > > > > pages and cause percent[0] zero. Actually the percent[0] is a very small
> > > > > value, but our calculation round it to zero. The commit makes vmscan
> > > > > completely skip anon pages and cause oops.
> > > > > An option is if percent[x] is zero in get_scan_ratio(), forces it
> > > > > to 1. See below patch.
> > > > > But the offending commit still changes behavior. Without the commit, we scan
> > > > > all pages if priority is zero, below patch doesn't fix this. Don't know if
> > > > > It's required to fix this too.
> > > >
> > > > Can you please post your /proc/meminfo and reproduce program? I'll digg it.
> > > >
> > > > Very unfortunately, this patch isn't acceptable. In past time, vmscan
> > > > had similar logic, but 1% swap-out made lots bug reports.
> > > if 1% is still big, how about below patch?
> >
> > This patch makes a lot of sense than previous. however I think <1% anon ratio
> > shouldn't happen anyway because file lru doesn't have reclaimable pages.
> > <1% seems no good reclaim rate.
>
> Oops, the above mention is wrong. sorry. only 1 page is still too big.
> because under streaming io workload, the number of scanning anon pages should
> be zero. this is very strong requirement. if not, backup operation will makes
> a lot of swapping out.
Sounds there is no big impact for the workload which you mentioned with the patch.
please see below descriptions.
I updated the description of the patch as fengguang suggested.



Commit 84b18490d introduces a regression. With it, our tmpfs test always oom.
The test uses a 6G tmpfs in a system with 3G memory. In the tmpfs, there are
6 copies of kernel source and the test does kbuild for each copy. My
investigation shows the test has a lot of rotated anon pages and quite few
file pages, so get_scan_ratio calculates percent[0] to be zero. Actually
the percent[0] shoule be a very small value, but our calculation round it
to zero. The commit makes vmscan completely skip anon pages and cause oops.

To avoid underflow, we don't use percentage, instead we directly calculate
how many pages should be scaned. In this way, we should get several scan pages
for < 1% percent. With this fix, my test doesn't oom any more.

Note, this patch doesn't really change logics, but just increase precise. For
system with a lot of memory, this might slightly changes behavior. For example,
in a sequential file read workload, without the patch, we don't swap any anon
pages. With it, if anon memory size is bigger than 16G, we will say one anon page
swapped. The 16G is calculated as PAGE_SIZE * priority(4096) * (fp/ap). fp/ap
is assumed to be 1024 which is common in this workload. So the impact sounds not
a big deal.

Signed-off-by: Shaohua Li <shaohua.li(a)intel.com>

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 79c8098..80a7ed5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1519,27 +1519,50 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
}

/*
+ * Smallish @nr_to_scan's are deposited in @nr_saved_scan,
+ * until we collected @swap_cluster_max pages to scan.
+ */
+static unsigned long nr_scan_try_batch(unsigned long nr_to_scan,
+ unsigned long *nr_saved_scan)
+{
+ unsigned long nr;
+
+ *nr_saved_scan += nr_to_scan;
+ nr = *nr_saved_scan;
+
+ if (nr >= SWAP_CLUSTER_MAX)
+ *nr_saved_scan = 0;
+ else
+ nr = 0;
+
+ return nr;
+}
+
+/*
* Determine how aggressively the anon and file LRU lists should be
* scanned. The relative value of each set of LRU lists is determined
* by looking at the fraction of the pages scanned we did rotate back
* onto the active list instead of evict.
*
- * percent[0] specifies how much pressure to put on ram/swap backed
- * memory, while percent[1] determines pressure on the file LRUs.
+ * nr[x] specifies how many pages should be scaned
*/
-static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
- unsigned long *percent)
+static void get_scan_count(struct zone *zone, struct scan_control *sc,
+ unsigned long *nr, int priority)
{
unsigned long anon, file, free;
unsigned long anon_prio, file_prio;
unsigned long ap, fp;
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
+ unsigned long fraction[2], denominator[2];
+ enum lru_list l;

/* If we have no swap space, do not bother scanning anon pages. */
if (!sc->may_swap || (nr_swap_pages <= 0)) {
- percent[0] = 0;
- percent[1] = 100;
- return;
+ fraction[0] = 0;
+ denominator[0] = 1;
+ fraction[1] = 1;
+ denominator[1] = 1;
+ goto out;
}

anon = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_ANON) +
@@ -1552,9 +1575,11 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
/* If we have very few page cache pages,
force-scan anon pages. */
if (unlikely(file + free <= high_wmark_pages(zone))) {
- percent[0] = 100;
- percent[1] = 0;
- return;
+ fraction[0] = 1;
+ denominator[0] = 1;
+ fraction[1] = 0;
+ denominator[1] = 1;
+ goto out;
}
}

@@ -1601,29 +1626,29 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
fp = (file_prio + 1) * (reclaim_stat->recent_scanned[1] + 1);
fp /= reclaim_stat->recent_rotated[1] + 1;

- /* Normalize to percentages */
- percent[0] = 100 * ap / (ap + fp + 1);
- percent[1] = 100 - percent[0];
-}
-
-/*
- * Smallish @nr_to_scan's are deposited in @nr_saved_scan,
- * until we collected @swap_cluster_max pages to scan.
- */
-static unsigned long nr_scan_try_batch(unsigned long nr_to_scan,
- unsigned long *nr_saved_scan)
-{
- unsigned long nr;
+ fraction[0] = ap;
+ denominator[0] = ap + fp + 1;
+ fraction[1] = fp;
+ denominator[1] = ap + fp + 1;

- *nr_saved_scan += nr_to_scan;
- nr = *nr_saved_scan;
+out:
+ for_each_evictable_lru(l) {
+ int file = is_file_lru(l);
+ unsigned long scan;

- if (nr >= SWAP_CLUSTER_MAX)
- *nr_saved_scan = 0;
- else
- nr = 0;
+ if (fraction[file] == 0) {
+ nr[l] = 0;
+ continue;
+ }

- return nr;
+ scan = zone_nr_lru_pages(zone, sc, l);
+ if (priority) {
+ scan >>= priority;
+ scan = (scan * fraction[file] / denominator[file]);
+ }
+ nr[l] = nr_scan_try_batch(scan,
+ &reclaim_stat->nr_saved_scan[l]);
+ }
}

/*
@@ -1634,31 +1659,11 @@ static void shrink_zone(int priority, struct zone *zone,
{
unsigned long nr[NR_LRU_LISTS];
unsigned long nr_to_scan;
- unsigned long percent[2]; /* anon @ 0; file @ 1 */
enum lru_list l;
unsigned long nr_reclaimed = sc->nr_reclaimed;
unsigned long nr_to_reclaim = sc->nr_to_reclaim;
- struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
-
- get_scan_ratio(zone, sc, percent);

- for_each_evictable_lru(l) {
- int file = is_file_lru(l);
- unsigned long scan;
-
- if (percent[file] == 0) {
- nr[l] = 0;
- continue;
- }
-
- scan = zone_nr_lru_pages(zone, sc, l);
- if (priority) {
- scan >>= priority;
- scan = (scan * percent[file]) / 100;
- }
- nr[l] = nr_scan_try_batch(scan,
- &reclaim_stat->nr_saved_scan[l]);
- }
+ get_scan_count(zone, sc, nr, priority);

while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: KOSAKI Motohiro on
> > Yeah, I don't want ignore .33-stable too. if I can't find the root cause
> > in 2-3 days, I'll revert guilty patch anyway.
> >
>
> It's a good idea to avoid fixing a bug one-way-in-stable,
> other-way-in-mainline. Because then we have new code in both trees
> which is different. And the -stable guys sensibly like to see code get
> a bit of a shakedown in mainline before backporting it.
>
> So it would be better to merge the "simple" patch into mainline, tagged
> for -stable backporting. Then we can later implement the larger fix in
> mainline, perhaps starting by reverting the "simple" fix.

......ok. I don't have to prevent your code maintainship. although I still
think we need to fix the issue completely.


===================================================================
From 52358cbccdfe94e0381974cd6e937bcc6b1c608b Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com>
Date: Fri, 2 Apr 2010 17:13:48 +0900
Subject: [PATCH] Revert "vmscan: get_scan_ratio() cleanup"

Shaohua Li reported his tmpfs streaming I/O test can lead to make oom.
The test uses a 6G tmpfs in a system with 3G memory. In the tmpfs,
there are 6 copies of kernel source and the test does kbuild for each
copy. His investigation shows the test has a lot of rotated anon
pages and quite few file pages, so get_scan_ratio calculates percent[0]
(i.e. scanning percent for anon) to be zero. Actually the percent[0]
shoule be a big value, but our calculation round it to zero.

Although before commit 84b18490, we have the same sick too. but the old
logic can rescue percent[0]==0 case only when priority==0. It had hided
the real issue. I didn't think merely streaming io can makes percent[0]==0
&& priority==0 situation. but I was wrong.

So, definitely we have to fix such tmpfs streaming io issue. but anyway
I revert the regression commit at first.

This reverts commit 84b18490d1f1bc7ed5095c929f78bc002eb70f26.

Reported-by: Shaohua Li <shaohua.li(a)intel.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com>
---
mm/vmscan.c | 23 +++++++++--------------
1 files changed, 9 insertions(+), 14 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 79c8098..cb3947e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1535,13 +1535,6 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
unsigned long ap, fp;
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);

- /* If we have no swap space, do not bother scanning anon pages. */
- if (!sc->may_swap || (nr_swap_pages <= 0)) {
- percent[0] = 0;
- percent[1] = 100;
- return;
- }
-
anon = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_ANON) +
zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON);
file = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) +
@@ -1639,20 +1632,22 @@ static void shrink_zone(int priority, struct zone *zone,
unsigned long nr_reclaimed = sc->nr_reclaimed;
unsigned long nr_to_reclaim = sc->nr_to_reclaim;
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
+ int noswap = 0;

- get_scan_ratio(zone, sc, percent);
+ /* If we have no swap space, do not bother scanning anon pages. */
+ if (!sc->may_swap || (nr_swap_pages <= 0)) {
+ noswap = 1;
+ percent[0] = 0;
+ percent[1] = 100;
+ } else
+ get_scan_ratio(zone, sc, percent);

for_each_evictable_lru(l) {
int file = is_file_lru(l);
unsigned long scan;

- if (percent[file] == 0) {
- nr[l] = 0;
- continue;
- }
-
scan = zone_nr_lru_pages(zone, sc, l);
- if (priority) {
+ if (priority || noswap) {
scan >>= priority;
scan = (scan * percent[file]) / 100;
}
--
1.6.5.2





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Shaohua Li on
On Fri, Apr 02, 2010 at 05:14:38PM +0800, KOSAKI Motohiro wrote:
> > > > This patch makes a lot of sense than previous. however I think <1% anon ratio
> > > > shouldn't happen anyway because file lru doesn't have reclaimable pages.
> > > > <1% seems no good reclaim rate.
> > >
> > > Oops, the above mention is wrong. sorry. only 1 page is still too big.
> > > because under streaming io workload, the number of scanning anon pages should
> > > be zero. this is very strong requirement. if not, backup operation will makes
> > > a lot of swapping out.
> > Sounds there is no big impact for the workload which you mentioned with the patch.
> > please see below descriptions.
> > I updated the description of the patch as fengguang suggested.
>
> Umm.. sorry, no.
>
> "one fix but introduce another one bug" is not good deal. instead,
> I'll revert the guilty commit at first as akpm mentioned.
Even we revert the commit, the patch still has its benefit, as it increases
calculation precision, right?

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/