vmscan: Force kswapd to take notice faster when high-order watermarks are being hit [Kernel]

Prev: x86, mce: disable MCE if cpu has no MCE banks
Next: ark3116: (v2) driver rework after review

From: Mel Gorman on 28 Oct 2009 06:30

On Tue, Oct 27, 2009 at 01:19:05PM -0700, Andrew Morton wrote:
> On Tue, 27 Oct 2009 13:40:33 +0000
> Mel Gorman <mel(a)csn.ul.ie> wrote:
>
> > When a high-order allocation fails, kswapd is kicked so that it reclaims
> > at a higher-order to avoid direct reclaimers stall and to help GFP_ATOMIC
> > allocations. Something has changed in recent kernels that affect the timing
> > where high-order GFP_ATOMIC allocations are now failing with more frequency,
> > particularly under pressure. This patch forces kswapd to notice sooner that
> > high-order allocations are occuring.
> >
>
> "something has changed"? Shouldn't we find out what that is?
>

We've been trying but the answer right now is "lots". There were some
changes in the allocator itself which were unintentional and fixed in
patches 1 and 2 of this series. The two other major changes are

iwlagn is now making high order GFP_ATOMIC allocations which didn't
help. This is being addressed separetly and I believe the relevant
patches are now in mainline.

The other major change appears to be in page writeback. Reverting
commits 373c0a7e + 8aa7e847 significantly helps one bug reporter but
it's still unknown as to why that is.

The latter is still being investigated but as the patches in this series
are known to help some bug reporters with their GFP_ATOMIC failures and
it is being reported against latest mainline and -stable, I felt it was
best to help some of the bug reporters now to reduce duplicate reports.

> > ---
> > mm/vmscan.c | 9 +++++++++
> > 1 files changed, 9 insertions(+), 0 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 64e4388..7eceb02 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2016,6 +2016,15 @@ loop_again:
> > priority != DEF_PRIORITY)
> > continue;
> >
> > + /*
> > + * Exit the function now and have kswapd start over
> > + * if it is known that higher orders are required
> > + */
> > + if (pgdat->kswapd_max_order > order) {
> > + all_zones_ok = 1;
> > + goto out;
> > + }
> > +
> > if (!zone_watermark_ok(zone, order,
> > high_wmark_pages(zone), end_zone, 0))
> > all_zones_ok = 0;
>
> So this handles the case where some concurrent thread or interrupt
> increases pgdat->kswapd_max_order while kswapd was running
> balance_pgdat(), yes?
>

Right.

> Does that actually happen much? Enough for this patch to make any
> useful difference?
>

Apparently, yes. Wireless drivers in particularly seem to be very
high-order GFP_ATOMIC happy.

> If one where to whack a printk in that `if' block, how often would it
> trigger, and under what circumstances?

I don't know the frequency. The circumstances are "under load" when
there are drivers depending on high-order allocations but the
reproduction cases are unreliable.

Do you want me to slap together a patch that adds a vmstat counter for
this? I can then ask future bug reporters to examine that counter and see
if it really is a major factor for a lot of people or not.

> If the -stable maintainers were to ask me "why did you send this" then
> right now my answer would have to be "I have no idea". Help.
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mel Gorman on 2 Nov 2009 11:10

On Wed, Oct 28, 2009 at 12:47:56PM -0700, Andrew Morton wrote:
> On Wed, 28 Oct 2009 10:29:36 +0000
> Mel Gorman <mel(a)csn.ul.ie> wrote:
>
> > On Tue, Oct 27, 2009 at 01:19:05PM -0700, Andrew Morton wrote:
> > > On Tue, 27 Oct 2009 13:40:33 +0000
> > > Mel Gorman <mel(a)csn.ul.ie> wrote:
> > >
> > > > When a high-order allocation fails, kswapd is kicked so that it reclaims
> > > > at a higher-order to avoid direct reclaimers stall and to help GFP_ATOMIC
> > > > allocations. Something has changed in recent kernels that affect the timing
> > > > where high-order GFP_ATOMIC allocations are now failing with more frequency,
> > > > particularly under pressure. This patch forces kswapd to notice sooner that
> > > > high-order allocations are occuring.
> > > >
> > >
> > > "something has changed"? Shouldn't we find out what that is?
> > >
> >
> > We've been trying but the answer right now is "lots". There were some
> > changes in the allocator itself which were unintentional and fixed in
> > patches 1 and 2 of this series. The two other major changes are
> >
> > iwlagn is now making high order GFP_ATOMIC allocations which didn't
> > help. This is being addressed separetly and I believe the relevant
> > patches are now in mainline.
> >
> > The other major change appears to be in page writeback. Reverting
> > commits 373c0a7e + 8aa7e847 significantly helps one bug reporter but
> > it's still unknown as to why that is.
>
> Peculiar. Those changes are fairly remote from large-order-GFP_ATOMIC
> allocations.
>

Indeed. The significance of the patch seems to be how long and how often
processes sleep in the page allocator and what kswapd is doing.

> > ...
> >
> > Wireless drivers in particularly seem to be very
> > high-order GFP_ATOMIC happy.
>
> It would be nice if we could find a way of preventing people from
> attempting high-order atomic allocations in the first place - it's a bit
> of a trap.
>

True.

> Maybe add a runtime warning which is suppressable by GFP_NOWARN (or a
> new flag), then either fix existing callers or, after review, add the
> flag.
>
> Of course, this might just end up with people adding these hopeless
> allocation attempts and just setting the nowarn flag :(
>

That's the difficulty but we should consider adding such warnings or
maintaining in-kernel the unique GFP_ATOMIC callers and their frequency.
It would require a lot of monitoring though and a fair amount of stick
beatings to get the callers corrected.

> > > If one where to whack a printk in that `if' block, how often would it
> > > trigger, and under what circumstances?
> >
> > I don't know the frequency. The circumstances are "under load" when
> > there are drivers depending on high-order allocations but the
> > reproduction cases are unreliable.
> >
> > Do you want me to slap together a patch that adds a vmstat counter for
> > this? I can then ask future bug reporters to examine that counter and see
> > if it really is a major factor for a lot of people or not.
>
> Something like that, if it will help us understand what's going on. I
> don't see a permanent need for that instrumentation but while this
> problem is still in the research stage, sure, lard it up with debug
> stuff?
>

I have a candidate patch below. One of the reasons it took so long to
get out is what I found on the way developing the patch. I had added a
debugging patch to printk what kswapd was doing. One massive difference I
noted was that in 2.6.30 kswapd often went to sleep for 25 jiffies (HZ/10)
in balance_pgdat(). In 2.6.31 and particularly in mainline, it sleeps less
and for shorter intervals. When the sleep interval is low, kswapd notices
the watermarks are ok and goes back to sleep far quicker than 2.6.30
did.

One consequence of this is that kswapd is going back to sleep just as the
high watermark is clear but if it had slept for longer it would have found
that the zone quickly went back below the high watermark due to parallel
allocators. i.e. in 2.6.30, kswapd worked for longer than current mainline.

To see if there is any merit to this, the patch below also counts the number
of times that kswapd prematurely went to sleep. If kswapd is routinely going
to sleep with watermarks not being met, one correction might be to make
balance_pgdat() unconditionally sleep for HZ/10 instead of sleeping based on
congestion as this would bring kswapd closer in line with 2.6.30. Of course,
the pain in the neck is that the premature-sleep-check itself is happening
too quickly.

> It's very important to understand _why_ the VM got worse. And, of
> course, to fix that up. But, separately, we should find a way of
> preventing developers from using these very unreliable allocations.
>

Agreed. I think the main thing that has changed is timing. congestion_wait()
is now doing the "right" thing and sleeping until congestion is
cleared. Unfortunately, it feels like some users of congestion_wait(),
such as kswapd, really wanted to sleep for a fixed interval and not based
on congestion. The comment in balance_pgdat() appears to indicate this was
the expected behaviour.

==== CUT HERE ====
vmscan: Help debug kswapd issues by counting number of rewakeups and premature sleeps

There is a growing amount of anedotal evidence that high-order atomic
allocation failures have been increasing since 2.6.31-rc1. The two
strongest possibilities are a marked increase in the number of
GFP_ATOMIC allocations and alterations in timing. Debugging printk
patches have shown for example that kswapd is sleeping for shorter
intervals and going to sleep when watermarks are still not being met.

This patch adds two kswapd counters to help identify if timing is an
issue. The first counter kswapd_highorder_rewakeup counts the number of
times that kswapd stops reclaiming at one order and restarts at a higher
order. The second counter kswapd_slept_prematurely counts the number of
times kswapd went to sleep when the high watermark was not met.

Signed-off-by: Mel Gorman <mel(a)csn.ul.ie>
---
include/linux/vmstat.h | 1 +
mm/vmscan.c | 17 ++++++++++++++++-
mm/vmstat.c | 2 ++
3 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 2d0f222..2e0d18d 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -40,6 +40,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
PGSCAN_ZONE_RECLAIM_FAILED,
#endif
PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
+ KSWAPD_HIGHORDER_REWAKEUP, KSWAPD_PREMATURE_SLEEP,
PAGEOUTRUN, ALLOCSTALL, PGROTATED,
#ifdef CONFIG_HUGETLB_PAGE
HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7eceb02..cf40136 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2021,6 +2021,7 @@ loop_again:
* if it is known that higher orders are required
*/
if (pgdat->kswapd_max_order > order) {
+ count_vm_event(KSWAPD_HIGHORDER_REWAKEUP);
all_zones_ok = 1;
goto out;
}
@@ -2124,6 +2125,17 @@ out:
return sc.nr_reclaimed;
}

+static int kswapd_sleeping_prematurely(int order)
+{
+ struct zone *zone;
+ for_each_populated_zone(zone)
+ if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
+ 0, 0))
+ return 1;
+
+ return 0;
+}
+
/*
* The background pageout daemon, started as a kernel thread
* from the init process.
@@ -2183,8 +2195,11 @@ static int kswapd(void *p)
*/
order = new_order;
} else {
- if (!freezing(current))
+ if (!freezing(current)) {
+ if (kswapd_sleeping_prematurely(order))
+ count_vm_event(KSWAPD_PREMATURE_SLEEP);
schedule();
+ }

order = pgdat->kswapd_max_order;
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c81321f..fa881c5 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -683,6 +683,8 @@ static const char * const vmstat_text[] = {
"slabs_scanned",
"kswapd_steal",
"kswapd_inodesteal",
+ "kswapd_highorder_rewakeup",
+ "kswapd_slept_prematurely",
"pageoutrun",
"allocstall",

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mel Gorman on 2 Nov 2009 12:40

On Mon, Nov 02, 2009 at 06:32:54PM +0100, Frans Pop wrote:
> On Monday 02 November 2009, Mel Gorman wrote:
> > vmscan: Help debug kswapd issues by counting number of rewakeups and
> > premature sleeps
> >
> > There is a growing amount of anedotal evidence that high-order atomic
> > allocation failures have been increasing since 2.6.31-rc1. The two
> > strongest possibilities are a marked increase in the number of
> > GFP_ATOMIC allocations and alterations in timing. Debugging printk
> > patches have shown for example that kswapd is sleeping for shorter
> > intervals and going to sleep when watermarks are still not being met.
> >
> > This patch adds two kswapd counters to help identify if timing is an
> > issue. The first counter kswapd_highorder_rewakeup counts the number of
> > times that kswapd stops reclaiming at one order and restarts at a higher
> > order. The second counter kswapd_slept_prematurely counts the number of
> > times kswapd went to sleep when the high watermark was not met.
>
> What testing would you like done with this patch?
>

Same reproduction as before except post what the contents of
/proc/vmstat were after the problem was triggered.

Thanks

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mel Gorman on 2 Nov 2009 15:40

On Mon, Nov 02, 2009 at 05:38:38PM +0000, Mel Gorman wrote:
> On Mon, Nov 02, 2009 at 06:32:54PM +0100, Frans Pop wrote:
> > On Monday 02 November 2009, Mel Gorman wrote:
> > > vmscan: Help debug kswapd issues by counting number of rewakeups and
> > > premature sleeps
> > >
> > > There is a growing amount of anedotal evidence that high-order atomic
> > > allocation failures have been increasing since 2.6.31-rc1. The two
> > > strongest possibilities are a marked increase in the number of
> > > GFP_ATOMIC allocations and alterations in timing. Debugging printk
> > > patches have shown for example that kswapd is sleeping for shorter
> > > intervals and going to sleep when watermarks are still not being met.
> > >
> > > This patch adds two kswapd counters to help identify if timing is an
> > > issue. The first counter kswapd_highorder_rewakeup counts the number of
> > > times that kswapd stops reclaiming at one order and restarts at a higher
> > > order. The second counter kswapd_slept_prematurely counts the number of
> > > times kswapd went to sleep when the high watermark was not met.
> >
> > What testing would you like done with this patch?
> >
>
> Same reproduction as before except post what the contents of
> /proc/vmstat were after the problem was triggered.
>

In the event there is a positive count for kswapd_slept_prematurely after
the error is produced, can you also check if the following patch makes a
difference and what the contents of vmstat are please? It alters how kswapd
behaves and when it goes to sleep.

Thanks

==== CUT HERE ====
vmscan: Have kswapd sleep for a short interval and double check it should be asleep

After kswapd balances all zones in a pgdat, it goes to sleep. In the event
of no IO congestion, kswapd can go to sleep very shortly after the high
watermark was reached. If there are a constant stream of allocations from
parallel processes, it can mean that kswapd went to sleep too quickly and
the high watermark is not being maintained for sufficient length time.

This patch makes kswapd go to sleep as a two-stage process. It first
tries to sleep for HZ/10. If it is woken up by another process or the
high watermark is no longer met, it's considered a premature sleep and
kswapd continues work. Otherwise it goes fully to sleep.

This adds more counters to distinguish between fast and slow breaches of
watermarks. A "fast" premature sleep is one where the low watermark was
hit in a very short time after kswapd going to sleep. A "slow" premature
sleep indicates that the high watermark was breached after a very short
interval.

Signed-off-by: Mel Gorman <mel(a)csn.ul.ie>
---
include/linux/vmstat.h | 3 ++-
mm/vmscan.c | 31 +++++++++++++++++++++++++++----
mm/vmstat.c | 3 ++-
3 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 2e0d18d..f344878 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -40,7 +40,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
PGSCAN_ZONE_RECLAIM_FAILED,
#endif
PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
- KSWAPD_HIGHORDER_REWAKEUP, KSWAPD_PREMATURE_SLEEP,
+ KSWAPD_HIGHORDER_REWAKEUP,
+ KSWAPD_PREMATURE_FAST, KSWAPD_PREMATURE_SLOW,
PAGEOUTRUN, ALLOCSTALL, PGROTATED,
#ifdef CONFIG_HUGETLB_PAGE
HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 11a69a8..70aeb05 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1905,10 +1905,14 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
#endif

/* is kswapd sleeping prematurely? */
-static int sleeping_prematurely(int order)
+static int sleeping_prematurely(int order, long remaining)
{
struct zone *zone;

+ /* If a direct reclaimer woke kswapd within HZ/10, it's premature */
+ if (remaining)
+ return 1;
+
/* If after HZ/10, a zone is below the high mark, it's premature */
for_each_populated_zone(zone)
if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
@@ -2209,9 +2213,28 @@ static int kswapd(void *p)
order = new_order;
} else {
if (!freezing(current)) {
- if (sleeping_prematurely(order))
- count_vm_event(KSWAPD_PREMATURE_SLEEP);
- schedule();
+ long remaining = 0;
+
+ /* Try to sleep for a short interval */
+ if (!sleeping_prematurely(order, remaining)) {
+ remaining = schedule_timeout(HZ/10);
+ finish_wait(&pgdat->kswapd_wait, &wait);
+ prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
+ }
+
+ /*
+ * After a short sleep, check if it was a
+ * premature sleep. If not, then go fully
+ * to sleep until explicitly woken up
+ */
+ if (!sleeping_prematurely(order, remaining))
+ schedule();
+ else {
+ if (remaining)
+ count_vm_event(KSWAPD_PREMATURE_FAST);
+ else
+ count_vm_event(KSWAPD_PREMATURE_SLOW);
+ }
}

order = pgdat->kswapd_max_order;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index fa881c5..47a6914 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -684,7 +684,8 @@ static const char * const vmstat_text[] = {
"kswapd_steal",
"kswapd_inodesteal",
"kswapd_highorder_rewakeup",
- "kswapd_slept_prematurely",
+ "kswapd_slept_prematurely_fast",
+ "kswapd_slept_prematurely_slow",
"pageoutrun",
"allocstall",

--
1.6.3.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mel Gorman on 3 Nov 2009 17:10

On Tue, Nov 03, 2009 at 11:01:50PM +0100, Frans Pop wrote:
> On Monday 02 November 2009, Mel Gorman wrote:
> > On Mon, Nov 02, 2009 at 06:32:54PM +0100, Frans Pop wrote:
> > > On Monday 02 November 2009, Mel Gorman wrote:
> > > > vmscan: Help debug kswapd issues by counting number of rewakeups and
> > > > premature sleeps
> > > >
> > > > There is a growing amount of anedotal evidence that high-order
> > > > atomic allocation failures have been increasing since 2.6.31-rc1.
> > > > The two strongest possibilities are a marked increase in the number
> > > > of GFP_ATOMIC allocations and alterations in timing. Debugging
> > > > printk patches have shown for example that kswapd is sleeping for
> > > > shorter intervals and going to sleep when watermarks are still not
> > > > being met.
> > > >
> > > > This patch adds two kswapd counters to help identify if timing is an
> > > > issue. The first counter kswapd_highorder_rewakeup counts the number
> > > > of times that kswapd stops reclaiming at one order and restarts at a
> > > > higher order. The second counter kswapd_slept_prematurely counts the
> > > > number of times kswapd went to sleep when the high watermark was not
> > > > met.
> > >
> > > What testing would you like done with this patch?
> >
> > Same reproduction as before except post what the contents of
> > /proc/vmstat were after the problem was triggered.
>
> With a representative test I get 0 for kswapd_slept_prematurely.
> Tested with .32-rc6 + patches 1-3 + this patch.
>

Assuming the problem actually reproduced, can you still retest with the
patch I posted as a follow-up and see if fast or slow premature sleeps
are happening and if the problem still occurs please? It's still
possible with the patch as-is could be timing related. After I posted
this patch, I continued testing and found I could get counts fairly
reliably if kswapd was calling printk() before making the premature
check so the window appears to be very small.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2
Prev: x86, mce: disable MCE if cpu has no MCE banks
Next: ark3116: (v2) driver rework after review