Do not compact within a preferred zone after a compaction failure [Kernel]

Prev: [RFC PATCH] lib/random32: export pseudo-random number generator for modules
Next: [PATCH] sched: prevent compiler from optimising sched_avg_update loop

From: Christoph Lameter on 23 Mar 2010 14:40

On Tue, 23 Mar 2010, Mel Gorman wrote:

> The fragmentation index may indicate that a failure it due to external

s/it/is/

> fragmentation, a compaction run complete and an allocation failure still

???

> fail. There are two obvious reasons as to why
>
> o Page migration cannot move all pages so fragmentation remains
> o A suitable page may exist but watermarks are not met
>
> In the event of compaction and allocation failure, this patch prevents
> compaction happening for a short interval. It's only recorded on the

compaction is "recorded"? deferred?

> preferred zone but that should be enough coverage. This could have been
> implemented similar to the zonelist_cache but the increased size of the
> zonelist did not appear to be justified.

> @@ -1787,6 +1787,9 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> */
> count_vm_event(COMPACTFAIL);
>
> + /* On failure, avoid compaction for a short time. */
> + defer_compaction(preferred_zone, jiffies + HZ/50);
> +

20ms? How was that interval determined?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Christoph Lameter on 23 Mar 2010 15:30

On Tue, 23 Mar 2010, Mel Gorman wrote:

> I was having some sort of fit when I wrote that obviously. Try this on
> for size
>
> The fragmentation index may indicate that a failure is due to external
> fragmentation but after a compaction run completes, it is still possible
> for an allocation to fail.

Ok.

> > > fail. There are two obvious reasons as to why
> > >
> > > o Page migration cannot move all pages so fragmentation remains
> > > o A suitable page may exist but watermarks are not met
> > >
> > > In the event of compaction and allocation failure, this patch prevents
> > > compaction happening for a short interval. It's only recorded on the
> >
> > compaction is "recorded"? deferred?
> >
>
> deferred makes more sense.
>
> What I was thinking at the time was that compact_resume was stored in struct
> zone - i.e. that is where it is recorded.

Ok adding a dozen or more words here may be useful.

> > > preferred zone but that should be enough coverage. This could have been
> > > implemented similar to the zonelist_cache but the increased size of the
> > > zonelist did not appear to be justified.
> >
> > > @@ -1787,6 +1787,9 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> > > */
> > > count_vm_event(COMPACTFAIL);
> > >
> > > + /* On failure, avoid compaction for a short time. */
> > > + defer_compaction(preferred_zone, jiffies + HZ/50);
> > > +
> >
> > 20ms? How was that interval determined?
> >
>
> Matches the time the page allocator would defer to an event like
> congestion. The choice is somewhat arbitrary. Ideally, there would be
> some sort of event that would re-enable compaction but there wasn't an
> obvious candidate so I used time.

There are frequent uses of HZ/10 as well especially in vmscna.c. A longer
time may be better? HZ/50 looks like an interval for writeout. But this
is related to reclaim?

backing-dev.h <global> 283 long congestion_wait(int sync, long timeout);
1 backing-dev.c <global> 762 EXPORT_SYMBOL(congestion_wait);
2 usercopy_32.c __copy_to_user_ll 754 congestion_wait(BLK_RW_ASYNC, HZ/50);
3 pktcdvd.c pkt_make_request 2557 congestion_wait(BLK_RW_ASYNC, HZ);
4 dm-crypt.c kcryptd_crypt_write_convert 834 congestion_wait(BLK_RW_ASYNC, HZ/100);
5 file.c fat_file_release 137 congestion_wait(BLK_RW_ASYNC, HZ/10);
6 journal.c reiserfs_async_progress_wait 990 congestion_wait(BLK_RW_ASYNC, HZ / 10);
7 kmem.c kmem_alloc 61 congestion_wait(BLK_RW_ASYNC, HZ/50);
8 kmem.c kmem_zone_alloc 117 congestion_wait(BLK_RW_ASYNC, HZ/50);
9 xfs_buf.c _xfs_buf_lookup_pages 343 congestion_wait(BLK_RW_ASYNC, HZ/50);
a backing-dev.c congestion_wait 751 long congestion_wait(int sync, long timeout)
b memcontrol.c mem_cgroup_force_empty 2858 congestion_wait(BLK_RW_ASYNC, HZ/10);
c page-writeback.c throttle_vm_writeout 674 congestion_wait(BLK_RW_ASYNC, HZ/10);
d page_alloc.c __alloc_pages_high_priority 1753 congestion_wait(BLK_RW_ASYNC, HZ/50);
e page_alloc.c __alloc_pages_slowpath 1924 congestion_wait(BLK_RW_ASYNC, HZ/50);
f vmscan.c shrink_inactive_list 1136 congestion_wait(BLK_RW_ASYNC, HZ/10);
g vmscan.c shrink_inactive_list 1220 congestion_wait(BLK_RW_ASYNC, HZ/10);
h vmscan.c do_try_to_free_pages 1837 congestion_wait(BLK_RW_ASYNC, HZ/10);
i vmscan.c balance_pgdat 2161 congestion_wait(BLK_RW_ASYNC, HZ/10);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Christoph Lameter on 24 Mar 2010 16:00

On Wed, 24 Mar 2010, Mel Gorman wrote:

> > > What I was thinking at the time was that compact_resume was stored in struct
> > > zone - i.e. that is where it is recorded.
> >
> > Ok adding a dozen or more words here may be useful.
> >
>
> In the event of compaction followed by an allocation failure, this patch
> defers further compaction in the zone for a period of time. The zone that
> is deferred is the first zone in the zonelist - i.e. the preferred zone.
> To defer compaction in the other zones, the information would need to
> be stored in the zonelist or implemented similar to the zonelist_cache.
> This would impact the fast-paths and is not justified at this time.
>
> ?

Ok.

> > There are frequent uses of HZ/10 as well especially in vmscna.c. A longer
> > time may be better? HZ/50 looks like an interval for writeout. But this
> > is related to reclaim?
> >
>
> HZ/10 is somewhat of an arbitrary choice as well and there isn't data on
> which is better and which is worse. If the zone is full of dirty data, then
> HZ/10 makes sense for IO. If it happened to be mainly clean cache but under
> heavy memory pressure, then reclaim would be a relatively fast event and a
> shorter wait makes sense of HZ/50.
>
> Thing is, if we start with a short timer and it's too short, COMPACTFAIL
> will be growing steadily. If we choose a long time and it's too long, there
> is no counter to indicate it was a bad choice. Hence, I'd prefer the short
> timer to start with and ideally resume compaction after some event in the
> future rather than depending on time.
>
> Does that make sense?

Yes.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andrew Morton on 24 Mar 2010 17:00

On Tue, 23 Mar 2010 12:25:46 +0000
Mel Gorman <mel(a)csn.ul.ie> wrote:

> The fragmentation index may indicate that a failure it due to external
> fragmentation, a compaction run complete and an allocation failure still
> fail. There are two obvious reasons as to why
>
> o Page migration cannot move all pages so fragmentation remains
> o A suitable page may exist but watermarks are not met
>
> In the event of compaction and allocation failure, this patch prevents
> compaction happening for a short interval. It's only recorded on the
> preferred zone but that should be enough coverage. This could have been
> implemented similar to the zonelist_cache but the increased size of the
> zonelist did not appear to be justified.
>
>
> ...
>
> +/* defer_compaction - Do not compact within a zone until a given time */
> +static inline void defer_compaction(struct zone *zone, unsigned long resume)
> +{
> + /*
> + * This function is called when compaction fails to result in a page
> + * allocation success. This is somewhat unsatisfactory as the failure
> + * to compact has nothing to do with time and everything to do with
> + * the requested order, the number of free pages and watermarks. How
> + * to wait on that is more unclear, but the answer would apply to
> + * other areas where the VM waits based on time.

um. "Two wrongs don't make a right". We should fix the other sites,
not use them as excuses ;)

What _is_ a good measure of "time" in this code? "number of pages
scanned" is a pretty good one in reclaim. We want something which will
adapt itself to amount-of-memory, number-of-cpus, speed-of-cpus,
nature-of-workload, etc, etc.

Is it possible to come up with some simple metric which approximately
reflects how busy this code is, then pace ourselves via that?

> + */
> + zone->compact_resume = resume;
> +}
> +
> +static inline int compaction_deferred(struct zone *zone)
> +{
> + /* init once if necessary */
> + if (unlikely(!zone->compact_resume)) {
> + zone->compact_resume = jiffies;
> + return 0;
> + }
> +
> + return time_before(jiffies, zone->compact_resume);
> +}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andrew Morton on 6 Apr 2010 20:10

On Fri, 2 Apr 2010 17:02:47 +0100
Mel Gorman <mel(a)csn.ul.ie> wrote:

> The fragmentation index may indicate that a failure is due to external
> fragmentation but after a compaction run completes, it is still possible
> for an allocation to fail. There are two obvious reasons as to why
>
> o Page migration cannot move all pages so fragmentation remains
> o A suitable page may exist but watermarks are not met
>
> In the event of compaction followed by an allocation failure, this patch
> defers further compaction in the zone for a period of time. The zone that
> is deferred is the first zone in the zonelist - i.e. the preferred zone.
> To defer compaction in the other zones, the information would need to be
> stored in the zonelist or implemented similar to the zonelist_cache.
> This would impact the fast-paths and is not justified at this time.
>

Your patch, it sucks!

> ---
> include/linux/compaction.h | 35 +++++++++++++++++++++++++++++++++++
> include/linux/mmzone.h | 7 +++++++
> mm/page_alloc.c | 5 ++++-
> 3 files changed, 46 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index ae98afc..2a02719 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -18,6 +18,32 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
> extern int fragmentation_index(struct zone *zone, unsigned int order);
> extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> int order, gfp_t gfp_mask, nodemask_t *mask);
> +
> +/* defer_compaction - Do not compact within a zone until a given time */
> +static inline void defer_compaction(struct zone *zone, unsigned long resume)
> +{
> + /*
> + * This function is called when compaction fails to result in a page
> + * allocation success. This is somewhat unsatisfactory as the failure
> + * to compact has nothing to do with time and everything to do with
> + * the requested order, the number of free pages and watermarks. How
> + * to wait on that is more unclear, but the answer would apply to
> + * other areas where the VM waits based on time.
> + */

c'mon, let's not make this rod for our backs.

The "A suitable page may exist but watermarks are not met" case can be
addressed by testing the watermarks up-front, surely?

I bet the "Page migration cannot move all pages so fragmentation
remains" case can be addressed by setting some metric in the zone, and
suitably modifying that as a result on ongoing activity. To tell the
zone "hey, compaction migth be worth trying now". that sucks too, but not
so much.

Or something. Putting a wallclock-based throttle on it like this
really does reduce the usefulness of the whole feature.

Internet: "My application works OK on a hard disk but fails when I use an SSD!".

akpm: "Tell Mel!"

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2
Prev: [RFC PATCH] lib/random32: export pseudo-random number generator for modules
Next: [PATCH] sched: prevent compiler from optimising sched_avg_update loop