From: KOSAKI Motohiro on
> Viewpoint 1. Unnecessary IO
>
> isolate_pages() for lumpy reclaim frequently grab very young page. it is often
> still dirty. then, pageout() is called much.
>
> Unfortunately, page size grained io is _very_ inefficient. it can makes lots disk
> seek and kill disk io bandwidth.
>
>
> Viewpoint 2. Unevictable pages
>
> isolate_pages() for lumpy reclaim can pick up unevictable page. it is obviously
> undroppable. so if the zone have plenty mlocked pages (it is not rare case on
> server use case), lumpy reclaim can become very useless.
>
>
> Viewpoint 3. GFP_ATOMIC allocation failure
>
> Obviously lumpy reclaim can't help GFP_ATOMIC issue.
>
>
> Viewpoint 4. reclaim latency
>
> reclaim latency directly affect page allocation latency. so if lumpy reclaim with
> much pageout io is slow (often it is), it affect page allocation latency and can
> reduce end user experience.

Viewpoint 5. end user surprising

lumpy reclaim can makes swap-out even though the system have lots free
memory. end users very surprised it and they can think it is bug.

Also, this swap activity easyly confuse that an administrator decide when
install more memory into the system.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on
On Fri, Mar 19, 2010 at 03:21:31PM +0900, KOSAKI Motohiro wrote:
> > @@ -1765,6 +1766,31 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> >
> > cond_resched();
> >
> > + /* Try memory compaction for high-order allocations before reclaim */
> > + if (order) {
> > + *did_some_progress = try_to_compact_pages(zonelist,
> > + order, gfp_mask, nodemask);
> > + if (*did_some_progress != COMPACT_INCOMPLETE) {
> > + page = get_page_from_freelist(gfp_mask, nodemask,
> > + order, zonelist, high_zoneidx,
> > + alloc_flags, preferred_zone,
> > + migratetype);
> > + if (page) {
> > + __count_vm_event(COMPACTSUCCESS);
> > + return page;
> > + }
> > +
> > + /*
> > + * It's bad if compaction run occurs and fails.
> > + * The most likely reason is that pages exist,
> > + * but not enough to satisfy watermarks.
> > + */
> > + count_vm_event(COMPACTFAIL);
> > +
> > + cond_resched();
> > + }
> > + }
> > +
>
> Hmm..Hmmm...........
>
> Today, I've reviewed this patch and [11/11] carefully twice. but It is harder to ack.
>
> This patch seems to assume page compaction is faster than direct
> reclaim. but it often doesn't, because dropping useless page cache is very
> lightweight operation,

Two points with that;

1. It's very hard to know in advance how often direct reclaim of clean page
cache would be enough to satisfy the allocation.

2. Even if it was faster to discard page cache, it's not necessarily
faster when the cost of reading that page cache back-in is taken into
account

Lumpy reclaim tries to avoid dumping useful page cache but it is perfectly
possible for hot data to be discarded because it happened to be located
near cold data. It's impossible to know in general how much unnecessary IO
takes place as a result of lumpy reclaim because it depends heavily on the
system-state when lumpy reclaim starts.

> but page compaction makes a lot of memcpy (i.e. cpu cache
> pollution). IOW this patch is focusing to hugepage allocation very aggressively, but
> it seems not enough care to reduce typical workload damage.
>

What typical workload is making aggressive use of high order
allocations? Typically when such a user is found, effort is spent on
finding alternatives to high-orders as opposed to worrying about the cost
of allocating them. There was a focus on huge page allocation because it
was the most useful test case that was likely to be encountered in practice.

I can adjust the allocation levels to some other value but it's not typical
for a system to make very aggressive use of other orders. I could have it
use random orders but also is not very typical.

> At first, I would like to clarify current reclaim corner case and how
> vmscan should do at this mail.
>
> Now we have Lumpy reclaim. It is very excellent solution for externa
> fragmentation.

In some situations, it can grind a system to trash for a time. What is far
more likely is to be dealing with a machine with no swap - something that
is common in clusters. In this case, lumpy is a lot less likely to succeed
unless the machine is very quiet. It's just not going to find the contiguous
page cache it needs to discard and anonymous pages get in the way.

> but unfortunately it have lots corner case.
>
> Viewpoint 1. Unnecessary IO
>
> isolate_pages() for lumpy reclaim frequently grab very young page. it is often
> still dirty. then, pageout() is called much.
>
> Unfortunately, page size grained io is _very_ inefficient. it can makes lots disk
> seek and kill disk io bandwidth.
>

Page-based IO like this has also been reported as being a problem for some
filesystems. When this happens, lumpy reclaim potentially stalls for a long
time waiting for the dirty data to be flushed by a flusher thread. Compaction
does not suffer from the same problem.

> Viewpoint 2. Unevictable pages
>
> isolate_pages() for lumpy reclaim can pick up unevictable page. it is obviously
> undroppable. so if the zone have plenty mlocked pages (it is not rare case on
> server use case), lumpy reclaim can become very useless.
>

Also true. Potentially, compaction can deal with unevictable pages but it's
not done in this series as it's significant enough as it is and useful in
its current form.

> Viewpoint 3. GFP_ATOMIC allocation failure
>
> Obviously lumpy reclaim can't help GFP_ATOMIC issue.
>

Also true although right now, it's not possible to compact for GFP_ATOMIC
either. I think it could be done on some cases but I didn't try for it.
High-order GFP_ATOMIC allocations are still something we simply try and
avoid rather than deal with within the page allocator.

> Viewpoint 4. reclaim latency
>
> reclaim latency directly affect page allocation latency. so if lumpy reclaim with
> much pageout io is slow (often it is), it affect page allocation latency and can
> reduce end user experience.
>

Also true. When allocation huge pages on a normal desktop for example,
it scan stall the machine for a number of seconds while reclaim kicks
in.

With direct compaction, this does not happen to anywhere near the same
degree. There are still some stalls because as huge pages get allocated,
free memory drops until pages have to be reclaimed anyway. The effects
are a lot less prononced and the operation finishes a lot faster.

> I really hope that auto page migration help to solve above issue. but sadly this
> patch seems doesn't.
>

How do you figure? I think it goes a long way to mitigating the worst of
the problems you laid out above.

> Honestly, I think this patch was very impressive and useful at 2-3 years ago.
> because 1) we didn't have lumpy reclaim 2) we didn't have sane reclaim bail out.
> then, old vmscan is very heavyweight and inefficient operation for high order reclaim.
> therefore the downside of adding this page migration is hidden relatively. but...
>
> We have to make an effort to reduce reclaim latency, not adding new latency source.

I recognise that reclaim latency has been reduced but there is a wall.
The cost of reading the data back in will always be there and on
swapless systems, it might simply be impossible for lumpy reclaim to do
what it needs.

> Instead, I would recommend tightly integrate page-compaction and lumpy reclaim.
> I mean 1) reusing lumpy reclaim's neighbor pfn page pickking up logic

There are a number of difficulties with this. I'm not saying it's impossible,
but the win is not very clear-cut and there are some disadvantages.

One, there would have to be exceptions for kswapd in the path because it
really should continue reclaiming. The reclaim path is already very dense
and this would add significant compliexity to that path.

The second difficulty is that the migration and free block selection
algorithm becomes a lot harder, more expensive and identifying the exit
conditions presents a significant difficultly. Right now, the selection is
based on linear scans with straight-forward selection and the exit condition
is simply when the scanners meet. With the migration scanner based on LRU,
significant care would have to be taken to ensure that appropriate free blocks
were chosen to migrate to so that we didn't "migrate from" a block in one
pass and "migrate to" in another (the reason why I went with linear scans
in the first place). Identifying when the zone has been compacted and should
just stop is no longer as straight-forward either. You'd have to track what
blocks had been operated on in the past which is potentially a lot of state. To
maintain this state, an unknown number structures would have to be allocated
which may re-enter the allocator presenting its own class of problems.

Third, right now it's very easy to identify when compaction is not going
to work in advance - simply check the watermarks and make a calculation
based on fragmentation. With a combined reclaim/compaction step, these
type of checks would need to be made continually - potentially
increasing the latency of reclaim albeit very slightly.

Lastly, with this series, there is very little difference between direct
compaction and proc-triggered compaction. They share the same code paths
and all that differs is the exit conditions. If it was integrated into
reclaim, it becomes a lot less straight-forward to share the code.

> 2) do page
> migration instead pageout when the page is some condition (example active or dirty
> or referenced or swapbacked).
>

Right now, it is identifed when pageout should happen instead of page
migration. It's known before compaction starts if it's likely to be
successful or not.

> This patch seems shoot me! /me die. R.I.P. ;-)
>

That seems a bit dramatic. Your alternative proposal has some significant
difficulties and is likely to be very complicated. Also, there is nothing
to say that this mechanism could not be integrated with lumpy reclaim over
time once it was shown that useless migration was going on or latencies were
increased for some workload.

This patch seems like a far more rational starting point to me than adding
more complexity to reclaim at the outset.

> btw please don't use 'hugeadm --set-recommended-min_free_kbytes' at testing.

It's somewhat important for the type of stress tests I do for huge page
allocation. Without it, fragmentation avoidance has trouble and the
results become a lot less repeatable.

> To evaluate a case of free memory starvation is very important for this patch
> series, I think. I slightly doubt this patch might invoke useless compaction
> in such case.
>

I can drop the min_free_kbytes change but the likely result will be that
allocation success rates will simply be lower. The calculations on
whether compaction should be used or not are based on watermarks which
adjust to the value of min_free_kbytes.

> At bottom line, the explict compaction via /proc can be merged soon, I think.
> but this auto compaction logic seems need more discussion.
>

My concern would be that the compaction paths would then be used very
rarely in practice and we'd get no data on how direct compaction should
be done.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on
On Fri, Mar 19, 2010 at 03:31:27PM +0900, KOSAKI Motohiro wrote:
> > Viewpoint 1. Unnecessary IO
> >
> > isolate_pages() for lumpy reclaim frequently grab very young page. it is often
> > still dirty. then, pageout() is called much.
> >
> > Unfortunately, page size grained io is _very_ inefficient. it can makes lots disk
> > seek and kill disk io bandwidth.
> >
> >
> > Viewpoint 2. Unevictable pages
> >
> > isolate_pages() for lumpy reclaim can pick up unevictable page. it is obviously
> > undroppable. so if the zone have plenty mlocked pages (it is not rare case on
> > server use case), lumpy reclaim can become very useless.
> >
> >
> > Viewpoint 3. GFP_ATOMIC allocation failure
> >
> > Obviously lumpy reclaim can't help GFP_ATOMIC issue.
> >
> >
> > Viewpoint 4. reclaim latency
> >
> > reclaim latency directly affect page allocation latency. so if lumpy reclaim with
> > much pageout io is slow (often it is), it affect page allocation latency and can
> > reduce end user experience.
>
> Viewpoint 5. end user surprising
>
> lumpy reclaim can makes swap-out even though the system have lots free
> memory. end users very surprised it and they can think it is bug.
>
> Also, this swap activity easyly confuse that an administrator decide when
> install more memory into the system.
>

Compaction in this case is a lot less surprising. If there is enough free
memory, compaction will trigger automatically without any reclaim.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Minchan Kim on
Hi, Mel.

On Tue, Mar 23, 2010 at 9:25 PM, Mel Gorman <mel(a)csn.ul.ie> wrote:
> Ordinarily when a high-order allocation fails, direct reclaim is entered to
> free pages to satisfy the allocation.  With this patch, it is determined if
> an allocation failed due to external fragmentation instead of low memory
> and if so, the calling process will compact until a suitable page is
> freed. Compaction by moving pages in memory is considerably cheaper than
> paging out to disk and works where there are locked pages or no swap. If
> compaction fails to free a page of a suitable size, then reclaim will
> still occur.
>
> Direct compaction returns as soon as possible. As each block is compacted,
> it is checked if a suitable page has been freed and if so, it returns.
>
> Signed-off-by: Mel Gorman <mel(a)csn.ul.ie>
> Acked-by: Rik van Riel <riel(a)redhat.com>
> ---
>  include/linux/compaction.h |   16 +++++-
>  include/linux/vmstat.h     |    1 +
>  mm/compaction.c            |  118 ++++++++++++++++++++++++++++++++++++++++++++
>  mm/page_alloc.c            |   26 ++++++++++
>  mm/vmstat.c                |   15 +++++-
>  5 files changed, 172 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index c94890b..b851428 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -1,14 +1,26 @@
>  #ifndef _LINUX_COMPACTION_H
>  #define _LINUX_COMPACTION_H
>
> -/* Return values for compact_zone() */
> +/* Return values for compact_zone() and try_to_compact_pages() */
>  #define COMPACT_INCOMPLETE     0
> -#define COMPACT_COMPLETE       1
> +#define COMPACT_PARTIAL                1
> +#define COMPACT_COMPLETE       2
>
>  #ifdef CONFIG_COMPACTION
>  extern int sysctl_compact_memory;
>  extern int sysctl_compaction_handler(struct ctl_table *table, int write,
>                        void __user *buffer, size_t *length, loff_t *ppos);
> +
> +extern int fragmentation_index(struct zone *zone, unsigned int order);
> +extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> +                       int order, gfp_t gfp_mask, nodemask_t *mask);
> +#else
> +static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
> +                       int order, gfp_t gfp_mask, nodemask_t *nodemask)
> +{
> +       return COMPACT_INCOMPLETE;
> +}
> +
>  #endif /* CONFIG_COMPACTION */
>
>  #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 56e4b44..b4b4d34 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -44,6 +44,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>                KSWAPD_SKIP_CONGESTION_WAIT,
>                PAGEOUTRUN, ALLOCSTALL, PGROTATED,
>                COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
> +               COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
>  #ifdef CONFIG_HUGETLB_PAGE
>                HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
>  #endif
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 8df6e3d..6688700 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -34,6 +34,8 @@ struct compact_control {
>        unsigned long nr_anon;
>        unsigned long nr_file;
>
> +       unsigned int order;             /* order a direct compactor needs */
> +       int migratetype;                /* MOVABLE, RECLAIMABLE etc */
>        struct zone *zone;
>  };
>
> @@ -301,10 +303,31 @@ static void update_nr_listpages(struct compact_control *cc)
>  static inline int compact_finished(struct zone *zone,
>                                                struct compact_control *cc)
>  {
> +       unsigned int order;
> +       unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
> +
>        /* Compaction run completes if the migrate and free scanner meet */
>        if (cc->free_pfn <= cc->migrate_pfn)
>                return COMPACT_COMPLETE;
>
> +       /* Compaction run is not finished if the watermark is not met */
> +       if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
> +               return COMPACT_INCOMPLETE;
> +
> +       if (cc->order == -1)
> +               return COMPACT_INCOMPLETE;
> +
> +       /* Direct compactor: Is a suitable page free? */
> +       for (order = cc->order; order < MAX_ORDER; order++) {
> +               /* Job done if page is free of the right migratetype */
> +               if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
> +                       return COMPACT_PARTIAL;
> +
> +               /* Job done if allocation would set block type */
> +               if (order >= pageblock_order && zone->free_area[order].nr_free)
> +                       return COMPACT_PARTIAL;
> +       }
> +
>        return COMPACT_INCOMPLETE;
>  }
>
> @@ -348,6 +371,101 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>        return ret;
>  }
>
> +static inline unsigned long compact_zone_order(struct zone *zone,
> +                                               int order, gfp_t gfp_mask)
> +{
> +       struct compact_control cc = {
> +               .nr_freepages = 0,
> +               .nr_migratepages = 0,
> +               .order = order,
> +               .migratetype = allocflags_to_migratetype(gfp_mask),
> +               .zone = zone,
> +       };
> +       INIT_LIST_HEAD(&cc.freepages);
> +       INIT_LIST_HEAD(&cc.migratepages);
> +
> +       return compact_zone(zone, &cc);
> +}
> +
> +/**
> + * try_to_compact_pages - Direct compact to satisfy a high-order allocation
> + * @zonelist: The zonelist used for the current allocation
> + * @order: The order of the current allocation
> + * @gfp_mask: The GFP mask of the current allocation
> + * @nodemask: The allowed nodes to allocate from
> + *
> + * This is the main entry point for direct page compaction.
> + */
> +unsigned long try_to_compact_pages(struct zonelist *zonelist,
> +                       int order, gfp_t gfp_mask, nodemask_t *nodemask)
> +{
> +       enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> +       int may_enter_fs = gfp_mask & __GFP_FS;
> +       int may_perform_io = gfp_mask & __GFP_IO;
> +       unsigned long watermark;
> +       struct zoneref *z;
> +       struct zone *zone;
> +       int rc = COMPACT_INCOMPLETE;
> +
> +       /* Check whether it is worth even starting compaction */
> +       if (order == 0 || !may_enter_fs || !may_perform_io)
> +               return rc;
> +
> +       /*
> +        * We will not stall if the necessary conditions are not met for
> +        * migration but direct reclaim seems to account stalls similarly
> +        */

I can't understand this comment.
In case of direct reclaim, shrink_zones's long time is just stall
by view point of allocation customer.
So "Allocation is stalled" makes sense to me.

But "Compaction is stalled" doesn't make sense to me.
How about "COMPACTION_DIRECT" like "PGSCAN_DIRECT"?
I think It's straightforward.
Naming is important since it makes ABI.

> +       count_vm_event(COMPACTSTALL);
> +





--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on
On Wed, Mar 24, 2010 at 08:10:40AM +0900, Minchan Kim wrote:
> Hi, Mel.
>
> On Tue, Mar 23, 2010 at 9:25 PM, Mel Gorman <mel(a)csn.ul.ie> wrote:
> > Ordinarily when a high-order allocation fails, direct reclaim is entered to
> > free pages to satisfy the allocation. �With this patch, it is determined if
> > an allocation failed due to external fragmentation instead of low memory
> > and if so, the calling process will compact until a suitable page is
> > freed. Compaction by moving pages in memory is considerably cheaper than
> > paging out to disk and works where there are locked pages or no swap. If
> > compaction fails to free a page of a suitable size, then reclaim will
> > still occur.
> >
> > Direct compaction returns as soon as possible. As each block is compacted,
> > it is checked if a suitable page has been freed and if so, it returns.
> >
> > Signed-off-by: Mel Gorman <mel(a)csn.ul.ie>
> > Acked-by: Rik van Riel <riel(a)redhat.com>
> > ---
> > �include/linux/compaction.h | � 16 +++++-
> > �include/linux/vmstat.h � � | � �1 +
> > �mm/compaction.c � � � � � �| �118 ++++++++++++++++++++++++++++++++++++++++++++
> > �mm/page_alloc.c � � � � � �| � 26 ++++++++++
> > �mm/vmstat.c � � � � � � � �| � 15 +++++-
> > �5 files changed, 172 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > index c94890b..b851428 100644
> > --- a/include/linux/compaction.h
> > +++ b/include/linux/compaction.h
> > @@ -1,14 +1,26 @@
> > �#ifndef _LINUX_COMPACTION_H
> > �#define _LINUX_COMPACTION_H
> >
> > -/* Return values for compact_zone() */
> > +/* Return values for compact_zone() and try_to_compact_pages() */
> > �#define COMPACT_INCOMPLETE � � 0
> > -#define COMPACT_COMPLETE � � � 1
> > +#define COMPACT_PARTIAL � � � � � � � �1
> > +#define COMPACT_COMPLETE � � � 2
> >
> > �#ifdef CONFIG_COMPACTION
> > �extern int sysctl_compact_memory;
> > �extern int sysctl_compaction_handler(struct ctl_table *table, int write,
> > � � � � � � � � � � � �void __user *buffer, size_t *length, loff_t *ppos);
> > +
> > +extern int fragmentation_index(struct zone *zone, unsigned int order);
> > +extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> > + � � � � � � � � � � � int order, gfp_t gfp_mask, nodemask_t *mask);
> > +#else
> > +static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
> > + � � � � � � � � � � � int order, gfp_t gfp_mask, nodemask_t *nodemask)
> > +{
> > + � � � return COMPACT_INCOMPLETE;
> > +}
> > +
> > �#endif /* CONFIG_COMPACTION */
> >
> > �#if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > index 56e4b44..b4b4d34 100644
> > --- a/include/linux/vmstat.h
> > +++ b/include/linux/vmstat.h
> > @@ -44,6 +44,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> > � � � � � � � �KSWAPD_SKIP_CONGESTION_WAIT,
> > � � � � � � � �PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> > � � � � � � � �COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
> > + � � � � � � � COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
> > �#ifdef CONFIG_HUGETLB_PAGE
> > � � � � � � � �HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
> > �#endif
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 8df6e3d..6688700 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -34,6 +34,8 @@ struct compact_control {
> > � � � �unsigned long nr_anon;
> > � � � �unsigned long nr_file;
> >
> > + � � � unsigned int order; � � � � � � /* order a direct compactor needs */
> > + � � � int migratetype; � � � � � � � �/* MOVABLE, RECLAIMABLE etc */
> > � � � �struct zone *zone;
> > �};
> >
> > @@ -301,10 +303,31 @@ static void update_nr_listpages(struct compact_control *cc)
> > �static inline int compact_finished(struct zone *zone,
> > � � � � � � � � � � � � � � � � � � � � � � � �struct compact_control *cc)
> > �{
> > + � � � unsigned int order;
> > + � � � unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
> > +
> > � � � �/* Compaction run completes if the migrate and free scanner meet */
> > � � � �if (cc->free_pfn <= cc->migrate_pfn)
> > � � � � � � � �return COMPACT_COMPLETE;
> >
> > + � � � /* Compaction run is not finished if the watermark is not met */
> > + � � � if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
> > + � � � � � � � return COMPACT_INCOMPLETE;
> > +
> > + � � � if (cc->order == -1)
> > + � � � � � � � return COMPACT_INCOMPLETE;
> > +
> > + � � � /* Direct compactor: Is a suitable page free? */
> > + � � � for (order = cc->order; order < MAX_ORDER; order++) {
> > + � � � � � � � /* Job done if page is free of the right migratetype */
> > + � � � � � � � if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
> > + � � � � � � � � � � � return COMPACT_PARTIAL;
> > +
> > + � � � � � � � /* Job done if allocation would set block type */
> > + � � � � � � � if (order >= pageblock_order && zone->free_area[order].nr_free)
> > + � � � � � � � � � � � return COMPACT_PARTIAL;
> > + � � � }
> > +
> > � � � �return COMPACT_INCOMPLETE;
> > �}
> >
> > @@ -348,6 +371,101 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> > � � � �return ret;
> > �}
> >
> > +static inline unsigned long compact_zone_order(struct zone *zone,
> > + � � � � � � � � � � � � � � � � � � � � � � � int order, gfp_t gfp_mask)
> > +{
> > + � � � struct compact_control cc = {
> > + � � � � � � � .nr_freepages = 0,
> > + � � � � � � � .nr_migratepages = 0,
> > + � � � � � � � .order = order,
> > + � � � � � � � .migratetype = allocflags_to_migratetype(gfp_mask),
> > + � � � � � � � .zone = zone,
> > + � � � };
> > + � � � INIT_LIST_HEAD(&cc.freepages);
> > + � � � INIT_LIST_HEAD(&cc.migratepages);
> > +
> > + � � � return compact_zone(zone, &cc);
> > +}
> > +
> > +/**
> > + * try_to_compact_pages - Direct compact to satisfy a high-order allocation
> > + * @zonelist: The zonelist used for the current allocation
> > + * @order: The order of the current allocation
> > + * @gfp_mask: The GFP mask of the current allocation
> > + * @nodemask: The allowed nodes to allocate from
> > + *
> > + * This is the main entry point for direct page compaction.
> > + */
> > +unsigned long try_to_compact_pages(struct zonelist *zonelist,
> > + � � � � � � � � � � � int order, gfp_t gfp_mask, nodemask_t *nodemask)
> > +{
> > + � � � enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > + � � � int may_enter_fs = gfp_mask & __GFP_FS;
> > + � � � int may_perform_io = gfp_mask & __GFP_IO;
> > + � � � unsigned long watermark;
> > + � � � struct zoneref *z;
> > + � � � struct zone *zone;
> > + � � � int rc = COMPACT_INCOMPLETE;
> > +
> > + � � � /* Check whether it is worth even starting compaction */
> > + � � � if (order == 0 || !may_enter_fs || !may_perform_io)
> > + � � � � � � � return rc;
> > +
> > + � � � /*
> > + � � � �* We will not stall if the necessary conditions are not met for
> > + � � � �* migration but direct reclaim seems to account stalls similarly
> > + � � � �*/
>
> I can't understand this comment.
> In case of direct reclaim, shrink_zones's long time is just stall
> by view point of allocation customer.
> So "Allocation is stalled" makes sense to me.
>
> But "Compaction is stalled" doesn't make sense to me.

I considered a "stall" to be when the allocator is doing work that is not
allocation-related such as page reclaim or in this case - memory compaction.

> How about "COMPACTION_DIRECT" like "PGSCAN_DIRECT"?

PGSCAN_DIRECT is page-based counter on the number of pages scanned. The
similar naming but very different meaning could be confusing to someone not
familar with the counters. The event being counted here is the number of
times compaction happened just like ALLOCSTALL counts the number of times
direct reclaim happened.

How about COMPACTSTALL like ALLOCSTALL? :/

> I think It's straightforward.
> Naming is important since it makes ABI.
>
> > + � � � count_vm_event(COMPACTSTALL);
> > +
>
>
>
>
>
> --
> Kind regards,
> Minchan Kim
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/