page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion [Kernel]

Prev: [PATCH] iio_trigger_find_by_name: Skip trailing newline if available
Next: [PATCH 00/96] Blackfin updates for 2.6.34

From: Nick Piggin on 9 Mar 2010 08:40

On Mon, Mar 08, 2010 at 11:48:21AM +0000, Mel Gorman wrote:
> Under heavy memory pressure, the page allocator may call congestion_wait()
> to wait for IO congestion to clear or a timeout. This is not as sensible
> a choice as it first appears. There is no guarantee that BLK_RW_ASYNC is
> even congested as the pressure could have been due to a large number of
> SYNC reads and the allocator waits for the entire timeout, possibly uselessly.
>
> At the point of congestion_wait(), the allocator is struggling to get the
> pages it needs and it should back off. This patch puts the allocator to sleep
> on a zone->pressure_wq for either a timeout or until a direct reclaimer or
> kswapd brings the zone over the low watermark, whichever happens first.
>
> Signed-off-by: Mel Gorman <mel(a)csn.ul.ie>
> ---
> include/linux/mmzone.h | 3 ++
> mm/internal.h | 4 +++
> mm/mmzone.c | 47 +++++++++++++++++++++++++++++++++++++++++++++
> mm/page_alloc.c | 50 +++++++++++++++++++++++++++++++++++++++++++----
> mm/vmscan.c | 2 +
> 5 files changed, 101 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 30fe668..72465c1 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -398,6 +398,9 @@ struct zone {
> unsigned long wait_table_hash_nr_entries;
> unsigned long wait_table_bits;
>
> + /* queue for processes waiting for pressure to relieve */
> + wait_queue_head_t *pressure_wq;

Hmm, processes may be eligible to allocate from > 1 zone, but you
have them only waiting for one. I wonder if we shouldn't wait for
more zones?

Congestion waiting uses a global waitqueue, which hasn't seemed to
cause a big scalability problem. Would it be better to have a global
waitqueue for this too?

> +void check_zone_pressure(struct zone *zone)

I don't really like the name pressure. We use that term for the reclaim
pressure wheras we're just checking watermarks here (actual pressure
could be anything).

> +{
> + /* If no process is waiting, nothing to do */
> + if (!waitqueue_active(zone->pressure_wq))
> + return;
> +
> + /* Check if the high watermark is ok for order 0 */
> + if (zone_watermark_ok(zone, 0, low_wmark_pages(zone), 0, 0))
> + wake_up_interruptible(zone->pressure_wq);
> +}

If you were to do this under the zone lock (in your subsequent patch),
then it could avoid races. I would suggest doing it all as a single
patch and not doing the pressure checks in reclaim at all.

If you are missing anything, then that needs to be explained and fixed
rather than just adding extra checks.

> +
> +/**
> + * zonepressure_wait - Wait for pressure on a zone to ease off
> + * @zone: The zone that is expected to be under pressure
> + * @order: The order the caller is waiting on pages for
> + * @timeout: Wait until pressure is relieved or this timeout is reached
> + *
> + * Waits for up to @timeout jiffies for pressure on a zone to be relieved.
> + * It's considered to be relieved if any direct reclaimer or kswapd brings
> + * the zone above the high watermark
> + */
> +long zonepressure_wait(struct zone *zone, unsigned int order, long timeout)
> +{
> + long ret;
> + DEFINE_WAIT(wait);
> +
> +wait_again:
> + prepare_to_wait(zone->pressure_wq, &wait, TASK_INTERRUPTIBLE);

I guess to do it without races you need to check watermark here.
And possibly some barriers if it is done without zone->lock.

> +
> + /*
> + * The use of io_schedule_timeout() here means that it gets
> + * accounted for as IO waiting. This may or may not be the case
> + * but at least this way it gets picked up by vmstat
> + */
> + ret = io_schedule_timeout(timeout);
> + finish_wait(zone->pressure_wq, &wait);
> +
> + /* If woken early, check watermarks before continuing */
> + if (ret && !zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0)) {
> + timeout = ret;
> + goto wait_again;
> + }

And then I don't know if we'd really need the extra check here. Might as
well just let the allocator try again and avoid the code?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 9 Mar 2010 10:10

On Tue, Mar 09, 2010 at 02:17:13PM +0000, Mel Gorman wrote:
> On Wed, Mar 10, 2010 at 12:35:13AM +1100, Nick Piggin wrote:
> > On Mon, Mar 08, 2010 at 11:48:21AM +0000, Mel Gorman wrote:
> > > Under heavy memory pressure, the page allocator may call congestion_wait()
> > > to wait for IO congestion to clear or a timeout. This is not as sensible
> > > a choice as it first appears. There is no guarantee that BLK_RW_ASYNC is
> > > even congested as the pressure could have been due to a large number of
> > > SYNC reads and the allocator waits for the entire timeout, possibly uselessly.
> > >
> > > At the point of congestion_wait(), the allocator is struggling to get the
> > > pages it needs and it should back off. This patch puts the allocator to sleep
> > > on a zone->pressure_wq for either a timeout or until a direct reclaimer or
> > > kswapd brings the zone over the low watermark, whichever happens first.
> > >
> > > Signed-off-by: Mel Gorman <mel(a)csn.ul.ie>
> > > ---
> > > include/linux/mmzone.h | 3 ++
> > > mm/internal.h | 4 +++
> > > mm/mmzone.c | 47 +++++++++++++++++++++++++++++++++++++++++++++
> > > mm/page_alloc.c | 50 +++++++++++++++++++++++++++++++++++++++++++----
> > > mm/vmscan.c | 2 +
> > > 5 files changed, 101 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index 30fe668..72465c1 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -398,6 +398,9 @@ struct zone {
> > > unsigned long wait_table_hash_nr_entries;
> > > unsigned long wait_table_bits;
> > >
> > > + /* queue for processes waiting for pressure to relieve */
> > > + wait_queue_head_t *pressure_wq;
> >
> > Hmm, processes may be eligible to allocate from > 1 zone, but you
> > have them only waiting for one. I wonder if we shouldn't wait for
> > more zones?
> >
>
> It's waiting for the zone that is most desirable. If that zones watermarks
> are met, why would it wait on any other zone?

I mean the other way around. If that zone's watermarks are not met, then
why shouldn't it be woken up by other zones reaching their watermarks.

> If you mean that it would
> wait for any of the eligible zones to meet their watermark, it might have
> an impact on NUMA locality but it could be managed. It might make sense to
> wait on a node-based queue rather than a zone if this behaviour was desirable.
>
> > Congestion waiting uses a global waitqueue, which hasn't seemed to
> > cause a big scalability problem. Would it be better to have a global
> > waitqueue for this too?
> >
>
> Considering that the congestion wait queue is for a relatively slow operation,
> it would be surprising if lock scalability was noticeable. Potentially the
> pressure_wq involves no IO so scalability may be noticeable there.
>
> What would the advantages of a global waitqueue be? Obviously, a smaller
> memory footprint. A second potential advantage is that on wakeup, it
> could check the watermarks on multiple zones which might reduce
> latencies in some cases. Can you think of more compelling reasons?

Your 2nd advantage is what I mean above.

> >
> > > +void check_zone_pressure(struct zone *zone)
> >
> > I don't really like the name pressure. We use that term for the reclaim
> > pressure wheras we're just checking watermarks here (actual pressure
> > could be anything).
> >
>
> pressure_wq => watermark_wq
> check_zone_pressure => check_watermark_wq
>
> ?

Thanks.

>
> >
> > > +{
> > > + /* If no process is waiting, nothing to do */
> > > + if (!waitqueue_active(zone->pressure_wq))
> > > + return;
> > > +
> > > + /* Check if the high watermark is ok for order 0 */
> > > + if (zone_watermark_ok(zone, 0, low_wmark_pages(zone), 0, 0))
> > > + wake_up_interruptible(zone->pressure_wq);
> > > +}
> >
> > If you were to do this under the zone lock (in your subsequent patch),
> > then it could avoid races. I would suggest doing it all as a single
> > patch and not doing the pressure checks in reclaim at all.
> >
>
> That is reasonable. I've already dropped the checks in reclaim because as you
> say, if the free path check is cheap enough, it's also sufficient. Checking
> in the reclaim paths as well is redundant.
>
> I'll move the call to check_zone_pressure() within the zone lock to avoid
> races.
>
> > If you are missing anything, then that needs to be explained and fixed
> > rather than just adding extra checks.
> >
> > > +
> > > +/**
> > > + * zonepressure_wait - Wait for pressure on a zone to ease off
> > > + * @zone: The zone that is expected to be under pressure
> > > + * @order: The order the caller is waiting on pages for
> > > + * @timeout: Wait until pressure is relieved or this timeout is reached
> > > + *
> > > + * Waits for up to @timeout jiffies for pressure on a zone to be relieved.
> > > + * It's considered to be relieved if any direct reclaimer or kswapd brings
> > > + * the zone above the high watermark
> > > + */
> > > +long zonepressure_wait(struct zone *zone, unsigned int order, long timeout)
> > > +{
> > > + long ret;
> > > + DEFINE_WAIT(wait);
> > > +
> > > +wait_again:
> > > + prepare_to_wait(zone->pressure_wq, &wait, TASK_INTERRUPTIBLE);
> >
> > I guess to do it without races you need to check watermark here.
> > And possibly some barriers if it is done without zone->lock.
> >
>
> As watermark checks are already done without the zone->lock and without
> barriers, why are they needed here? Yes, there are small races. For
> example, it's possible to hit a window where pages were freed between
> watermarks were checked and we went to sleep here but that is similar to
> current behaviour.

Well with the check in free_pages_bulk then doing another check here
before the wait should be able to close all lost-wakeup races. I agree
it is pretty fuzzy heuristics anyway, so these races don't *really*
matter a lot. But it seems easy to close the races, so I don't see
why not.

> > > +
> > > + /*
> > > + * The use of io_schedule_timeout() here means that it gets
> > > + * accounted for as IO waiting. This may or may not be the case
> > > + * but at least this way it gets picked up by vmstat
> > > + */
> > > + ret = io_schedule_timeout(timeout);
> > > + finish_wait(zone->pressure_wq, &wait);
> > > +
> > > + /* If woken early, check watermarks before continuing */
> > > + if (ret && !zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0)) {
> > > + timeout = ret;
> > > + goto wait_again;
> > > + }
> >
> > And then I don't know if we'd really need the extra check here. Might as
> > well just let the allocator try again and avoid the code?
> >
>
> I was considering multiple processes been woken up and racing with each
> other. I can drop this check though. The worst that happens is multiple
> processes wake and walk the full zonelist. Some will succeed and others
> will go back to sleep.

Yep. And it doesn't really solve that race either becuase the zone
might subsequently go below the watermark.

Thanks,
Nick

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Christoph Lameter on 9 Mar 2010 11:00

On Mon, 8 Mar 2010, Mel Gorman wrote:

> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 30fe668..72465c1 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -398,6 +398,9 @@ struct zone {
> unsigned long wait_table_hash_nr_entries;
> unsigned long wait_table_bits;
>
> + /* queue for processes waiting for pressure to relieve */
> + wait_queue_head_t *pressure_wq;
> +
> /*

The waitqueue is in a zone? But allocation occurs by scanning a
list of possible zones.

> +long zonepressure_wait(struct zone *zone, unsigned int order, long timeout)

So zone specific.

>
> - if (!page && gfp_mask & __GFP_NOFAIL)
> - congestion_wait(BLK_RW_ASYNC, HZ/50);
> + if (!page && gfp_mask & __GFP_NOFAIL) {
> + /* If still failing, wait for pressure on zone to relieve */
> + zonepressure_wait(preferred_zone, order, HZ/50);

The first zone is special therefore...

What happens if memory becomes available in another zone? Lets say we are
waiting on HIGHMEM and memory in ZONE_NORMAL becomes available?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Christoph Lameter on 9 Mar 2010 11:10

On Tue, 9 Mar 2010, Christian Ehrhardt wrote:

> > What happens if memory becomes available in another zone? Lets say we are
> > waiting on HIGHMEM and memory in ZONE_NORMAL becomes available?
>
> Do you mean the same as Nick asked or another aspect of it?
> citation:
> "I mean the other way around. If that zone's watermarks are not met, then why
> shouldn't it be woken up by other zones reaching their watermarks."

Just saw that exchange. Yes it is similar. Mel only thought about NUMA
but the situation can also occur in !NUMA because multiple zones do not
require NUMA.

If a process goes to sleep on an allocation that has a preferred zone of
HIGHMEM then other processors may free up memory in ZONE_DMA and
ZONE_NORMAL and therefore memory may become available but the process will
continue to sleep.

The wait structure needs to be placed in the pgdat structure to make it
node specific.

But then an overallocated node may stall processes. If that node is full
of unreclaimable memory then the process may never wake up?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Christoph Lameter on 9 Mar 2010 12:20

On Tue, 9 Mar 2010, Mel Gorman wrote:

> Until it's timeout at least. It's still better than the current
> situation of sleeping on congestion.

Congestion may clear if memory becomes available in other zones.

> The ideal would be waiting on a per-node basis. I'm just not liking having
> to look up the node structure when freeing a patch of pages and making a
> cache line in there unnecessarily hot.

The node structure (pgdat) contains the zone structures. If you know the
type of zone then you can calculate the pgdat address.

> > But then an overallocated node may stall processes. If that node is full
> > of unreclaimable memory then the process may never wake up?
> Processes wake after a timeout.

Ok that limits it but still we may be waiting for no reason.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2
Prev: [PATCH] iio_trigger_find_by_name: Skip trailing newline if available
Next: [PATCH 00/96] Blackfin updates for 2.6.34