writeback: sync old inodes first in background writeback [Kernel]

Prev: [PATCH 4/6] Mtd: fixed open brace.
Next: Modpost error after changing CONFIG_SOUND from m to y

From: Wu Fengguang on 25 Jul 2010 23:10

KOSAKI,

> My reviewing doesn't found any bug. however I think original thread have too many guess
> and we need to know reproduce way and confirm it.
>
> At least, we need three confirms.
> o original issue is still there?

As long as the root cause is still there :)

> o DEF_PRIORITY/3 is best value?

There are no best value. I suspect the whole PAGEOUT_IO_SYNC and
wait_on_page_writeback() approach is a terrible workaround and should
be avoided as much as possible. This is why I lifted the bar from
DEF_PRIORITY/2 to DEF_PRIORITY/3.

wait_on_page_writeback() is bad because for a typical desktop, one
single call may block 1-10 seconds (remember we are under memory
pressure, which is almost always accompanied with busy disk IO, so
the page will wait noticeable time in the IO queue). To make it worse,
it is very possible there are 10 more dirty/writeback pages in the
isolated pages(dirty pages are often clustered). This ends up with
10-100 seconds stall time.

We do need some throttling under memory pressure. However stall time
more than 1s is not acceptable. A simple congestion_wait() may be
better, since it waits on _any_ IO completion (which will likely
release a set of PG_reclaim pages) rather than one specific IO
completion. This makes much smoother stall time.
wait_on_page_writeback() shall really be the last resort.
DEF_PRIORITY/3 means 1/16=6.25%, which is closer.

Since dirty/writeback pages are such a bad factor under memory
pressure, it may deserve to adaptively shrink dirty_limit as well.
When short on memory, why not reduce the dirty/writeback page cache?
This will not only consume memory, but also considerably improve IO
efficiency and responsiveness. When the LRU lists are scanned fast
(under memory pressure), it is likely lots of the dirty pages are
caught by pageout(). Reducing the number of dirty pages reduces the
pageout() invocations.

> o Current approach have better performance than Wu's original proposal? (below)

I guess it will have better user experience :)

> Anyway, please feel free to use my reviewed-by tag.

Thanks,
Fengguang

> --- linux-next.orig/mm/vmscan.c 2010-06-24 14:32:03.000000000 +0800
> +++ linux-next/mm/vmscan.c 2010-07-22 16:12:34.000000000 +0800
> @@ -1650,7 +1650,7 @@ static void set_lumpy_reclaim_mode(int p
> */
> if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
> sc->lumpy_reclaim_mode = 1;
> - else if (sc->order && priority < DEF_PRIORITY - 2)
> + else if (sc->order && priority < DEF_PRIORITY / 2)
> sc->lumpy_reclaim_mode = 1;
> else
> sc->lumpy_reclaim_mode = 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Wu Fengguang on 25 Jul 2010 23:20

On Mon, Jul 26, 2010 at 11:11:37AM +0800, Rik van Riel wrote:
> On 07/25/2010 11:08 PM, Wu Fengguang wrote:
>
> > We do need some throttling under memory pressure. However stall time
> > more than 1s is not acceptable. A simple congestion_wait() may be
> > better, since it waits on _any_ IO completion (which will likely
> > release a set of PG_reclaim pages) rather than one specific IO
> > completion. This makes much smoother stall time.
> > wait_on_page_writeback() shall really be the last resort.
> > DEF_PRIORITY/3 means 1/16=6.25%, which is closer.
>
> I agree with the max 1 second stall time, but 6.25% of
> memory could be an awful lot of pages to scan on a system
> with 1TB of memory :)

I totally ignored the 1TB systems out of this topic, because in such
systems, <PAGE_ALLOC_COSTLY_ORDER pages are easily available? :)

> Not sure what the best approach is, just pointing out
> that DEF_PRIORITY/3 may be too much for large systems...

What if DEF_PRIORITY/3 is used under PAGE_ALLOC_COSTLY_ORDER?

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Wu Fengguang on 25 Jul 2010 23:30

On Sun, Jul 25, 2010 at 08:03:45PM +0800, Minchan Kim wrote:
> On Sun, Jul 25, 2010 at 07:43:20PM +0900, KOSAKI Motohiro wrote:
> > Hi
> >
> > sorry for the delay.
> >
> > > Will you be picking it up or should I? The changelog should be more or less
> > > the same as yours and consider it
> > >
> > > Signed-off-by: Mel Gorman <mel(a)csn.ul.ie>
> > >
> > > It'd be nice if the original tester is still knocking around and willing
> > > to confirm the patch resolves his/her problem. I am running this patch on
> > > my desktop at the moment and it does feel a little smoother but it might be
> > > my imagination. I had trouble with odd stalls that I never pinned down and
> > > was attributing to the machine being commonly heavily loaded but I haven't
> > > noticed them today.
> > >
> > > It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters
> > > logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also
> > > should use PAGEOUT_IO_SYNC]
> >
> > My reviewing doesn't found any bug. however I think original thread have too many guess
> > and we need to know reproduce way and confirm it.
> >
> > At least, we need three confirms.
> > o original issue is still there?
> > o DEF_PRIORITY/3 is best value?
>
> I agree. Wu, how do you determine DEF_PRIORITY/3 of LRU?
> I guess system has 512M and 22M writeback pages.
> So you may determine it for skipping max 32M writeback pages.
> Is right?

For 512M mem, DEF_PRIORITY/3 means 32M dirty _or_ writeback pages.
Because shrink_inactive_list() first calls
shrink_page_list(PAGEOUT_IO_ASYNC) then optionally
shrink_page_list(PAGEOUT_IO_SYNC), so dirty pages will first be
converted to writeback pages and then optionally be waited on.

The dirty/writeback pages may go up to 512M*20% = 100M. So 32M looks
a reasonable value.

> And I have a question of your below comment.
>
> "As the default dirty throttle ratio is 20%, sync write&wait
> will hardly be triggered by pure dirty pages"
>
> I am not sure exactly what you mean but at least DEF_PRIOIRTY/3 seems to be
> related to dirty_ratio. It always can be changed by admin.
> Then do we have to determine magic value(DEF_PRIORITY/3) proportional to dirty_ratio?

Yes DEF_PRIORITY/3 is already proportional to the _default_
dirty_ratio. We could do explicit comparison with dirty_ratio
just in case dirty_ratio get changed by user. It's mainly a question
of whether deserving to add such overheads and complexity. I'd prefer
to keep the current simple form :)

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Wu Fengguang on 26 Jul 2010 00:40

On Mon, Jul 26, 2010 at 12:11:59PM +0800, Minchan Kim wrote:
> On Mon, Jul 26, 2010 at 12:27 PM, Wu Fengguang <fengguang.wu(a)intel.com> wrote:
> > On Sun, Jul 25, 2010 at 08:03:45PM +0800, Minchan Kim wrote:
> >> On Sun, Jul 25, 2010 at 07:43:20PM +0900, KOSAKI Motohiro wrote:
> >> > Hi
> >> >
> >> > sorry for the delay.
> >> >
> >> > > Will you be picking it up or should I? The changelog should be more or less
> >> > > the same as yours and consider it
> >> > >
> >> > > Signed-off-by: Mel Gorman <mel(a)csn.ul.ie>
> >> > >
> >> > > It'd be nice if the original tester is still knocking around and willing
> >> > > to confirm the patch resolves his/her problem. I am running this patch on
> >> > > my desktop at the moment and it does feel a little smoother but it might be
> >> > > my imagination. I had trouble with odd stalls that I never pinned down and
> >> > > was attributing to the machine being commonly heavily loaded but I haven't
> >> > > noticed them today.
> >> > >
> >> > > It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters
> >> > > logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also
> >> > > should use PAGEOUT_IO_SYNC]
> >> >
> >> > My reviewing doesn't found any bug. however I think original thread have too many guess
> >> > and we need to know reproduce way and confirm it.
> >> >
> >> > At least, we need three confirms.
> >> > o original issue is still there?
> >> > o DEF_PRIORITY/3 is best value?
> >>
> >> I agree. Wu, how do you determine DEF_PRIORITY/3 of LRU?
> >> I guess system has 512M and 22M writeback pages.
> >> So you may determine it for skipping max 32M writeback pages.
> >> Is right?
> >
> > For 512M mem, DEF_PRIORITY/3 means 32M dirty _or_ writeback pages.
> > Because shrink_inactive_list() first calls
> > shrink_page_list(PAGEOUT_IO_ASYNC) then optionally
> > shrink_page_list(PAGEOUT_IO_SYNC), so dirty pages will first be
> > converted to writeback pages and then optionally be waited on.
> >
> > The dirty/writeback pages may go up to 512M*20% = 100M. So 32M looks
> > a reasonable value.
>
> Why do you think it's a reasonable value?
> I mean why isn't it good 12.5% or 3.125%? Why do you select 6.25%?
> I am not against you. Just out of curiosity and requires more explanation.
> It might be thing _only I_ don't know. :(

It's more or less random selected. I'm also OK with 3.125%. It's an
threshold to turn on some _last resort_ mechanism, so don't need to be
optimal..

> >
> >> And I have a question of your below comment.
> >>
> >> "As the default dirty throttle ratio is 20%, sync write&wait
> >> will hardly be triggered by pure dirty pages"
> >>
> >> I am not sure exactly what you mean but at least DEF_PRIOIRTY/3 seems to be
> >> related to dirty_ratio. It always can be changed by admin.
> >> Then do we have to determine magic value(DEF_PRIORITY/3) proportional to dirty_ratio?
> >
> > Yes DEF_PRIORITY/3 is already proportional to the _default_
> > dirty_ratio. We could do explicit comparison with dirty_ratio
> > just in case dirty_ratio get changed by user. It's mainly a question
> > of whether deserving to add such overheads and complexity. I'd prefer
> > to keep the current simple form :)
>
> What I suggest is that couldn't we use recent_writeback/recent_scanned ratio?
> I think scan_control's new filed and counting wouldn't be a big
> overhead and complexity.
> I am not sure which ratio is best. but at least, it would make the
> logic scalable and sense to me. :)

...and don't need to be elaborated :)

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Minchan Kim on 26 Jul 2010 12:40

On Mon, Jul 26, 2010 at 12:37:09PM +0800, Wu Fengguang wrote:
> On Mon, Jul 26, 2010 at 12:11:59PM +0800, Minchan Kim wrote:
> > On Mon, Jul 26, 2010 at 12:27 PM, Wu Fengguang <fengguang.wu(a)intel.com> wrote:
> > > On Sun, Jul 25, 2010 at 08:03:45PM +0800, Minchan Kim wrote:
> > >> On Sun, Jul 25, 2010 at 07:43:20PM +0900, KOSAKI Motohiro wrote:
> > >> > Hi
> > >> >
> > >> > sorry for the delay.
> > >> >
> > >> > > Will you be picking it up or should I? The changelog should be more or less
> > >> > > the same as yours and consider it
> > >> > >
> > >> > > Signed-off-by: Mel Gorman <mel(a)csn.ul.ie>
> > >> > >
> > >> > > It'd be nice if the original tester is still knocking around and willing
> > >> > > to confirm the patch resolves his/her problem. I am running this patch on
> > >> > > my desktop at the moment and it does feel a little smoother but it might be
> > >> > > my imagination. I had trouble with odd stalls that I never pinned down and
> > >> > > was attributing to the machine being commonly heavily loaded but I haven't
> > >> > > noticed them today.
> > >> > >
> > >> > > It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters
> > >> > > logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also
> > >> > > should use PAGEOUT_IO_SYNC]
> > >> >
> > >> > My reviewing doesn't found any bug. however I think original thread have too many guess
> > >> > and we need to know reproduce way and confirm it.
> > >> >
> > >> > At least, we need three confirms.
> > >> > �o original issue is still there?
> > >> > �o DEF_PRIORITY/3 is best value?
> > >>
> > >> I agree. Wu, how do you determine DEF_PRIORITY/3 of LRU?
> > >> I guess system has 512M and 22M writeback pages.
> > >> So you may determine it for skipping max 32M writeback pages.
> > >> Is right?
> > >
> > > For 512M mem, DEF_PRIORITY/3 means 32M dirty _or_ writeback pages.
> > > Because shrink_inactive_list() first calls
> > > shrink_page_list(PAGEOUT_IO_ASYNC) then optionally
> > > shrink_page_list(PAGEOUT_IO_SYNC), so dirty pages will first be
> > > converted to writeback pages and then optionally be waited on.
> > >
> > > The dirty/writeback pages may go up to 512M*20% = 100M. So 32M looks
> > > a reasonable value.
> >
> > Why do you think it's a reasonable value?
> > I mean why isn't it good 12.5% or 3.125%? Why do you select 6.25%?
> > I am not against you. Just out of curiosity and requires more explanation.
> > It might be thing _only I_ don't know. :(
>
> It's more or less random selected. I'm also OK with 3.125%. It's an
> threshold to turn on some _last resort_ mechanism, so don't need to be
> optimal..

Okay. Why I had a question is that I don't want to add new magic value in
VM without detailed comment.
While I review the source code, I always suffer form it. :(
Now we have a great tool called 'git'.
Please write down why we select that number detaily when we add new
magic value. :)

Thanks, Wu.

--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: [PATCH 4/6] Mtd: fixed open brace.
Next: Modpost error after changing CONFIG_SOUND from m to y