superblock: introduce per-sb cache shrinker infrastructure [Kernel]

Prev: ath: Failed to stop TX DMA in 100 msec after killing last frame
Next: [GIT PULL] LED updates for 2.6.35

From: Dave Chinner on 27 May 2010 18:50

On Thu, May 27, 2010 at 04:35:23PM +1000, Nick Piggin wrote:
> On Tue, May 25, 2010 at 06:53:06PM +1000, Dave Chinner wrote:
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -37,6 +37,50 @@
> > LIST_HEAD(super_blocks);
> > DEFINE_SPINLOCK(sb_lock);
> >
> > +static int prune_super(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask)
> > +{
> > + struct super_block *sb;
> > + int count;
> > +
> > + sb = container_of(shrink, struct super_block, s_shrink);
> > +
> > + /*
> > + * Deadlock avoidance. We may hold various FS locks, and we don't want
> > + * to recurse into the FS that called us in clear_inode() and friends..
> > + */
> > + if (!(gfp_mask & __GFP_FS))
> > + return -1;
> > +
> > + /*
> > + * if we can't get the umount lock, then there's no point having the
> > + * shrinker try again because the sb is being torn down.
> > + */
> > + if (!down_read_trylock(&sb->s_umount))
> > + return -1;
> > +
> > + if (!sb->s_root) {
> > + up_read(&sb->s_umount);
> > + return -1;
> > + }
> > +
> > + if (nr_to_scan) {
> > + /* proportion the scan between the two cacheѕ */
> > + int total;
> > +
> > + total = sb->s_nr_dentry_unused + sb->s_nr_inodes_unused + 1;
> > + count = (nr_to_scan * sb->s_nr_dentry_unused) / total;
> > +
> > + /* prune dcache first as icache is pinned by it */
> > + prune_dcache_sb(sb, count);
> > + prune_icache_sb(sb, nr_to_scan - count);
>
> Hmm, an interesting dynamic that you've changed is that previously
> we'd scan dcache LRU proportionately to pagecache, and then scan
> inode LRU in proportion to the current number of unused inodes.
>
> But we can think of inodes that are only in use by unused (and aged)
> dentries as effectively unused themselves. So this sequence under
> estimates how many inodes to scan. This could bias pressure against
> dcache I'd think, especially considering inodes are far larger than
> dentries. Maybe require 2 passes to get the inodes unused inthe
> first pass.

It's self-balancing - it trends towards an equal number of unused
dentries and inodes in the caches. Yes, it will tear down more
dentries at first, but we need to do that to be able to reclaim
inodes. Ås reclaim progresses the propotion of inodes increases, so
the amount of inodes reclaimed increases.

Basically this is a recognition that the important cache for
avoiding IO is the inode cache, not he dentry cache. Once the inode
cache is freed that we need to do IO to repopulate it, but
rebuilding dentries fromteh inode cache only costs CPU time. Hence
under light reclaim, inodes are mostly left in cache but we free up
memory that only costs CPU to rebuild. Under heavy, sustained
reclaim, we trend towards freeing equal amounts of objects from both
caches.

This is pretty much what the current code attempts to do - free a
lot of dentries, then free a smaller amount of the inodes that were
used by the freed dentries. Once again it is a direct encoding of
what is currently an implicit design feature - it makes it *obvious*
how we are trying to balance the caches.

Another reason for this is that the calculation changes again to
allow filesystem caches to modiy this proportioning in the next
patch....

FWIW, this also makes workloads that generate hundreds of thousands
of never-to-be-used again negative dentries free dcache memory really
quickly on memory pressure...

> Part of the problem is the funny shrinker API.
>
> The right way to do it is to change the shrinker API so that it passes
> down the lru_pages and scanned into the callback. From there, the
> shrinkers can calculate the appropriate ratio of objects to scan.
> No need for 2-call scheme, no need for shrinker->seeks, and the
> ability to calculate an appropriate ratio first for dcache, and *then*
> for icache.

My only concern about this is that exposes the inner workings of the
shrinker and mm subsystem to code that simply doesn't need to know
about it.

> A helper of course can do the calculation (considering that every
> driver and their dog will do the wrong thing if we let them :)).
>
> unsigned long shrinker_scan(unsigned long lru_pages,
> unsigned long lru_scanned,
> unsigned long nr_objects,
> unsigned long scan_ratio)
> {
> unsigned long long tmp = nr_objects;
>
> tmp *= lru_scanned * 100;
> do_div(tmp, (lru_pages * scan_ratio) + 1);
>
> return (unsigned long)tmp;
> }
>
> Then the shrinker callback will go:
> sb->s_nr_dentry_scan += shrinker_scan(lru_pages, lru_scanned,
> sb->s_nr_dentry_unused,
> vfs_cache_pressure * SEEKS_PER_DENTRY);
> if (sb->s_nr_dentry_scan > SHRINK_BATCH)
> prune_dcache()
>
> sb->s_nr_inode_scan += shrinker_scan(lru_pages, lru_scanned,
> sb->s_nr_inodes_unused,
> vfs_cache_pressure * SEEKS_PER_INODE);
> ...
>
> What do you think of that? Seeing as we're changing the shrinker API
> anyway, I'd think it is high time to do somthing like this.

Ignoring the dcache/icache reclaim ratio issues, I'd prefer a two
call API that matches the current behaviour, leaving the caclulation
of how much to reclaim in shrink_slab(). Encoding it this way makes
it more difficult to change the high level behaviour e.g. if we want
to modify the amount of slab reclaim based on reclaim priority, we'd
have to cahnge every shrinker instead of just shrink_slab().

Cheers,

Dave.
--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Dave Chinner on 27 May 2010 19:10

On Thu, May 27, 2010 at 01:32:34PM -0700, Andrew Morton wrote:
> On Tue, 25 May 2010 18:53:06 +1000
> Dave Chinner <david(a)fromorbit.com> wrote:
>
> > From: Dave Chinner <dchinner(a)redhat.com>
> >
> > With context based shrinkers, we can implement a per-superblock
> > shrinker that shrinks the caches attached to the superblock. We
> > currently have global shrinkers for the inode and dentry caches that
> > split up into per-superblock operations via a coarse proportioning
> > method that does not batch very well. The global shrinkers also
> > have a dependency - dentries pin inodes - so we have to be very
> > careful about how we register the global shrinkers so that the
> > implicit call order is always correct.
> >
> > With a per-sb shrinker callout, we can encode this dependency
> > directly into the per-sb shrinker, hence avoiding the need for
> > strictly ordering shrinker registrations. We also have no need for
> > any proportioning code for the shrinker subsystem already provides
> > this functionality across all shrinkers. Allowing the shrinker to
> > operate on a single superblock at a time means that we do less
> > superblock list traversals and locking and reclaim should batch more
> > effectively. This should result in less CPU overhead for reclaim and
> > potentially faster reclaim of items from each filesystem.
> >
>
> I go all tingly when a changelog contains the word "should".
>
> OK, it _should_ do X. But _does_ it actually do X?

As i said to Nick - the tests I ran showed an average improvement of
5% but the accuracy of the benchmark was +/-10%. Hence it's hard to
draw any conclusive results from that. It appears to be slightly
faster on an otherwise idle system, but...

As it is, the XFS shrinker that gets integrated into this structure
in a later patch peaks at a higher rate - 150k inodes/s vs 90k
inodes/s with the current shrinker - but still it's hard to quantify
qualitatively. I'm going to run more benchmarks to try to get better
numbers.

> > fs/super.c | 53 +++++++++++++++++++++
> > include/linux/fs.h | 7 +++
> > 4 files changed, 88 insertions(+), 214 deletions(-)
> >
> > diff --git a/fs/dcache.c b/fs/dcache.c
> > index dba6b6d..d7bd781 100644
> > --- a/fs/dcache.c
> > +++ b/fs/dcache.c
> > @@ -456,21 +456,16 @@ static void prune_one_dentry(struct dentry * dentry)
> > * which flags are set. This means we don't need to maintain multiple
> > * similar copies of this loop.
> > */
> > -static void __shrink_dcache_sb(struct super_block *sb, int *count, int flags)
> > +static void __shrink_dcache_sb(struct super_block *sb, int count, int flags)
>
> Forgot to update the kerneldoc description of `count'.

Will fix.

Cheers,

Dave.
--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 28 May 2010 01:20

On Fri, May 28, 2010 at 08:40:34AM +1000, Dave Chinner wrote:
> On Thu, May 27, 2010 at 04:35:23PM +1000, Nick Piggin wrote:
> > But we can think of inodes that are only in use by unused (and aged)
> > dentries as effectively unused themselves. So this sequence under
> > estimates how many inodes to scan. This could bias pressure against
> > dcache I'd think, especially considering inodes are far larger than
> > dentries. Maybe require 2 passes to get the inodes unused inthe
> > first pass.
>
> It's self-balancing - it trends towards an equal number of unused
> dentries and inodes in the caches. Yes, it will tear down more
> dentries at first, but we need to do that to be able to reclaim
> inodes.

But then it doesn't scan enough inodes on the inode pass.

> Ås reclaim progresses the propotion of inodes increases, so
> the amount of inodes reclaimed increases.
>
> Basically this is a recognition that the important cache for
> avoiding IO is the inode cache, not he dentry cache. Once the inode

You can bias against the dcache using multipliers.

> cache is freed that we need to do IO to repopulate it, but
> rebuilding dentries fromteh inode cache only costs CPU time. Hence
> under light reclaim, inodes are mostly left in cache but we free up
> memory that only costs CPU to rebuild. Under heavy, sustained
> reclaim, we trend towards freeing equal amounts of objects from both
> caches.

I don't know if you've got numbers or patterns to justify that.
My point is that things should stay as close to the old code as
possible without good reason.

> This is pretty much what the current code attempts to do - free a
> lot of dentries, then free a smaller amount of the inodes that were
> used by the freed dentries. Once again it is a direct encoding of
> what is currently an implicit design feature - it makes it *obvious*
> how we are trying to balance the caches.

With your patches, if there are no inodes free you would need to take
2 passes at freeing the dentry cache. My suggestion is closer to the
current code.

> Another reason for this is that the calculation changes again to
> allow filesystem caches to modiy this proportioning in the next
> patch....
>
> FWIW, this also makes workloads that generate hundreds of thousands
> of never-to-be-used again negative dentries free dcache memory really
> quickly on memory pressure...

That would still be the case because used inodes aren't getting their
dentries freed so little inode scanning will occur.

>
> > Part of the problem is the funny shrinker API.
> >
> > The right way to do it is to change the shrinker API so that it passes
> > down the lru_pages and scanned into the callback. From there, the
> > shrinkers can calculate the appropriate ratio of objects to scan.
> > No need for 2-call scheme, no need for shrinker->seeks, and the
> > ability to calculate an appropriate ratio first for dcache, and *then*
> > for icache.
>
> My only concern about this is that exposes the inner workings of the
> shrinker and mm subsystem to code that simply doesn't need to know
> about it.

It's just providing a ratio. The shrinkers allready know they are
scanning based on a ratio of pagecache scanned.

> > A helper of course can do the calculation (considering that every
> > driver and their dog will do the wrong thing if we let them :)).
> >
> > unsigned long shrinker_scan(unsigned long lru_pages,
> > unsigned long lru_scanned,
> > unsigned long nr_objects,
> > unsigned long scan_ratio)
> > {
> > unsigned long long tmp = nr_objects;
> >
> > tmp *= lru_scanned * 100;
> > do_div(tmp, (lru_pages * scan_ratio) + 1);
> >
> > return (unsigned long)tmp;
> > }
> >
> > Then the shrinker callback will go:
> > sb->s_nr_dentry_scan += shrinker_scan(lru_pages, lru_scanned,
> > sb->s_nr_dentry_unused,
> > vfs_cache_pressure * SEEKS_PER_DENTRY);
> > if (sb->s_nr_dentry_scan > SHRINK_BATCH)
> > prune_dcache()
> >
> > sb->s_nr_inode_scan += shrinker_scan(lru_pages, lru_scanned,
> > sb->s_nr_inodes_unused,
> > vfs_cache_pressure * SEEKS_PER_INODE);
> > ...
> >
> > What do you think of that? Seeing as we're changing the shrinker API
> > anyway, I'd think it is high time to do somthing like this.
>
> Ignoring the dcache/icache reclaim ratio issues, I'd prefer a two

Well if it is an issue, it should be changed in a different patch
I think (with numbers).

> call API that matches the current behaviour, leaving the caclulation
> of how much to reclaim in shrink_slab(). Encoding it this way makes
> it more difficult to change the high level behaviour e.g. if we want
> to modify the amount of slab reclaim based on reclaim priority, we'd
> have to cahnge every shrinker instead of just shrink_slab().

We can modifiy the ratios before calling if needed, or have a default
ratio define to multiply with as well.

But shrinkers are very subsystem specific.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Dave Chinner on 31 May 2010 02:50

On Fri, May 28, 2010 at 03:19:24PM +1000, Nick Piggin wrote:
> On Fri, May 28, 2010 at 08:40:34AM +1000, Dave Chinner wrote:
> > On Thu, May 27, 2010 at 04:35:23PM +1000, Nick Piggin wrote:
> > > But we can think of inodes that are only in use by unused (and aged)
> > > dentries as effectively unused themselves. So this sequence under
> > > estimates how many inodes to scan. This could bias pressure against
> > > dcache I'd think, especially considering inodes are far larger than
> > > dentries. Maybe require 2 passes to get the inodes unused inthe
> > > first pass.
> >
> > It's self-balancing - it trends towards an equal number of unused
> > dentries and inodes in the caches. Yes, it will tear down more
> > dentries at first, but we need to do that to be able to reclaim
> > inodes.
>
> But then it doesn't scan enough inodes on the inode pass.

We don't get a single shrinker call - we get batches of them fo each
shrinker. The shrinker determines how many objects to scan in the
cache, and that only changes for each shrinker to call based on the
the shrinker->seeks and the number of objects in the cache. Once
the number is decided, then the shrinker gets batches of reclaim to
operate on.

Fundamentally, the current shrinker will ask for the same percentage
of each cache to be scanned - if it decides to scan 20% of the
dentry cache, it will also decide to scan 20% of the inode cache
Hence what the inode shrinker is doing is scanning 20% of the inodes
freed by the dcache shrinker.

In rough numbers, say we have 100k dentries, and the shrinker
calculates it needs to scan 20% of the caches to reclaim them, the
current code will end up with:

unused dentries unused inodes
before 100k 0
after dentry 80k 20k
after inode 80k 16k

So we get 20k dentries freed and 4k inodes freed on that shrink_slab
pass.

To contrast this against the code I proposed, I'll make a couple of
simplicfications to avoid hurting my brain. That is, I'll assume
SHRINK_BATCH=100 (rather than 128) and forgetting about rounding
errors. With this, the algorithm I encoded gives roughly the
following for a 20% object reclaim:

number of batches = 20k / 100 = 200

Unused
dentries+inodes dentries inodes
before 100k 100k 0
batch 1 100k 99900 100
batch 2 100k 99800 200
.....
batch 10 100k 99000 1000
batch 20 99990 98010 1990
batch 30 99980 97030 2950
batch 50 99910 95100 4810
batch 60 99860 94150 5710
......
batch 200 98100 81900 16200

And so (roughly) we see that the number of inodes being reclaim per
set of 10 batches roughly equals the (batch number - 10). Hence over
200 batches, we can expect to see roughly 190 + 180 + ... + 10
inodes reclaimed. That is 1900 inodes. Similarly for dentries, we
get roughly 1000 + 990 + 980 + ... 810 dentries reclaimed - 18,100
in total.

In other words, we have roughly 18k dentries and 1.9k inodes
reclaimed for the code I wrote new algorithm. That does mean it
initially attempts to reclaim dentries faster than the current code, but
as the number of unused inodes increases, this comes back to parity
with the current code and we end up with a 1:1 reclaim ratio.

This is good behaviour - dentries are cheap to reconstruct from the
inode cache, and we should hold onto the inode cache as much as
possible. i.e. we should reclaim them more aggressively only if
there is sustained pressure on the superblock and that is what the
above algorithm does.

> > Ås reclaim progresses the propotion of inodes increases, so
> > the amount of inodes reclaimed increases.
> >
> > Basically this is a recognition that the important cache for
> > avoiding IO is the inode cache, not he dentry cache. Once the inode
>
> You can bias against the dcache using multipliers.

Multipliers are not self-balancing, and generally just amplify any
imbalance an algorithm tends towards. The vfs_cache_pressure
multiplier is a shining example of this kind of utterly useless
knob...

> > > Part of the problem is the funny shrinker API.
> > >
> > > The right way to do it is to change the shrinker API so that it passes
> > > down the lru_pages and scanned into the callback. From there, the
> > > shrinkers can calculate the appropriate ratio of objects to scan.
> > > No need for 2-call scheme, no need for shrinker->seeks, and the
> > > ability to calculate an appropriate ratio first for dcache, and *then*
> > > for icache.
> >
> > My only concern about this is that exposes the inner workings of the
> > shrinker and mm subsystem to code that simply doesn't need to know
> > about it.
>
> It's just providing a ratio. The shrinkers allready know they are
> scanning based on a ratio of pagecache scanned.

Sure, but the shrinkers are just a simple mechanism for implementing
VM policy decisions. IMO reclaim policy decisions should not be
pushed down and replicated in every one of these reclaim mechanisms.

> But shrinkers are very subsystem specific.

And as such should concentrate on getting their subsystem reclaim
correct, not have to worry about implementing VM policy
calculations...

Cheers,

Dave.
--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 31 May 2010 03:30

On Mon, May 31, 2010 at 04:39:38PM +1000, Dave Chinner wrote:
> On Fri, May 28, 2010 at 03:19:24PM +1000, Nick Piggin wrote:
> > On Fri, May 28, 2010 at 08:40:34AM +1000, Dave Chinner wrote:
> > > On Thu, May 27, 2010 at 04:35:23PM +1000, Nick Piggin wrote:
> > > > But we can think of inodes that are only in use by unused (and aged)
> > > > dentries as effectively unused themselves. So this sequence under
> > > > estimates how many inodes to scan. This could bias pressure against
> > > > dcache I'd think, especially considering inodes are far larger than
> > > > dentries. Maybe require 2 passes to get the inodes unused inthe
> > > > first pass.
> > >
> > > It's self-balancing - it trends towards an equal number of unused
> > > dentries and inodes in the caches. Yes, it will tear down more
> > > dentries at first, but we need to do that to be able to reclaim
> > > inodes.
> >
> > But then it doesn't scan enough inodes on the inode pass.
>
> We don't get a single shrinker call - we get batches of them fo each
> shrinker.

OK fair point. However

[...]

> In other words, we have roughly 18k dentries and 1.9k inodes
> reclaimed for the code I wrote new algorithm. That does mean it
> initially attempts to reclaim dentries faster than the current code, but
> as the number of unused inodes increases, this comes back to parity
> with the current code and we end up with a 1:1 reclaim ratio.
>
> This is good behaviour - dentries are cheap to reconstruct from the
> inode cache, and we should hold onto the inode cache as much as
> possible. i.e. we should reclaim them more aggressively only if
> there is sustained pressure on the superblock and that is what the
> above algorithm does.

I prefer just to keep changes to a minimum and split into seperate
patches (each with at least basic test or two showing no regression).

As-is you're already changing global inode/dentry passes into per
sb inode and dentry passes. I think it can only be a good thing
for that changeset if other changes are minimised.

Then if it is so obviously good behaviour to reduce dcache pressure,
it should be easy to justify that too.

> > > Ås reclaim progresses the propotion of inodes increases, so
> > > the amount of inodes reclaimed increases.
> > >
> > > Basically this is a recognition that the important cache for
> > > avoiding IO is the inode cache, not he dentry cache. Once the inode
> >
> > You can bias against the dcache using multipliers.
>
> Multipliers are not self-balancing, and generally just amplify any
> imbalance an algorithm tends towards. The vfs_cache_pressure
> multiplier is a shining example of this kind of utterly useless
> knob...

Well you can also bias against the dcache with any other means,
including the change you've made here. My main point I guess is
that it should not be in the same as this patchset (or at least
an individual patch).

> > > > Part of the problem is the funny shrinker API.
> > > >
> > > > The right way to do it is to change the shrinker API so that it passes
> > > > down the lru_pages and scanned into the callback. From there, the
> > > > shrinkers can calculate the appropriate ratio of objects to scan.
> > > > No need for 2-call scheme, no need for shrinker->seeks, and the
> > > > ability to calculate an appropriate ratio first for dcache, and *then*
> > > > for icache.
> > >
> > > My only concern about this is that exposes the inner workings of the
> > > shrinker and mm subsystem to code that simply doesn't need to know
> > > about it.
> >
> > It's just providing a ratio. The shrinkers allready know they are
> > scanning based on a ratio of pagecache scanned.
>
> Sure, but the shrinkers are just a simple mechanism for implementing
> VM policy decisions. IMO reclaim policy decisions should not be
> pushed down and replicated in every one of these reclaim mechanisms.

Not really. The VM doesn't know about any of those. They are just
told to provide a ratio and some scanning based on some abstract cost.

The VM doesn't know anything about usage patterns, inuse vs unused
objects, exactly how their LRU algorithms are supposed to work, etc.

There is very little policy decision by the VM in the shrinkers.

> > But shrinkers are very subsystem specific.
>
> And as such should concentrate on getting their subsystem reclaim
> correct, not have to worry about implementing VM policy
> calculations...

Clearly they wouldn't with what I was proposing. And the result would
be much more flexible and also gives the shrinkers more information.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev |
Pages: 1 2 3
Prev: ath: Failed to stop TX DMA in 100 msec after killing last frame
Next: [GIT PULL] LED updates for 2.6.35