From: Nick Piggin on
On Thu, May 27, 2010 at 02:07:04PM +1000, Dave Chinner wrote:
> On Thu, May 27, 2010 at 12:19:05PM +1000, Nick Piggin wrote:
> > On Thu, May 27, 2010 at 09:12:14AM +1000, Dave Chinner wrote:
> > > On Thu, May 27, 2010 at 02:41:16AM +1000, Nick Piggin wrote:
> > > > > + count = ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) / 100)
> > > > > + * sysctl_vfs_cache_pressure;
> > > >
> > > > Do you think truncating in the divisions is at all a problem? It
> > > > probably doesn't matter much I suppose.
> > >
> > > Same code as currently exists. IIRC, the reasoning is that if we've
> > > got less that 100 objects to reclaim, then we're unlikely to be able
> > > to free up any memory from the caches, anyway.
> >
> > Yeah, which is why I stop short of saying you should change it in
> > this patch.
> >
> > But I think we should ensure things can get reclaimed eventually.
> > 100 objects could be 100 slabs, which could be anything from
> > half a meg to half a dozen. Multiplied by each of the caches.
> > Could be significant in small systems.
>
> True, but usually there are busy objects in the dentry and inode
> slabs, so it shouldn't be a significant issue. We can probably
> address such problems if they can be demonstrated to be an issue in
> a separate patch set....

I didn't want to say it is a problem with your patchset, I just
thought of it when reviewing.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Dave Chinner on
On Thu, May 27, 2010 at 05:01:20AM +0100, Al Viro wrote:
> On Thu, May 27, 2010 at 11:53:35AM +1000, Dave Chinner wrote:
> > On Thu, May 27, 2010 at 09:12:14AM +1000, Dave Chinner wrote:
> > > On Thu, May 27, 2010 at 02:41:16AM +1000, Nick Piggin wrote:
> > ....
> > > > Nitpick but I prefer just the restart label wher it is previously. This
> > > > is moving setup for the next iteration into the "error" case.
> > >
> > > Ok, will fix.
> > ....
> > > > Would you just elaborate on the lock order problem somewhere? (the
> > > > comment makes it look like we *could* take the mutex if we wanted
> > > > to).
> > >
> > > The shrinker is unregistered in deactivate_locked_super() which is
> > > just before ->kill_sb is called. The sb->s_umount lock is held at
> > > this point. hence is the shrinker is operating, we will deadlock if
> > > we try to lock it like this:
> > >
> > > unmount: shrinker:
> > > down_read(&shrinker_lock);
> > > down_write(&sb->s_umount)
> > > unregister_shrinker()
> > > down_write(&shrinker_lock)
> > > prune_super()
> > > down_read(&sb->s_umount);
> > > (deadlock)
> > >
> > > hence if we can't get the sb->s_umount lock in prune_super(), then
> > > the superblock must be being unmounted and the shrinker should abort
> > > as the ->kill_sb method will clean up everything after the shrinker
> > > is unregistered. Hence the down_read_trylock().
>
> Um... Maybe I'm dumb, but what's wrong with doing unregistration from
> deactivate_locked_super(), right after the call of ->kill_sb()? At that
> point ->s_umount is already dropped, so we won't deadlock at all.
> Shrinker rwsem will make sure that all shrinkers-in-progress will run
> to completion, so we won't get a superblock freed under prune_super().
> I don't particulary mind down_try_read() in prune_super(), but why not
> make life obviously safer?
>
> Am I missing something here?

I was worried about memory allocation in the ->kill_sb path
deadlocking on the s_umount lock if it enters reclaim. e.g. XFS
inodes can still be dirty even after the VFS has disposed of them,
and writing them back can require page cache allocation for the
backing buffers. If allocation recurses back into the shrinker, we
can deadlock on the s_umount lock. This doesn't seem like an XFS
specific problem, so I used a trylock to avoid that whole class of
problems (same way the current shrinkers do).

From there, we can unregister the shrinker before calling ->kill_sb
as per above. That, in turn, means that the unmount
invalidate_inodes() vs shrinker race goes away and the iprune_sem is
not needed in the new prune_icache_sb() function. I'm pretty sure
that I can now remove the iprune_sem, but I haven't written the
patch to do that yet.

Cheers,

Dave.

--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Nick Piggin on
On Tue, May 25, 2010 at 06:53:06PM +1000, Dave Chinner wrote:
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -37,6 +37,50 @@
> LIST_HEAD(super_blocks);
> DEFINE_SPINLOCK(sb_lock);
>
> +static int prune_super(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask)
> +{
> + struct super_block *sb;
> + int count;
> +
> + sb = container_of(shrink, struct super_block, s_shrink);
> +
> + /*
> + * Deadlock avoidance. We may hold various FS locks, and we don't want
> + * to recurse into the FS that called us in clear_inode() and friends..
> + */
> + if (!(gfp_mask & __GFP_FS))
> + return -1;
> +
> + /*
> + * if we can't get the umount lock, then there's no point having the
> + * shrinker try again because the sb is being torn down.
> + */
> + if (!down_read_trylock(&sb->s_umount))
> + return -1;
> +
> + if (!sb->s_root) {
> + up_read(&sb->s_umount);
> + return -1;
> + }
> +
> + if (nr_to_scan) {
> + /* proportion the scan between the two cacheѕ */
> + int total;
> +
> + total = sb->s_nr_dentry_unused + sb->s_nr_inodes_unused + 1;
> + count = (nr_to_scan * sb->s_nr_dentry_unused) / total;
> +
> + /* prune dcache first as icache is pinned by it */
> + prune_dcache_sb(sb, count);
> + prune_icache_sb(sb, nr_to_scan - count);

Hmm, an interesting dynamic that you've changed is that previously
we'd scan dcache LRU proportionately to pagecache, and then scan
inode LRU in proportion to the current number of unused inodes.

But we can think of inodes that are only in use by unused (and aged)
dentries as effectively unused themselves. So this sequence under
estimates how many inodes to scan. This could bias pressure against
dcache I'd think, especially considering inodes are far larger than
dentries. Maybe require 2 passes to get the inodes unused inthe
first pass.

Part of the problem is the funny shrinker API.

The right way to do it is to change the shrinker API so that it passes
down the lru_pages and scanned into the callback. From there, the
shrinkers can calculate the appropriate ratio of objects to scan.
No need for 2-call scheme, no need for shrinker->seeks, and the
ability to calculate an appropriate ratio first for dcache, and *then*
for icache.

A helper of course can do the calculation (considering that every
driver and their dog will do the wrong thing if we let them :)).

unsigned long shrinker_scan(unsigned long lru_pages,
unsigned long lru_scanned,
unsigned long nr_objects,
unsigned long scan_ratio)
{
unsigned long long tmp = nr_objects;

tmp *= lru_scanned * 100;
do_div(tmp, (lru_pages * scan_ratio) + 1);

return (unsigned long)tmp;
}

Then the shrinker callback will go:
sb->s_nr_dentry_scan += shrinker_scan(lru_pages, lru_scanned,
sb->s_nr_dentry_unused,
vfs_cache_pressure * SEEKS_PER_DENTRY);
if (sb->s_nr_dentry_scan > SHRINK_BATCH)
prune_dcache()

sb->s_nr_inode_scan += shrinker_scan(lru_pages, lru_scanned,
sb->s_nr_inodes_unused,
vfs_cache_pressure * SEEKS_PER_INODE);
...

What do you think of that? Seeing as we're changing the shrinker API
anyway, I'd think it is high time to do somthing like this.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Nick Piggin on
On Thu, May 27, 2010 at 04:17:51PM +1000, Dave Chinner wrote:
> On Thu, May 27, 2010 at 05:01:20AM +0100, Al Viro wrote:
> > Um... Maybe I'm dumb, but what's wrong with doing unregistration from
> > deactivate_locked_super(), right after the call of ->kill_sb()? At that
> > point ->s_umount is already dropped, so we won't deadlock at all.
> > Shrinker rwsem will make sure that all shrinkers-in-progress will run
> > to completion, so we won't get a superblock freed under prune_super().
> > I don't particulary mind down_try_read() in prune_super(), but why not
> > make life obviously safer?
> >
> > Am I missing something here?
>
> I was worried about memory allocation in the ->kill_sb path
> deadlocking on the s_umount lock if it enters reclaim. e.g. XFS
> inodes can still be dirty even after the VFS has disposed of them,
> and writing them back can require page cache allocation for the
> backing buffers. If allocation recurses back into the shrinker, we
> can deadlock on the s_umount lock. This doesn't seem like an XFS
> specific problem, so I used a trylock to avoid that whole class of
> problems (same way the current shrinkers do).

If GFP_FS is set, we wouldn't touch the locks. It is a concern
though, if __GFP_FS allocations were previously permitted under
the exclusive lock.


> >From there, we can unregister the shrinker before calling ->kill_sb
> as per above. That, in turn, means that the unmount
> invalidate_inodes() vs shrinker race goes away and the iprune_sem is
> not needed in the new prune_icache_sb() function. I'm pretty sure
> that I can now remove the iprune_sem, but I haven't written the
> patch to do that yet.

I do really like that aspect of your patch. It's nice to have the
shrinker always only operating against active supers. So I would
be in favour of your current scheme.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andrew Morton on
On Tue, 25 May 2010 18:53:06 +1000
Dave Chinner <david(a)fromorbit.com> wrote:

> From: Dave Chinner <dchinner(a)redhat.com>
>
> With context based shrinkers, we can implement a per-superblock
> shrinker that shrinks the caches attached to the superblock. We
> currently have global shrinkers for the inode and dentry caches that
> split up into per-superblock operations via a coarse proportioning
> method that does not batch very well. The global shrinkers also
> have a dependency - dentries pin inodes - so we have to be very
> careful about how we register the global shrinkers so that the
> implicit call order is always correct.
>
> With a per-sb shrinker callout, we can encode this dependency
> directly into the per-sb shrinker, hence avoiding the need for
> strictly ordering shrinker registrations. We also have no need for
> any proportioning code for the shrinker subsystem already provides
> this functionality across all shrinkers. Allowing the shrinker to
> operate on a single superblock at a time means that we do less
> superblock list traversals and locking and reclaim should batch more
> effectively. This should result in less CPU overhead for reclaim and
> potentially faster reclaim of items from each filesystem.
>

I go all tingly when a changelog contains the word "should".

OK, it _should_ do X. But _does_ it actually do X?

> fs/super.c | 53 +++++++++++++++++++++
> include/linux/fs.h | 7 +++
> 4 files changed, 88 insertions(+), 214 deletions(-)
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index dba6b6d..d7bd781 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -456,21 +456,16 @@ static void prune_one_dentry(struct dentry * dentry)
> * which flags are set. This means we don't need to maintain multiple
> * similar copies of this loop.
> */
> -static void __shrink_dcache_sb(struct super_block *sb, int *count, int flags)
> +static void __shrink_dcache_sb(struct super_block *sb, int count, int flags)

Forgot to update the kerneldoc description of `count'.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/