2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098] [Kernel]

Prev: [PATCH] Documentation/HOWTO: update git home URL
Next: Documentation/development-process: update staging info

From: Dave Chinner on 6 Apr 2010 19:20

On Tue, Apr 06, 2010 at 04:52:57PM +0200, Hans-Peter Jansen wrote:
> Hi Dave,
>
> On Tuesday 06 April 2010, 01:06:00 Dave Chinner wrote:
> > On Mon, Apr 05, 2010 at 01:35:41PM +0200, Hans-Peter Jansen wrote:
> > > >
> > > > Oh, this is a highmem box. You ran out of low memory, I think, which
> > > > is where all the inodes are cached. Seems like a VM problem or a
> > > > highmem/lowmem split config problem to me, not anything to do with
> > > > XFS...

[snip]

> Dave, I really don't want to disappoint you, but a lengthy bisection session
> points to:
>
> 57817c68229984818fea9e614d6f95249c3fb098 is the first bad commit
> commit 57817c68229984818fea9e614d6f95249c3fb098
> Author: Dave Chinner <david(a)fromorbit.com>
> Date: Sun Jan 10 23:51:47 2010 +0000
>
> xfs: reclaim all inodes by background tree walks

Interesting. I did a fair bit of low memory testing when i made that
change (admittedly none on a highmem i386 box), and since then I've
done lots of "millions of files" tree creates, traversals and destroys on
limited memory machines without triggering problems when memory is
completely full of inodes.

Let me try to reproduce this on a small VM and I'll get back to you.

> diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
> index 52e06b4..a76fc01 100644
> --- a/fs/xfs/linux-2.6/xfs_super.c
> +++ b/fs/xfs/linux-2.6/xfs_super.c
> @@ -954,14 +954,16 @@ xfs_fs_destroy_inode(
> ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM));
>
> /*
> - * We always use background reclaim here because even if the
> - * inode is clean, it still may be under IO and hence we have
> - * to take the flush lock. The background reclaim path handles
> - * this more efficiently than we can here, so simply let background
> - * reclaim tear down all inodes.
> + * If we have nothing to flush with this inode then complete the
> + * teardown now, otherwise delay the flush operation.
> */
> + if (!xfs_inode_clean(ip)) {
> + xfs_inode_set_reclaim_tag(ip);
> + return;
> + }
> +
> out_reclaim:
> - xfs_inode_set_reclaim_tag(ip);
> + xfs_ireclaim(ip);
> }

I don't think that will work as expected in all situations - the
inode clean check there is not completely valid as the XFS inode
locks aren't held, so it can race with other operations that need
to complete before reclaim is done. This was one of the reasons for
pushing reclaim into the background....

Cheers,

Dave.
--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Dave Chinner on 6 Apr 2010 21:50

On Wed, Apr 07, 2010 at 09:11:44AM +1000, Dave Chinner wrote:
> On Tue, Apr 06, 2010 at 04:52:57PM +0200, Hans-Peter Jansen wrote:
> > Hi Dave,
> >
> > On Tuesday 06 April 2010, 01:06:00 Dave Chinner wrote:
> > > On Mon, Apr 05, 2010 at 01:35:41PM +0200, Hans-Peter Jansen wrote:
> > > > >
> > > > > Oh, this is a highmem box. You ran out of low memory, I think, which
> > > > > is where all the inodes are cached. Seems like a VM problem or a
> > > > > highmem/lowmem split config problem to me, not anything to do with
> > > > > XFS...
>
> [snip]
>
> > Dave, I really don't want to disappoint you, but a lengthy bisection session
> > points to:
> >
> > 57817c68229984818fea9e614d6f95249c3fb098 is the first bad commit
> > commit 57817c68229984818fea9e614d6f95249c3fb098
> > Author: Dave Chinner <david(a)fromorbit.com>
> > Date: Sun Jan 10 23:51:47 2010 +0000
> >
> > xfs: reclaim all inodes by background tree walks
>
> Interesting. I did a fair bit of low memory testing when i made that
> change (admittedly none on a highmem i386 box), and since then I've
> done lots of "millions of files" tree creates, traversals and destroys on
> limited memory machines without triggering problems when memory is
> completely full of inodes.
>
> Let me try to reproduce this on a small VM and I'll get back to you.

OK, if there is page cache pressure (e.g. creating small files or
grepping the resultant tree) or the machine has significant amounts
of memory (e.g. >= 4GB) then I can't reproduce this.

However, if the memory pressure is purely inode cache (creating zero
length files or read-only traversal), then the OOM killer kicks a
while after the slab cache fills memory. This doesn't need highmem;
I used a x86_64 kernel on a VM w/ 1GB RAM to reliably reproduce
this. I'll add zero length file tests and traversals to my low
memory testing.

The best way to fix this, I think, is to trigger a shrinker callback
when memory is low to run the background inode reclaim. The problem
is that these inode caches and the reclaim state are per-filesystem,
not global state, and the current shrinker interface only works with
global state.

Hence there are two patches to this fix - the first adds a context
to the shrinker callout, and the second adds the XFS infrastructure
to track the number of reclaimable inodes per filesystem and
register/unregister shrinkers for each filesystem.

With these patches, my reproducable test case which locked the
machine up with a OOM panic in a couple of minutes has been running
for over half an hour. I have much more confidence in this change
with limited testing than the reverting of the background inode
reclaim as the revert introduces

The patches below apply to the xfs-dev tree, which is currently at
34-rc1. If they don't apply, let me know and I'll redo them against
a vanilla kernel tree. Can you test them to see if the problem goes
away? If the problem is fixed, I'll push them for a proper review
cycle...

Cheers,

Dave.
--
Dave Chinner
david(a)fromorbit.com

From: Dave Chinner on 12 Apr 2010 21:30

On Thu, Apr 08, 2010 at 12:02:20AM +0200, Hans-Peter Jansen wrote:
> On Wednesday 07 April 2010, 03:45:33 Dave Chinner wrote:
> >
> > However, if the memory pressure is purely inode cache (creating zero
> > length files or read-only traversal), then the OOM killer kicks a
> > while after the slab cache fills memory. This doesn't need highmem;
> > I used a x86_64 kernel on a VM w/ 1GB RAM to reliably reproduce
> > this. I'll add zero length file tests and traversals to my low
> > memory testing.
>
> I'm glad, that you're able to reproduce it. My initial failure was during
> disk to disk backup (with a simple cp -al & rsync combination).
>
> > The best way to fix this, I think, is to trigger a shrinker callback
> > when memory is low to run the background inode reclaim. The problem
> > is that these inode caches and the reclaim state are per-filesystem,
> > not global state, and the current shrinker interface only works with
> > global state.
> >
> > Hence there are two patches to this fix - the first adds a context
> > to the shrinker callout, and the second adds the XFS infrastructure
> > to track the number of reclaimable inodes per filesystem and
> > register/unregister shrinkers for each filesystem.
>
> I see, the first one will be interesting to get into mainline, given the
> number of projects, that are involved.
>
> > With these patches, my reproducable test case which locked the
> > machine up with a OOM panic in a couple of minutes has been running
> > for over half an hour. I have much more confidence in this change
> > with limited testing than the reverting of the background inode
> > reclaim as the revert introduces
> >
> > The patches below apply to the xfs-dev tree, which is currently at
> > 34-rc1. If they don't apply, let me know and I'll redo them against
> > a vanilla kernel tree. Can you test them to see if the problem goes
> > away? If the problem is fixed, I'll push them for a proper review
> > cycle...
>
> Of course, you did the original patch for a reason... Therefor I would love
> to test your patches. I've tried to apply them to 2.6.33.2, but after
> fixing the same reject as noted below, I'm stuck here:
>
> /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:
> In function 'xfs_reclaim_inode_shrink':
> /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:805:
> error: implicit declaration of function 'xfs_perag_get'
> /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:805:
> warning: assignment makes pointer from integer without a cast
> /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:807:
> error: implicit declaration of function 'xfs_perag_put'
>
> Now I see, that there happened a rename of the offending functions, but also
> they've grown a radix_tree structure and locking. How do I handle that?

With difficulty. I'd need to backport it to match the .33 code,
which may or may not be trivial...

> BTW, your patches do not apply to Linus' current git tree either:
> patching file fs/xfs/quota/xfs_qm.c
> Hunk #1 succeeded at 72 (offset 3 lines).
> Hunk #2 FAILED at 2120.
> 1 out of 2 hunks FAILED -- saving rejects to file fs/xfs/quota/xfs_qm.c.rej
> I'm able to resolve this, but 2.6.34-current does give me some other
> trouble, that I need to get by (PS2 keyboard stops working eventually)..

Yeah, there's another patch in my xfs-dev tree that changes that.
I'll rebase it on a clean linux tree before I post it again.

Cheers,

Dave.
--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Dave Chinner on 13 Apr 2010 05:20

On Tue, Apr 13, 2010 at 10:50:35AM +0200, Hans-Peter Jansen wrote:
> On Tuesday 13 April 2010, 00:32:41 Dave Chinner wrote:
> > On Thu, Apr 08, 2010 at 12:02:20AM +0200, Hans-Peter Jansen wrote:
> > > On Wednesday 07 April 2010, 03:45:33 Dave Chinner wrote:
> > >
> > > > The patches below apply to the xfs-dev tree, which is currently at
> > > > 34-rc1. If they don't apply, let me know and I'll redo them against
> > > > a vanilla kernel tree. Can you test them to see if the problem goes
> > > > away? If the problem is fixed, I'll push them for a proper review
> > > > cycle...
> > >
> > > Of course, you did the original patch for a reason... Therefor I would
> > > love to test your patches. I've tried to apply them to 2.6.33.2, but
> > > after fixing the same reject as noted below, I'm stuck here:
> > >
> > > /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/lin
> > >ux-2.6/xfs_sync.c: In function 'xfs_reclaim_inode_shrink':
> > > /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/lin
> > >ux-2.6/xfs_sync.c:805: error: implicit declaration of function
> > > 'xfs_perag_get'
> > > /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/lin
> > >ux-2.6/xfs_sync.c:805: warning: assignment makes pointer from integer
> > > without a cast
> > > /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/lin
> > >ux-2.6/xfs_sync.c:807: error: implicit declaration of function
> > > 'xfs_perag_put'
> > >
> > > Now I see, that there happened a rename of the offending functions, but
> > > also they've grown a radix_tree structure and locking. How do I handle
> > > that?
> >
> > With difficulty. I'd need to backport it to match the .33 code,
> > which may or may not be trivial...
>
> Dave, may I ask you kindly for briefly elaborating on the worst consequences
> of just reverting this hunk, as I've done before?

Well, given that is the new shrinker code generating the warnings,
reverting/removing that hunk will render the patch useless :0

I'll get you a working 2.6.33 patch tomorrow - it's dinner time
now....

Cheers,

Dave.
--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Emmanuel Florac on 24 Apr 2010 17:30

Le Sat, 24 Apr 2010 18:44:22 +0200 vous �criviez:

> Greg, you might search for a server using xfs filesystems and and a
> i586 kernel >= 2.6.33, (2.6.32.11 of SLE11-SP1 will serve as well),
> log in as an ordinary user, do a "du" on /usr, and wait for the other
> users screaming...

I did precisely that, and didn't notice anything special (du on kernel
source tree) kernel 2.6.32.11, deadline scheduler, 7 drives RAID-6
array, 8GB RAM.

--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac(a)intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2
Prev: [PATCH] Documentation/HOWTO: update git home URL
Next: Documentation/development-process: update staging info