From: Al Viro on
On Wed, Jul 21, 2010 at 02:11:17PM +0200, Jan Kara wrote:
> Thanks for bisecting this. The patch series indeed seems to uncover
> some discrepancies.
> Ext3 has always dirtied inode in it's ->delete_inode method (via quota
> code). But previously clear_inode() just overwrote the state with I_CLEAR
> and thus we never saw the BUG_ON. After Al's patches, i_state is set in
> end_writeback() which happens earlier. In particular it happens before
> ext3_free_inode() which dirties the inode through quota code while freeing
> xattrs - they are accounted in i_blocks, so i_blocks are updated during
> freeing and inode is dirtied.
> Actually, ext3_mark_inode_dirty() called during each mark_inode_dirty()
> call writes the inode state to the journal so the dirty flag in the inode
> state is in fact stale and overwriting it with I_CLEAR never mattered. In
> this sense, the BUG_ON triggered is a false positive. But I believe this is
> a separate story.
> I'm not sure how to really fix this. It seems a bit premature to me to
> mark inode as I_CLEAR before the filesystem is actually done with it. So
> maybe the line
> inode->i_state = I_FREEING | I_CLEAR;
> should be moved to evict() fuction?

Nope. I_CLEAR is "no async calls from vfs anymore; it's under complete
fs control and is about to die now".

In any case, I'll post a dumb replacement for ext3 after I verify it on
the laptop I have with me. Should be in an hour or so (the damn thing is
_slow_).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Al Viro on
On Wed, Jul 21, 2010 at 02:11:17PM +0200, Jan Kara wrote:
> Thanks for bisecting this. The patch series indeed seems to uncover
> some discrepancies.
> Ext3 has always dirtied inode in it's ->delete_inode method (via quota
> code). But previously clear_inode() just overwrote the state with I_CLEAR
> and thus we never saw the BUG_ON. After Al's patches, i_state is set in
> end_writeback() which happens earlier. In particular it happens before
> ext3_free_inode() which dirties the inode through quota code while freeing
> xattrs - they are accounted in i_blocks, so i_blocks are updated during
> freeing and inode is dirtied.
> Actually, ext3_mark_inode_dirty() called during each mark_inode_dirty()
> call writes the inode state to the journal so the dirty flag in the inode
> state is in fact stale and overwriting it with I_CLEAR never mattered. In
> this sense, the BUG_ON triggered is a false positive. But I believe this is
> a separate story.

Could you please explain why the hell does ext2_xattr_delete_inode() call the
dirtying variant of dquot_free_block()? Note that the inode is well beyond
the point where writeback would consider touching it; this mark_inode_dirty()
will do nothing useful whatsoever at that place.

Anyway, I've dealt with ext3 and ext2 (both b0rken with quota) and AFAICS
the rest of quota-supporting filesystems had been OK. Changes:

ext3_evict_inode() hd end_writeback() shifted downstream (needed anyway)
ext2 switched to nodirty variant of dquot_free_block()
ext2_xattr_delete_inode() doesn't try to dirty inode anymore (always
had been pointless).

It's in for-next, should propagate to git.kernel.org in a few.

Diff against the buggy version follows, feel free to try.
diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
index e8766a3..c6c684b 100644
--- a/fs/ext2/balloc.c
+++ b/fs/ext2/balloc.c
@@ -571,7 +571,7 @@ do_more:
error_return:
brelse(bitmap_bh);
release_blocks(sb, freed);
- dquot_free_block(inode, freed);
+ dquot_free_block_nodirty(inode, freed);
}

/**
@@ -1418,7 +1418,8 @@ allocated:

*errp = 0;
brelse(bitmap_bh);
- dquot_free_block(inode, *count-num);
+ dquot_free_block_nodirty(inode, *count-num);
+ mark_inode_dirty(inode);
*count = num;
return ret_block;

@@ -1428,8 +1429,10 @@ out:
/*
* Undo the block allocation
*/
- if (!performed_allocation)
- dquot_free_block(inode, *count);
+ if (!performed_allocation) {
+ dquot_free_block_nodirty(inode, *count);
+ mark_inode_dirty(inode);
+ }
brelse(bitmap_bh);
return 0;
}
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 7e4a455..940c961 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -439,6 +439,8 @@ static int ext2_alloc_blocks(struct inode *inode,
failed_out:
for (i = 0; i <index; i++)
ext2_free_blocks(inode, new_blocks[i], 1);
+ if (index)
+ mark_inode_dirty(inode);
return ret;
}

@@ -1009,8 +1011,8 @@ static inline void ext2_free_data(struct inode *inode, __le32 *p, __le32 *q)
else if (block_to_free == nr - count)
count++;
else {
- mark_inode_dirty(inode);
ext2_free_blocks (inode, block_to_free, count);
+ mark_inode_dirty(inode);
free_this:
block_to_free = nr;
count = 1;
@@ -1018,8 +1020,8 @@ static inline void ext2_free_data(struct inode *inode, __le32 *p, __le32 *q)
}
}
if (count > 0) {
- mark_inode_dirty(inode);
ext2_free_blocks (inode, block_to_free, count);
+ mark_inode_dirty(inode);
}
}

diff --git a/fs/ext2/xattr.c b/fs/ext2/xattr.c
index 7c39157..5ab87e6 100644
--- a/fs/ext2/xattr.c
+++ b/fs/ext2/xattr.c
@@ -674,6 +674,7 @@ ext2_xattr_set2(struct inode *inode, struct buffer_head *old_bh,
new_bh = sb_getblk(sb, block);
if (!new_bh) {
ext2_free_blocks(inode, block, 1);
+ mark_inode_dirty(inode);
error = -EIO;
goto cleanup;
}
@@ -703,8 +704,10 @@ ext2_xattr_set2(struct inode *inode, struct buffer_head *old_bh,
* written (only some dirty data were not) so we just proceed
* as if nothing happened and cleanup the unused block */
if (error && error != -ENOSPC) {
- if (new_bh && new_bh != old_bh)
- dquot_free_block(inode, 1);
+ if (new_bh && new_bh != old_bh) {
+ dquot_free_block_nodirty(inode, 1);
+ mark_inode_dirty(inode);
+ }
goto cleanup;
}
} else
@@ -727,6 +730,7 @@ ext2_xattr_set2(struct inode *inode, struct buffer_head *old_bh,
mb_cache_entry_free(ce);
ea_bdebug(old_bh, "freeing");
ext2_free_blocks(inode, old_bh->b_blocknr, 1);
+ mark_inode_dirty(inode);
/* We let our caller release old_bh, so we
* need to duplicate the buffer before. */
get_bh(old_bh);
@@ -736,7 +740,8 @@ ext2_xattr_set2(struct inode *inode, struct buffer_head *old_bh,
le32_add_cpu(&HDR(old_bh)->h_refcount, -1);
if (ce)
mb_cache_entry_release(ce);
- dquot_free_block(inode, 1);
+ dquot_free_block_nodirty(inode, 1);
+ mark_inode_dirty(inode);
mark_buffer_dirty(old_bh);
ea_bdebug(old_bh, "refcount now=%d",
le32_to_cpu(HDR(old_bh)->h_refcount));
@@ -799,7 +804,7 @@ ext2_xattr_delete_inode(struct inode *inode)
mark_buffer_dirty(bh);
if (IS_SYNC(inode))
sync_dirty_buffer(bh);
- dquot_free_block(inode, 1);
+ dquot_free_block_nodirty(inode, 1);
}
EXT2_I(inode)->i_file_acl = 0;

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 7edd529..cc55cec 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -230,7 +230,6 @@ void ext3_evict_inode (struct inode *inode)
inode->i_size = 0;
if (inode->i_blocks)
ext3_truncate(inode);
- end_writeback(inode);
/*
* Kill off the orphan record which ext3_truncate created.
* AKPM: I think this can be inside the above `if'.
@@ -252,10 +251,12 @@ void ext3_evict_inode (struct inode *inode)
if (ext3_mark_inode_dirty(handle, inode)) {
/* If that failed, just dquot_drop() and be done with that */
dquot_drop(inode);
+ end_writeback(inode);
} else {
ext3_xattr_delete_inode(handle, inode);
dquot_free_inode(inode);
dquot_drop(inode);
+ end_writeback(inode);
ext3_free_inode(handle, inode);
}
ext3_journal_stop(handle);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Dave Chinner on
On Wed, Jul 21, 2010 at 12:29:07AM -0700, Andrew Morton wrote:
> On Wed, 21 Jul 2010 15:20:07 +1000 Dave Chinner <david(a)fromorbit.com> wrote:
>
> > > and they were dirtied within dquot_free_space().
> >
> > AFAICT dquot_free_space() is called deep in the guts of
> > ext3_truncate() via dquot_free_block(), which is called directly
> > before end_writeback(). That should overwrite any state changes made
> > inside ext3_truncate. I wonder if iput_final() is racing with
> > something else here?
> >
>
> This isn't a race. I type `make' and the warnings spew out at hundreds
> per second - every unlink, I'd say.
>
> Did you try my .config?

Yes, I did - it was the second thing I tried after using my usual
..config (*).

I'm not an ext3 expert, so I might be missing something, but I
cannot see (from code inspection) where those flags are being set
after end_writeback is called. My experience with inode flag
corruptions on XFS that can only be reprodued by a small number of
machines is that it is usually the result of a race condition.
It looks and smells similar to me.

Anyway, I'm not sure I can help much more at this point - until I
can reproduce it I'm just making SWAGs...

Cheers,

Dave.

(*) I eventually worked out that the ext3 corruption I was chasing
on that one filesystem was a result of using a kernel built with
your config because it didn't have CONFIG_EXT3_DEFAULTS_TO_ORDERED=y
set and I did the equivalent of yanking the power cord to the VM
(I do that all the time) while that kernel was running...

--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jan Kara on
Hi Al,

On Wed 21-07-10 22:40:16, Al Viro wrote:
> On Wed, Jul 21, 2010 at 02:11:17PM +0200, Jan Kara wrote:
> > Thanks for bisecting this. The patch series indeed seems to uncover
> > some discrepancies.
> > Ext3 has always dirtied inode in it's ->delete_inode method (via quota
> > code). But previously clear_inode() just overwrote the state with I_CLEAR
> > and thus we never saw the BUG_ON. After Al's patches, i_state is set in
> > end_writeback() which happens earlier. In particular it happens before
> > ext3_free_inode() which dirties the inode through quota code while freeing
> > xattrs - they are accounted in i_blocks, so i_blocks are updated during
> > freeing and inode is dirtied.
> > Actually, ext3_mark_inode_dirty() called during each mark_inode_dirty()
> > call writes the inode state to the journal so the dirty flag in the inode
> > state is in fact stale and overwriting it with I_CLEAR never mattered. In
> > this sense, the BUG_ON triggered is a false positive. But I believe this is
> > a separate story.
>
> Could you please explain why the hell does ext2_xattr_delete_inode() call the
> dirtying variant of dquot_free_block()? Note that the inode is well beyond
> the point where writeback would consider touching it; this mark_inode_dirty()
> will do nothing useful whatsoever at that place.
Yes, I'm aware that dirtying inode does nothing useful at that point.
But OTOH it does no harm so why not call the standard variant of quota
function? But I have no strong opinion on this particular callsite so
using dquot_free_block_nodirty() at that place is OK with me.

> Anyway, I've dealt with ext3 and ext2 (both b0rken with quota) and AFAICS
> the rest of quota-supporting filesystems had been OK. Changes:
>
> ext3_evict_inode() hd end_writeback() shifted downstream (needed anyway)
I like this.

> ext2 switched to nodirty variant of dquot_free_block()
But I don't like this - see comments in the patch. I'd be much more happy
if ext2 did the same thing as ext3 - pulled ext2_xattr_delete_inode()
call out of ext2_free_inode called end_writeback() after it.

> ext2_xattr_delete_inode() doesn't try to dirty inode anymore (always
> had been pointless).
As I wrote above I don't care about this very much.

> It's in for-next, should propagate to git.kernel.org in a few.
>
> Diff against the buggy version follows, feel free to try.
> diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
> index e8766a3..c6c684b 100644
> --- a/fs/ext2/balloc.c
> +++ b/fs/ext2/balloc.c
> @@ -571,7 +571,7 @@ do_more:
> error_return:
> brelse(bitmap_bh);
> release_blocks(sb, freed);
> - dquot_free_block(inode, freed);
> + dquot_free_block_nodirty(inode, freed);
> }
The above call in ext2_free_blocks() modifies the inode and
ext2_free_blocks is enough "black-box" function that it should do the right
thing and dirty the inode if it modified it.

> /**
> @@ -1418,7 +1418,8 @@ allocated:
>
> *errp = 0;
> brelse(bitmap_bh);
> - dquot_free_block(inode, *count-num);
> + dquot_free_block_nodirty(inode, *count-num);
> + mark_inode_dirty(inode);
> *count = num;
> return ret_block;
>
> @@ -1428,8 +1429,10 @@ out:
> /*
> * Undo the block allocation
> */
> - if (!performed_allocation)
> - dquot_free_block(inode, *count);
> + if (!performed_allocation) {
> + dquot_free_block_nodirty(inode, *count);
> + mark_inode_dirty(inode);
> + }
> brelse(bitmap_bh);
> return 0;
> }
Sorry, but the above two changes look stupid... Why call _nodirty variant
and dirty the inode immediately after that? It happens in two other places
in your patch as well...

Honza
--
Jan Kara <jack(a)suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Al Viro on
On Fri, Jul 23, 2010 at 12:04:08PM +0200, Jan Kara wrote:

> > @@ -1428,8 +1429,10 @@ out:
> > /*
> > * Undo the block allocation
> > */
> > - if (!performed_allocation)
> > - dquot_free_block(inode, *count);
> > + if (!performed_allocation) {
> > + dquot_free_block_nodirty(inode, *count);
> > + mark_inode_dirty(inode);
> > + }
> > brelse(bitmap_bh);
> > return 0;
> > }
> Sorry, but the above two changes look stupid... Why call _nodirty variant
> and dirty the inode immediately after that? It happens in two other places
> in your patch as well...

Frankly, I don't believe that dquot_free_block() is _ever_ the right interface;
please, leave dirtying the inode to the filesystem and let it be explicit.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/