From: Dave Jones on
On Mon, Oct 09, 2006 at 02:59:25PM -0700, Badari Pulavarty wrote:

> journal_dirty_data() would do submit_bh() ONLY if its part of the older
> transaction.
>
> I need to take a closer look to understand the race.
>
> BTW, is this 1k or 2k filesystem ?

(18:41:11:davej(a)gelk:~)$ sudo tune2fs -l /dev/md0 | grep size
Block size: 1024
Fragment size: 1024
Inode size: 128
(18:41:16:davej(a)gelk:~)$

> How easy is to reproduce the problem ?

I can reproduce it within a few hours of stressing, but only
on that one box. I've not figured out exactly what's so
special about it yet (though the 1k block thing may be it).
I had been thinking it was a raid0 only thing, as none of
my other boxes have that.

I'm not entirely sure how it got set up that way either.
The Fedora installer being too smart for its own good perhaps.

Dave

--
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jan Kara on
> On Mon, 2006-10-09 14:46:30 -0500, Eric Sandeen <sandeen(a)sandeen.net> wrote:
> > Andrew Morton wrote:
> > > On Tue, 03 Oct 2006 00:43:01 -0500
> > > Eric Sandeen <sandeen(a)sandeen.net> wrote:
> > > > Dave Jones wrote:
> > > > > So I managed to reproduce it with an 'fsx foo' and a
> > > > > 'fsstress -d . -r -n 100000 -p 20 -r'. This time I grabbed it from
> > > > > a vanilla 2.6.18 with none of the Fedora patches..
> > > > >
> > > > > I'll give 2.6.18-git a try next.
> > > > >
> > > > > ----------- [cut here ] --------- [please bite here ] ---------
> > > > > Kernel BUG at fs/buffer.c:2791
> > > > I had thought/hoped that this was fixed by Jan's patch at
> > > > http://lkml.org/lkml/2006/9/7/236 from the thread started at
> > > > http://lkml.org/lkml/2006/9/1/149, but it seems maybe not. Dave hit this bug
> > > > first by going through that new codepath....
> > >
> > > Yes, Jan's patch is supposed to fix that !buffer_mapped() assertion. iirc,
> > > Badari was hitting that BUG and was able to confirm that Jan's patch
> > > (3998b9301d3d55be8373add22b6bc5e11c1d9b71 in post-2.6.18 mainline) fixed
> > > it.
> >
> > Looking at some BH traces*, it appears that what Dave hit is a truncate
> > racing with a sync...
> >
> > truncate ...
> > ext3_invalidate_page
> > journal_invalidatepage
> > journal_unmap buffer
> >
> > going off at the same time as
> >
> > sync ...
> > journal_dirty_data
> > sync_dirty_buffer
> > submit_bh <-- finds unmapped buffer, boom.
>
> Is this possibly related to the issues that are discussed in another
> thread? We're seeing problems while unlinking large files (usually get
> it within some hours with 200MB files, but couldn't yet reproduce it
> with 20MB.)
I don't think this is related (BTW: I've run your test for 5 hours
without any luck ;( Maybe I'll try again for some longer time...).

Honza
--
Jan Kara <jack(a)suse.cz>
SuSE CR Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jan Kara on
> On Mon, Oct 09, 2006 at 02:59:25PM -0700, Badari Pulavarty wrote:
>
> > journal_dirty_data() would do submit_bh() ONLY if its part of the older
> > transaction.
> >
> > I need to take a closer look to understand the race.
> >
> > BTW, is this 1k or 2k filesystem ?
>
> (18:41:11:davej(a)gelk:~)$ sudo tune2fs -l /dev/md0 | grep size
> Block size: 1024
> Fragment size: 1024
> Inode size: 128
> (18:41:16:davej(a)gelk:~)$
>
> > How easy is to reproduce the problem ?
>
> I can reproduce it within a few hours of stressing, but only
> on that one box. I've not figured out exactly what's so
> special about it yet (though the 1k block thing may be it).
> I had been thinking it was a raid0 only thing, as none of
> my other boxes have that.
>
> I'm not entirely sure how it got set up that way either.
> The Fedora installer being too smart for its own good perhaps.
I think it's really the 1KB block size that makes it happen.
I've looked at journal_dirty_data() code and I think the following can
happen:
sync() eventually ends up in journal_dirty_data(bh) as Eric writes.
There is finds dirty buffer attached to the comitting transaction. So it drops
all locks and calls sync_dirty_buffer(bh).
Now in other process, file is truncated so that 'bh' gets just after EOF.
As we have 1kb buffers, it can happen that bh is in the partially
truncated page. Buffer is marked unmapped and clean. But in a moment the page
is marked dirty and msync() is called. That eventually calls
set_page_dirty() and all buffers in the page are marked dirty.
The first process now wakes up, locks the buffer, clears the dirty bit
and does submit_bh() - Oops.

This is essentially the same problem Badari found but in a different
place. There are two places that are arguably wrong...
1) We mark buffer dirty after EOF. But actually that may be needed -
or what is the expected behaviour when we write into mmapped file after
EOF, then extend the file and do msync()?
2) We submit a buffer after EOF for IO. This should be clearly avoided
but getting the needed info from bh is really ugly...

What we could do is: Instead od calling sync_dirty_buffer() we do
something like:

lock_buffer(bh);
jbd_lock_bh_state(bh);
if (!buffer_jbd(bh) || jh != bh2jh(bh) || jh->b_transaction !=
journal->j_committing_transaction) {
jbd_unlock_bh_state(bh);
unlock_buffer(bh);
}
jbd_unlock_bh_state(bh);
if (test_clear_buffer_dirty(bh)) {
get_bh(bh);
bh->b_end_io = end_buffer_write_sync;
submit_bh(WRITE, bh);
wait_on_buffer(bh);
}
else
unlock_buffer(bh);

That should deal with the problem... Much more adventurous change would
be to remove the syncing code altogether - the new commit code makes
sure to write out each buffer just once so the livelock should not
happen now. But then we'd have to put running transaction in
j_next_transaction and refile data buffers instead of unfiling them.
That should actually give quite some performance improvement when
intensively rewriting files. But I guess that is JBD2 matter ;).

Honza
--
Jan Kara <jack(a)suse.cz>
SuSE CR Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jan-Benedict Glaw on
On Tue, 2006-10-10 15:16:03 +0200, Jan Kara <jack(a)suse.cz> wrote:
> > Is this possibly related to the issues that are discussed in another
> > thread? We're seeing problems while unlinking large files (usually get
> > it within some hours with 200MB files, but couldn't yet reproduce it
> > with 20MB.)
> I don't think this is related (BTW: I've run your test for 5 hours
> without any luck ;( Maybe I'll try again for some longer time...).

Looking at it, I also think it's a different thing. Just to add, I
couldn't make it bug with a 10 MB file in a day, but it failed again
using a 100MB file.

So file size seems to matter somehow.

MfG, JBG

--
Jan-Benedict Glaw jbglaw(a)lug-owl.de +49-172-7608481
Signature of: Wenn ich wach bin, träume ich.
the second :
From: Andrew Morton on
On Tue, 10 Oct 2006 16:11:45 +0200
Jan Kara <jack(a)suse.cz> wrote:

> > On Mon, Oct 09, 2006 at 02:59:25PM -0700, Badari Pulavarty wrote:
> >
> > > journal_dirty_data() would do submit_bh() ONLY if its part of the older
> > > transaction.
> > >
> > > I need to take a closer look to understand the race.
> > >
> > > BTW, is this 1k or 2k filesystem ?
> >
> > (18:41:11:davej(a)gelk:~)$ sudo tune2fs -l /dev/md0 | grep size
> > Block size: 1024
> > Fragment size: 1024
> > Inode size: 128
> > (18:41:16:davej(a)gelk:~)$
> >
> > > How easy is to reproduce the problem ?
> >
> > I can reproduce it within a few hours of stressing, but only
> > on that one box. I've not figured out exactly what's so
> > special about it yet (though the 1k block thing may be it).
> > I had been thinking it was a raid0 only thing, as none of
> > my other boxes have that.
> >
> > I'm not entirely sure how it got set up that way either.
> > The Fedora installer being too smart for its own good perhaps.
> I think it's really the 1KB block size that makes it happen.
> I've looked at journal_dirty_data() code and I think the following can
> happen:
> sync() eventually ends up in journal_dirty_data(bh) as Eric writes.
> There is finds dirty buffer attached to the comitting transaction. So it drops
> all locks and calls sync_dirty_buffer(bh).
> Now in other process, file is truncated so that 'bh' gets just after EOF.
> As we have 1kb buffers, it can happen that bh is in the partially
> truncated page. Buffer is marked unmapped and clean. But in a moment the page
> is marked dirty and msync() is called. That eventually calls
> set_page_dirty() and all buffers in the page are marked dirty.
> The first process now wakes up, locks the buffer, clears the dirty bit
> and does submit_bh() - Oops.
>
> This is essentially the same problem Badari found but in a different
> place. There are two places that are arguably wrong...
> 1) We mark buffer dirty after EOF. But actually that may be needed -
> or what is the expected behaviour when we write into mmapped file after
> EOF, then extend the file and do msync()?

yup.

> 2) We submit a buffer after EOF for IO. This should be clearly avoided
> but getting the needed info from bh is really ugly...

Things like __block_write_full_page() avoid this by checking the block's
offset against i_size. (Not racy against truncate-down because the page is
locked, not racy against truncate-up because the bh is zero and
up-to-date).

But for jbd writeout we don't hold the page lock, so checking against
bh->b_page->host->i_size is a bit racy.

hm. But we do lock the buffers in journal_invalidatepage(), so checking
i_size after locking the buffer in the writeout path might get us there.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/