From: Dave Jones on
On Tue, Oct 03, 2006 at 02:40:30AM -0400, Dave Jones wrote:

> > > > ----------- [cut here ] --------- [please bite here ] ---------
> > > > Kernel BUG at fs/buffer.c:2791
> > >
> > > I had thought/hoped that this was fixed by Jan's patch at
> > > http://lkml.org/lkml/2006/9/7/236 from the thread started at
> > > http://lkml.org/lkml/2006/9/1/149, but it seems maybe not. Dave hit this bug
> > > first by going through that new codepath....
> >
> > Yes, Jan's patch is supposed to fix that !buffer_mapped() assertion. iirc,
> > Badari was hitting that BUG and was able to confirm that Jan's patch
> > (3998b9301d3d55be8373add22b6bc5e11c1d9b71 in post-2.6.18 mainline) fixed
> > it.
>
> Ok, this afternoon I was definitly running a kernel with that patch in it,
> and managed to get a trace (It was the one from the top of this thread
> that unfortunatly got truncated).
>
> Now, I can't reproduce it on a plain 2.6.18+that patch.
> I'll leave the stress test running overnight, and see if anything
> falls out in the morning.

Been chugging away for 10 hrs now without repeating that incident. Hmm.
That patch looks like good -stable material. I'll keep digging to
see if I can somehow reproduce the problem I saw with the patch applied,
but in absense of something better, I think we should go with it.

One thing that did happen in the 10hrs was fsx-over-NFS spewed some
nasty looking trace. I'll post that separately next.

Dave

--
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Eric Sandeen on
Andrew Morton wrote:
> On Tue, 03 Oct 2006 00:43:01 -0500
> Eric Sandeen <sandeen(a)sandeen.net> wrote:
>
>> Dave Jones wrote:
>>
>>> So I managed to reproduce it with an 'fsx foo' and a
>>> 'fsstress -d . -r -n 100000 -p 20 -r'. This time I grabbed it from
>>> a vanilla 2.6.18 with none of the Fedora patches..
>>>
>>> I'll give 2.6.18-git a try next.
>>>
>>> Dave
>>>
>>> ----------- [cut here ] --------- [please bite here ] ---------
>>> Kernel BUG at fs/buffer.c:2791
>> I had thought/hoped that this was fixed by Jan's patch at
>> http://lkml.org/lkml/2006/9/7/236 from the thread started at
>> http://lkml.org/lkml/2006/9/1/149, but it seems maybe not. Dave hit this bug
>> first by going through that new codepath....
>
> Yes, Jan's patch is supposed to fix that !buffer_mapped() assertion. iirc,
> Badari was hitting that BUG and was able to confirm that Jan's patch
> (3998b9301d3d55be8373add22b6bc5e11c1d9b71 in post-2.6.18 mainline) fixed
> it.

Looking at some BH traces*, it appears that what Dave hit is a truncate
racing with a sync...

truncate ...
ext3_invalidate_page
journal_invalidatepage
journal_unmap buffer

going off at the same time as

sync ...
journal_dirty_data
sync_dirty_buffer
submit_bh <-- finds unmapped buffer, boom.

I'm not sure what should be coordinating this, and I'm not sure why
we've not yet seen it on a stock kernel, but only FC6... I haven't found
anything in FC6 that looks like it may affect this.

-Eric

*http://people.redhat.com/esandeen/traces/davej_ext3_oops1.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Eric Sandeen on
Eric Sandeen wrote:

>>> I had thought/hoped that this was fixed by Jan's patch at
>>> http://lkml.org/lkml/2006/9/7/236 from the thread started at
>>> http://lkml.org/lkml/2006/9/1/149, but it seems maybe not. Dave hit this bug
>>> first by going through that new codepath....
>> Yes, Jan's patch is supposed to fix that !buffer_mapped() assertion. iirc,
>> Badari was hitting that BUG and was able to confirm that Jan's patch
>> (3998b9301d3d55be8373add22b6bc5e11c1d9b71 in post-2.6.18 mainline) fixed
>> it.
>
> Looking at some BH traces*, it appears that what Dave hit is a truncate
> racing with a sync...

(oh btw this is -with the above patch from Jan in place...)

-Eric

> truncate ...
> ext3_invalidate_page
> journal_invalidatepage
> journal_unmap buffer
>
> going off at the same time as
>
> sync ...
> journal_dirty_data
> sync_dirty_buffer
> submit_bh <-- finds unmapped buffer, boom.
>
> I'm not sure what should be coordinating this, and I'm not sure why
> we've not yet seen it on a stock kernel, but only FC6... I haven't found
> anything in FC6 that looks like it may affect this.
>
> -Eric
>
> *http://people.redhat.com/esandeen/traces/davej_ext3_oops1.txt

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Badari Pulavarty on
On Mon, 2006-10-09 at 14:46 -0500, Eric Sandeen wrote:
> Andrew Morton wrote:
> > On Tue, 03 Oct 2006 00:43:01 -0500
> > Eric Sandeen <sandeen(a)sandeen.net> wrote:
> >
> >> Dave Jones wrote:
> >>
> >>> So I managed to reproduce it with an 'fsx foo' and a
> >>> 'fsstress -d . -r -n 100000 -p 20 -r'. This time I grabbed it from
> >>> a vanilla 2.6.18 with none of the Fedora patches..
> >>>
> >>> I'll give 2.6.18-git a try next.
> >>>
> >>> Dave
> >>>
> >>> ----------- [cut here ] --------- [please bite here ] ---------
> >>> Kernel BUG at fs/buffer.c:2791
> >> I had thought/hoped that this was fixed by Jan's patch at
> >> http://lkml.org/lkml/2006/9/7/236 from the thread started at
> >> http://lkml.org/lkml/2006/9/1/149, but it seems maybe not. Dave hit this bug
> >> first by going through that new codepath....
> >
> > Yes, Jan's patch is supposed to fix that !buffer_mapped() assertion. iirc,
> > Badari was hitting that BUG and was able to confirm that Jan's patch
> > (3998b9301d3d55be8373add22b6bc5e11c1d9b71 in post-2.6.18 mainline) fixed
> > it.
>
> Looking at some BH traces*, it appears that what Dave hit is a truncate
> racing with a sync...
>
> truncate ...
> ext3_invalidate_page
> journal_invalidatepage
> journal_unmap buffer
>
> going off at the same time as
>
> sync ...
> journal_dirty_data
> sync_dirty_buffer
> submit_bh <-- finds unmapped buffer, boom.
>

I don't understand how this can happen ..

journal_unmap_buffer() zapping the buffer since its not attached to any
transaction.

journal_unmap_buffer():[fs/jbd/transaction.c:1789] not on any
transaction: zap
b_state:0x10402f b_jlist:BJ_None cpu:0 b_count:3 b_blocknr:52735707
b_jbd:1 b_frozen_data:0000000000000000
b_committed_data:0000000000000000
b_transaction:0 b_next_transaction:0 b_cp_transaction:0
b_trans_is_running:0
b_trans_is_comitting:0 b_jcount:2 pg_dirty:1


journal_dirty_data() would do submit_bh() ONLY if its part of the older
transaction.

I need to take a closer look to understand the race.

BTW, is this 1k or 2k filesystem ? How easy is to reproduce the
problem ?

Thanks,
Badari



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jan-Benedict Glaw on
On Mon, 2006-10-09 14:46:30 -0500, Eric Sandeen <sandeen(a)sandeen.net> wrote:
> Andrew Morton wrote:
> > On Tue, 03 Oct 2006 00:43:01 -0500
> > Eric Sandeen <sandeen(a)sandeen.net> wrote:
> > > Dave Jones wrote:
> > > > So I managed to reproduce it with an 'fsx foo' and a
> > > > 'fsstress -d . -r -n 100000 -p 20 -r'. This time I grabbed it from
> > > > a vanilla 2.6.18 with none of the Fedora patches..
> > > >
> > > > I'll give 2.6.18-git a try next.
> > > >
> > > > ----------- [cut here ] --------- [please bite here ] ---------
> > > > Kernel BUG at fs/buffer.c:2791
> > > I had thought/hoped that this was fixed by Jan's patch at
> > > http://lkml.org/lkml/2006/9/7/236 from the thread started at
> > > http://lkml.org/lkml/2006/9/1/149, but it seems maybe not. Dave hit this bug
> > > first by going through that new codepath....
> >
> > Yes, Jan's patch is supposed to fix that !buffer_mapped() assertion. iirc,
> > Badari was hitting that BUG and was able to confirm that Jan's patch
> > (3998b9301d3d55be8373add22b6bc5e11c1d9b71 in post-2.6.18 mainline) fixed
> > it.
>
> Looking at some BH traces*, it appears that what Dave hit is a truncate
> racing with a sync...
>
> truncate ...
> ext3_invalidate_page
> journal_invalidatepage
> journal_unmap buffer
>
> going off at the same time as
>
> sync ...
> journal_dirty_data
> sync_dirty_buffer
> submit_bh <-- finds unmapped buffer, boom.

Is this possibly related to the issues that are discussed in another
thread? We're seeing problems while unlinking large files (usually get
it within some hours with 200MB files, but couldn't yet reproduce it
with 20MB.)

MfG, JBG
--
Jan-Benedict Glaw jbglaw(a)lug-owl.de +49-172-7608481
Signature of: Alles wird gut! ...und heute wirds schon ein biÃ?chen besser.
the second :