From: Chris Mason on
On Wed, Jun 02, 2010 at 04:46:49AM +1000, Nick Piggin wrote:
> On Tue, Jun 01, 2010 at 02:09:05PM -0400, Chris Mason wrote:
> > On Tue, Jun 01, 2010 at 04:54:53PM +0000, James Bottomley wrote:
> >
> > > For self
> > > induced errors (as long as we can detect them) I think we can just
> > > forget about it ... if the changed page is important, the I/O request
> > > gets repeated (modulo the problem of too great a frequency of changes
> > > leading to us never successfully writing it) or it gets dropped because
> > > the file was truncated or the data deleted for some other reason.
> >
> > Sorry, how can we tell the errors that are self induced from the evil
> > bit flipping cable induced errors?
>
> Block layer should retry it with bounce pages. That would be a lot nicer
> than forcing all upper layers to avoid the problem.
>

So the idea is that we have sent down a buffer and it changed in flight.
The block layer is going to say: oh look, the crcs don't match, I'll
bounce it, recrc it and send again. But, there are at least 3 reasons the crc
will change:

1) filesystem changed it
2) corruption on the wire or in the raid controller
3) the page was corrupted while the IO layer was doing the IO.

1 and 2 are easy, we bounce, retry and everyone continues on with
their lives. With #3, we'll recrc and send the IO down again thinking
the data is correct when really we're writing garbage.

How can we tell these three cases apart?

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: James Bottomley on
On Tue, 2010-06-01 at 14:09 -0400, Chris Mason wrote:
> On Tue, Jun 01, 2010 at 04:54:53PM +0000, James Bottomley wrote:
> > On Tue, 2010-06-01 at 12:47 -0400, Chris Mason wrote:
> > > On Tue, Jun 01, 2010 at 10:29:30AM -0600, Matthew Wilcox wrote:
> > > > On Tue, Jun 01, 2010 at 09:49:51AM -0400, Chris Mason wrote:
> > > > > > I agree that a block based retry would close all the holes ... it just
> > > > > > doesn't look elegant to me that the fs will already be repeating the I/O
> > > > > > if it changed the page and so will block.
> > > > >
> > > > > We might not ever repeat the IO. We might change the page, write it,
> > > > > change it again, truncate the file and toss the page completely.
> > > >
> > > > Why does it matter that it was never written in that case?
> > >
> > > It matters is the storage layer is going to wait around for the block to
> > > be written again with a correct crc.
> >
> > Actually, I wasn't advocating that. I think block should return a guard
> > mismatch error. I think somewhere in filesystem writeout is the place
> > to decide whether the error was self induced or systematic.
>
> In that case the io error goes to the async page writeback bio-endio
> handlers. We don't have a reference on the inode and no ability to
> reliably restart the IO, but we can set a bit on the address space
> indicating that somewhere, sometime in the past we had an IO error.
>
> > For self
> > induced errors (as long as we can detect them) I think we can just
> > forget about it ... if the changed page is important, the I/O request
> > gets repeated (modulo the problem of too great a frequency of changes
> > leading to us never successfully writing it) or it gets dropped because
> > the file was truncated or the data deleted for some other reason.
>
> Sorry, how can we tell the errors that are self induced from the evil
> bit flipping cable induced errors?

We have all the information ... the fs will eventually mark the page
dirty when it finishes the alterations, we just have to find a way to
express that.

If you're thinking of the double fault scenario where the page
spontaneously corrupts *and* the filesystem alters it, then the only way
of detecting that is to freeze the page as it undergoes I/O ... which
involves quite a bit of filesystem surgery, doesn't it?

James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Chris Mason on
On Tue, Jun 01, 2010 at 04:07:43PM -0500, James Bottomley wrote:
> On Tue, 2010-06-01 at 14:09 -0400, Chris Mason wrote:
> > On Tue, Jun 01, 2010 at 04:54:53PM +0000, James Bottomley wrote:
> > > On Tue, 2010-06-01 at 12:47 -0400, Chris Mason wrote:
> > > > On Tue, Jun 01, 2010 at 10:29:30AM -0600, Matthew Wilcox wrote:
> > > > > On Tue, Jun 01, 2010 at 09:49:51AM -0400, Chris Mason wrote:
> > > > > > > I agree that a block based retry would close all the holes ... it just
> > > > > > > doesn't look elegant to me that the fs will already be repeating the I/O
> > > > > > > if it changed the page and so will block.
> > > > > >
> > > > > > We might not ever repeat the IO. We might change the page, write it,
> > > > > > change it again, truncate the file and toss the page completely.
> > > > >
> > > > > Why does it matter that it was never written in that case?
> > > >
> > > > It matters is the storage layer is going to wait around for the block to
> > > > be written again with a correct crc.
> > >
> > > Actually, I wasn't advocating that. I think block should return a guard
> > > mismatch error. I think somewhere in filesystem writeout is the place
> > > to decide whether the error was self induced or systematic.
> >
> > In that case the io error goes to the async page writeback bio-endio
> > handlers. We don't have a reference on the inode and no ability to
> > reliably restart the IO, but we can set a bit on the address space
> > indicating that somewhere, sometime in the past we had an IO error.
> >
> > > For self
> > > induced errors (as long as we can detect them) I think we can just
> > > forget about it ... if the changed page is important, the I/O request
> > > gets repeated (modulo the problem of too great a frequency of changes
> > > leading to us never successfully writing it) or it gets dropped because
> > > the file was truncated or the data deleted for some other reason.
> >
> > Sorry, how can we tell the errors that are self induced from the evil
> > bit flipping cable induced errors?
>
> We have all the information ... the fs will eventually mark the page
> dirty when it finishes the alterations, we just have to find a way to
> express that.

Eventually?

>
> If you're thinking of the double fault scenario where the page
> spontaneously corrupts *and* the filesystem alters it, then the only way
> of detecting that is to freeze the page as it undergoes I/O ... which
> involves quite a bit of filesystem surgery, doesn't it?

No, I'm thinking of the case where the page corrupts and the FS doesn't
alter it. I still don't know how to determine if the FS changed the
page and will write it again, if the FS changed the page and won't write
it again, or if the page corrupted all on its own. The page dirty bit
isn't sufficient for this.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Nick Piggin on
On Tue, Jun 01, 2010 at 03:35:28PM -0400, Chris Mason wrote:
> On Wed, Jun 02, 2010 at 04:46:49AM +1000, Nick Piggin wrote:
> > On Tue, Jun 01, 2010 at 02:09:05PM -0400, Chris Mason wrote:
> > > On Tue, Jun 01, 2010 at 04:54:53PM +0000, James Bottomley wrote:
> > >
> > > > For self
> > > > induced errors (as long as we can detect them) I think we can just
> > > > forget about it ... if the changed page is important, the I/O request
> > > > gets repeated (modulo the problem of too great a frequency of changes
> > > > leading to us never successfully writing it) or it gets dropped because
> > > > the file was truncated or the data deleted for some other reason.
> > >
> > > Sorry, how can we tell the errors that are self induced from the evil
> > > bit flipping cable induced errors?
> >
> > Block layer should retry it with bounce pages. That would be a lot nicer
> > than forcing all upper layers to avoid the problem.
> >
>
> So the idea is that we have sent down a buffer and it changed in flight.
> The block layer is going to say: oh look, the crcs don't match, I'll
> bounce it, recrc it and send again. But, there are at least 3 reasons the crc
> will change:
>
> 1) filesystem changed it
> 2) corruption on the wire or in the raid controller
> 3) the page was corrupted while the IO layer was doing the IO.
>
> 1 and 2 are easy, we bounce, retry and everyone continues on with
> their lives. With #3, we'll recrc and send the IO down again thinking
> the data is correct when really we're writing garbage.
>
> How can we tell these three cases apart?

Do we really need to handle #3? It could have happened before the
checksum was calculated.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Martin K. Petersen on
>>>>> "Nick" == Nick Piggin <npiggin(a)suse.de> writes:

>> 1) filesystem changed it
>> 2) corruption on the wire or in the raid controller
>> 3) the page was corrupted while the IO layer was doing the IO.
>>
>> 1 and 2 are easy, we bounce, retry and everyone continues on with
>> their lives. With #3, we'll recrc and send the IO down again
>> thinking the data is correct when really we're writing garbage.
>>
>> How can we tell these three cases apart?

Nick> Do we really need to handle #3? It could have happened before the
Nick> checksum was calculated.

Reason #3 is one of the main reasons for having the checksum in the
first place. The whole premise of the data integrity extensions is that
the checksum is calculated in close temporal proximity to the data
creation. I.e. eventually in userland.

Filesystems will inevitably have to be integrity-aware for that to work.
And it will be their job to keep the data pages stable during DMA.

--
Martin K. Petersen Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/