From: Nick Piggin on
On Thu, Jun 03, 2010 at 11:46:34AM -0400, Chris Mason wrote:
> On Wed, Jun 02, 2010 at 11:41:21PM +1000, Nick Piggin wrote:
> > On Wed, Jun 02, 2010 at 09:17:56AM -0400, Martin K. Petersen wrote:
> > > >>>>> "Nick" == Nick Piggin <npiggin(a)suse.de> writes:
> > >
> > > >> 1) filesystem changed it
> > > >> 2) corruption on the wire or in the raid controller
> > > >> 3) the page was corrupted while the IO layer was doing the IO.
> > > >>
> > > >> 1 and 2 are easy, we bounce, retry and everyone continues on with
> > > >> their lives. With #3, we'll recrc and send the IO down again
> > > >> thinking the data is correct when really we're writing garbage.
> > > >>
> > > >> How can we tell these three cases apart?
> > >
> > > Nick> Do we really need to handle #3? It could have happened before the
> > > Nick> checksum was calculated.
> > >
> > > Reason #3 is one of the main reasons for having the checksum in the
> > > first place. The whole premise of the data integrity extensions is that
> > > the checksum is calculated in close temporal proximity to the data
> > > creation. I.e. eventually in userland.
> > >
> > > Filesystems will inevitably have to be integrity-aware for that to work.
> > > And it will be their job to keep the data pages stable during DMA.
> >
> > Let's just think hard about what windows can actually be closed versus
> > how much effort goes in to closing them. I also prefer not to accept
> > half-solutions in the kernel because they don't want to implement real
> > solutions in hardware (it's pretty hard to checksum and protect all
> > kernel data structures by hand).
> >
> > For "normal" writes into pagecache, the data can get corrupted anywhere
> > from after it is generated in userspace, during the copy, while it is
> > dirty in cache, and while it is being written out.
>
> This is why the DIF/DIX spec has the idea of a crc generated in userland
> when the data is generated. At any rate the basic idea is to crc early
> but not often...recalculating the crc after we hand our precious memory
> to the evil device driver does weaken the end-to-end integrity checks.
>
> What I don't want to do is weaken the basic DIF/DIX structure by letting
> the lower recrc stuff as they find faults. It would be fine if we had
> some definitive way to say: the FS raced, just recrc, but we really
> don't.

That's true but a naive kernel crc cannot do better than the
user/kernel boundary (and has very big problems even doing that well
with mmap, get_user_pages, concurrent dirtying). So we are already
resigned there to a best effort approach.

Since we fundamentally can't have end-to-end protection then, it's
much harder to argue for significant complexity just to close the
hole a little.

So if we do the block layer retries in response to concurrent writes, it
opens the window there a little bit, but remember only a small
proportion of writes will require retries, and for that proportion, the
window is only opening a small amount.

As far as I know, we're not checksumming at the usercopy point, but the
writeback point, so we have a vastly bigger window there already.


> > Closing the while it is dirty, while it is being written back window
> > still leaves a pretty big window. Also, how do you handle mmap writes?
> > Write protect and checksum the destination page after every store? Or
> > leave some window between when the pagecache is dirtied and when it is
> > written back? So I don't know whether it's worth putting a lot of effort
> > into this case.
>
> So, changing gears to how do we protect filesystem page cache pages
> instead of the generic idea of dif/dix, btrfs crcs just before writing,
> which does leave a pretty big window for the page to get corrupted.
> The storage layer shouldn't care or know about that though, we hand it a
> crc and it makes sure data matching that crc goes to the media.

I'm not totally against freezing redirtying events during writeback,
but as long as there is a huge window of the page sitting dirty in
pagecache, it's not worth much if any complexity.

Also I don't think we can deal with memory errors and scribbles just by
crcing dirty data. The calculations generating the data could get
corrupted. Data can be corrupted on its way back from the device to
userspace. Dirty memory writeback is usually only a small part of the
entire data transformation process.


> > If you had an interface for userspace to insert checksums to direct IO
> > requests or pagecache ranges, then not only can you close the entire gap
> > between userspace data generation, and writeback. But you also can
> > handle mmap writes and anything else just fine: userspace knows about
> > the concurrency details, so it can add the right checksum (and
> > potentially fsync etc) when it's ready.
>
> Yeah, I do agree here.

After protecting writeback from IO bus and wire errors, I think this
would be the most productive thing to work on. Obviously this feature is
being pushed by databases and such that really want to pass checksums
all the way from userspace. Block retrying is _not_ needed or wanted
here of course.

After that is done, it might be worth coming back to see if
regular pagecache can be protected any better. I just think that's
the highest hanging fruit at the moment.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Martin K. Petersen on
>>>>> "Nick" == Nick Piggin <npiggin(a)suse.de> writes:

Nick,

>> Filesystems will inevitably have to be integrity-aware for that to
>> work. And it will be their job to keep the data pages stable during
>> DMA.

Nick> Closing the while it is dirty, while it is being written back
Nick> window still leaves a pretty big window. Also, how do you handle
Nick> mmap writes? Write protect and checksum the destination page
Nick> after every store? Or leave some window between when the pagecache
Nick> is dirtied and when it is written back? So I don't know whether
Nick> it's worth putting a lot of effort into this case.

I'm mostly interested in the cases where the filesystem acts as a
conduit for integrity metadata from user space.

I agree the corruption windows inside the kernel are only of moderate
interest. No filesystems have added support for explicitly protecting a
bio because the block layer's function to do so automatically is just a
few function calls away.

--
Martin K. Petersen Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Martin K. Petersen on
>>>>> "Nick" == Nick Piggin <npiggin(a)suse.de> writes:

Nick> Also I don't think we can deal with memory errors and scribbles
Nick> just by crcing dirty data. The calculations generating the data
Nick> could get corrupted.

Yep, the goal is to make the window as small as possible.


Nick> Data can be corrupted on its way back from the device to
Nick> userspace.

We also get a CRC back from the storage. So the (integrity-aware)
application is also able to check on read.


Nick> Obviously this feature is being pushed by databases and such that
Nick> really want to pass checksums all the way from userspace. Block
Nick> retrying is _not_ needed or wanted here of course.

Nope. The integrity error is bubbled all the way up to the database and
we can decide to retry, recreate or error out depending on what we find
when we do validation checks on the data buffer and the integrity
metadata.

--
Martin K. Petersen Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Dave Chinner on
On Thu, Jun 03, 2010 at 11:46:34AM -0400, Chris Mason wrote:
> On Wed, Jun 02, 2010 at 11:41:21PM +1000, Nick Piggin wrote:
> > Closing the while it is dirty, while it is being written back window
> > still leaves a pretty big window. Also, how do you handle mmap writes?
> > Write protect and checksum the destination page after every store? Or
> > leave some window between when the pagecache is dirtied and when it is
> > written back? So I don't know whether it's worth putting a lot of effort
> > into this case.
>
> So, changing gears to how do we protect filesystem page cache pages
> instead of the generic idea of dif/dix, btrfs crcs just before writing,
> which does leave a pretty big window for the page to get corrupted.
> The storage layer shouldn't care or know about that though, we hand it a
> crc and it makes sure data matching that crc goes to the media.

I think the only way to get accurate CRCs is to stop modifications
from occurring while the page is under writeback. i.e. when a page
transitions from dirty to writeback we need to unmap any writable
mappings on the page, and then any new modifications (either by the
write() path or through ->fault) need to block waiting for
page writeback to complete before they can proceed...

If we can lock out modification during writeback, we can calculate
CRCs safely at any point in time the page is not mapped. e.g. we
could do the CRC calculation at copy-in time and store it on new
pages. During writeback, if the page has not been mapped then the
stored CRC can be used. If it has been mapped (say writeable
mappings clear the stored CRC during ->fault) then we can
recalculate the CRC once we've transitioned the page to being under
writeback...

Cheers,

Dave.
--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Dave Chinner on
On Thu, Jun 03, 2010 at 09:34:18PM -0400, Martin K. Petersen wrote:
> >>>>> "Dave" == Dave Chinner <david(a)fromorbit.com> writes:
>
> Dave> If you are running DIF hardware, then XFS is only OK for direct
> Dave> IO. XFS will still get torn writes if you are overwriting
> Dave> buffered data (either by write() or mmap()) because there are no
> Dave> interlocks to prevent cached pages under writeback from being
> Dave> modified while DMA is being performed.....
>
> Didn't you use to wait_on_page_writeback() in page_mkwrite()?

The generic implementation of ->page_mkwrite (block_page_mkwrite())
which XFS uses has never had a wait_on_page_writeback() call in it.
There's no call in the generic write paths, either, hence my comment
that only direct IO on XFS will work.

It should be noted that metadata updates in XFS are already
protected against torn writes - buffers are held locked during IO,
and can only be modified while holding the lock. Hence the only
problem that needs solving for XFS to make full use of DIF/DIX is
getting the page cache paths to not modify pages under IO...


Cheers,

Dave.
--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/