From: Nick Piggin on
On Thu, Jun 03, 2010 at 09:46:02PM -0400, Martin K. Petersen wrote:
> >>>>> "Nick" == Nick Piggin <npiggin(a)suse.de> writes:
>
> Nick> Also I don't think we can deal with memory errors and scribbles
> Nick> just by crcing dirty data. The calculations generating the data
> Nick> could get corrupted.
>
> Yep, the goal is to make the window as small as possible.
>
>
> Nick> Data can be corrupted on its way back from the device to
> Nick> userspace.
>
> We also get a CRC back from the storage. So the (integrity-aware)
> application is also able to check on read.

Well that's nice :)


> Nick> Obviously this feature is being pushed by databases and such that
> Nick> really want to pass checksums all the way from userspace. Block
> Nick> retrying is _not_ needed or wanted here of course.
>
> Nope. The integrity error is bubbled all the way up to the database and
> we can decide to retry, recreate or error out depending on what we find
> when we do validation checks on the data buffer and the integrity
> metadata.

By block retrying, I just meant the bounce / re-checksum approach.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jan Kara on
On Fri 04-06-10 12:02:43, Dave Chinner wrote:
> On Thu, Jun 03, 2010 at 11:46:34AM -0400, Chris Mason wrote:
> > On Wed, Jun 02, 2010 at 11:41:21PM +1000, Nick Piggin wrote:
> > > Closing the while it is dirty, while it is being written back window
> > > still leaves a pretty big window. Also, how do you handle mmap writes?
> > > Write protect and checksum the destination page after every store? Or
> > > leave some window between when the pagecache is dirtied and when it is
> > > written back? So I don't know whether it's worth putting a lot of effort
> > > into this case.
> >
> > So, changing gears to how do we protect filesystem page cache pages
> > instead of the generic idea of dif/dix, btrfs crcs just before writing,
> > which does leave a pretty big window for the page to get corrupted.
> > The storage layer shouldn't care or know about that though, we hand it a
> > crc and it makes sure data matching that crc goes to the media.
>
> I think the only way to get accurate CRCs is to stop modifications
> from occurring while the page is under writeback. i.e. when a page
> transitions from dirty to writeback we need to unmap any writable
> mappings on the page, and then any new modifications (either by the
> write() path or through ->fault) need to block waiting for
> page writeback to complete before they can proceed...
Actually, we already write-protect the page in clear_page_dirty_for_io
so the first part already happens. Any filesystem can do
wait_on_page_writeback() in its ->page_mkwrite function so even the second
part shouldn't be hard. I'm just a bit worried about the performance
implications / hidden deadlocks...
Also we'd have to wait_on_page_writeback() in ->write_begin function to
protect against ordinary writes but that's the easy part...

Honza
--
Jan Kara <jack(a)suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Martin K. Petersen on
>>>>> "Dave" == Dave Chinner <david(a)fromorbit.com> writes:

>> Didn't you use to wait_on_page_writeback() in page_mkwrite()?

Dave> The generic implementation of ->page_mkwrite
Dave> (block_page_mkwrite()) which XFS uses has never had a
Dave> wait_on_page_writeback() call in it. There's no call in the
Dave> generic write paths, either, hence my comment that only direct IO
Dave> on XFS will work.

I guess that wait_on_page_writeback() was something I added when I used
XFS for DIF testing.

--
Martin K. Petersen Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Boaz Harrosh on
On 06/07/2010 07:20 PM, Martin K. Petersen wrote:
>>>>>> "Dave" == Dave Chinner <david(a)fromorbit.com> writes:
>
>>> Didn't you use to wait_on_page_writeback() in page_mkwrite()?
>
> Dave> The generic implementation of ->page_mkwrite
> Dave> (block_page_mkwrite()) which XFS uses has never had a
> Dave> wait_on_page_writeback() call in it. There's no call in the
> Dave> generic write paths, either, hence my comment that only direct IO
> Dave> on XFS will work.
>
> I guess that wait_on_page_writeback() was something I added when I used
> XFS for DIF testing.
>

Do you remember some performance numbers that show degradation / sameness?

What type of work loads?

Thanks
Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Martin K. Petersen on
>>>>> "Boaz" == Boaz Harrosh <bharrosh(a)panasas.com> writes:

Boaz> Do you remember some performance numbers that show degradation /
Boaz> sameness?

Boaz> What type of work loads?

I haven't been using XFS much for over a year. I'm using an internal
async I/O tool and btrfs for most of my DIX/DIF testing these days.

But my original changes were along the lines of what Jan mentioned
earlier (hooking into page_mkwrite and waiting for writeback. I could
have sworn that I only did it for ext[23] and that XFS waited out of the
box but git proves me wrong). Anyway, I'll try to get some benchmarking
happening later this week.

This won't fix things completely, though. ext2fs, for instance,
frequently changes metadata buffers in flight so it trips the guard
check in no time.

--
Martin K. Petersen Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/