new ->perform_write fop [Kernel]

Prev: [PATCH 1/2] sysfs: add struct file* to bin_attr callbacks
Next: [git pull] Please pull powerpc.git merge branch

From: Nick Piggin on 21 May 2010 22:40

On Fri, May 21, 2010 at 11:15:18AM -0400, Christoph Hellwig wrote:
> Nick, what exactly is the problem with the reserve + allocate design?
>
> In a delalloc filesystem (which is all those that will care about high
> performance large writes) the write path fundamentally consists of those
> two operations. Getting rid of the get_blocks mess and replacing it
> with a dedicated operations vector will simplify things a lot.

Nothing wrong with it, I think it's a fine idea (although you may still
need a per-bh call to connect the fs metadata to each page).

I just much prefer to have operations after the copy not able to fail,
otherwise you get into all those pagecache corner cases.

BTW. when you say reserve + allocate, this is in the page-dirty path,
right? I thought delalloc filesystems tend to do the actual allocation
in the page-cleaning path? Or am I confused?

> Punching holes is a rather problematic operation, and as mentioned not
> actually implemented for most filesystems - just decrementing counters
> on errors increases the chances that our error handling will actually
> work massively.

It's just harder for the pagecache. Invalidating and throwing out old
pagecache and splicing in new pages seems a bit of a hack.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Dave Chinner on 22 May 2010 04:40

On Sat, May 22, 2010 at 12:31:02PM +1000, Nick Piggin wrote:
> On Fri, May 21, 2010 at 11:15:18AM -0400, Christoph Hellwig wrote:
> > Nick, what exactly is the problem with the reserve + allocate design?
> >
> > In a delalloc filesystem (which is all those that will care about high
> > performance large writes) the write path fundamentally consists of those
> > two operations. Getting rid of the get_blocks mess and replacing it
> > with a dedicated operations vector will simplify things a lot.
>
> Nothing wrong with it, I think it's a fine idea (although you may still
> need a per-bh call to connect the fs metadata to each page).
>
> I just much prefer to have operations after the copy not able to fail,
> otherwise you get into all those pagecache corner cases.
>
> BTW. when you say reserve + allocate, this is in the page-dirty path,
> right? I thought delalloc filesystems tend to do the actual allocation
> in the page-cleaning path? Or am I confused?

See my reply to Jan - delayed allocate has two parts to it - space
reservation (accounting for ENOSPC) and recording of the delalloc extents
(allocate). This is separate to the writeback path where we convert
delalloc extents to real extents....

> > Punching holes is a rather problematic operation, and as mentioned not
> > actually implemented for most filesystems - just decrementing counters
> > on errors increases the chances that our error handling will actually
> > work massively.
>
> It's just harder for the pagecache. Invalidating and throwing out old
> pagecache and splicing in new pages seems a bit of a hack.

Hardly a hack - it turns a buffered write into an operation that
does not expose transient page state and hence prevents torn writes.
That will allow us to use DIF enabled storage paths for buffered
filesystem IO(*), perhaps even allow us to generate checksums during
copy-in to do end-to-end checksum protection of data....

Cheers,

Dave.

(*) Yes, I know that mmap writes will still break DIF even if we do
writes this way.
--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 23 May 2010 23:20

On Sat, May 22, 2010 at 06:37:03PM +1000, Dave Chinner wrote:
> On Sat, May 22, 2010 at 12:31:02PM +1000, Nick Piggin wrote:
> > On Fri, May 21, 2010 at 11:15:18AM -0400, Christoph Hellwig wrote:
> > > Nick, what exactly is the problem with the reserve + allocate design?
> > >
> > > In a delalloc filesystem (which is all those that will care about high
> > > performance large writes) the write path fundamentally consists of those
> > > two operations. Getting rid of the get_blocks mess and replacing it
> > > with a dedicated operations vector will simplify things a lot.
> >
> > Nothing wrong with it, I think it's a fine idea (although you may still
> > need a per-bh call to connect the fs metadata to each page).
> >
> > I just much prefer to have operations after the copy not able to fail,
> > otherwise you get into all those pagecache corner cases.
> >
> > BTW. when you say reserve + allocate, this is in the page-dirty path,
> > right? I thought delalloc filesystems tend to do the actual allocation
> > in the page-cleaning path? Or am I confused?
>
> See my reply to Jan - delayed allocate has two parts to it - space
> reservation (accounting for ENOSPC) and recording of the delalloc extents
> (allocate). This is separate to the writeback path where we convert
> delalloc extents to real extents....

Yes I saw that. I'm sure we'll want clearer terminology in the core
code. But I don't quite know why you need to do it in 2 parts
(reserve, then "allocate"). Surely even reservation failures are
very rare, and obviously the error handling is required, so why not
just do a single call?

> > > Punching holes is a rather problematic operation, and as mentioned not
> > > actually implemented for most filesystems - just decrementing counters
> > > on errors increases the chances that our error handling will actually
> > > work massively.
> >
> > It's just harder for the pagecache. Invalidating and throwing out old
> > pagecache and splicing in new pages seems a bit of a hack.
>
> Hardly a hack - it turns a buffered write into an operation that
> does not expose transient page state and hence prevents torn writes.
> That will allow us to use DIF enabled storage paths for buffered
> filesystem IO(*), perhaps even allow us to generate checksums during
> copy-in to do end-to-end checksum protection of data....

It is a hack. Invalidating is inherently racy and isn't guaranteed
to succeed.

You do not need to invalidate the pagecache to do this (which as I said
is racy). You need to lock the page to prevent writes, and then unmap
user mappings. You'd also need to have done some magic so writable mmaps
don't leak into get_user_pages.

But this should be a different discussion anyway. Don't forget, your
approach is forced into the invalidation requirement because of
downsides in its error handling sequence. That cannot be construed as
positive, because you are forced into it, wheras other approaches
*could* use it, but do not have to.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 23 May 2010 23:40

On Fri, May 21, 2010 at 11:19:22AM -0400, Josef Bacik wrote:
> On Sat, May 22, 2010 at 12:23:54AM +1000, Nick Piggin wrote:
> > On Fri, May 21, 2010 at 09:50:54AM -0400, Josef Bacik wrote:
> > > On Fri, May 21, 2010 at 09:05:24AM +1000, Dave Chinner wrote:
> > > > Allocating multipage writes as unwritten extents turns off delayed
> > > > allocation and hence we'd lose all the benefits that this gives...
> > > >
> > >
> > > I just realized we have another problem, the mmap_sem/page_lock deadlock.
> > > Currently BTRFS is susceptible to this, since we don't prefault any of the pages
> > > in yet. If we're going to do multi-page writes we're going to need to have a
> > > way to fault in all of the iovec's at once, so when we do the
> > > pagefault_disable() copy pagefault_enable() we don't just end up copying the
> > > first iovec. Nick have you done something like this already? If not I assume
> > > I can just loop through all the iovec's and call fault_in_pages_readable on all
> > > of them and be good to go right? Thanks,
> >
> > Yes, well it's a different issue. With multi-page writes, even a single
> > iovec may not be faulted in properly. And with multiple iovecs, we are
> > already suboptimal with faulting.
> >
> > faulting in multiple iovecs may already be a good idea. I didn't add
> > that code, I had hoped for a test case first, but perhaps we can just
> > go and add it.
> >
> > With multipage writes, we would want to fault in multiple source pages
> > at once if the design was to lock multiple pages at once and do the
> > copy. I still think we might be able to just lock and copy one page at
> > a time, but I could be wrong.
> >
>
> I was thinking about this possibility, it seems this is what FUSE does already.
> It would probably be easier to deal with this fault problem, just do all the fs
> stuff to prepare for the total amount, do the copy one page at a time, and then
> if something goes wrong we can cleanup properly.

Yep. The fewer pages we have to copy to/from at once, the more robust
it should be.

> > Oh wow, btrfs is deadlocky there. Firstly, fault_in_pages_readable does
> > not guarantee success (race can always unmap the page in the meantime).
> > Secondly, calling it inside the page lock section just means it will
> > cause the deadlock rather than the copy_from_user.
> >
> > Quick workaround to reduce probability is to do fault_in_pages_readable
> > calls before locking the pages.
> >
> > But you really need to handle the short-copy case. From the error
> > handling there, it seems like you can just free_reserved_data_space and
> > retry the copy?
> >
>
> Well no, if theres a short copy we just exit. If we do the
> fault_in_pages_readable before we do the prepare_pages we could deal with a
> short-copy then. Thanks,

Take a look at the code in generic buffered write. It's not hard, but it
is easy to get the error case wrong.

You need to fault in the source pages before holding page lock. If this
fails, then you may exit. After locking the pages, you need to do a
pagefault_disable() and atomic kmaps/usercopies. If this copy comes up
short, you need to unlock and retry faulting in the source pages.

The reason for this is that the source page may be unmapped but there is
still a valid memory mapping at the point of the usercopy. So you must
not fail that.

And, if some bytes have been copied into pagecache, and if that pagecache
page is marked as uptodate, then you must treat it as a partial write.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Jan Kara on 24 May 2010 05:30

On Sat 22-05-10 10:27:59, Dave Chinner wrote:
> On Fri, May 21, 2010 at 08:58:46PM +0200, Jan Kara wrote:
> > On Fri 21-05-10 09:05:24, Dave Chinner wrote:
> > > On Thu, May 20, 2010 at 10:12:32PM +0200, Jan Kara wrote:
> > > > b) E.g. ext4 can do even without hole punching. It can allocate extent
> > > > as 'unwritten' and when something during the write fails, it just
> > > > leaves the extent allocated and the 'unwritten' flag makes sure that
> > > > any read will see zeros. I suppose that other filesystems that care
> > > > about multipage writes are able to do similar things (e.g. btrfs can
> > > > do the same as far as I remember, I'm not sure about gfs2).
> > >
> > > Allocating multipage writes as unwritten extents turns off delayed
> > > allocation and hence we'd lose all the benefits that this gives...
> > Ah, sorry. That was a short-circuit in my brain. But when we do delayed
> > I don't see why we should actually do any hole punching... The write needs
> > to:
> > a) reserve enough blocks for the write - I don't know about other
> > filesystems but for ext4 this means just incrementing a counter.
> > b) copy data page by page
> > c) release part of reservation (i.e. decrement counter) if we actually
> > copied less than we originally thought.
> >
> > Am I missing something?
>
> Possibly. Delayed allocation is made up of two parts - space
> reservation and recording the regions of delayed allocation in an
> extent tree, page/bufferhead state or both.
Yes. Ext4 records the info about delayed allocation only in buffer
heads.

> In XFS, these two steps happen in the same get_blocks call, but the
> result of that is we have to truncate/punch delayed allocate extents
> out just like normal extents if we are not going to use them. Hence
> a reserve/allocate interface allows us to split the operation -
> reserve ensures we have space for the delayed allocation, allocate
> inserts the delayed extents into the inode extent tree for later
> real allocation during writeback. Hence the unreserve call can
> simply be accounting - it has no requirement to punch out delayed
> extents that may have already been allocated, just do work on
> counters.
>
> btrfs already has this split design - it reserves space, does the
> copy, then marks the extent ranges as delalloc once the copy has
> succeeded, otherwise it simply unreserves the unused space.
>
> Once again, I don't know if ext4 does this internal delayed
> allocation extent tracking or whether it just uses page state to
> track those extents, but it would probably still have to use the
> allocate call to mark all the pages/bufferheads as delalloc so
> that uneserve didn't have to do any extra work.
Yes, exactly. I just wanted to point out that AFAICS ext4 can implement
proper error recovery without a need for 'punch' operation. So after all
Nick's copy page-by-page should be plausible at least for ext4.

Honza
--
Jan Kara <jack(a)suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8
Prev: [PATCH 1/2] sysfs: add struct file* to bin_attr callbacks
Next: [git pull] Please pull powerpc.git merge branch