From: Dave Chinner on
On Thu, May 13, 2010 at 11:30:57PM -0400, Josef Bacik wrote:
> On Fri, May 14, 2010 at 11:00:42AM +1000, Dave Chinner wrote:
> > On Wed, May 12, 2010 at 09:39:27PM -0400, Josef Bacik wrote:
> > > On Wed, May 12, 2010 at 05:24:04PM -0400, Josef Bacik wrote:
> > > > Hello,
> > > >
> > > > I just started adding aio_write to Btrfs and I noticed we're duplicating _alot_
> > > > of the generic stuff in mm/filemap.c, even though the only thing thats really
> > > > unique is the fact that we copy userspace pages in chunks rather than one page a
> > > > t a time. What would be best is instead of doing write_begin/write_end with
> > > > Btrfs, it would be nice if we could just do our own perform_write instead of
> > > > generic_perform_write. This way we can drop all of these generic checks we have
> > > > that we copied from filemap.c and just got to the business of actually writing
> > > > the data. I hate to add another file operation, but it would _greatly_ reduce
> > > > the amount of duplicate code we have. If there is no violent objection to this
> > > > I can put something together quickly for review. Thanks,
> > > >
> > >
> > > I just got a suggestion from hpa about instead just moving what BTRFS does into
> > > the generic_perform_write. What btrfs does is allocates a chunk of pages to
> > > cover the entirety of the write, sets everything up, does the copy from user
> > > into the pages, and tears everything down, so essentially what
> > > generic_perform_write does, just with more pages.
> >
> > Except that btrfs does things in a very different manner to most
> > other filesystems ;)
> >
> > > I could modify
> > > generic_perform_write and the write_begin/write_end aops to do this, where
> > > write_begin will return how many pages it allocated, copy in all of the
> > > userpages into the pages we allocated at once, and then call write_end with the
> > > pages we allocated in write begin. Then I could just make btrfs do
> > > write_being/write_end. So which option seems more palatable? Thanks,
> >
> > I can see how this would work for btrfs, but the issue is how any
> > other filesystem would handle it.
> >
> > I've been trying to get my head around how any filesystem using
> > bufferheads and generic code can do multipage writes using
> > write_begin/write_end without modifying the interface, and I just
> > threw away my second attempt because the error handling just
> > couldn't be handled cleanly without duplicating the entire
> > block_write_begin path in each filesystem that wanted to do
> > multipage writes.
> >
> > The biggest problem is that block allocation is intermingled with
> > allocating and attaching bufferheads to pages, hence error handling
> > can get really nasty and is split across a call chain 3 or 4
> > functions deep. The error handling is where I'm finding all the
> > dangerous and hard-to-kill demons lurking in dark corners. I suspect
> > there's a grue in there somewhere, too. ;)
> >
> > Separating the page+bufferhead allocation and block allocation would
> > make this much simpler but I can't fit that easily into the existing
> > interfaces.
> >
> > Hence I think that write_begin/copy pages/write_end is not really
> > suited to multipage writes when allocation is done in write_begin
> > and the write can then fail in a later stage without a simple method
> > of undoing the allocation. We don't have any hole punch interfaces
> > to the filesystems (and I think only XFS supports that functionality
> > right now), so handling errors after allocation becomes rather
> > complex, especially when you have multiple blocks per page.
> >
> > Hence I've independently come to the conclusion that delaying the
> > allocation until *after* the copy as btrfs does is probably the best
> > approach to take here. This largely avoids the error handling
> > complexity because the write operation is an all-or-nothing
> > operation. btrfs has separate hooks for space reservation and
> > releasing the reservation and doesn't commit to actually allocating
> > anything until the copying is complete. Hence cleanup is simple no
> > matter where a failure occurs.
> >
> > Personally, I'm tending towards killing the get_blocks() callback as
> > the first step in this process - turn it into a real inode/address
> > space operation (say ->allocate) so we can untangle the write path
> > somewhat (lots of filesystem just provide operations as wrappers to
> > provide a fs-specific get_blocks callback to generic code. If the
> > "create" flag is then converted to a "command" field, the interface
> > can pass "RESERVE", "ALLOC", "CREATE", etc to allow different
> > operations to be clearly handled.
> >
> > e.g.:
> >
> > ->allocate(mapping, NULL, off, len, RESERVE)
> > reserves necessary space for write
> > ->write_begin
> > grab pages into page cache
> > attach bufferheads (if required)
> > fail -> goto drop pages
> > copy data into pages
> > fail -> goto drop pages
> > ->allocate(mapping, pages, off, len, ALLOC)
> > allocates reserved space (if required)
> > sets up/maps/updates bufferheads/extents
> > fail -> goto drop pages
> > ->write_end
> > set pages dirty + uptodate
> > done
> >
> > drop_pages:
> > ->allocate(mapping, NULL, off, len, UNRESERVE)
> > if needed, zero partial pages
> > release pages, clears uptodate.
> >
> > Basically this allows the copying of data before any allocation is
> > actually done, but also allows ENOSPC to be detected before starting
> > the copy. The filesystem can call whatver helpers it needs inside
> > ->get_blocks(ALLOC) to set up bufferhead/extent state to match
> > what has been reserved/allocated/mapped in the RESERVE call.
> >
> > This will work for btrfs, and it will work for XFS and I think it
> > will work for other filesystems that are using bufferheads as well.
> > For those filesystems that will only support a page at a time, then
> > they can continue to use the current code, but should be able to be
> > converted to the multipage code by making RESERVE and UNRESERVE
> > no-ops, and ALLOC does what write_begin+get_blocks currently does to
> > set up block mappings.
> >
> > Thoughts?
> >
> So this is what I had envisioned, we make write_begin take a nr_pages pointer
> and tell it how much data we have to write, then in the filesystem we allocate
> as many pages as we feel like, idealy something like
>
> min(number of pages we need for the write, some arbitrary limit for security)

Actually, i was thinking that the RESERVE call determines the size
of the chunk (in the order of 1-4MB maximum). IOWs, we pass in the
start offset of the write, the entire length remaining, and the
RESERVE call determines how much it will allow in one loop.

written = 0;
while (bytes_remaining > 0) {
chunklen = ->allocate(off, bytes_remaining, RESERVE);
write_begin(&pages, off, chunklen);
copied = copy_pages(&pages, iov_iter, chunklen);
.....
bytes_remaining -= copied;
off += copied;
written += copied;
}


> and then we allocate all those pages. Then you can pass them into
> block_write_begin, which will walk the pages, allocating buffer heads and
> allocating the space as needed.

Close - except that I'd like like to move the block allocation out of
->write_begin until after the copy into the pages has completed.
Hence write_begin only deals with populating the page cache and
anything filesystem specific like locking or allocating/populating
per-page structures like bufferheads. i.e. ->write_begin does not
do anything that cannot be undone easily if the copy fails.

> Now since we're coming into write_begin with "we want to write X bytes" we can
> go ahead and do the enospc checks for X bytes, and then if we are good to go,
> chances are we won't fail.

Well, the "RESERVE" callout I proposed should mean that the
allocation that follows the copy should never fail with ENOSPC. i.e.
if there isn't space, the RESERVE call will fail with ENOSPC, not
the actual ALLOCATE call that occurs after the data is copied.

Basically I was thinking that in the XFS case, RESERVE would scan the
range for holes, and reserve all blocks needed to fill all the holes
in the range. Then the ALLOCATE call would mark them all delalloc.
The space reservation and delalloc is done in a single call right
now, but splitting them is not hard...

> Except if we're overwriting a holey section of the file, we're going to be
> screwed in both your way and my way. My way probably would be the most likely
> to fail, since we could fail to do the copy_from_user, but hopefully the segment
> checks and doing the fault_in_readable before all of this would keep those
> problems to a minimum.

fault_in_readable() only faults the first page of the first iov
passed in. It can still fail on any other page the iov points to.
And even then, fault_in_readable() does not pin the page it faulted
in memory, so even that can get recycled before we start the copy
and we'd still get EFAULT.

> In your case the only failure point is in the allocate step. If we fail on down
> the line after we've done some hole filling, we'll be hard pressed to go back
> and free up those blocks. Is that what you are talking about, having the
> allocate(UNRESERVE) thing being able to go back and figure out what should have
> been holes needs to be holes again?

Yeah, that's kind of what I was thinking of. After a bit more
thought, if the allocation only covers part of the range (for
whatever reason), we can still allow that allocated part of the
write to complete succesfully, and the UNRESERVE call simply returns
the unused part of the space reservation. This would be a partial
write and be much simpler to deal with than punching out the
allocations that succeeded.

> If thats the case then I think your idea
> will work and is probably the best way to move forward. But from what I can
> remember about how all this works with buffer heads, there's not a way to say
> "we _just_ allocated this recently". Thanks,

buffer_new() would suffice, I think.

In __block_prepare_write(), when a new buffer is mapped that flag is
cleared. If the buffer is not mapped, we call get_blocks, and any
buffer it allocates will be marked buffer_new() again. This is used
to trigger special things in __block_prepare_write() and the flag is
not cleared until write_end (in __block_commit_write). Hence a
buffer with the new flag set after allocation is a buffer that had
space allocated...

Cheers,

Dave.
--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Dave Chinner on
On Fri, May 14, 2010 at 03:50:54PM +1000, Nick Piggin wrote:
> On Thu, May 13, 2010 at 11:30:57PM -0400, Josef Bacik wrote:
> > On Fri, May 14, 2010 at 11:00:42AM +1000, Dave Chinner wrote:
> > > On Wed, May 12, 2010 at 09:39:27PM -0400, Josef Bacik wrote:
> > > > On Wed, May 12, 2010 at 05:24:04PM -0400, Josef Bacik wrote:
> > > > I could modify
> > > > generic_perform_write and the write_begin/write_end aops to do this, where
> > > > write_begin will return how many pages it allocated, copy in all of the
> > > > userpages into the pages we allocated at once, and then call write_end with the
> > > > pages we allocated in write begin. Then I could just make btrfs do
> > > > write_being/write_end. So which option seems more palatable? Thanks,
> > >
> > > I can see how this would work for btrfs, but the issue is how any
> > > other filesystem would handle it.
> > >
> > > I've been trying to get my head around how any filesystem using
> > > bufferheads and generic code can do multipage writes using
> > > write_begin/write_end without modifying the interface, and I just
> > > threw away my second attempt because the error handling just
> > > couldn't be handled cleanly without duplicating the entire
> > > block_write_begin path in each filesystem that wanted to do
> > > multipage writes.
> > >
> > > The biggest problem is that block allocation is intermingled with
> > > allocating and attaching bufferheads to pages, hence error handling
> > > can get really nasty and is split across a call chain 3 or 4
> > > functions deep. The error handling is where I'm finding all the
> > > dangerous and hard-to-kill demons lurking in dark corners. I suspect
> > > there's a grue in there somewhere, too. ;)
> > >
> > > Separating the page+bufferhead allocation and block allocation would
> > > make this much simpler but I can't fit that easily into the existing
> > > interfaces.
> > >
> > > Hence I think that write_begin/copy pages/write_end is not really
> > > suited to multipage writes when allocation is done in write_begin
> > > and the write can then fail in a later stage without a simple method
> > > of undoing the allocation. We don't have any hole punch interfaces
> > > to the filesystems (and I think only XFS supports that functionality
> > > right now), so handling errors after allocation becomes rather
> > > complex, especially when you have multiple blocks per page.
> > >
> > > Hence I've independently come to the conclusion that delaying the
> > > allocation until *after* the copy as btrfs does is probably the best
> > > approach to take here. This largely avoids the error handling
> > > complexity because the write operation is an all-or-nothing
> > > operation. btrfs has separate hooks for space reservation and
> > > releasing the reservation and doesn't commit to actually allocating
> > > anything until the copying is complete. Hence cleanup is simple no
> > > matter where a failure occurs.
> > >
> > > Personally, I'm tending towards killing the get_blocks() callback as
> > > the first step in this process - turn it into a real inode/address
> > > space operation (say ->allocate) so we can untangle the write path
> > > somewhat (lots of filesystem just provide operations as wrappers to
> > > provide a fs-specific get_blocks callback to generic code. If the
> > > "create" flag is then converted to a "command" field, the interface
> > > can pass "RESERVE", "ALLOC", "CREATE", etc to allow different
> > > operations to be clearly handled.
> > >
> > > e.g.:
> > >
> > > ->allocate(mapping, NULL, off, len, RESERVE)
> > > reserves necessary space for write
> > > ->write_begin
> > > grab pages into page cache
> > > attach bufferheads (if required)
> > > fail -> goto drop pages
> > > copy data into pages
> > > fail -> goto drop pages
> > > ->allocate(mapping, pages, off, len, ALLOC)
> > > allocates reserved space (if required)
> > > sets up/maps/updates bufferheads/extents
> > > fail -> goto drop pages
> > > ->write_end
> > > set pages dirty + uptodate
> > > done
> > >
> > > drop_pages:
> > > ->allocate(mapping, NULL, off, len, UNRESERVE)
> > > if needed, zero partial pages
> > > release pages, clears uptodate.
> > >
> > > Basically this allows the copying of data before any allocation is
> > > actually done, but also allows ENOSPC to be detected before starting
> > > the copy. The filesystem can call whatver helpers it needs inside
> > > ->get_blocks(ALLOC) to set up bufferhead/extent state to match
> > > what has been reserved/allocated/mapped in the RESERVE call.
> > >
> > > This will work for btrfs, and it will work for XFS and I think it
> > > will work for other filesystems that are using bufferheads as well.
> > > For those filesystems that will only support a page at a time, then
> > > they can continue to use the current code, but should be able to be
> > > converted to the multipage code by making RESERVE and UNRESERVE
> > > no-ops, and ALLOC does what write_begin+get_blocks currently does to
> > > set up block mappings.
> > >
> > > Thoughts?
> > >
> > So this is what I had envisioned, we make write_begin take a nr_pages pointer
> > and tell it how much data we have to write, then in the filesystem we allocate
> > as many pages as we feel like, idealy something like
> >
> > min(number of pages we need for the write, some arbitrary limit for security)
> >
> > and then we allocate all those pages. Then you can pass them into
> > block_write_begin, which will walk the pages, allocating buffer heads and
> > allocating the space as needed.
> >
> > Now since we're coming into write_begin with "we want to write X bytes" we can
> > go ahead and do the enospc checks for X bytes, and then if we are good to go,
> > chances are we won't fail.
> >
> > Except if we're overwriting a holey section of the file, we're going to be
> > screwed in both your way and my way. My way probably would be the most likely
> > to fail, since we could fail to do the copy_from_user, but hopefully the segment
> > checks and doing the fault_in_readable before all of this would keep those
> > problems to a minimum.
> >
> > In your case the only failure point is in the allocate step. If we fail on down
> > the line after we've done some hole filling, we'll be hard pressed to go back
> > and free up those blocks. Is that what you are talking about, having the
> > allocate(UNRESERVE) thing being able to go back and figure out what should have
> > been holes needs to be holes again? If thats the case then I think your idea
> > will work and is probably the best way to move forward. But from what I can
> > remember about how all this works with buffer heads, there's not a way to say
> > "we _just_ allocated this recently". Thanks,
>
> Now is there really a good reason to go this way and add more to the
> write_begin/write_end paths? Rather than having filesystems just
> implement their own write file_operations in order to do multi-block
> operations?

Well, if we've got xfs, btrfs, gfs2, ext4, and others all wanting to
do multipage writes, shouldn't we be trying doing in a generic way?


Fuse doesn't have to deal with allocation of blocks in
fuse_perform_write()

> From what I can see, the generic code is not going to be able to be
> much help with error handling etc. so I would prefer to keep it as
> simple as possible. I think it is still adequate for most cases.
>
> Take a look at how fuse does multi-page write operations. It is about
> the simplest case you can get, but it still requires all the generic
> checks etc.

fuse_perform_write() doesn't do allocation, and so can easily abort
at the first error and just complete the writes that did succeed.
Hence it don't think it's a model that a filesystem that has to
handle space allocation can use.

> and it is quite neat -- I don't see a big issue with
> duplicating generic code?

When a large number of filesystems end up duplicating the same code,
then we should be looking at how to implement that functionality
generically, right?

Cheers,

Dave.
--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Nick Piggin on
On Fri, May 14, 2010 at 04:41:45PM +1000, Dave Chinner wrote:
> On Thu, May 13, 2010 at 11:30:57PM -0400, Josef Bacik wrote:
> > So this is what I had envisioned, we make write_begin take a nr_pages pointer
> > and tell it how much data we have to write, then in the filesystem we allocate
> > as many pages as we feel like, idealy something like
> >
> > min(number of pages we need for the write, some arbitrary limit for security)
>
> Actually, i was thinking that the RESERVE call determines the size
> of the chunk (in the order of 1-4MB maximum). IOWs, we pass in the
> start offset of the write, the entire length remaining, and the
> RESERVE call determines how much it will allow in one loop.
>
> written = 0;
> while (bytes_remaining > 0) {
> chunklen = ->allocate(off, bytes_remaining, RESERVE);
> write_begin(&pages, off, chunklen);
> copied = copy_pages(&pages, iov_iter, chunklen);
> .....
> bytes_remaining -= copied;
> off += copied;
> written += copied;
> }

How much benefit are you expecting to get? I have no problems with
non-pagecache block allocation/freeing APIs to get the most out of
amortising expensive locking or tree manipulations, but I don't
know if it is a good idea to pin a big chunk of pagecache like that.

I'd say just do the per-page manipulations like we currently do.


> > and then we allocate all those pages. Then you can pass them into
> > block_write_begin, which will walk the pages, allocating buffer heads and
> > allocating the space as needed.
>
> Close - except that I'd like like to move the block allocation out of
> ->write_begin until after the copy into the pages has completed.
> Hence write_begin only deals with populating the page cache and
> anything filesystem specific like locking or allocating/populating
> per-page structures like bufferheads. i.e. ->write_begin does not
> do anything that cannot be undone easily if the copy fails.

Hmm, not so sure. You run into other problems.

Write to a hole? Then you have to clear page uptodate and unmap
any potential mmaps on it before the copy to pagecache, in case
the allocation fails (otherwise you have exposed inconsistent
state to userspace).

Write to a hole, then to a non-hole? Then your write to the non
hole pagecache cannot be recovered in case the block allocation
for the hole fails and we need to return a short write.


> > Now since we're coming into write_begin with "we want to write X bytes" we can
> > go ahead and do the enospc checks for X bytes, and then if we are good to go,
> > chances are we won't fail.
>
> Well, the "RESERVE" callout I proposed should mean that the
> allocation that follows the copy should never fail with ENOSPC. i.e.
> if there isn't space, the RESERVE call will fail with ENOSPC, not
> the actual ALLOCATE call that occurs after the data is copied.
>
> Basically I was thinking that in the XFS case, RESERVE would scan the
> range for holes, and reserve all blocks needed to fill all the holes
> in the range. Then the ALLOCATE call would mark them all delalloc.
> The space reservation and delalloc is done in a single call right
> now, but splitting them is not hard...
>
> > Except if we're overwriting a holey section of the file, we're going to be
> > screwed in both your way and my way. My way probably would be the most likely
> > to fail, since we could fail to do the copy_from_user, but hopefully the segment
> > checks and doing the fault_in_readable before all of this would keep those
> > problems to a minimum.
>
> fault_in_readable() only faults the first page of the first iov
> passed in. It can still fail on any other page the iov points to.
> And even then, fault_in_readable() does not pin the page it faulted
> in memory, so even that can get recycled before we start the copy
> and we'd still get EFAULT.
>
> > In your case the only failure point is in the allocate step. If we fail on down
> > the line after we've done some hole filling, we'll be hard pressed to go back
> > and free up those blocks. Is that what you are talking about, having the
> > allocate(UNRESERVE) thing being able to go back and figure out what should have
> > been holes needs to be holes again?
>
> Yeah, that's kind of what I was thinking of. After a bit more
> thought, if the allocation only covers part of the range (for
> whatever reason), we can still allow that allocated part of the
> write to complete succesfully, and the UNRESERVE call simply returns
> the unused part of the space reservation. This would be a partial
> write and be much simpler to deal with than punching out the
> allocations that succeeded.

Do we really need to restore holes inside i_size in error cases?
Current code does not. It just leaves the blocks allocated and
zeroes them.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Nick Piggin on
On Fri, May 14, 2010 at 05:20:55PM +1000, Dave Chinner wrote:
> On Fri, May 14, 2010 at 03:50:54PM +1000, Nick Piggin wrote:
> > Now is there really a good reason to go this way and add more to the
> > write_begin/write_end paths? Rather than having filesystems just
> > implement their own write file_operations in order to do multi-block
> > operations?
>
> Well, if we've got xfs, btrfs, gfs2, ext4, and others all wanting to
> do multipage writes, shouldn't we be trying doing in a generic way?

If it makes sense, definitely.


> Fuse doesn't have to deal with allocation of blocks in
> fuse_perform_write()

I just can't see how the generic code can really help out with that
problem of error handling in various parts of the operation allocation.


> > From what I can see, the generic code is not going to be able to be
> > much help with error handling etc. so I would prefer to keep it as
> > simple as possible. I think it is still adequate for most cases.
> >
> > Take a look at how fuse does multi-page write operations. It is about
> > the simplest case you can get, but it still requires all the generic
> > checks etc.
>
> fuse_perform_write() doesn't do allocation, and so can easily abort
> at the first error and just complete the writes that did succeed.
> Hence it don't think it's a model that a filesystem that has to
> handle space allocation can use.

No but it does all the _generic_ vfs checks required, which sounded
like what the btrfs folk were concerned about duplicating. My point
was just that there isn't very much duplication really.


> > and it is quite neat -- I don't see a big issue with
> > duplicating generic code?
>
> When a large number of filesystems end up duplicating the same code,
> then we should be looking at how to implement that functionality
> generically, right?

Yes if it captures a good chunk of common code without unduely
complicating things. I'll be interested to see if that can be
made the case.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Dave Chinner on
On Fri, May 14, 2010 at 05:22:19PM +1000, Nick Piggin wrote:
> On Fri, May 14, 2010 at 04:41:45PM +1000, Dave Chinner wrote:
> > On Thu, May 13, 2010 at 11:30:57PM -0400, Josef Bacik wrote:
> > > So this is what I had envisioned, we make write_begin take a nr_pages pointer
> > > and tell it how much data we have to write, then in the filesystem we allocate
> > > as many pages as we feel like, idealy something like
> > >
> > > min(number of pages we need for the write, some arbitrary limit for security)
> >
> > Actually, i was thinking that the RESERVE call determines the size
> > of the chunk (in the order of 1-4MB maximum). IOWs, we pass in the
> > start offset of the write, the entire length remaining, and the
> > RESERVE call determines how much it will allow in one loop.
> >
> > written = 0;
> > while (bytes_remaining > 0) {
> > chunklen = ->allocate(off, bytes_remaining, RESERVE);
> > write_begin(&pages, off, chunklen);
> > copied = copy_pages(&pages, iov_iter, chunklen);
> > .....
> > bytes_remaining -= copied;
> > off += copied;
> > written += copied;
> > }
>
> How much benefit are you expecting to get?

If the max chunk size is 4MB, then three orders of magnitudes fewer
allocation calls for x86_64 (i.e. one instead of 1024). For
filesystems with significant allocation overhead (like gaining
cluster locks in gfs2), this will be a *massive* win.

> I have no problems with
> non-pagecache block allocation/freeing APIs to get the most out of
> amortising expensive locking or tree manipulations, but I don't
> know if it is a good idea to pin a big chunk of pagecache like that.
> I'd say just do the per-page manipulations like we currently do.

Can you expand on this?

> > > and then we allocate all those pages. Then you can pass them into
> > > block_write_begin, which will walk the pages, allocating buffer heads and
> > > allocating the space as needed.
> >
> > Close - except that I'd like like to move the block allocation out of
> > ->write_begin until after the copy into the pages has completed.
> > Hence write_begin only deals with populating the page cache and
> > anything filesystem specific like locking or allocating/populating
> > per-page structures like bufferheads. i.e. ->write_begin does not
> > do anything that cannot be undone easily if the copy fails.
>
> Hmm, not so sure. You run into other problems.
>
> Write to a hole? Then you have to clear page uptodate and unmap
> any potential mmaps on it before the copy to pagecache, in case
> the allocation fails (otherwise you have exposed inconsistent
> state to userspace).

Good call. I think that checking page_mapped() on each page as they
are grabbed and locked, and if any are mapped call
unmap_mapping_range() on them will work. New faults will serialise
on the page locks and because we hold the page lock across the
entire operation, I don't think we need to clear the uptodate flag
unless we fail the write.

> Write to a hole, then to a non-hole? Then your write to the non
> hole pagecache cannot be recovered in case the block allocation
> for the hole fails and we need to return a short write.

The pages covering non-holes are locked, so nobody else can access
them. We could simply invalidate them so the next access would force
a re-read from disk. The problem I see with this, though, is
multiple partial overwrites of the allocated block - we could lose
changes invalidating it.

I think the simplest way to deal with it is to ensure that
->allocate(RESERVE) is not allowed to return ranges that span
allocated and unallocated regions. i.e. it can return an allocated
region or a hole, but not a region that combines both. The worst
case is that we fall back to current behaviour (an allocation call
per block), but for the case we're trying to optimise (large writes
into large holes) we'd still see the benefit of fewer allocation
calls. FWIW, I think this simplifies error handling as well (see
below).

> > > Now since we're coming into write_begin with "we want to write X bytes" we can
> > > go ahead and do the enospc checks for X bytes, and then if we are good to go,
> > > chances are we won't fail.
> >
> > Well, the "RESERVE" callout I proposed should mean that the
> > allocation that follows the copy should never fail with ENOSPC. i.e.
> > if there isn't space, the RESERVE call will fail with ENOSPC, not
> > the actual ALLOCATE call that occurs after the data is copied.
> >
> > Basically I was thinking that in the XFS case, RESERVE would scan the
> > range for holes, and reserve all blocks needed to fill all the holes
> > in the range. Then the ALLOCATE call would mark them all delalloc.
> > The space reservation and delalloc is done in a single call right
> > now, but splitting them is not hard...
> >
> > > Except if we're overwriting a holey section of the file, we're going to be
> > > screwed in both your way and my way. My way probably would be the most likely
> > > to fail, since we could fail to do the copy_from_user, but hopefully the segment
> > > checks and doing the fault_in_readable before all of this would keep those
> > > problems to a minimum.
> >
> > fault_in_readable() only faults the first page of the first iov
> > passed in. It can still fail on any other page the iov points to.
> > And even then, fault_in_readable() does not pin the page it faulted
> > in memory, so even that can get recycled before we start the copy
> > and we'd still get EFAULT.
> >
> > > In your case the only failure point is in the allocate step. If we fail on down
> > > the line after we've done some hole filling, we'll be hard pressed to go back
> > > and free up those blocks. Is that what you are talking about, having the
> > > allocate(UNRESERVE) thing being able to go back and figure out what should have
> > > been holes needs to be holes again?
> >
> > Yeah, that's kind of what I was thinking of. After a bit more
> > thought, if the allocation only covers part of the range (for
> > whatever reason), we can still allow that allocated part of the
> > write to complete succesfully, and the UNRESERVE call simply returns
> > the unused part of the space reservation. This would be a partial
> > write and be much simpler to deal with than punching out the
> > allocations that succeeded.
>
> Do we really need to restore holes inside i_size in error cases?

No, we don't. That's what I was trying to say - holes that were
allocated are full of good data, so we only need to deal with the
holes that didn't get allocated. The range of the write that failed
is only stale in memory now, so we can just invalidate pages in the
range that is considered failed.

If we limit RESERVE to return ranges of a single type of block, then
this whole cleanup problem will also go away - it's back to a simple
succeed or fail-and-rollback state per execution of the loop, and
partial writes are those that have made at least one successful pass
of the loop.

> Current code does not. It just leaves the blocks allocated and
> zeroes them.

Yup, error handling is easy when you're only dealing with one page
at a time. ;)

Cheers,

Dave.
--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/