From: Christoph Hellwig on
Nick, what exactly is the problem with the reserve + allocate design?

In a delalloc filesystem (which is all those that will care about high
performance large writes) the write path fundamentally consists of those
two operations. Getting rid of the get_blocks mess and replacing it
with a dedicated operations vector will simplify things a lot.

Punching holes is a rather problematic operation, and as mentioned not
actually implemented for most filesystems - just decrementing counters
on errors increases the chances that our error handling will actually
work massively.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Josef Bacik on
On Sat, May 22, 2010 at 12:23:54AM +1000, Nick Piggin wrote:
> On Fri, May 21, 2010 at 09:50:54AM -0400, Josef Bacik wrote:
> > On Fri, May 21, 2010 at 09:05:24AM +1000, Dave Chinner wrote:
> > > Allocating multipage writes as unwritten extents turns off delayed
> > > allocation and hence we'd lose all the benefits that this gives...
> > >
> >
> > I just realized we have another problem, the mmap_sem/page_lock deadlock.
> > Currently BTRFS is susceptible to this, since we don't prefault any of the pages
> > in yet. If we're going to do multi-page writes we're going to need to have a
> > way to fault in all of the iovec's at once, so when we do the
> > pagefault_disable() copy pagefault_enable() we don't just end up copying the
> > first iovec. Nick have you done something like this already? If not I assume
> > I can just loop through all the iovec's and call fault_in_pages_readable on all
> > of them and be good to go right? Thanks,
>
> Yes, well it's a different issue. With multi-page writes, even a single
> iovec may not be faulted in properly. And with multiple iovecs, we are
> already suboptimal with faulting.
>
> faulting in multiple iovecs may already be a good idea. I didn't add
> that code, I had hoped for a test case first, but perhaps we can just
> go and add it.
>
> With multipage writes, we would want to fault in multiple source pages
> at once if the design was to lock multiple pages at once and do the
> copy. I still think we might be able to just lock and copy one page at
> a time, but I could be wrong.
>

I was thinking about this possibility, it seems this is what FUSE does already.
It would probably be easier to deal with this fault problem, just do all the fs
stuff to prepare for the total amount, do the copy one page at a time, and then
if something goes wrong we can cleanup properly.

> Oh wow, btrfs is deadlocky there. Firstly, fault_in_pages_readable does
> not guarantee success (race can always unmap the page in the meantime).
> Secondly, calling it inside the page lock section just means it will
> cause the deadlock rather than the copy_from_user.
>
> Quick workaround to reduce probability is to do fault_in_pages_readable
> calls before locking the pages.
>
> But you really need to handle the short-copy case. From the error
> handling there, it seems like you can just free_reserved_data_space and
> retry the copy?
>

Well no, if theres a short copy we just exit. If we do the
fault_in_pages_readable before we do the prepare_pages we could deal with a
short-copy then. Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jan Kara on
On Fri 21-05-10 09:05:24, Dave Chinner wrote:
> On Thu, May 20, 2010 at 10:12:32PM +0200, Jan Kara wrote:
> > Hmm, I was thinking about it and I see two options how to get out
> > of problems:
> > a) Filesystems which are not able to handle hole punching will allow
> > multipage writes only after EOF (which can be easily undone by
> > truncate in case of failure). That should actually cover lots of
> > cases we are interested in (I don't expect multipage writes to holes
> > to be a common case).
>
> multipage writes to holes is a relatively common operation in the
> HPC space that XFS is designed for (e.g. calculations on huge sparse
> matrices), so I'm not really fond of this idea....
Well, XFS could still handle them because it is able to do hole punching
but I get your point.

> > b) E.g. ext4 can do even without hole punching. It can allocate extent
> > as 'unwritten' and when something during the write fails, it just
> > leaves the extent allocated and the 'unwritten' flag makes sure that
> > any read will see zeros. I suppose that other filesystems that care
> > about multipage writes are able to do similar things (e.g. btrfs can
> > do the same as far as I remember, I'm not sure about gfs2).
>
> Allocating multipage writes as unwritten extents turns off delayed
> allocation and hence we'd lose all the benefits that this gives...
Ah, sorry. That was a short-circuit in my brain. But when we do delayed
I don't see why we should actually do any hole punching... The write needs
to:
a) reserve enough blocks for the write - I don't know about other
filesystems but for ext4 this means just incrementing a counter.
b) copy data page by page
c) release part of reservation (i.e. decrement counter) if we actually
copied less than we originally thought.

Am I missing something?

Honza
--
Jan Kara <jack(a)suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Dave Chinner on
On Fri, May 21, 2010 at 08:58:46PM +0200, Jan Kara wrote:
> On Fri 21-05-10 09:05:24, Dave Chinner wrote:
> > On Thu, May 20, 2010 at 10:12:32PM +0200, Jan Kara wrote:
> > > b) E.g. ext4 can do even without hole punching. It can allocate extent
> > > as 'unwritten' and when something during the write fails, it just
> > > leaves the extent allocated and the 'unwritten' flag makes sure that
> > > any read will see zeros. I suppose that other filesystems that care
> > > about multipage writes are able to do similar things (e.g. btrfs can
> > > do the same as far as I remember, I'm not sure about gfs2).
> >
> > Allocating multipage writes as unwritten extents turns off delayed
> > allocation and hence we'd lose all the benefits that this gives...
> Ah, sorry. That was a short-circuit in my brain. But when we do delayed
> I don't see why we should actually do any hole punching... The write needs
> to:
> a) reserve enough blocks for the write - I don't know about other
> filesystems but for ext4 this means just incrementing a counter.
> b) copy data page by page
> c) release part of reservation (i.e. decrement counter) if we actually
> copied less than we originally thought.
>
> Am I missing something?

Possibly. Delayed allocation is made up of two parts - space
reservation and recording the regions of delayed allocation in an
extent tree, page/bufferhead state or both.

In XFS, these two steps happen in the same get_blocks call, but the
result of that is we have to truncate/punch delayed allocate extents
out just like normal extents if we are not going to use them. Hence
a reserve/allocate interface allows us to split the operation -
reserve ensures we have space for the delayed allocation, allocate
inserts the delayed extents into the inode extent tree for later
real allocation during writeback. Hence the unreserve call can
simply be accounting - it has no requirement to punch out delayed
extents that may have already been allocated, just do work on
counters.

btrfs already has this split design - it reserves space, does the
copy, then marks the extent ranges as delalloc once the copy has
succeeded, otherwise it simply unreserves the unused space.

Once again, I don't know if ext4 does this internal delayed
allocation extent tracking or whether it just uses page state to
track those extents, but it would probably still have to use the
allocate call to mark all the pages/bufferheads as delalloc so
that uneserve didn't have to do any extra work.

Cheers,

Dave.
--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Dave Chinner on
On Fri, May 21, 2010 at 09:50:54AM -0400, Josef Bacik wrote:
> On Fri, May 21, 2010 at 09:05:24AM +1000, Dave Chinner wrote:
> > On Thu, May 20, 2010 at 10:12:32PM +0200, Jan Kara wrote:
> > > On Thu 20-05-10 09:50:54, Dave Chinner wrote:
> > > b) E.g. ext4 can do even without hole punching. It can allocate extent
> > > as 'unwritten' and when something during the write fails, it just
> > > leaves the extent allocated and the 'unwritten' flag makes sure that
> > > any read will see zeros. I suppose that other filesystems that care
> > > about multipage writes are able to do similar things (e.g. btrfs can
> > > do the same as far as I remember, I'm not sure about gfs2).
> >
> > Allocating multipage writes as unwritten extents turns off delayed
> > allocation and hence we'd lose all the benefits that this gives...
> >
>
> I just realized we have another problem, the mmap_sem/page_lock deadlock.
> Currently BTRFS is susceptible to this, since we don't prefault any of the pages
> in yet. If we're going to do multi-page writes we're going to need to have a
> way to fault in all of the iovec's at once, so when we do the
> pagefault_disable() copy pagefault_enable() we don't just end up copying the
> first iovec. Nick have you done something like this already?

I have patches that already do this, but the big issue is that it is
inhernetly racy. The prefaulting does not guarantee that by the time
we disable page faults that the prefaulted page has not already been
reclaimed. Basically we have to design for EFAULT occurring because
even pre-faultig does not prevent it from occurring.

> If not I assume
> I can just loop through all the iovec's and call fault_in_pages_readable on all
> of them and be good to go right? Thanks,

That's effectively what I've done, but it's still no guarantee.

Cheers,

Dave.
--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/