new ->perform_write fop [Kernel]

Prev: [PATCH 1/2] sysfs: add struct file* to bin_attr callbacks
Next: [git pull] Please pull powerpc.git merge branch

From: Nick Piggin on 24 May 2010 05:40

On Mon, May 24, 2010 at 11:20:34AM +0200, Jan Kara wrote:
> On Sat 22-05-10 10:27:59, Dave Chinner wrote:
> > On Fri, May 21, 2010 at 08:58:46PM +0200, Jan Kara wrote:
> > > On Fri 21-05-10 09:05:24, Dave Chinner wrote:
> > > > On Thu, May 20, 2010 at 10:12:32PM +0200, Jan Kara wrote:
> > > > > b) E.g. ext4 can do even without hole punching. It can allocate extent
> > > > > as 'unwritten' and when something during the write fails, it just
> > > > > leaves the extent allocated and the 'unwritten' flag makes sure that
> > > > > any read will see zeros. I suppose that other filesystems that care
> > > > > about multipage writes are able to do similar things (e.g. btrfs can
> > > > > do the same as far as I remember, I'm not sure about gfs2).
> > > >
> > > > Allocating multipage writes as unwritten extents turns off delayed
> > > > allocation and hence we'd lose all the benefits that this gives...
> > > Ah, sorry. That was a short-circuit in my brain. But when we do delayed
> > > I don't see why we should actually do any hole punching... The write needs
> > > to:
> > > a) reserve enough blocks for the write - I don't know about other
> > > filesystems but for ext4 this means just incrementing a counter.
> > > b) copy data page by page
> > > c) release part of reservation (i.e. decrement counter) if we actually
> > > copied less than we originally thought.
> > >
> > > Am I missing something?
> >
> > Possibly. Delayed allocation is made up of two parts - space
> > reservation and recording the regions of delayed allocation in an
> > extent tree, page/bufferhead state or both.
> Yes. Ext4 records the info about delayed allocation only in buffer
> heads.
>
> > In XFS, these two steps happen in the same get_blocks call, but the
> > result of that is we have to truncate/punch delayed allocate extents
> > out just like normal extents if we are not going to use them. Hence
> > a reserve/allocate interface allows us to split the operation -
> > reserve ensures we have space for the delayed allocation, allocate
> > inserts the delayed extents into the inode extent tree for later
> > real allocation during writeback. Hence the unreserve call can
> > simply be accounting - it has no requirement to punch out delayed
> > extents that may have already been allocated, just do work on
> > counters.
> >
> > btrfs already has this split design - it reserves space, does the
> > copy, then marks the extent ranges as delalloc once the copy has
> > succeeded, otherwise it simply unreserves the unused space.
> >
> > Once again, I don't know if ext4 does this internal delayed
> > allocation extent tracking or whether it just uses page state to
> > track those extents, but it would probably still have to use the
> > allocate call to mark all the pages/bufferheads as delalloc so
> > that uneserve didn't have to do any extra work.
> Yes, exactly. I just wanted to point out that AFAICS ext4 can implement
> proper error recovery without a need for 'punch' operation. So after all
> Nick's copy page-by-page should be plausible at least for ext4.

Great. AFAIKS, any filesystem that does not leak uninitialized data
on IO error or crash when allocating writeback cache over holes
should already have enough information to recover properly from
short-copy type of error today.

Otherwise, an IO error or crash seems like quite a similar problem
from the point of view of the filesystem. Now perhaps it can be
recovered only in a fsck type operation which is far too expensive
to do in a normal error path, which sounds like XFS.

So possibly we could have 2 APIs, one for filesystems like XFS, but
I don't think we should penalise ones like ext4 which can handle
this situation.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: tytso on 5 Jun 2010 11:10

On Mon, May 24, 2010 at 11:20:34AM +0200, Jan Kara wrote:
> Yes, exactly. I just wanted to point out that AFAICS ext4 can implement
> proper error recovery without a need for 'punch' operation. So after all
> Nick's copy page-by-page should be plausible at least for ext4.

Sorry for my late response to this thread; I've been busy catching up
on another of other fronts, so I didn't have a chance to go through
this thread until now.

First of all, I'm not against implementing a 'punch' operation for
ext4; I've actually toyed with this idea before.

Secondly, I'm not sure it's really necessary; we already have a code
path (which I was planning on making be the default when I have a
chance to rewrite ext4_writepages) where the blocks are initially
allocated with the 'uninitialized' flag in the extent tree; this is
the same flag used for fallocate(2) support when we allocate blocks
without filling in the data blocks. Then, when the block I/O
completes, we use the block I/O callback to clear the uninit flag in
the extent tree. This is currently used to avoid safely avoid locking
in the read path, which is needed to speed up access for extremely
fast (think Fusion I/O-like) flash devices.

I was already thinking about using this trick in my planned
ext4_writepages() rewrite, and if it turns out we have common code
that also assumes that file systems can do the equivalent fallocate(2)
and can clear the uninitialized bit on a callback, I think that makes
ext4 fairly similar to what XFS does, at least at the high level,
doesn't it?

Note that strictly speaking this isn't a 'punch' operation in this
case; it's rather an fallocate(2) and don't convert the extent to mark
the data blocks as valid on error, which is not quite the same as a
'punch' operation.

Am I missing something?

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 6 Jun 2010 04:10

On Sat, Jun 05, 2010 at 11:05:23AM -0400, tytso(a)mit.edu wrote:
> On Mon, May 24, 2010 at 11:20:34AM +0200, Jan Kara wrote:
> > Yes, exactly. I just wanted to point out that AFAICS ext4 can implement
> > proper error recovery without a need for 'punch' operation. So after all
> > Nick's copy page-by-page should be plausible at least for ext4.
>
> Sorry for my late response to this thread; I've been busy catching up
> on another of other fronts, so I didn't have a chance to go through
> this thread until now.
>
> First of all, I'm not against implementing a 'punch' operation for
> ext4; I've actually toyed with this idea before.
>
> Secondly, I'm not sure it's really necessary; we already have a code
> path (which I was planning on making be the default when I have a
> chance to rewrite ext4_writepages) where the blocks are initially
> allocated with the 'uninitialized' flag in the extent tree; this is
> the same flag used for fallocate(2) support when we allocate blocks
> without filling in the data blocks. Then, when the block I/O
> completes, we use the block I/O callback to clear the uninit flag in
> the extent tree. This is currently used to avoid safely avoid locking
> in the read path, which is needed to speed up access for extremely
> fast (think Fusion I/O-like) flash devices.
>
> I was already thinking about using this trick in my planned
> ext4_writepages() rewrite, and if it turns out we have common code
> that also assumes that file systems can do the equivalent fallocate(2)
> and can clear the uninitialized bit on a callback, I think that makes
> ext4 fairly similar to what XFS does, at least at the high level,
> doesn't it?
>
> Note that strictly speaking this isn't a 'punch' operation in this
> case; it's rather an fallocate(2) and don't convert the extent to mark
> the data blocks as valid on error, which is not quite the same as a
> 'punch' operation.
>
> Am I missing something?

No this is fine, it's actually better than a punch operation from
error recovery point of view because it wouldn't require further
modifications to to filesystem in the error case.

AFAIKS this 'uninitialised blocks' approach seems to be the most
optimal way to do block allocations that are not tightly coupled
with the pagecache.

Do you mean the ext4's file_write path?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev |
Pages: 1 2 3 4 5 6 7 8
Prev: [PATCH 1/2] sysfs: add struct file* to bin_attr callbacks
Next: [git pull] Please pull powerpc.git merge branch