ext4: Don't send extra barrier during fsync if there are no dirty pages. [Kernel]

Prev: Lieber Freund!!!
Next: pcrypt: sysfs interface

From: tytso on 30 Jun 2010 08:50

On Tue, May 04, 2010 at 11:45:53AM -0400, Christoph Hellwig wrote:
> On Tue, May 04, 2010 at 10:16:37AM -0400, Ric Wheeler wrote:
> > Checking per inode is actually incorrect - we do not want to short cut
> > the need to flush the target storage device's write cache just because a
> > specific file has no dirty pages. If a power hit occurs, having sent
> > the pages from to the storage device is not sufficient.
>
> As long as we're only using the information for fsync doing it per inode
> is the correct thing. We only want to flush the cache if the inode
> (data or metadata) is dirty in some way. Note that this includes writes
> via O_DIRECT which are quite different to track - I've not found the
> original patch in my mbox so I can't comment if this is done right.

I agree.

I wonder if it's worthwhile to think about a new system call which
allows users to provide an array of fd's which are collectively should
be fsync'ed out at the same time. Otherwise, we end up issuing
multiple barrier operations in cases where the application needs to
do:

fsync(control_fd);
fsync(data_fd);

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Ric Wheeler on 30 Jun 2010 09:30

On 06/30/2010 08:48 AM, tytso(a)mit.edu wrote:
> On Tue, May 04, 2010 at 11:45:53AM -0400, Christoph Hellwig wrote:
>> On Tue, May 04, 2010 at 10:16:37AM -0400, Ric Wheeler wrote:
>>> Checking per inode is actually incorrect - we do not want to short cut
>>> the need to flush the target storage device's write cache just because a
>>> specific file has no dirty pages. If a power hit occurs, having sent
>>> the pages from to the storage device is not sufficient.
>>
>> As long as we're only using the information for fsync doing it per inode
>> is the correct thing. We only want to flush the cache if the inode
>> (data or metadata) is dirty in some way. Note that this includes writes
>> via O_DIRECT which are quite different to track - I've not found the
>> original patch in my mbox so I can't comment if this is done right.
>
> I agree.
>
> I wonder if it's worthwhile to think about a new system call which
> allows users to provide an array of fd's which are collectively should
> be fsync'ed out at the same time. Otherwise, we end up issuing
> multiple barrier operations in cases where the application needs to
> do:
>
> fsync(control_fd);
> fsync(data_fd);
>
> - Ted

The problem with not issuing a cache flush when you have dirty meta data or data
is that it does not have any tie to the state of the volatile write cache of the
target storage device.

We do need to have fsync() issue the cache flush command even when there is no
dirty state for the inode in our local page cache in order to flush data that
was pushed out/cleaned and not followed by a flush.

It would definitely be *very* useful to have an array of fd's that all need
fsync()'ed at home time....

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: tytso on 30 Jun 2010 09:50

On Wed, Jun 30, 2010 at 09:21:04AM -0400, Ric Wheeler wrote:
>
> The problem with not issuing a cache flush when you have dirty meta
> data or data is that it does not have any tie to the state of the
> volatile write cache of the target storage device.

We track whether or not there is any metadata updates associated with
the inode already; if it does, we force a journal commit, and this
implies a barrier operation.

The case we're talking about here is one where either (a) there is no
journal, or (b) there have been no metadata updates (I'm simplifying a
little here; in fact we track whether there have been fdatasync()- vs
fsync()- worthy metadata updates), and so there hasn't been a journal
commit to do the cache flush.

In this case, we want to track when is the last time an fsync() has
been issued, versus when was the last time data blocks for a
particular inode have been pushed out to disk.

To use an example I used as motivation for why we might want an
fsync2(int fd[], int flags[], int num) syscall, consider the situation
of:

fsync(control_fd);
fdatasync(data_fd);

The first fsync() will have executed a cache flush operation. So when
we do the fdatasync() (assuming that no metadata needs to be flushed
out to disk), there is no need for the cache flush operation.

If we had an enhanced fsync command, we would also be able to
eliminate a second journal commit in the case where data_fd also had
some metadata that needed to be flushed out to disk.

> It would definitely be *very* useful to have an array of fd's that
> all need fsync()'ed at home time....

Yes, but it would require applications to change their code.

One thing that I would like about a new fsync2() system call is with a
flags field, we could add some new, more expressive flags:

#define FSYNC_DATA 0x0001 /* Only flush metadata if needed to access data */
#define FSYNC_NOWAIT 0x0002 /* Initiate the flush operations but don't wait
for them to complete */
#define FSYNC_NOBARRER 0x004 /* FS may skip the barrier if not needed for fs
consistency */

etc.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Ric Wheeler on 30 Jun 2010 10:00

On 06/30/2010 09:44 AM, tytso(a)mit.edu wrote:
> On Wed, Jun 30, 2010 at 09:21:04AM -0400, Ric Wheeler wrote:
>>
>> The problem with not issuing a cache flush when you have dirty meta
>> data or data is that it does not have any tie to the state of the
>> volatile write cache of the target storage device.
>
> We track whether or not there is any metadata updates associated with
> the inode already; if it does, we force a journal commit, and this
> implies a barrier operation.
>
> The case we're talking about here is one where either (a) there is no
> journal, or (b) there have been no metadata updates (I'm simplifying a
> little here; in fact we track whether there have been fdatasync()- vs
> fsync()- worthy metadata updates), and so there hasn't been a journal
> commit to do the cache flush.
>
> In this case, we want to track when is the last time an fsync() has
> been issued, versus when was the last time data blocks for a
> particular inode have been pushed out to disk.

I think that the state that we want to track is the last time the write cache on
the target device has been flushed. If the last fsync() did do a full barrier,
that would be equivalent :-)

ric

>
> To use an example I used as motivation for why we might want an
> fsync2(int fd[], int flags[], int num) syscall, consider the situation
> of:
>
> fsync(control_fd);
> fdatasync(data_fd);
>
> The first fsync() will have executed a cache flush operation. So when
> we do the fdatasync() (assuming that no metadata needs to be flushed
> out to disk), there is no need for the cache flush operation.
>
> If we had an enhanced fsync command, we would also be able to
> eliminate a second journal commit in the case where data_fd also had
> some metadata that needed to be flushed out to disk.
>
>> It would definitely be *very* useful to have an array of fd's that
>> all need fsync()'ed at home time....
>
> Yes, but it would require applications to change their code.
>
> One thing that I would like about a new fsync2() system call is with a
> flags field, we could add some new, more expressive flags:
>
> #define FSYNC_DATA 0x0001 /* Only flush metadata if needed to access data */
> #define FSYNC_NOWAIT 0x0002 /* Initiate the flush operations but don't wait
> for them to complete */
> #define FSYNC_NOBARRER 0x004 /* FS may skip the barrier if not needed for fs
> consistency */
>
> etc.
>
> - Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

|
Pages: 1
Prev: Lieber Freund!!!
Next: pcrypt: sysfs interface