Flushing file writes to disk with 100% reliability [Unix Programming]

Prev: Learn about proxy sites and how to use them to open blocked sites unlimited downloads from RapidShare and megaupload and increase the speed of the Internet with new sites for free
Next: How reliable are named pipes?

From: Peter Olcott on 9 Apr 2010 22:32

"Golden California Girls" <gldncagrls(a)aol.com.mil> wrote in
message news:hpokas$tdk$1(a)speranza.aioe.org...
> Peter Olcott wrote:
>> "David Schwartz" <davids(a)webmaster.com> wrote in message
>> news:683c22d6-b73f-44cf-ae8d-8df79039bc4b(a)w42g2000yqm.googlegroups.com...
>> On Apr 8, 4:56 pm, "Peter Olcott" <NoS...(a)OCR4Screen.com>
>> wrote:
>>
>>> Oh yeah to solve the problem of disk drive onboard
>>> cache,
>>> simply turn off write caching.
>>
>> --If this really is a hard requirement, you will have to
>> do
>> one of two
>> --things:
>>
>> --1) Strictly control the hardware and software that is
>> to
>> be used.
>> --Unfortunately, in-between your call to 'fdatasync' and
>> the
>> physical
>> --platters, there are lots of places where the code can
>> be
>> lied to and
>> --told that the sync is completed when it's really not.
>>
>> --2) Use an architecture that inherently provides this
>> guarantee by
>> --design. For example, if you commit a transaction to a
>> separate storage
>> --system before you move on, you are guaranteed that the
>> transaction
>> --will not be lost unless both this system and that
>> separate
>> system fail
>> --concurrently.
>>
>> It looks like one piece that is often missing (according
>> to
>> one respondent) is that fsync() is often broken. From
>> what I
>> understand this makes the whole system much less reliable
>> as
>> this relates to committed transactions. The only way that
>> I
>> could think of to account for this is to provide some
>> sort
>> of transaction-by-transaction on-the-fly offsite backup.
>>
>> One simple way to do this (I don't know how reliable it
>> would be) would be to simply email the transactions to
>> myself. Another way would be to provide some sort of HTTP
>> based web service that can accept and archive
>> transactions
>> from another HTTP web service. The main transactions that
>> I
>> want to never lose track of is whenever a customer adds
>> money to their user account. All other transactions are
>> less
>> crucial.
>>
>> --I think it's fair to say that you will never get this
>> to
>> 100%, so if
>> --you need the overall system reliability to be high, one
>> factor will
>> --have to be high hardware reliability. Even if you can
>> only
>> get this to
>> --99%, if a power loss only occurs once a year, the
>> system
>> will, on
>> --average, only fail once per hundred years. You can
>> achieve
>> this with
>> --redundant power supplies plugged into separate UPSes.
>> RAID
>> 6 with hot
>> --spares helps too. (Tip: Make sure your controller is
>> set
>> to
>> --periodically *test* your hot spares!)
>
> I think there may be another issue as well, despite
> everything working for the
> file and its data, when does the kernel issue its write
> request to update the
> directory? Even if every buffer for the file is written,
> if the directory isn't
> updated because the kernel hasn't asked for it then you
> are still hosed. Of
> course there is a work around for this, don't use the file
> system.
>

Yes, this is very simple apply fsync() to the directory too.
The big problem with this is that I have heard that fsync()
is often broken.

From: Golden California Girls on 10 Apr 2010 11:20

Peter Olcott wrote:
> "Golden California Girls" <gldncagrls(a)aol.com.mil> wrote in
> message news:hpokas$tdk$1(a)speranza.aioe.org...
>> I think there may be another issue as well, despite
>> everything working for the
>> file and its data, when does the kernel issue its write
>> request to update the
>> directory? Even if every buffer for the file is written,
>> if the directory isn't
>> updated because the kernel hasn't asked for it then you
>> are still hosed. Of
>> course there is a work around for this, don't use the file
>> system.
>>
>
> Yes, this is very simple apply fsync() to the directory too.
> The big problem with this is that I have heard that fsync()
> is often broken.

from my man page

Note that while fsync() will flush all data from the host to the drive
(i.e. the "permanent storage device"), the drive itself may not physically write
the data to the platters for quite some time and it may be written in an
out-of-order
sequence.

Specifically, if the drive loses power or the OS crashes, the application
may find that only some or none of their data was written. The disk drive may
also re-order the data so that later writes may be present while earlier writes
are not.

This is not a theoretical edge case. This scenario is easily reproduced
with real world workloads and drive power failures.

from man fcntl on my system
F_FULLFSYNC Does the same thing as fsync(2) then asks the drive to
flush all buffered data to the permanent storage device (arg is ignored). This
is currently only implemented on HFS filesystems and the operation may take quite a
while to complete. Certain FireWire drives have also
been known to ignore this request.

To get what you want, you are going to be reading a lot of tech info from disk
drive manufacturers to be sure the drives on your system will in fact write data
to the disk when requested to do so. You also are going to have to find out if
the device drivers for your system actually send the request on to the drives.
Otherwise fsycn = nop
Best to consider fsync to be universally broken.

From: Peter Olcott on 10 Apr 2010 11:26

"Golden California Girls" <gldncagrls(a)aol.com.mil> wrote in
message news:hpq4vi$4o4$1(a)speranza.aioe.org...
> Peter Olcott wrote:

>> Yes, this is very simple apply fsync() to the directory
>> too.
>> The big problem with this is that I have heard that
>> fsync()
>> is often broken.
>
>
> from my man page
>
> Note that while fsync() will flush all data from the
> host to the drive
> (i.e. the "permanent storage device"), the drive itself
> may not physically write
> the data to the platters for quite some time and it may be
> written in an
> out-of-order
> sequence.
>
> Specifically, if the drive loses power or the OS
> crashes, the application
> may find that only some or none of their data was written.
> The disk drive may
> also re-order the data so that later writes may be present
> while earlier writes
> are not.
>
> This is not a theoretical edge case. This scenario is
> easily reproduced
> with real world workloads and drive power failures.
>
>
>
> from man fcntl on my system
> F_FULLFSYNC Does the same thing as fsync(2)
> then asks the drive to
> flush all buffered data to the permanent storage device
> (arg is ignored). This
> is currently only implemented on HFS filesystems and the
> operation may take quite a
> while to complete. Certain
> FireWire drives have also
> been known to ignore this request.
>
>
>
> To get what you want, you are going to be reading a lot of
> tech info from disk
> drive manufacturers to be sure the drives on your system
> will in fact write data
> to the disk when requested to do so. You also are going
> to have to find out if
> the device drivers for your system actually send the
> request on to the drives.
> Otherwise fsycn = nop
> Best to consider fsync to be universally broken.
>
>

Yes, that is why you either have to turn the drive's write
caching off or use a file system that is smart enough to do
this on the fly, such as ZEST. This still does not solve the
problem that fsync() itself is often broken. It must also be
verifies that fsync() works correctly.

I am not sure of the best way to do this, possibly a lot of
tests where the process is killed in the middle of a
transaction from a very high load of transactions. The
theory is that you only lose the last transaction.

From: Golden California Girls on 10 Apr 2010 15:12

Peter Olcott wrote:
> "Golden California Girls" <gldncagrls(a)aol.com.mil> wrote in
> message news:hpq4vi$4o4$1(a)speranza.aioe.org...
>> from my man page
>>
>> Note that while fsync() will flush all data from the
>> host to the drive
>> (i.e. the "permanent storage device"), the drive itself
>> may not physically write
>> the data to the platters for quite some time and it may be
>> written in an
>> out-of-order
>> sequence.
>>
>> Specifically, if the drive loses power or the OS
>> crashes, the application
>> may find that only some or none of their data was written.
>> The disk drive may
>> also re-order the data so that later writes may be present
>> while earlier writes
>> are not.
>>
>> This is not a theoretical edge case. This scenario is
>> easily reproduced
>> with real world workloads and drive power failures.
>>
>>
>>
>> from man fcntl on my system
>> F_FULLFSYNC Does the same thing as fsync(2)
>> then asks the drive to
>> flush all buffered data to the permanent storage device
>> (arg is ignored). This
>> is currently only implemented on HFS filesystems and the
>> operation may take quite a
>> while to complete. Certain
>> FireWire drives have also
>> been known to ignore this request.
>>
>>
>>
>> To get what you want, you are going to be reading a lot of
>> tech info from disk
>> drive manufacturers to be sure the drives on your system
>> will in fact write data
>> to the disk when requested to do so. You also are going
>> to have to find out if
>> the device drivers for your system actually send the
>> request on to the drives.
>> Otherwise fsycn = nop
>> Best to consider fsync to be universally broken.
>>
>>
>
> Yes, that is why you either have to turn the drive's write
> caching off or use a file system that is smart enough to do
> this on the fly, such as ZEST. This still does not solve the
> problem that fsync() itself is often broken. It must also be
> verifies that fsync() works correctly.
>
> I am not sure of the best way to do this, possibly a lot of
> tests where the process is killed in the middle of a
> transaction from a very high load of transactions. The
> theory is that you only lose the last transaction.

I don't think you quite got it. Some drives IGNORE requests to not use a cache.
There is no way to turn off the cache on some drives. Drive manufacturers
believe they know better than you do what you want.

It isn't that fsync is broken, but that there is no way to implement fsync
because the hardware does not support it!

>> Best to consider fsync to be universally broken.

From: Peter Olcott on 10 Apr 2010 21:26

"Golden California Girls" <gldncagrls(a)aol.com.mil> wrote in
message news:hpqihu$oln$1(a)speranza.aioe.org...
> Peter Olcott wrote:
>> "Golden California Girls" <gldncagrls(a)aol.com.mil> wrote
>> in
>> message news:hpq4vi$4o4$1(a)speranza.aioe.org...
>>> from my man page
>>>
>>> Note that while fsync() will flush all data from the
>>> host to the drive
>>> (i.e. the "permanent storage device"), the drive itself
>>> may not physically write
>>> the data to the platters for quite some time and it may
>>> be
>>> written in an
>>> out-of-order
>>> sequence.
>>>
>>> Specifically, if the drive loses power or the OS
>>> crashes, the application
>>> may find that only some or none of their data was
>>> written.
>>> The disk drive may
>>> also re-order the data so that later writes may be
>>> present
>>> while earlier writes
>>> are not.
>>>
>>> This is not a theoretical edge case. This scenario
>>> is
>>> easily reproduced
>>> with real world workloads and drive power failures.
>>>
>>>
>>>
>>> from man fcntl on my system
>>> F_FULLFSYNC Does the same thing as fsync(2)
>>> then asks the drive to
>>> flush all buffered data to the permanent storage device
>>> (arg is ignored). This
>>> is currently only implemented on HFS filesystems and the
>>> operation may take quite a
>>> while to complete. Certain
>>> FireWire drives have also
>>> been known to ignore this request.
>>>
>>>
>>>
>>> To get what you want, you are going to be reading a lot
>>> of
>>> tech info from disk
>>> drive manufacturers to be sure the drives on your system
>>> will in fact write data
>>> to the disk when requested to do so. You also are going
>>> to have to find out if
>>> the device drivers for your system actually send the
>>> request on to the drives.
>>> Otherwise fsycn = nop
>>> Best to consider fsync to be universally broken.
>>>
>>>
>>
>> Yes, that is why you either have to turn the drive's
>> write
>> caching off or use a file system that is smart enough to
>> do
>> this on the fly, such as ZEST. This still does not solve
>> the
>> problem that fsync() itself is often broken. It must also
>> be
>> verifies that fsync() works correctly.
>>
>> I am not sure of the best way to do this, possibly a lot
>> of
>> tests where the process is killed in the middle of a
>> transaction from a very high load of transactions. The
>> theory is that you only lose the last transaction.
>
> I don't think you quite got it. Some drives IGNORE
> requests to not use a cache.
> There is no way to turn off the cache on some drives.
> Drive manufacturers
> believe they know better than you do what you want.
>
> It isn't that fsync is broken, but that there is no way to
> implement fsync
> because the hardware does not support it!

http://linux.die.net/man/2/fsync

That may be the case,but, this is not how it was related to
me on this thread. It was related to me on this thread as
two distinctly different and separate issues. The one that
you just mentioned, and also in addition to this the issues
that fsync() itself is often broken. fsync() is ONLY
supposed to flush the OS kernel buffers. The application
buffers and the application buffers as well as the drive
cache are both supposed to be separate issues.

>
>>> Best to consider fsync to be universally broken.