From: Jan Vorbrüggen on
> Perhaps I should have included the latter feature in my list: flexible
> standardized and/or ad hoc annotation capability is something that I
> consider important, but I was trying to keep the list items at a fairly
> high level.

Yes, that was the one major point I though you had missed. I would call it
an extensible meta-data structure myself.

> The recent introduction of 'bundles' ('files' that
> are actually more like directories in terms of containing a hierarchical
> multitude of parts - considerably richer IIRC than IBM's old
> 'partitioned data sets') as a means of handling multi-'fork' and/or
> attribute-enriched files in a manner that simple file systems can at
> least store (though applications then need to understand that form of
> storage to handle it effectively) may be applicable here.

Extensible meta-data would do away with the need for this in almost all cases,
would it not?

Jan
From: Bill Todd on
Jan Vorbr?ggen wrote:
>> Perhaps I should have included the latter feature in my list:
>> flexible standardized and/or ad hoc annotation capability is something
>> that I consider important, but I was trying to keep the list items at
>> a fairly high level.
>
> Yes, that was the one major point I though you had missed. I would call it
> an extensible meta-data structure myself.
>
>> The recent introduction of 'bundles' ('files' that are actually more
>> like directories in terms of containing a hierarchical multitude of
>> parts - considerably richer IIRC than IBM's old 'partitioned data
>> sets') as a means of handling multi-'fork' and/or attribute-enriched
>> files in a manner that simple file systems can at least store (though
>> applications then need to understand that form of storage to handle it
>> effectively) may be applicable here.
>
> Extensible meta-data would do away with the need for this in almost all
> cases,
> would it not?

Rather, this is a manner in which extensible metadata can be implemented
in a manner in which it can be ported to file systems that don't support
it as such: true, applications still have to be able to recognize it
for what it is in order to be able to take advantage of it (though
unaware applications can still access the pieces which they can
understand), but at least it can be represented.

- bill
From: Jonathan Thornburg -- remove -animal to reply on
Terje Mathisen <terje.mathisen(a)hda.hydro.com> asked:
>So what are the features that good file system should have, besides
>never silently dropping updates, and never allowing an inconsistent state?

Anton Ertl <anton(a)mips.complang.tuwien.ac.at> wrote:
> Well, there are different kinds of consistency.
>
> Many file systems people only care for meta-data consistency; as long
> as the fsck passes, everything is fine. Who needs data, anyway?
>
> On the other extreme there is fully synchronous operation of the file
> system (so you don't even lose a second of work in case of a crash),
> but this usually results in too-slow implementations.
>
> I like the one that I call in-order semantics
> <http://www.complang.tuwien.ac.at/papers/czezatke&ertl00/#sect-in-order>:
>
> |The state of the file system after recovery represents all write()s
> |(or other changes) that occurred before a specific point in time, and
> |no write() (or other change) that occurred afterwards. I.e., at most
> |you lose a minute or so of work.
>
> Unfortunately, AFAIK all widely-used file systems provide this
> guarantee only in fully-synchronous mode, if at all.

Doesn't the BSD FFS with McKusick's Soft Updates
http://www.usenix.org/publications/library/proceedings/usenix99/full_papers/mckusick/mckusick.ps
http://www.usenix.org/publications/library/proceedings/usenix2000/general/full_papers/seltzer/
provide this?

ciao,

--
-- "Jonathan Thornburg -- remove -animal to reply" <jthorn(a)aei.mpg-zebra.de>
Max-Planck-Institut fuer Gravitationsphysik (Albert-Einstein-Institut),
Golm, Germany, "Old Europe" http://www.aei.mpg.de/~jthorn/home.html
"Washing one's hands of the conflict between the powerful and the
powerless means to side with the powerful, not to be neutral."
-- quote by Freire / poster by Oxfam
From: Bill Todd on
Jonathan Thornburg -- remove -animal to reply wrote:
> Terje Mathisen <terje.mathisen(a)hda.hydro.com> asked:

....

>> I like the one that I call in-order semantics
>> <http://www.complang.tuwien.ac.at/papers/czezatke&ertl00/#sect-in-order>:
>>
>> |The state of the file system after recovery represents all write()s
>> |(or other changes) that occurred before a specific point in time, and
>> |no write() (or other change) that occurred afterwards. I.e., at most
>> |you lose a minute or so of work.
>>
>> Unfortunately, AFAIK all widely-used file systems provide this
>> guarantee only in fully-synchronous mode, if at all.
>
> Doesn't the BSD FFS with McKusick's Soft Updates
> http://www.usenix.org/publications/library/proceedings/usenix99/full_papers/mckusick/mckusick.ps
> http://www.usenix.org/publications/library/proceedings/usenix2000/general/full_papers/seltzer/
> provide this?

I don't think so. For example, since two update-in-place writes to
existing file space affect no metadata (save mtimes and atimes), I don't
think that soft updates track them at all, hence they could easily go to
disk in reverse order.

By the way, my impression is that soft updates were Ganger's creation,
not McKusick's.

- bill
From: Terje Mathisen on
Anton Ertl wrote:
> Terje Mathisen <terje.mathisen(a)hda.hydro.com> writes:
>> So what are the features that good file system should have
>
> You might be interested in the 2006 Linux File Systems Workshop.
> There's a summary of the workshop at
> <http://lwn.net/Articles/190222/>.

Thanks, I have read the first parts so far, and the argument about
effectively increasing error rates, measured in number of errors per
full read of the disk, is _very_ interesting.

It is _very_ obvious that even home users could/will need some form of
sector-level data redundancy to store their digital home videos, RAW
photos etc.

So, what can be done at the single-disk level?

If actual disk failures are becoming much less common, compared to
single-sector read errors, then it starts to make sense to employ some
form of in-disk RAID:

For every N MB of contiguous disk space, use an extra MB to store ECC
info for the current block. The block size needs to be large enough that
a local soft spot which straddles two sectors cannot overwhelm the ECC
coding.

This sort of setup would be _really_ horrible for a random-access,
frequent updates type of setup (file system metadata?), but very good
for stuff like Google's minimum disk access block of 64 MB.

The overhead of writing a few percent extra data really doesn't matter,
as long as it is all sequential IO.

Back to multi-disk configurations:

After following a few more links, I found NetApp's description of their
diagonal parity setup, which uses XOR for all calculations, and achieves
performance very close to what's possible in a single-parity situation:

http://www.usenix.org/events/fast04/tech/corbett/corbett_html/index.html

Terje

--
- <Terje.Mathisen(a)hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"