From: Andrew Reilly on
Hi there,

On Tue, 26 Sep 2006 04:04:41 +0000, Stephen Fuld wrote:
> I'll chime in here. Several people have taken a passing shot at metadata,
> but I would like to discuss this further. I think a file system needs a
> consistant, easily accessable, extensible mechanism for setting, retrieving
> and modifying metadata/attributes.

I'm sure you're right. Lots of recent file system work (in the Unix
space, at least) seems to be focussed on adding these. I'd like to play
devil's advocate here, though: I'm not convinced.

> Currently file systems use at least four different methods, frequently
> within the same file system!

That's not a-priori a bad thing, if there are significantly different
access and usage characteristics of these different pieces of information.

> 1. Overloading part of the file name (the extension) to indicate what
> program is the default to process this file and perhaps implicitly
> something about the file format.

Aren't file extensions exclusively a file-type hint, rather than a
preferred application hint? What are some common file extensions that can
be used for different types of file? I can open .doc files with four or
five different applications, these days (with varying degress of success,
admittedly).

> 2. Various bits of the directory entry (loosly defined) for such
> things as read only status, ownership, time of last update, etc.
>
> 3. Extra streams.
>
> 4. An entry in another file altogether in who knows what format. This
> is used by, for example some backup systems. etc. for telling where the
> backup copy is, etc.

If some specific backup system is the only thing that needs the
information, perhaps it's right that it's the only thing that knows it?

> There should be a single mechanism for creating and reading all such
> data. There must be a way for users to be able to define their own new
> attributes that are accessed in the same way as the other ones. The
> metadata should be backed up with the data so it can be restored in the
> event of an error. The mechanisms should be easy enough to use that no
> one will want to use any other one. There should be utilities for
> listing the attributes/metadata for a file as well as changing it (with
> appropriate permission).
>
> Once you have the mechanism, we can have a profitable discussion of what
> those attributes should be. Note that many of the "wish list" items
> mentioned already are perfect things to be stored in this manner.

How does file-anchored, user-definable meta-data fare in a network-shared
multi-user world? User-specific meta data, like preferred application, or
relative position within folder GUI, or even icon has different access
permission behaviour and visibility than the file itself. My preferred
application might not be your preferred application. On some systems that
can specify file icon location within a folder, one can specify the
location even of read-only files, in read-only folders.

Quite a lot of meta-data is stored within files, in application-specific
formats, now. ID3 title/artist tags or sample rates in MP3 files, "meta"
attributes in HTML files, author information in office documents.
Alternate language soundtracks in DVD movies, perhaps (not meta-data, but
"extra stream" information).

How could this reasonably be subsumed by a file system, when the
information must travel with the file, by the definition of the file
format? Perhaps it is reasonable for a "file system" to expose abstract
meta-data methods that operate on different file types through
type-specific plug-ins that access (and modify?) the information in
format-specific ways. Is that really a win? Is it what you are thinking
about, or would such meta-information be duplicated from the file into
file-system meta-data forks? How much effort would you go to to ensure
consistency in that case?

Cheers,

--
Andrew

From: Jan Vorbrüggen on
>>>Case-blind case-preserving is the only variant which is acceptable from the
>>>point of view of ergonomics, IMNSHO.
>>There I agree. This obeys the principle of least surprise, but as noted
>>above, it does still have drawbacks.
> In mathematics and physics quantities are *always* case-sensitive.
> That is, 'g' and 'G' are *always* distinct.

Quite. But would you extend this to making "thisisanimportantconstant"
and "thisIsAnImportantConstant" and "thisisanImportantConstant" distinct?
These are the cases that cause the problems.

> Are you saying that your idea of the POLS is that because the two
> pathnames are in the same directory, and have filenames which (let us
> say, in the current locale) compare as equal according to strcoll(3),
> then the 2nd file should overwrite the 1st? Ick.

Ayup.

> Things are going to get even ickier if different users (having different
> locales in effect) find different sets of files in the directory. Eg
> what happens if a backup from a system which allows creation of the
> distinct files /some/where/g.h5 and /some/where/G.h5 gets restored on
> a system which thinks those are two distinct names for the same file?

IMO, that's a fundamental incompatibility that prohibits interoperability.

> The fundamental problem is that different {users,applications} may
> have different ideas of how case should be handled...

But they shouldn't, because they are all human beings that are subject to
the same (within some reasonable definition of "same") cognitive abilities
and, more importantly, disabilitities. I wouldn't go so far as to take all
results from ergonomics as gospel - but there are some basic ways you can
build things that, done one way, lead to a small rate of errors in using
them or, build differently, lead to a large error rate.

The A320 crash at Strasbourg occured, in the final account, because in the
display for the descent rate it made a difference whether it showed "3" or
"3.". Nobody in the cockpit noticed this, and the crew likely didn't even
know what the presence or absence of the "." meant. That's just bad design
- as is allowing the case of letters in a filename to distinguish files.

Jan
From: Jan Vorbrüggen on
>>> Which language do you want to be case-insensitive in? What if two
>>> users of the same file system disagree on the choice?
>> That is not a matter of language. Or is there a character encoding that
>> says for language A, "X" and "x" are a pair while for language B, "X" and
>> "y" are a pair?
> Yes, afaik:
> The German 'double-s' is two letters in uppercase and a single letter in
> lowercase.

No, that's not what I meant. I asked whether there are languages that use the
same letters, but for which the mapping between upper- and lower-case is in-
compatible.

Whether you would want to distinguish the files "DASS", "dass" and "da?" is
another matter...

Jan
From: Jan Vorbrüggen on
> ECC seems to be in the same redundancy-space as RS codes to me.

It seems to me that would depend on what you intend to convey with "ECC".
If you expand it as "error-correcting code", then RS codes are one example
of such. If you mean the ECC codes generally used for semiconductor memory,
then they would be two examples of the same general class of algorithms.
The difference is in the statistics of errors - disks (and comms) generally
have burst errors, for which RS codes are optimal; semiconductor memory has
random errors or small, defined units of failure with defined characteristics
("chipkill" stuck at one or zero), for which the usual "ECC" or "EDC" codes
are optimal. Generally, when you know your error statistics (including corre-
lation among errors), you can design an appropriate optimal ECC for it.

Jan
From: Jan Vorbrüggen on
>>None of any worth IMO. But case smashing to provide a case blind name
>>space takes code, and would not fit into a PDP7/11 address space.
> Nonsense. Keeping the case the user specified was a choice.
> Case-squashing would be a very few instructions.

I'm all for keeping the user's choice of case, but making it irrelevant
on compare. Would that still be "a very few instructions", in your opinion?

Jan