Trying to design low level hard disk manipulation program [Computer Architecture]

Prev: "Livermore Loops" on x86 Linux
Next: How Many Processor Cores Are Enough?

From: Jonathan Thornburg -- remove -animal to reply on 26 Sep 2006 08:24

I wrote:
>> In mathematics and physics quantities are *always* case-sensitive.
>> That is, 'g' and 'G' are *always* distinct.

Jan Vorbr?ggen <jvorbrueggen(a)not-mediasec.de> replied:
> Quite. But would you extend this to making "thisisanimportantconstant"
> and "thisIsAnImportantConstant" and "thisisanImportantConstant" distinct?
> These are the cases that cause the problems.

Yes, IMHO these should be distinct files at the OS level.

Of course, if some _application_ (or suite of applications) wants
to canonicalize all of these strings before forming a filename,
that's fine. I just don't think the {filesystem,OS} should be in
the business of imposing such a semantics on *all* applictions.

I asked:
>> Eg
>> what happens if a backup from a system which allows creation of the
>> distinct files /some/where/g.h5 and /some/where/G.h5 gets restored on
>> a system which thinks those are two distinct names for the same file?
>
> IMO, that's a fundamental incompatibility that prohibits interoperability.
>
>> The fundamental problem is that different {users,applications} may
>> have different ideas of how case should be handled...
>
> But they shouldn't, because they are all human beings that are subject to
> the same (within some reasonable definition of "same") cognitive abilities
> and, more importantly, disabilitities.

Strange, 30 years of professional work in math, computing science,
and physics, I've yet to mean anyone who had trouble distinguishing
between 'r' and 'R' in an equation. And in my current work (numerical
simulations in general relativity), I've yet to meet anyone who has
trouble distinguishing (say) lower-case Greek gamma ($\gamma$ in TeX)
and upper-case greek gamma ($\Gamma$) in TeX, even though our most
common equation system contains both of these. ($\gamma$ is the spatial
3-metric $\Gamma$ is the spatial Christoffel symbols.)

Admittedly, equations usually use reasonably short identifiers.
Things aren't quite so pretty for your example of
"thisisanimportantconstant"
"thisIsAnImportantConstant"
"thisisanImportantConstant"
So where do you want to draw the threshold? By identifier length?
By Hamming distance? Weighted by ordinal-position-in-identifier?

I really, _really_ don't think any one-size-fits-all policy is going
to be suitable for everyone here. This should be left to applications;
a general-purpose filesystem should provide a clean primitive (filenames
are uninterpreted byte [or some larger alphabet if that seems appropriate;
I don't want to get into the i18n tarpit here] strings) and leave the
rest to higher-level software.

> But they shouldn't, because they are all human beings that are subject to
> the same (within some reasonable definition of "same") cognitive abilities
> and, more importantly, disabilitities.

Since when are file names generated *only* by humans? Lots and lots
of software generates file names... and software doesn't have any problems
distinguishing 'r' from 'R', or even
"thisisanimportantconstant"
"thisIsAnImportantConstant"
"thisisanImportantConstant"

> The A320 crash at Strasbourg occured, in the final account, because in the
> display for the descent rate it made a difference whether it showed "3" or
> "3.". Nobody in the cockpit noticed this, and the crew likely didn't even
> know what the presence or absence of the "." meant. That's just bad design
> - as is allowing the case of letters in a filename to distinguish files.

Of course, lousy GUI design is lousy GUI design. (And having the decimal
point be small and not backlit in the LCD display didn't help, either!)

But if we follow your line of reasoning, then we should design our
{file systems, OSs, programming languages, etc} to always treat the
strings "3" and "3." as being the same critter. Ick. We have several
decades of experience with programming languages in which 'int' and
'floating point' are the same data type (APL and Perl come to mind),
and also several decades of experience with programming langugaes in
which these are distinct data types (eg the entire Algol-derived family,
the entire Fortran family), and each has advantages and disadvantages.
I will observe that almost everyone doing serious floating-point
arithmetic has "voted with their feet" for software environments
where "3" and "3." do indeed have different meanings.

The nice thing about lower-level software *not* making this sort of
decision is that it leaves the field free for higher-level software
to experiment, and do what makes sense in a particular situation.
As a rule of thumb, one size does *not* fit all!

ciao,

--
-- "Jonathan Thornburg -- remove -animal to reply" <jthorn(a)aei.mpg-zebra.de>
Max-Planck-Institut fuer Gravitationsphysik (Albert-Einstein-Institut),
Golm, Germany, "Old Europe" http://www.aei.mpg.de/~jthorn/home.html
"Washing one's hands of the conflict between the powerful and the
powerless means to side with the powerful, not to be neutral."
-- quote by Freire / poster by Oxfam

From: Andrew Reilly on 26 Sep 2006 08:45

On Tue, 26 Sep 2006 04:43:13 -0500, Rob Warnock wrote:

> Andrew Reilly <andrew-newspost(a)areilly.bpc-users.org> wrote:
> +---------------
> | Aren't file extensions exclusively a file-type hint, rather than a
> | preferred application hint? What are some common file extensions that can
> | be used for different types of file? I can open .doc files with four or
> | five different applications, these days (with varying degress of success,
> | admittedly).
> +---------------
>
> But they're *only* a "hint", since they're often ambiguous.
>
> For example, long before MS Windows existed, ".DOC" was used used on
> the PDP-10 (and elsewhere) to indicate that a file was "documentation",
> that is, human-readable plaintext. Even today on Unix/Linux there are
> several software packages that use the same convention -- that ".doc"
> is human-readable plaintext, *not* MS Word format. Case in point: On my
> FreeBSD laptop, of 226 files named "*.doc", 147 are plain ASCII text,
> 30 are directories(!), and only 46 are Microsoft format files.
>
> And even when ".doc" *does* mean MS Word or Office format, which
> *version*?!? There have been several compatibility breaks over
> the years.

Maybe common practice hints that "hint" is in fact the best answer. Who
wants to have to maintain the dictionary of arbitrary distinctions
introduced by version histories and platform variations that a strict
mechanical (mathematical) "file type" indicator would require?

Do you really want to fire up Word97-pc-release.1 when that's what created
a specific .doc file? [Mind you, there's a better chance of that
happening with something like the Unix "magic" system than some sort of
manually ascribed file type system, IMO, maintenance nightmare though that
obviously is.]

Cheers,

--
Andrew

From: Jan Vorbrüggen on 26 Sep 2006 09:21

> Strange, 30 years of professional work in math, computing science,
> and physics, I've yet to mean anyone who had trouble distinguishing
> between 'r' and 'R' in an equation.

Strange - with similar experience, I definitely have. Oh, not after you've
pointed it out - but it is a potential source for confusion. Yes, even in
the one-letter case.

> Admittedly, equations usually use reasonably short identifiers.
> Things aren't quite so pretty for your example of
> "thisisanimportantconstant"
> "thisIsAnImportantConstant"
> "thisisanImportantConstant"
> So where do you want to draw the threshold? By identifier length?
> By Hamming distance? Weighted by ordinal-position-in-identifier?

As you can't, in a canonical way, you need to do away with the distinction
for all lengths.

> Since when are file names generated *only* by humans? Lots and lots
> of software generates file names... and software doesn't have any problems
> distinguishing 'r' from 'R', or even

That's a strawman, and you know it.

> But if we follow your line of reasoning, then we should design our
> {file systems, OSs, programming languages, etc} to always treat the
> strings "3" and "3." as being the same critter.

Nope, that doesn't follow at all. What follows is that if "3" and "3."
are different things, they should be displayed in such a way that the
distinction is immediately visually apparent. This is similar to European
preference for writing "0.1" instead of the American ".1".

> The nice thing about lower-level software *not* making this sort of
> decision is that it leaves the field free for higher-level software
> to experiment, and do what makes sense in a particular situation.
> As a rule of thumb, one size does *not* fit all!

Unfortunately, experience has shown that if blade guards are not enforced,
they are not used. And at least in Europe, it is illegal to sell dangerous
equipment without blade guards.

Jan

From: "Peter "Firefly" Lund" on 26 Sep 2006 12:43

On Mon, 25 Sep 2006, [ISO-8859-1] Jan Vorbr?ggen wrote:

>> Yes. There's also annoying things like ligatures and diacritics. And
>> perhaps many different codepoints that (more or less) share a glyph.
>
> How are those in any way relevant?

Change the H to an A, then.

-Peter

From: Bill Todd on 26 Sep 2006 12:46

Terje Mathisen wrote:
> Bill Todd wrote:

....

As for how metadata is presented to the outside world, bundles
>> (which sound similar to what you may mean by 'meta-data forks') seem
>> like one good option.
>
> If we must have this, then I would strongly prefer to have them visible
> and accessible as virtual directory structures:
>
> I.e. attribute "creator" of file "foo" could be read by
>
> 'cat foo/creator'

My very limited acquaintance with 'bundles' suggests that this is what
they do.

>
> or possibly
>
> 'cat foo.meta/creator'

And that's more like the ReiserFS V4 approach (IIRC they reserve the
single subdirectory name 'metas' for this purpose, introducing another -
syntactical - path element which does not in fact perform another actual
disk look-up and in so doing eliminating all other potential naming
collisions with *real* subdirectory names).

Appending .meta to the file name would introduce path look-up ambiguity
unless that ending was otherwise reserved.

>
> Allowing regular file/directory operations to create/read/write/modify
> these attribute streams seems like the obviously Right Thing (tm) to do.
>
> It also has the great advantage of being almost transparently portable
> to any hierarchical filesystem, modulo performance.

Exactly: as long as the normal data 'stream' appears as its own
lower-level entry rather than being syntactically associated with the
'container' parent directory, the entire structure can be represented -
and accessed - in a conventional implementation, albeit without the
performance optimizations available in an implementation which better
understands the grouping.

There are, however, remaining issues with the system-managed attributes,
which in the conventional file system instance need to continue to be
associated with the specific objects which they control. So while in
bundle-aware implementations they can be accessed syntactically just as
the other metadata elements are, in bundle-ignorant implementations they
likely don't appear as separate objects at all but are managed according
to the local idiom - and thus while they *appear* to be handled the same
as the application-level metadata in the bundle-aware implementation,
they may in fact be implemented quite differently in a way that's more
easily transported to conventional environments and back again.

Or, one could explore the approach of retaining traditional behavior
where porting that metadata back and forth between new and traditional
systems is awkward, and only treat the extended metadata that
traditional systems don't already support specially - normalizing the
application interface across both environments at the expense of
normalizing the access mechanisms for *all* metadata in the new
environments only.

Either way, the resulting application interface (when considered across
both old and new environments) is somewhat kludgier than it would have
been had all this been designed in from the start, but may be 'good
enough' to be useful.

- bill

First | Prev | Next | Last
Pages: 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
Prev: "Livermore Loops" on x86 Linux
Next: How Many Processor Cores Are Enough?