Trying to design low level hard disk manipulation program [Computer Architecture]

Prev: "Livermore Loops" on x86 Linux
Next: How Many Processor Cores Are Enough?

From: Bill Todd on 27 Sep 2006 01:22

Andrew Reilly wrote:

....

> My concern is that there's no obvious reason that one shouldn't be
> required to specify a file by the name that it was given.

The obvious reason has hardly been obscured in this discussion: it's
because requiring this can be painful for human users (average human
users being one very prominent, and arguably the least educable,
consumer of the feature).

My own experience is that most operating systems' file systems
understand this reason just fine, and that Unix is the primary exception
rather than anything resembling the rule. But my own experience is
fairly limited in this respect, so by all means enlighten me if I've got
the situation backward.

Now, the situation today may have evolved to the point where many human
users never really interact directly with file names at all, and if so
that could alter the balance of the argument somewhat. The complexities
introduced by non-English (and non-ASCII) character sets may also be
worth considering in this regard. But where the argument against
case-sensitivity came from should be crystal clear.

- bill

From: Jason Ozolins on 27 Sep 2006 04:23

Jan Vorbr?ggen wrote:
>> Er, how does it know that you are putting in file names? Yes, there is
>> a kludge that works moderately well for both case-sensitive and case-
>> insensitive for sort, but what about uniq?
>
> Miscommunication - I thought you were talking of the file browser which
> knows that it can do a case-blind compare. Obviously, for other cases tools
> such as sort and uniq will need to be told of such constraints. What
> default?
> Dunno.

On modern GNU/Linux and Solaris OSs, sort(1) certainly uses the locale
setting to determine collation order. That means that quite a lot of
users are already running by default with case-insensitive sort.

Setting LANG or more specifically LC_COLLATE to C will make sort use a
case-sensitive ASCII collating order, but for instance LC_COLLATE=en_AU
will give a case-insensitive collation order. "uniq -i" respects this too.

eg:
[jao900(a)armstretcher ~]$ cat > zbleb
Apple
aardvark
Aardvark
apple
aerial
Aerial
[jao900(a)armstretcher ~]$ echo $LANG
en_AU.UTF-8
[jao900(a)armstretcher ~]$ sort zbleb
aardvark
Aardvark
aerial
Aerial
apple
Apple
[jao900(a)armstretcher ~]$ sort zbleb | uniq -i
aardvark
aerial
apple

The overhead for doing this case insensitive sorting is:
[jao900(a)armstretcher ~]$ ls -al /usr/lib/locale/en_AU.utf8/LC_COLLATE
-rw-r--r-- 85 root root 882134 Aug 12 20:59
/usr/lib/locale/en_AU.utf8/LC_COLLATE

Sure, let's blow out the kernel by 900K to stick this in the filesystem
layer... ahem. I don't really think that 900K translation tables belong
in the kernel. The WinNT developers did, but that's hardly surprising.
In fact, IIRC there is a UTF-16 case translation table embedded in the
structure of an NTFS fileystem.

If I may be so bold, the real ergonomic nightmare of traditional UNIX
file names is that they can contain arbitrary ASCII control characters.
These control characters could be checked for and rejected very very
simply for both ASCII and UTF-8 encodings*, and the only people who
would be unhappy would be the l33t skr1p7 k1dd13s whose root kits would
no longer be able to hide themselves. You think it's difficult
supporting users who can't tell the difference between "Makefile" and
"makefile", well I found it pretty annoying trying to support a sysadmin
who hadn't spotted the bogus directory hiding in
/var/iforgetnow/itwasalongtimeago. :-)

Hmm, I just checked, and it looks like a modern "ls" actually shows up
the control characters as question marks. (My experience was with Red
Hat 5 or 6, circa 2000-2001) But still, who would ever want a control
character in a filename? And try telling someone who accidentally
creates such a beast how to remove it...

-Jason

* ISO-8859-n defines a bunch of weirdo control characters from 128 to
159. I doubt that many programs implement them.

From: Terje Mathisen on 27 Sep 2006 04:41

Tarjei T. Jensen wrote:
>
> "Eric P." wreote:
>> I have never built a file system, but it seems to me that the problem
>> with file compression is that a write in the middle of the file
>> will be recompressed and can cause changes to the files' physical
>> block mappings and meta data structures. This in turn updates file
>> system block allocation tables and meta transaction logs.
>
> In NetWare the file was compressed if it was not used (or was it
> modified) for a certain time. Once you write it stays uncompressed until
> it became eligeble for compression again.

NetWare compressed file systems were tightly integrated in teh OS, along
with their hierarchical storage manager (HSM):

I.e. the directory structure could maintain two copies of any given file
(regular and compressed), as well as a third version which had been
migrated to a lower-level/bulk storage system.

Viewed this way, whole-file compression becomes just the first
(intermediate) tier of your HSM setup.

The OS allowed you to configure HSM/compression handling on
volume/directory/file level, with the obvious inheritance rules.

Per volume you could configure high/low watermarks where it would force
immediate compression and/or migration, and you could also configure
if/how long it would maintain both a compressed/migrated version and the
original.

When a user access caused decompression/demigration, you had rules for
how/if this should be treated, with the default something like 'keep
both copies for a week, if no further access happens in that time,
delete the uncompressed/original version'.

Netware also had something close to what NetApp's WAFL gives us now: The
ability to turn back the clock.

Under Netware each volume would be default keep all versions of all
files forever, or until you ran out of total disk space. At that time it
would start to actually delete the oldest stuff first, but this could be
overridden, also on dir/file level to force longer/shorter version
backup times.

I.e. for a scratch/temp dir you'd set the 'Flush Immediate' flag, while
you could keep the financial director's spreadsheets around more or less
forever.

BTW, since this was so completely integrated in the OS, all access
controls would follow files into both migrated and
deleted-but-not-yet-flushed status.

Terje

--
- <Terje.Mathisen(a)hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

From: Rob Warnock on 27 Sep 2006 04:46

Jan Vorbr?ggen <jvorbrueggen(a)not-mediasec.de> wrote:
+---------------
| > But if we follow your line of reasoning, then we should design our
| > {file systems, OSs, programming languages, etc} to always treat the
| > strings "3" and "3." as being the same critter.
|
| Nope, that doesn't follow at all. What follows is that if "3" and "3."
| are different things, they should be displayed in such a way that the
| distinction is immediately visually apparent. This is similar to European
| preference for writing "0.1" instead of the American ".1".
+---------------

Note that in ANSI-standard Common Lisp, 3 and 3. *are* the same,
but 3.0 is different:

> (type-of 3)

(INTEGER 3 3)
> (type-of 3.)

(INTEGER 3 3)
> (type-of 3.0)

SINGLE-FLOAT
> (eql 3 3.)

T
> (eql 3. 3.0)

NIL
>

-Rob

-----
Rob Warnock <rpw3(a)rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607

From: Andrew Reilly on 27 Sep 2006 04:46

On Wed, 27 Sep 2006 01:22:05 -0400, Bill Todd wrote:

> Andrew Reilly wrote:
>
> ...
>
>> My concern is that there's no obvious reason that one shouldn't be
>> required to specify a file by the name that it was given.
>
> The obvious reason has hardly been obscured in this discussion: it's
> because requiring this can be painful for human users (average human
> users being one very prominent, and arguably the least educable,
> consumer of the feature).

Sure, that's the incentive, but is putting national language rules into
the file system the right answer, or might those users be better served
with predictive input or file name completion in shells, or full-text
search, or choices in file selection dialogs and GUIs?[*]

> Now, the situation today may have evolved to the point where many human
> users never really interact directly with file names at all, and if so
> that could alter the balance of the argument somewhat. The complexities
> introduced by non-English (and non-ASCII) character sets may also be
> worth considering in this regard. But where the argument against
> case-sensitivity came from should be crystal clear.

I think that the *practice* of case insensitivity came from historical
constraints on file name composition. The justifications have been
post-facto, and as you suggest, don't work as well in today's non-English,
non-command-line world.

[*] In case it's not obvious, this is the *only* way to go at this stage,
IMO. I have files on my system (downloads) with names that I have no
idea how to type: non-English diacriticals, or Japanese characters.
File-name completion or GUI selection is the only way I can use them, but
of course both of those are the primary interfaces, so it's really no
issue at all.

Cheers,

--
Andrew

First | Prev | Next | Last
Pages: 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Prev: "Livermore Loops" on x86 Linux
Next: How Many Processor Cores Are Enough?