From: David Hopwood on
Jason Ozolins wrote:
> The overhead for doing this case insensitive sorting is:
> [jao900(a)armstretcher ~]$ ls -al /usr/lib/locale/en_AU.utf8/LC_COLLATE
> -rw-r--r-- 85 root root 882134 Aug 12 20:59
> /usr/lib/locale/en_AU.utf8/LC_COLLATE
>
> Sure, let's blow out the kernel by 900K to stick this in the filesystem
> layer... ahem. I don't really think that 900K translation tables belong
> in the kernel.

You're confusing collation with case folding. A full Unicode case folding
implementation is nowhere near 900K, or even 50K. There are only three
cased scripts (Latin, Greek and Cyrillic), and their encodings are
sufficiently regular to allow an algorithmic implementation -- the
resulting code is somewhat inelegant, but fairly small.

--
David Hopwood <david.nospam.hopwood(a)blueyonder.co.uk>
From: Eric P. on
Bill Todd wrote:
>
> Eric P. wrote:
>
> ...
>
> > I have never built a file system, but it seems to me that the problem
> > with file compression is that a write in the middle of the file
> > will be recompressed and can cause changes to the files' physical
> > block mappings and meta data structures. This in turn updates file
> > system block allocation tables and meta transaction logs.
>
> But since most large files aren't updated in the middle (they're
> typically only appended to if modified at all after creation), that's
> not necessarily a significant problem regardless of the implementation
> (in fact, one could simply incrementally uncompress any such awkward
> files as they were updated).

>From a file system design and implementation point of view,
it doesn't matter whether is does happen, just that it might.
Unless, as I suggested, you define compressed files as append only.
Adding a duplicate uncompressed file copy avoids the fragmentation
but at the cost of storing two copies (which kinda defeats the
purpose of compression in the first place), and with all the
extra code to manage two copies.

The point is that just the presence of this feature adds a lot of
complexity to the file system. That, as you suggest, all that
complex code may not be used just makes that a bad feature.

And I doubt that it can "simply" incrementally decompress
for the reasons I address below.

> > With normal non-compressed files this only happens when the file is
> > extended. With compressed files every write operation can do this
> > and could bog down the whole system by hammering these critical
> > common data structures.
>
> That's part of what caching and journaling are for: if any such
> structures are truly being hammered they'll just remain in cache while
> multiple compact (logical) updates accumulate in the log which will
> eventually be written back to disk in a single bulk structure update.
> True, that still requires a synchronous log write if the update itself
> is synchronous (though if the relevant data had to be written to the log
> anyway, any metadata changes just piggyback on the same log write), but
> if the system even starts to get bogged down by this then additional
> operations accumulate while waiting for the previous log write to
> complete and get batched together in the next log write. I.e., the
> eventual limit is the sequential bandwidth of the log disks, which at,
> say, 40 MB/sec works out to tens of thousands of updates per second
> before any serious 'bogging down' occurs (and if that's not enough, you
> can stripe the log across multiple disks to increase the bandwidth even
> more).

If by journaling you mean a full blown journalled file system,
the ones that do the circular writing of meta and user data,
then you and I are on completely different wavelengths.

If you mean transaction logged updates for critical file system
structures to prevent corruption on 'unexpected volume dismounts'
then we are at least talking about the same kind of system.

In the later case, I was under the impression that it was far more
than just a matter of writing the log and caching the rest.
The exact order in which updates of the different data structures
are performed is critical to the ability to roll back or forward.

It seems to me it must control the exact order of:
- tlog writes
- tlog flush and apply
- mft file header write
- block allocation bitmap state changes and writes
- writes to different levels of each files' physical b+tree index
as the buckets split or updated
- directory entries (and directory file fragmentation)

Some of these updates are circularly dependent,
which introduces the potential for deadlocks.
And of course there are multiple overapping operations from
different threads taking place at the same time.

So I doubt you can just toss this into the file cache
and hope things will take care of themselves.
(but not that I have ever built a file system though).

> > It also serializes all file operations while the meta data is being
> > diddled.
>
> Not at all.

That was my understanding, but I am happy to learn otherwise.
On a normal, non compressed file, file extend is the only operation
that updates the physical map tree. While it is in a state of flux
it is unavailable for other accesses.

Normally this is a simple test if the write buffer is past EOF.
If so, hold the operations queue while the file extends because
the entire index structure can be affected. Otherwise it need
only serialize on read-write and write-write buffer overlaps.

However with compressed files, this can happen on any write.
Also it looks to me like this would occur even if you had two
data streams, one compressed and one decompressed because you
have to always check the decompressed stream for more recent
updates before you access the compressed stream.

> However until you read the current data, decompress, update
> > new data and recompress, you cannot tell whether the compressed
> > buffer will expand or contract, and what mapping changes are needed.
> > If the file meta structure is affected it forces all operations
> > to serialize unless you want to go for a concurrent b+tree update
> > mechanism which is probably an order of magnitude more complicated.
>
> Hey, if you're going to support really large files you'll almost
> *always* need *something* like a b+ tree to map them, and you'll need to
> allow concurrent operations on it. So there's nothing to be saved here:
> just bite the bullet and go for it.

Ok, well, suit yourself. _I_ would not do that because I think there
is plenty to be saved by dropping all the questionable features.

I would take the approach that unless compressed files can be
implemented in a manner that does not sigificantly expand the cost,
they just are not worth the trouble.

Eric

From: Bill Todd on
Eric P. wrote:
> Bill Todd wrote:
>> Eric P. wrote:
>>
>> ...
>>
>>> I have never built a file system, but it seems to me that the problem
>>> with file compression is that a write in the middle of the file
>>> will be recompressed and can cause changes to the files' physical
>>> block mappings and meta data structures. This in turn updates file
>>> system block allocation tables and meta transaction logs.
>> But since most large files aren't updated in the middle (they're
>> typically only appended to if modified at all after creation), that's
>> not necessarily a significant problem regardless of the implementation
>> (in fact, one could simply incrementally uncompress any such awkward
>> files as they were updated).
>
>>From a file system design and implementation point of view,
> it doesn't matter whether is does happen, just that it might.
> Unless, as I suggested, you define compressed files as append only.
> Adding a duplicate uncompressed file copy avoids the fragmentation
> but at the cost of storing two copies (which kinda defeats the
> purpose of compression in the first place), and with all the
> extra code to manage two copies.
>
> The point is that just the presence of this feature adds a lot of
> complexity to the file system.

Not really. Given support for extent-based mapping, all the approach
which I sketched above really requires (in the rare case when a large,
compressed file *is* updated in the middle) is breaking out the updated
compressed portion into a new, uncompressed extent: anything beyond
that is background defragmentation/performance optimization (which is
also desirable for other reasons, so can't be chalked up to this
particular one).

That, as you suggest, all that
> complex code may not be used just makes that a bad feature.

It might, if your premise were correct.

>
> And I doubt that it can "simply" incrementally decompress
> for the reasons I address below.

It is certainly possible that we have different ideas about what is
'simple'.

>
>>> With normal non-compressed files this only happens when the file is
>>> extended. With compressed files every write operation can do this
>>> and could bog down the whole system by hammering these critical
>>> common data structures.
>> That's part of what caching and journaling are for: if any such
>> structures are truly being hammered they'll just remain in cache while
>> multiple compact (logical) updates accumulate in the log which will
>> eventually be written back to disk in a single bulk structure update.
>> True, that still requires a synchronous log write if the update itself
>> is synchronous (though if the relevant data had to be written to the log
>> anyway, any metadata changes just piggyback on the same log write), but
>> if the system even starts to get bogged down by this then additional
>> operations accumulate while waiting for the previous log write to
>> complete and get batched together in the next log write. I.e., the
>> eventual limit is the sequential bandwidth of the log disks, which at,
>> say, 40 MB/sec works out to tens of thousands of updates per second
>> before any serious 'bogging down' occurs (and if that's not enough, you
>> can stripe the log across multiple disks to increase the bandwidth even
>> more).
>
> If by journaling you mean a full blown journalled file system,
> the ones that do the circular writing of meta and user data,
> then you and I are on completely different wavelengths.

That's hardly a surprise to me - though you may be confusing 'a
full-blown journaled file system' with what is properly termed a
log-structured file system (it's difficult to know from your wording).

>
> If you mean transaction logged updates for critical file system
> structures to prevent corruption on 'unexpected volume dismounts'
> then we are at least talking about the same kind of system.

I'm afraid that's always struck me as a kind of half-assed solution that
saves little complexity (in some cases, it can actually *add* some)
while sacrificing considerable robustness. It does have the saving
grace of being easier to retrofit to an existing unjournaled
implementation, but that's about it.

>
> In the later case, I was under the impression that it was far more
> than just a matter of writing the log and caching the rest.

It's *always* far more than just a matter of writing the log and caching
the rest.

> The exact order in which updates of the different data structures
> are performed is critical to the ability to roll back or forward.
>
> It seems to me it must control the exact order of:
> - tlog writes
> - tlog flush and apply
> - mft file header write
> - block allocation bitmap state changes and writes
> - writes to different levels of each files' physical b+tree index
> as the buckets split or updated
> - directory entries (and directory file fragmentation)
>
> Some of these updates are circularly dependent,

Not in any competent design.

> which introduces the potential for deadlocks.

Er, no (though there are certainly plenty of *other* potential deadlocks
to worry about).

> And of course there are multiple overapping operations from
> different threads taking place at the same time.

Well, yuh...

>
> So I doubt you can just toss this into the file cache
> and hope things will take care of themselves.

I've already agreed with that.

> (but not that I have ever built a file system though).

I have - three, in fact, plus a couple more that progressed
significantly through design but were never implemented.

>
>>> It also serializes all file operations while the meta data is being
>>> diddled.
>> Not at all.
>
> That was my understanding, but I am happy to learn otherwise.
> On a normal, non compressed file, file extend is the only operation
> that updates the physical map tree. While it is in a state of flux
> it is unavailable for other accesses.

Funny - you seem to be describing a specific implementation with
specific limitations as if it were some kind of natural law, rather than
what it is possible to implement.

Apparently having some VMS background, surely you understand that when
an RMS data bucket in an indexed file splits it does not make anything
like the entire file unavailable for other modifications (let alone
other accesses): it ties up only that bucket, and only for the amo
From: prep on
"Dennis Ritchie" <dmr(a)bell-labs.com> writes:

> <prep(a)prep.synonet.com> wrote in message news:87mz8ncwlj.fsf(a)k9.prep.synonet.com...

>> None of any worth IMO. But case smashing to provide a case blind
>> name space takes code, and would not fit into a PDP7/11 address
>> space.

> Nonsense. Keeping the case the user specified was a choice.
> Case-squashing would be a very few instructions.

OK, so Why? And why allow almost ANY char in file names? What was the
thinking behind the choice?

--
Paul Repacholi 1 Crescent Rd.,
+61 (08) 9257-1001 Kalamunda.
West Australia 6076
comp.os.vms,- The Older, Grumpier Slashdot
Raw, Cooked or Well-done, it's all half baked.
EPIC, The Architecture of the future, always has been, always will be.
From: Bill Todd on
prep(a)prep.synonet.com wrote:
> "Dennis Ritchie" <dmr(a)bell-labs.com> writes:
>
>> <prep(a)prep.synonet.com> wrote in message news:87mz8ncwlj.fsf(a)k9.prep.synonet.com...
>
>>> None of any worth IMO. But case smashing to provide a case blind
>>> name space takes code, and would not fit into a PDP7/11 address
>>> space.
>
>> Nonsense. Keeping the case the user specified was a choice.
>> Case-squashing would be a very few instructions.
>
> OK, so Why? And why allow almost ANY char in file names? What was the
> thinking behind the choice?

Perhaps the same thinking that goes with supporting almost any character
in a b-tree key value: generality. I'm usually sympathetic to that
idea - it's only the real pain that it has caused over the years in the
specific case of file names (which tend to be far more
naive/sloppy-user-visible than values used in program-driven b-tree
lookups) that makes me inclined to treat them specially.

So the argument that the advent of largely GUI-based user file handling
changes things resonates with me a lot: it demotes raw file name
manipulation to the same program-only access status that b-tree keys
enjoy, and allows one to prettify them at a higher level.

It also significantly simplifies the file system code these days, now
that 7-(or at worst 8-)bit ASCII has been replaced by Unicode: don't
worry about things like collating sequences at all (i.e., just treat the
mess as a simple byte string) - let the GUI handle appropriate sorting
if it wants to for presentation purposes.

I've almost managed to convince myself that case-sensitive file names
make sense today even though they emphatically did not a couple of
decades ago: please, please disabuse me of that seductive notion if I'm
not being objective.

- bill