Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs) [Kernel]

Prev: Query dw_dmac driver: All descriptors done, but channel not idle!
Next: [RESEND][REGRESSION] um: CONFIG_STATIC_LINK=y broken

From: Edward Shishkin on 18 Jun 2010 15:40

Jamie Lokier wrote:
> Edward Shishkin wrote:
>
>> If you decide to base your file system on some algorithms then please
>> use the original ones from proper academic papers. DO NOT modify the
>> algorithms in solitude: this is very fragile thing! All such
>> modifications must be reviewed by specialists in the theory of
>> algorithms. Such review can be done in various scientific magazines of
>> proper level.
>>
>> Personally I don't see any way to improve the situation with Btrfs
>> except full redesigning the last one. If you want to base your file
>> system on the paper of Ohad Rodeh, then please, use *exactly* the
>> Bayer's B-trees that he refers to. That said, make sure that all
>> records you put to the tree has equal length and all non-root nodes of
>> your tree are at least half filled.
>>
>
> First, thanks Edward for identifying a specific problem with the
> current btrfs implementation.
>

Hello Jamie.

> I've studied modified B-trees quite a lot and know enough to be sure
> that they are quite robust when you modify them in all sorts of ways.
>

Which property is robust?

> Moreover, you are incorrect to say there's an intrinsic algorithmic
> problem with variable-length records. It is not true; if Knuth said
> so, Knuth was mistaken.
>

I didn't say about intrinsic algorithmic problems :)
I just repeat (after Knuth et al) that B-trees with variable-length
records don't
have any sane boundary for internal fragmentation. The common idea is
that if we
don't want Btrfs to be in infinite development stage, then we should
choose some
*sane* strategy (for example the paper of Ohad Rodeh) and strictly
adhere this in
future.

> This is easily shown by modelling variable-length records (key ->
> string) as a range of fixed length records ([key,index] -> byte) and
> apply the standard B-tree algorithms to that, which guarantees
> algorithm properties such as space utilisation and time; then you can
> substitute a "compressed" representation of contiguous index runs,
> which amounts to nothing more than just storing the strings (split
> where the algorithm says to do so) and endpoint indexes , and because
> this compression does not expand (in any way that matters), classic
> algorithmic properties are still guaranteed.
>
> Variable-length keys are a different business. Those are trickier,
> but as far as I know, btrfs doesn't use them.
>
>
>> As to current Btrfs code: *NOT ACK*!!! I don't think we need such
>> "file systems".
>>
>
> Btrfs provides many useful features that other filesystems don't. We
> definitely need it, or something like it. You have identified a bug.
> It's not a corruption bug, but it's definitely a bug, and probably
> affects performance as well as space utilisation.
>
> It is not deep design bug; it is just a result of the packing and
> balancing heuristics.
>

Frankly, I would like to believe to such end, taking into account amount
of my
contributions to the Btrfs project. At least to make sure I didn't do
useless
work..

> If you are still interested, please apply your knowledge of B-tree
> algorithms to understanding why btrfs fails to balance the tree
> sufficiently well,

Because of top-down balancing. It doesn't like "clever" things like
"splitting"
and "merging". Currently top-down works properly only with stupid classic
Bayer's B-trees.

> and then propose a fix.
>

I'll try to help, but I am rather pessimistic here: working out
algorithms is
something, which doesn't like timelines..

> Note that it's not necessarily a problem to have a few nodes with low
> utilisation. Deliberate violation of the classic balancing heuristic
> is often useful for performance.[*]

Ok, let's violate this, but not in the detriment of utilisation:
Internal fragmentation is a horror for file systems, the enemy #1
(which is completely defeated in the last century BTW).

> The problem you've found is only a
> real problem when there are _too many_ nodes with low utilisation.
>

IMHO this is a problem, as we can not estimate number of such nodes.
Do you have any sane upper boundary for this number? I don't know such one.
Maybe I have missed something?

> [*] For example when filling a tree by inserting contiguously
> ascending keys, the classic "split into two when full" heuristic gives
> the worst possible results (50% lost space), and deliberately
> underfilling the most actively updated nodes, which is not permitted
> at all by the classic algorithm, gives denser packing in the end
> (almost zero lost space).

At the end of what? I hope you don't speak about on-line defragmentation?
Can you point me to the paper (if any)?

Thanks!

> It's also faster. The trick is to make
> sure there's just the right number of underfilled nodes...
>
> -- Jamie
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo(a)vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

--
Edward O. Shishkin
Principal Software Engineer
Red Hat Czech

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Ric Wheeler on 25 Jun 2010 15:10

On 06/24/2010 06:06 PM, Daniel Taylor wrote:
>
>
>
>> -----Original Message-----
>> From: mikefedyk(a)gmail.com [mailto:mikefedyk(a)gmail.com] On
>> Behalf Of Mike Fedyk
>> Sent: Wednesday, June 23, 2010 9:51 PM
>> To: Daniel Taylor
>> Cc: Daniel J Blueman; Mat; LKML;
>> linux-fsdevel(a)vger.kernel.org; Chris Mason; Ric Wheeler;
>> Andrew Morton; Linus Torvalds; The development of BTRFS
>> Subject: Re: Btrfs: broken file system design (was Unbound(?)
>> internal fragmentation in Btrfs)
>>
>> On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor
>> <Daniel.Taylor(a)wdc.com> wrote:
>>
>>> Just an FYI reminder. The original test (2K files) is utterly
>>> pathological for disk drives with 4K physical sectors, such as
>>> those now shipping from WD, Seagate, and others. Some of the
>>> SSDs have larger (16K0 or smaller blocks (2K). There is also
>>> the issue of btrfs over RAID (which I know is not entirely
>>> sensible, but which will happen).
>>>
>>> The absolute minimum allocation size for data should be the same
>>> as, and aligned with, the underlying disk block size. If that
>>> results in underutilization, I think that's a good thing for
>>> performance, compared to read-modify-write cycles to update
>>> partial disk blocks.
>>>
>> Block size = 4k
>>
>> Btrfs packs smaller objects into the blocks in certain cases.
>>
>>
> As long as no object smaller than the disk block size is ever
> flushed to media, and all flushed objects are aligned to the disk
> blocks, there should be no real performance hit from that.
>
> Otherwise we end up with the damage for the ext[234] family, where
> the file blocks can be aligned, but the 1K inode updates cause
> the read-modify-write (RMW) cycles and and cost>10% performance
> hit for creation/update of large numbers of files.
>
> An RMW cycle costs at least a full rotation (11 msec on a 5400 RPM
> drive), which is painful.
>

Also interesting is to note that you can get a significant overheard
even with 0 byte length files. Path names, metadata overhead, etc can
consume (depending on the pathname length) quite a bit of space per file.

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Michael Tokarev on 26 Jun 2010 01:20

25.06.2010 22:58, Ric Wheeler wrote:
> On 06/24/2010 06:06 PM, Daniel Taylor wrote:
[]
>>> On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor
>>> <Daniel.Taylor(a)wdc.com> wrote:
>>>
>>>> Just an FYI reminder. The original test (2K files) is utterly
>>>> pathological for disk drives with 4K physical sectors, such as
>>>> those now shipping from WD, Seagate, and others. Some of the
>>>> SSDs have larger (16K0 or smaller blocks (2K). There is also
>>>> the issue of btrfs over RAID (which I know is not entirely
>>>> sensible, but which will happen).

Why it is not sensible to use btrfs on raid devices?
Nowadays raid is just everywhere, from 'fakeraid' on AHCI to
large external arrays on iSCSI-attached storage. Sometimes
it is nearly imposisble to _not_ use RAID, -- many servers
comes with a built-in RAID card which can't be turned off or
disabled. And hardware raid is faster (at least in theory)
at least because it puts less load on various system busses.

To many "enterprise folks" a statement "we don't need hw raid,
we have better solution" sounds like "we're just a toy, don't
use".

Hmm? ;)

/mjt, who always used and preferred _software_ raid due to
multiple reasons, and never used btrfs so far.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Ric Wheeler on 26 Jun 2010 08:00

On 06/26/2010 01:18 AM, Michael Tokarev wrote:
> 25.06.2010 22:58, Ric Wheeler wrote:
>
>> On 06/24/2010 06:06 PM, Daniel Taylor wrote:
>>
> []
>
>>>> On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor
>>>> <Daniel.Taylor(a)wdc.com> wrote:
>>>>
>>>>
>>>>> Just an FYI reminder. The original test (2K files) is utterly
>>>>> pathological for disk drives with 4K physical sectors, such as
>>>>> those now shipping from WD, Seagate, and others. Some of the
>>>>> SSDs have larger (16K0 or smaller blocks (2K). There is also
>>>>> the issue of btrfs over RAID (which I know is not entirely
>>>>> sensible, but which will happen).
>>>>>
> Why it is not sensible to use btrfs on raid devices?
> Nowadays raid is just everywhere, from 'fakeraid' on AHCI to
> large external arrays on iSCSI-attached storage. Sometimes
> it is nearly imposisble to _not_ use RAID, -- many servers
> comes with a built-in RAID card which can't be turned off or
> disabled. And hardware raid is faster (at least in theory)
> at least because it puts less load on various system busses.
>
> To many "enterprise folks" a statement "we don't need hw raid,
> we have better solution" sounds like "we're just a toy, don't
> use".
>
> Hmm? ;)
>
> /mjt, who always used and preferred _software_ raid due to
> multiple reasons, and never used btrfs so far.
>

Absolutely no reason that you would not use btrfs on hardware raid
volumes (or software RAID for that matter).

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev |
Pages: 1 2
Prev: Query dw_dmac driver: All descriptors done, but channel not idle!
Next: [RESEND][REGRESSION] um: CONFIG_STATIC_LINK=y broken