Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs) [Kernel]

Prev: [RESEND][REGRESSION] um: CONFIG_STATIC_LINK=y broken
Next: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

From: Chris Mason on 18 Jun 2010 09:50

On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
> Mat wrote:
> >On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin <edward(a)redhat.com> wrote:
> >>Hello everyone.
> >>
> >>I was asked to review/evaluate Btrfs for using in enterprise
> >>systems and the below are my first impressions (linux-2.6.33).
> >>
> >>The first test I have made was filling an empty 659M (/dev/sdb2)
> >>btrfs partition (mounted to /mnt) with 2K files:
> >>
> >># for i in $(seq 1000000); \
> >>do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
> >>(terminated after getting "No space left on device" reports).
> >>
> >># ls /mnt | wc -l
> >>59480
> >>
> >>So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
> >>and the first obvious question is "hey, where are other 83% of my
> >>disk space???" I looked at the btrfs storage tree (fs_tree) and was
> >>shocked with the situation on the leaf level. The Appendix B shows
> >>5 adjacent btrfs leafs, which have the same parent.
> >>
> >>For example, look at the leaf 29425664: "items 1 free space 3892"
> >>(of 4096!!). Note, that this "free" space (3892) is _dead_: any
> >>attempts to write to the file system will result in "No space left
> >>on device".

There are two easy ways to fix this problem. Turn off the inline
extents (max_inline=0) or allow splitting of the inline extents. I
didn't put in the splitting simply because the complexity was high while
the benefits were low (in comparison with just turning off the inline
extents).

>
> It must be a highly unexpected and difficult question for file system
> developers: "how efficiently does your file system manage disk space"?
>
> In the meanwhile I confirm that Btrfs design is completely broken:
> records stored in the B-tree differ greatly from each other (it is
> unacceptable!), and the balancing algorithms have been modified in
> insane manner. All these factors has led to loss of *all* boundaries
> holding internal fragmentation and to exhaustive waste of disk space
> (and memory!) in spite of the property "scaling in their ability to
> address large storage".
>
> This is not a large storage, this is a "scalable sieve": you can not
> rely on finding there some given amount of water even after infinite
> increasing the size of the sieve (read escalating the pool of Btrfs
> devices).
>
> It seems that nobody have reviewed Btrfs before its inclusion to the
> mainline. I have only found a pair of recommendations with a common
> idea that Btrfs maintainer is "not a crazy man". Plus a number of
> papers which admire with the "Btrfs phenomena". Sigh.
>
> Well, let's decide what can we do in current situation..
> The first obvious point here is that we *can not* put such file system
> to production. Just because it doesn't provide any guarantees for our
> users regarding disk space utilization.

Are you basing all of this on inline extents? The other extents of
variable size are more flexible (taking up the room in the leaf), but
they can also easy be split during balancing.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Chris Mason on 18 Jun 2010 11:20

On Fri, Jun 18, 2010 at 05:05:46PM +0200, Edward Shishkin wrote:
> Chris Mason wrote:
> >On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
> >>Mat wrote:
> >>>On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin <edward(a)redhat.com> wrote:
> >>>>Hello everyone.
> >>>>
> >>>>I was asked to review/evaluate Btrfs for using in enterprise
> >>>>systems and the below are my first impressions (linux-2.6.33).
> >>>>
> >>>>The first test I have made was filling an empty 659M (/dev/sdb2)
> >>>>btrfs partition (mounted to /mnt) with 2K files:
> >>>>
> >>>># for i in $(seq 1000000); \
> >>>>do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
> >>>>(terminated after getting "No space left on device" reports).
> >>>>
> >>>># ls /mnt | wc -l
> >>>>59480
> >>>>
> >>>>So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
> >>>>and the first obvious question is "hey, where are other 83% of my
> >>>>disk space???" I looked at the btrfs storage tree (fs_tree) and was
> >>>>shocked with the situation on the leaf level. The Appendix B shows
> >>>>5 adjacent btrfs leafs, which have the same parent.
> >>>>
> >>>>For example, look at the leaf 29425664: "items 1 free space 3892"
> >>>>(of 4096!!). Note, that this "free" space (3892) is _dead_: any
> >>>>attempts to write to the file system will result in "No space left
> >>>>on device".
> >
> >There are two easy ways to fix this problem. Turn off the inline
> >extents (max_inline=0) or allow splitting of the inline extents. I
> >didn't put in the splitting simply because the complexity was high while
> >the benefits were low (in comparison with just turning off the inline
> >extents).
>
> Hello, Chris. Thanks for response!
> I afraid that both ways won't fix the problem. Look at this leaf:
>
> [...]
> leaf 29425664 items 1 free space 3892 generation 8 owner 5
> fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
> chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
> item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
> location key (0 UNKNOWN 0) type 8
> namelen 16 datalen 32 name: security.selinux
> [...]
>
> There is no inline extents, and what are you going to split here?
> All leafs must be at least a half filled, otherwise we loose all
> boundaries, which provides non-zero utilization..

Right, there is no inline extent because we require them to fit entirely
in the leaf. So we end up with mostly empty leaves because the inline
item is large enough to make it difficult to push around but not large
enough to fill the leaf.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Chris Mason on 18 Jun 2010 11:30

On Fri, Jun 18, 2010 at 05:21:21PM +0200, Christian Stroetmann wrote:
> Chris Mason wrote:
> >On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
> >>Mat wrote:
> >>>On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin<edward(a)redhat.com> wrote:
> >>>>Hello everyone.
> >>>>
> >>>>I was asked to review/evaluate Btrfs for using in enterprise
> >>>>systems and the below are my first impressions (linux-2.6.33).
> >>>>
> >>>>The first test I have made was filling an empty 659M (/dev/sdb2)
> >>>>btrfs partition (mounted to /mnt) with 2K files:
> >>>>
> >>>># for i in $(seq 1000000); \
> >>>>do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
> >>>>(terminated after getting "No space left on device" reports).
> >>>>
> >>>># ls /mnt | wc -l
> >>>>59480
> >>>>
> >>>>So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
> >>>>and the first obvious question is "hey, where are other 83% of my
> >>>>disk space???" I looked at the btrfs storage tree (fs_tree) and was
> >>>>shocked with the situation on the leaf level. The Appendix B shows
> >>>>5 adjacent btrfs leafs, which have the same parent.
> >>>>
> >>>>For example, look at the leaf 29425664: "items 1 free space 3892"
> >>>>(of 4096!!). Note, that this "free" space (3892) is _dead_: any
> >>>>attempts to write to the file system will result in "No space left
> >>>>on device".
> >There are two easy ways to fix this problem. Turn off the inline
> >extents (max_inline=0) or allow splitting of the inline extents. I
> >didn't put in the splitting simply because the complexity was high while
> >the benefits were low (in comparison with just turning off the inline
> >extents).
> But then the benefits of splitting must be high, because it solves
> this problem if inline extents are turned on.

It depends, we might also argue that for larger inline extents like this
we are better off with much larger leaves or with using max_inline=0
instead.

> >>It must be a highly unexpected and difficult question for file system
> >>developers: "how efficiently does your file system manage disk space"?
> >>
> >>In the meanwhile I confirm that Btrfs design is completely broken:
> >>records stored in the B-tree differ greatly from each other (it is
> >>unacceptable!), and the balancing algorithms have been modified in
> >>insane manner. All these factors has led to loss of *all* boundaries
> >>holding internal fragmentation and to exhaustive waste of disk space
> >>(and memory!) in spite of the property "scaling in their ability to
> >>address large storage".
> >>
> >>This is not a large storage, this is a "scalable sieve": you can not
> >>rely on finding there some given amount of water even after infinite
> >>increasing the size of the sieve (read escalating the pool of Btrfs
> >>devices).
> >>
> >>It seems that nobody have reviewed Btrfs before its inclusion to the
> >>mainline. I have only found a pair of recommendations with a common
> >>idea that Btrfs maintainer is "not a crazy man". Plus a number of
> >>papers which admire with the "Btrfs phenomena". Sigh.
> >>
> >>Well, let's decide what can we do in current situation..
> >>The first obvious point here is that we *can not* put such file system
> >>to production. Just because it doesn't provide any guarantees for our
> >>users regarding disk space utilization.
> >Are you basing all of this on inline extents? The other extents of
> >variable size are more flexible (taking up the room in the leaf), but
> >they can also easy be split during balancing.
> If we have to split everywhere, hasn't it then some (dramatic)
> impact on the performance of the Btrfs filesystem?
> As it was said above: splitting has a high complexity.

Yes. Both Edward and I have worked on the reiserfs sources where inline
extents were split during balancing. It made for a few surprises in the
file read/write code. They aren't impossible by any means, but I wanted
to avoid them.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Chris Mason on 18 Jun 2010 14:20

On Fri, Jun 18, 2010 at 06:22:39PM +0200, Edward Shishkin wrote:
> Chris Mason wrote:
> >On Fri, Jun 18, 2010 at 05:05:46PM +0200, Edward Shishkin wrote:
> >>Chris Mason wrote:
> >>>On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
> >>>>Mat wrote:
> >>>>>On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin <edward(a)redhat.com> wrote:
> >>>>>>Hello everyone.
> >>>>>>
> >>>>>>I was asked to review/evaluate Btrfs for using in enterprise
> >>>>>>systems and the below are my first impressions (linux-2.6.33).
> >>>>>>
> >>>>>>The first test I have made was filling an empty 659M (/dev/sdb2)
> >>>>>>btrfs partition (mounted to /mnt) with 2K files:
> >>>>>>
> >>>>>># for i in $(seq 1000000); \
> >>>>>>do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
> >>>>>>(terminated after getting "No space left on device" reports).
> >>>>>>
> >>>>>># ls /mnt | wc -l
> >>>>>>59480
> >>>>>>
> >>>>>>So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
> >>>>>>and the first obvious question is "hey, where are other 83% of my
> >>>>>>disk space???" I looked at the btrfs storage tree (fs_tree) and was
> >>>>>>shocked with the situation on the leaf level. The Appendix B shows
> >>>>>>5 adjacent btrfs leafs, which have the same parent.
> >>>>>>
> >>>>>>For example, look at the leaf 29425664: "items 1 free space 3892"
> >>>>>>(of 4096!!). Note, that this "free" space (3892) is _dead_: any
> >>>>>>attempts to write to the file system will result in "No space left
> >>>>>>on device".
> >>>There are two easy ways to fix this problem. Turn off the inline
> >>>extents (max_inline=0) or allow splitting of the inline extents. I
> >>>didn't put in the splitting simply because the complexity was high while
> >>>the benefits were low (in comparison with just turning off the inline
> >>>extents).
> >>Hello, Chris. Thanks for response!
> >>I afraid that both ways won't fix the problem. Look at this leaf:
> >>
> >>[...]
> >>leaf 29425664 items 1 free space 3892 generation 8 owner 5
> >>fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
> >>chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
> >> item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
> >> location key (0 UNKNOWN 0) type 8
> >> namelen 16 datalen 32 name: security.selinux
> >>[...]
> >>
> >>There is no inline extents, and what are you going to split here?
> >>All leafs must be at least a half filled, otherwise we loose all
> >>boundaries, which provides non-zero utilization..
> >
> >Right, there is no inline extent because we require them to fit entirely
> >in the leaf. So we end up with mostly empty leaves because the inline
> >item is large enough to make it difficult to push around but not large
> >enough to fill the leaf.
>
> How about left and right neighbors? They contain a lot of
> free space (1572 and 1901 respectively).
> I am not happy with the very fact of such shallow leafs which
> contain only one small (xattr) item..

Sure, the balancing can also be made more aggressive. This should be
very easy to fix.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Chris Mason on 18 Jun 2010 15:40

On Fri, Jun 18, 2010 at 09:29:40PM +0200, Edward Shishkin wrote:
> Jamie Lokier wrote:
> >Edward Shishkin wrote:
> >>If you decide to base your file system on some algorithms then please
> >>use the original ones from proper academic papers. DO NOT modify the
> >>algorithms in solitude: this is very fragile thing! All such
> >>modifications must be reviewed by specialists in the theory of
> >>algorithms. Such review can be done in various scientific magazines of
> >>proper level.
> >>
> >>Personally I don't see any way to improve the situation with Btrfs
> >>except full redesigning the last one. If you want to base your file
> >>system on the paper of Ohad Rodeh, then please, use *exactly* the
> >>Bayer's B-trees that he refers to. That said, make sure that all
> >>records you put to the tree has equal length and all non-root nodes of
> >>your tree are at least half filled.
> >
> >First, thanks Edward for identifying a specific problem with the
> >current btrfs implementation.
>
> Hello Jamie.
>
> >I've studied modified B-trees quite a lot and know enough to be sure
> >that they are quite robust when you modify them in all sorts of ways.
>
> Which property is robust?
>
> >Moreover, you are incorrect to say there's an intrinsic algorithmic
> >problem with variable-length records. It is not true; if Knuth said
> >so, Knuth was mistaken.
>
> I didn't say about intrinsic algorithmic problems :)
> I just repeat (after Knuth et al) that B-trees with variable-length
> records don't
> have any sane boundary for internal fragmentation. The common idea
> is that if we
> don't want Btrfs to be in infinite development stage, then we should
> choose some
> *sane* strategy (for example the paper of Ohad Rodeh) and strictly
> adhere this in
> future.

Again, other than the inline file data, what exactly do you believe
needs to change? Top down balancing vs balancing on insertion doesn't
impact our ability to maintain full leaves. The current code is clearly
choosing not to merge two leaves that it should have merged, which is
just a plain old bug.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

|
Pages: 1
Prev: [RESEND][REGRESSION] um: CONFIG_STATIC_LINK=y broken
Next: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)