Balancing leaves when walking from top to down (was Btrfs:...) [Kernel]

Prev: msm: 8x50: add gpio device configuration.
Next: [tip:x86/build] x86, vdso: Error out if the vdso contains external references

From: Edward Shishkin on 22 Jun 2010 10:20

Chris Mason wrote:
> On Mon, Jun 21, 2010 at 09:15:28AM -0400, Chris Mason wrote:
>
>> I'll reproduce from your test case and provide a fix. mount -o
>> max_inline=1500 would give us 50% usage in the worst case

This is a very strange statement: how did you calculate this lower bound?

>> (minus the
>> balancing bug you hit).
>>
>
> Ok, the balancing bug was interesting. What happens is we create all
> the inodes and directory items nicely packed into the leaves.
>
> Then delayed allocation kicks in and starts inserting the big fat inline
> extents. This often forces the balancing code to split a leaf twice in
> order to make room for the new inline extent. The double split code
> wasn't balancing the bits that were left over on either side of our
> desired slot.
>
> The patch below fixes the problem for me, reducing the number of leaves
> that have more than 2K of free space down from 6% of the total to about
> 74 leaves. It could be zero, but the balancing code doesn't push
> items around if our leaf is in the first or last slot of the parent
> node (complexity vs benefit tradeoff).
>

Nevertheless, I see leaves, which are not in the first or last slot,
but mergeable with their neighbors (test case is the same):

leaf 269557760 items 22 free space 2323 generation 25 owner 2
fs uuid 614fb921-cfa9-403d-9feb-940021e72382
chunk uuid b1674882-a445-45f9-b250-0985e483d231

leaf 280027136 items 18 free space 2627 generation 25 owner 2
fs uuid 614fb921-cfa9-403d-9feb-940021e72382
chunk uuid b1674882-a445-45f9-b250-0985e483d231

node 269549568 level 1 items 60 free 61 generation 25 owner 2
fs uuid 614fb921-cfa9-403d-9feb-940021e72382
chunk uuid b1674882-a445-45f9-b250-0985e483d231
key (175812608 EXTENT_ITEM 4096) block 175828992 (42927) gen 15
key (176025600 EXTENT_ITEM 4096) block 176111616 (42996) gen 15
key (176238592 EXTENT_ITEM 4096) block 176300032 (43042) gen 15
key (176451584 EXTENT_ITEM 4096) block 216248320 (52795) gen 17
key (176672768 EXTENT_ITEM 4096) block 216236032 (52792) gen 17
key (176783360 EXTENT_ITEM 4096) block 216252416 (52796) gen 17
key (176955392 EXTENT_ITEM 4096) block 138854400 (33900) gen 25
key (177131520 EXTENT_ITEM 4096) block 280289280 (68430) gen 25
key (177348608 EXTENT_ITEM 4096) block 280285184 (68429) gen 25
key (177561600 EXTENT_ITEM 4096) block 269557760 (65810) gen 25
key (177795072 EXTENT_ITEM 4096) block 280027136 (68366) gen 25
key (178008064 EXTENT_ITEM 4096) block 280064000 (68375) gen 25
key (178233344 EXTENT_ITEM 4096) block 285061120 (69595) gen 25
key (178442240 EXTENT_ITEM 4096) block 178442240 (43565) gen 16
key (178655232 EXTENT_ITEM 4096) block 178655232 (43617) gen 16
key (178868224 EXTENT_ITEM 4096) block 178868224 (43669) gen 16
[...]

> With the patch, I'm able to create 106894 files (2K each) on a 1GB FS.
> That doesn't sound like a huge improvement, but the default from
> mkfs.btrfs is to duplicate metadata. After duplication, that's about
> 417MB or about 40% of the overall space.
>
> When you factor in the space that we reserve to avoid exploding on
> enospc and the space that we have allocated for data extents, that's not
> a very surprising number.
>
> I'm still putting this patch through more testing, the double split code
> is used in some difficult corners and I need to make sure I've tried
> all of them.
>

Chris, for the further code review we need documents, which reflect
the original ideas of the balancing, etc. Could you please provide them?
Obviously, I can not do it instead of you, as I don't know those ideas.

Thanks,
Edward.

> -chris
>
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 0d1d966..1f393b0 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -2304,12 +2304,17 @@ noinline int btrfs_leaf_free_space(struct btrfs_root *root,
> return ret;
> }
>
> +/*
> + * min slot controls the lowest index we're willing to push to the
> + * right. We'll push up to and including min_slot, but no lower
> + */
> static noinline int __push_leaf_right(struct btrfs_trans_handle *trans,
> struct btrfs_root *root,
> struct btrfs_path *path,
> int data_size, int empty,
> struct extent_buffer *right,
> - int free_space, u32 left_nritems)
> + int free_space, u32 left_nritems,
> + u32 min_slot)
> {
> struct extent_buffer *left = path->nodes[0];
> struct extent_buffer *upper = path->nodes[1];
> @@ -2327,7 +2332,7 @@ static noinline int __push_leaf_right(struct btrfs_trans_handle *trans,
> if (empty)
> nr = 0;
> else
> - nr = 1;
> + nr = max_t(u32, 1, min_slot);
>
> if (path->slots[0] >= left_nritems)
> push_space += data_size;
> @@ -2469,10 +2474,13 @@ out_unlock:
> *
> * returns 1 if the push failed because the other node didn't have enough
> * room, 0 if everything worked out and < 0 if there were major errors.
> + *
> + * this will push starting from min_slot to the end of the leaf. It won't
> + * push any slot lower than min_slot
> */
> static int push_leaf_right(struct btrfs_trans_handle *trans, struct btrfs_root
> *root, struct btrfs_path *path, int data_size,
> - int empty)
> + int empty, u32 min_slot)
> {
> struct extent_buffer *left = path->nodes[0];
> struct extent_buffer *right;
> @@ -2515,7 +2523,7 @@ static int push_leaf_right(struct btrfs_trans_handle *trans, struct btrfs_root
> goto out_unlock;
>
> return __push_leaf_right(trans, root, path, data_size, empty,
> - right, free_space, left_nritems);
> + right, free_space, left_nritems, min_slot);
> out_unlock:
> btrfs_tree_unlock(right);
> free_extent_buffer(right);
> @@ -2525,12 +2533,17 @@ out_unlock:
> /*
> * push some data in the path leaf to the left, trying to free up at
> * least data_size bytes. returns zero if the push worked, nonzero otherwise
> + *
> + * max_slot can put a limit on how far into the leaf we'll push items. The
> + * item at 'max_slot' won't be touched. Use (u32)-1 to make us do all the
> + * items
> */
> static noinline int __push_leaf_left(struct btrfs_trans_handle *trans,
> struct btrfs_root *root,
> struct btrfs_path *path, int data_size,
> int empty, struct extent_buffer *left,
> - int free_space, int right_nritems)
> + int free_space, u32 right_nritems,
> + u32 max_slot)
> {
> struct btrfs_disk_key disk_key;
> struct extent_buffer *right = path->nodes[0];
> @@ -2549,9 +2562,9 @@ static noinline int __push_leaf_left(struct btrfs_trans_handle *trans,
> slot = path->slots[1];
>
> if (empty)
> - nr = right_nritems;
> + nr = min(right_nritems, max_slot);
> else
> - nr = right_nritems - 1;
> + nr = min(right_nritems - 1, max_slot);
>
> for (i = 0; i < nr; i++) {
> item = btrfs_item_nr(right, i);
> @@ -2712,10 +2725,14 @@ out:
> /*
> * push some data in the path leaf to the left, trying to free up at
> * least data_size bytes. returns zero if the push worked, nonzero otherwise
> + *
> + * max_slot can put a limit on how far into the leaf we'll push items. The
> + * item at 'max_slot' won't be touched. Use (u32)-1 to make us push all the
> + * items
> */
> static int push_leaf_left(struct btrfs_trans_handle *trans, struct btrfs_root
> *root, struct btrfs_path *path, int data_size,
> - int empty)
> + int empty, u32 max_slot)
> {
> struct extent_buffer *right = path->nodes[0];
> struct extent_buffer *left;
> @@ -2762,7 +2779,8 @@ static int push_leaf_left(struct btrfs_trans_handle *trans, struct btrfs_root
> }
>
> return __push_leaf_left(trans, root, path, data_size,
> - empty, left, free_space, right_nritems);
> + empty, left, free_space, right_nritems,
> + max_slot);
> out:
> btrfs_tree_unlock(left);
> free_extent_buffer(left);
> @@ -2855,6 +2873,59 @@ static noinline int copy_for_split(struct btrfs_trans_handle *trans,
> }
>
> /*
> + * double splits happen when we need to insert a big item in the middle
> + * of a leaf. A double split can leave us with 3 mostly empty leaves:
> + * leaf: [ slots 0 - N] [ our target ] [ N + 1 - total in leaf ]
> + * A B C
> + *
> + * We avoid this by trying to push the items on either side of our target
> + * into the adjacent leaves. If all goes well we can avoid the double split
> + * completely.
> + */
> +static noinline int push_for_double_split(struct btrfs_trans_handle *trans,
> + struct btrfs_root *root,
> + struct btrfs_path *path)
> +{
> + int ret;
> + int progress = 0;
> + int slot;
> +
> + slot = path->slots[0];
> +
> + /*
> + * try to push all the items after our slot into the
> + * right leaf
> + */
> + ret = push_leaf_right(trans, root, path, 1, 0, slot);
> + if (ret < 0)
> + return ret;
> +
> + if (ret == 0)
> + progress++;
> +
> + /*
> + * our goal is to get our slot at the start or end of a leaf. If
> + * we've done so we're done
> + */
> + if (path->slots[0] == 0 ||
> + path->slots[0] == btrfs_header_nritems(path->nodes[0]))
> + return 0;
> +
> + /* try to push all the items before our slot into the next leaf */
> + slot = path->slots[0];
> + ret = push_leaf_left(trans, root, path, 1, 0, slot);
> + if (ret < 0)
> + return ret;
> +
> + if (ret == 0)
> + progress++;
> +
> + if (progress)
> + return 0;
> + return 1;
> +}
> +
> +/*
> * split the path's leaf in two, making sure there is at least data_size
> * available for the resulting leaf level of the path.
> *
> @@ -2876,6 +2947,7 @@ static noinline int split_leaf(struct btrfs_trans_handle *trans,
> int wret;
> int split;
> int num_doubles = 0;
> + int tried_avoid_double = 0;
>
> l = path->nodes[0];
> slot = path->slots[0];
> @@ -2884,12 +2956,13 @@ static noinline int split_leaf(struct btrfs_trans_handle *trans,
> return -EOVERFLOW;
>
> /* first try to make some room by pushing left and right */
> - if (data_size && ins_key->type != BTRFS_DIR_ITEM_KEY) {
> - wret = push_leaf_right(trans, root, path, data_size, 0);
> + if (data_size) {
> + wret = push_leaf_right(trans, root, path, data_size, 0, 0);
> if (wret < 0)
> return wret;
> if (wret) {
> - wret = push_leaf_left(trans, root, path, data_size, 0);
> + wret = push_leaf_left(trans, root, path,
> + data_size, 0, (u32)-1);
> if (wret < 0)
> return wret;
> }
> @@ -2923,6 +2996,12 @@ again:
> if (mid != nritems &&
> leaf_space_used(l, mid, nritems - mid) +
> data_size > BTRFS_LEAF_DATA_SIZE(root)) {
> + if (!tried_avoid_double) {
> + push_for_double_split(trans,
> + root, path);
> + tried_avoid_double = 1;
> + goto again;
> + }
> split = 2;
> }
> }
> @@ -2939,6 +3018,12 @@ again:
> if (mid != nritems &&
> leaf_space_used(l, mid, nritems - mid) +
> data_size > BTRFS_LEAF_DATA_SIZE(root)) {
> + if (!tried_avoid_double) {
> + push_for_double_split(trans,
> + root, path);
> + tried_avoid_double = 1;
> + goto again;
> + }
> split = 2 ;
> }
> }
> @@ -3915,13 +4000,14 @@ int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> extent_buffer_get(leaf);
>
> btrfs_set_path_blocking(path);
> - wret = push_leaf_left(trans, root, path, 1, 1);
> + wret = push_leaf_left(trans, root, path, 1, 1, (u32)-1);
> if (wret < 0 && wret != -ENOSPC)
> ret = wret;
>
> if (path->nodes[0] == leaf &&
> btrfs_header_nritems(leaf)) {
> - wret = push_leaf_right(trans, root, path, 1, 1);
> + wret = push_leaf_right(trans, root, path,
> + 1, 1, 0);
> if (wret < 0 && wret != -ENOSPC)
> ret = wret;
> }
>

--
Edward O. Shishkin
Principal Software Engineer
Red Hat Czech

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Chris Mason on 22 Jun 2010 10:30

On Tue, Jun 22, 2010 at 04:12:57PM +0200, Edward Shishkin wrote:
> Chris Mason wrote:
> >On Mon, Jun 21, 2010 at 09:15:28AM -0400, Chris Mason wrote:
> >>I'll reproduce from your test case and provide a fix. mount -o
> >>max_inline=1500 would give us 50% usage in the worst case
>
> This is a very strange statement: how did you calculate this lower bound?

We want room for the extent and the inode item and the inode backref.
It's a rough number that leaves some extra room. But even with a 2K
item we're getting very close to 50% usage of the metadata area.

>
> >> (minus the
> >>balancing bug you hit).
> >
> >Ok, the balancing bug was interesting. What happens is we create all
> >the inodes and directory items nicely packed into the leaves.
> >
> >Then delayed allocation kicks in and starts inserting the big fat inline
> >extents. This often forces the balancing code to split a leaf twice in
> >order to make room for the new inline extent. The double split code
> >wasn't balancing the bits that were left over on either side of our
> >desired slot.
> >
> >The patch below fixes the problem for me, reducing the number of leaves
> >that have more than 2K of free space down from 6% of the total to about
> >74 leaves. It could be zero, but the balancing code doesn't push
> >items around if our leaf is in the first or last slot of the parent
> >node (complexity vs benefit tradeoff).
>
> Nevertheless, I see leaves, which are not in the first or last slot,
> but mergeable with their neighbors (test case is the same):

Correct, but it was in the first or last slot when it was balanced (I
confirmed this with printk).

Then the higher layers were balanced and our leaves were no longer in
the first/last slot. We don't rebalance leaves when we balance level 1.

>
> leaf 269557760 items 22 free space 2323 generation 25 owner 2
> fs uuid 614fb921-cfa9-403d-9feb-940021e72382
> chunk uuid b1674882-a445-45f9-b250-0985e483d231
>
> leaf 280027136 items 18 free space 2627 generation 25 owner 2
> fs uuid 614fb921-cfa9-403d-9feb-940021e72382
> chunk uuid b1674882-a445-45f9-b250-0985e483d231
>
> node 269549568 level 1 items 60 free 61 generation 25 owner 2
> fs uuid 614fb921-cfa9-403d-9feb-940021e72382
> chunk uuid b1674882-a445-45f9-b250-0985e483d231
> key (175812608 EXTENT_ITEM 4096) block 175828992 (42927) gen 15
> key (176025600 EXTENT_ITEM 4096) block 176111616 (42996) gen 15
> key (176238592 EXTENT_ITEM 4096) block 176300032 (43042) gen 15
> key (176451584 EXTENT_ITEM 4096) block 216248320 (52795) gen 17
> key (176672768 EXTENT_ITEM 4096) block 216236032 (52792) gen 17
> key (176783360 EXTENT_ITEM 4096) block 216252416 (52796) gen 17
> key (176955392 EXTENT_ITEM 4096) block 138854400 (33900) gen 25
> key (177131520 EXTENT_ITEM 4096) block 280289280 (68430) gen 25
> key (177348608 EXTENT_ITEM 4096) block 280285184 (68429) gen 25
> key (177561600 EXTENT_ITEM 4096) block 269557760 (65810) gen 25
> key (177795072 EXTENT_ITEM 4096) block 280027136 (68366) gen 25
> key (178008064 EXTENT_ITEM 4096) block 280064000 (68375) gen 25
> key (178233344 EXTENT_ITEM 4096) block 285061120 (69595) gen 25
> key (178442240 EXTENT_ITEM 4096) block 178442240 (43565) gen 16
> key (178655232 EXTENT_ITEM 4096) block 178655232 (43617) gen 16
> key (178868224 EXTENT_ITEM 4096) block 178868224 (43669) gen 16
> [...]
>
> >With the patch, I'm able to create 106894 files (2K each) on a 1GB FS.
> >That doesn't sound like a huge improvement, but the default from
> >mkfs.btrfs is to duplicate metadata. After duplication, that's about
> >417MB or about 40% of the overall space.
> >
> >When you factor in the space that we reserve to avoid exploding on
> >enospc and the space that we have allocated for data extents, that's not
> >a very surprising number.
> >
> >I'm still putting this patch through more testing, the double split code
> >is used in some difficult corners and I need to make sure I've tried
> >all of them.
>
> Chris, for the further code review we need documents, which reflect
> the original ideas of the balancing, etc. Could you please provide them?
> Obviously, I can not do it instead of you, as I don't know those ideas.
>

Which part are you most interested in?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Edward Shishkin on 23 Jun 2010 09:50

Chris Mason wrote:
> On Tue, Jun 22, 2010 at 04:12:57PM +0200, Edward Shishkin wrote:
>
>> Chris Mason wrote:
>>
>>> On Mon, Jun 21, 2010 at 09:15:28AM -0400, Chris Mason wrote:
>>>
>>>> I'll reproduce from your test case and provide a fix. mount -o
>>>> max_inline=1500 would give us 50% usage in the worst case
>>>>
>> This is a very strange statement: how did you calculate this lower bound?
>>
>
> We want room for the extent and the inode item and the inode backref.
> It's a rough number that leaves some extra room. But even with a 2K
> item we're getting very close to 50% usage of the metadata area.
>
>
>>>> (minus the
>>>> balancing bug you hit).
>>>>
>>> Ok, the balancing bug was interesting. What happens is we create all
>>> the inodes and directory items nicely packed into the leaves.
>>>
>>> Then delayed allocation kicks in and starts inserting the big fat inline
>>> extents. This often forces the balancing code to split a leaf twice in
>>> order to make room for the new inline extent. The double split code
>>> wasn't balancing the bits that were left over on either side of our
>>> desired slot.
>>>
>>> The patch below fixes the problem for me, reducing the number of leaves
>>> that have more than 2K of free space down from 6% of the total to about
>>> 74 leaves. It could be zero, but the balancing code doesn't push
>>> items around if our leaf is in the first or last slot of the parent
>>> node (complexity vs benefit tradeoff).
>>>
>> Nevertheless, I see leaves, which are not in the first or last slot,
>> but mergeable with their neighbors (test case is the same):
>>
>
> Correct, but it was in the first or last slot when it was balanced (I
> confirmed this with printk).
>
> Then the higher layers were balanced and our leaves were no longer in
> the first/last slot. We don't rebalance leaves when we balance level 1.
>
>
>> leaf 269557760 items 22 free space 2323 generation 25 owner 2
>> fs uuid 614fb921-cfa9-403d-9feb-940021e72382
>> chunk uuid b1674882-a445-45f9-b250-0985e483d231
>>
>> leaf 280027136 items 18 free space 2627 generation 25 owner 2
>> fs uuid 614fb921-cfa9-403d-9feb-940021e72382
>> chunk uuid b1674882-a445-45f9-b250-0985e483d231
>>
>> node 269549568 level 1 items 60 free 61 generation 25 owner 2
>> fs uuid 614fb921-cfa9-403d-9feb-940021e72382
>> chunk uuid b1674882-a445-45f9-b250-0985e483d231
>> key (175812608 EXTENT_ITEM 4096) block 175828992 (42927) gen 15
>> key (176025600 EXTENT_ITEM 4096) block 176111616 (42996) gen 15
>> key (176238592 EXTENT_ITEM 4096) block 176300032 (43042) gen 15
>> key (176451584 EXTENT_ITEM 4096) block 216248320 (52795) gen 17
>> key (176672768 EXTENT_ITEM 4096) block 216236032 (52792) gen 17
>> key (176783360 EXTENT_ITEM 4096) block 216252416 (52796) gen 17
>> key (176955392 EXTENT_ITEM 4096) block 138854400 (33900) gen 25
>> key (177131520 EXTENT_ITEM 4096) block 280289280 (68430) gen 25
>> key (177348608 EXTENT_ITEM 4096) block 280285184 (68429) gen 25
>> key (177561600 EXTENT_ITEM 4096) block 269557760 (65810) gen 25
>> key (177795072 EXTENT_ITEM 4096) block 280027136 (68366) gen 25
>> key (178008064 EXTENT_ITEM 4096) block 280064000 (68375) gen 25
>> key (178233344 EXTENT_ITEM 4096) block 285061120 (69595) gen 25
>> key (178442240 EXTENT_ITEM 4096) block 178442240 (43565) gen 16
>> key (178655232 EXTENT_ITEM 4096) block 178655232 (43617) gen 16
>> key (178868224 EXTENT_ITEM 4096) block 178868224 (43669) gen 16
>> [...]
>>
>>
>>> With the patch, I'm able to create 106894 files (2K each) on a 1GB FS.
>>> That doesn't sound like a huge improvement, but the default from
>>> mkfs.btrfs is to duplicate metadata. After duplication, that's about
>>> 417MB or about 40% of the overall space.
>>>
>>> When you factor in the space that we reserve to avoid exploding on
>>> enospc and the space that we have allocated for data extents, that's not
>>> a very surprising number.
>>>
>>> I'm still putting this patch through more testing, the double split code
>>> is used in some difficult corners and I need to make sure I've tried
>>> all of them.
>>>
>> Chris, for the further code review we need documents, which reflect
>> the original ideas of the balancing, etc. Could you please provide them?
>> Obviously, I can not do it instead of you, as I don't know those ideas.
>>
>>
>
> Which part are you most interested in?
>

Balancing.

What conditions do we want to provide?
Why won't we get utilization slump in future?
How do we need to fix things when something goes wrong?
Whereas in accordance with the single existing paper Btrfs
design looks really broken, because:
1. the balancing condition n <= number_of_keys <= 2n+1
is not satisfied (indeed, when number_of_keys is 1
we have 1 <= 2n+1 -- false);
2. the trees mentioned in this paper don't allow keys of
variable length.

Thanks,
Edward.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Jamie Lokier on 23 Jun 2010 19:40

Edward Shishkin wrote:
> Chris Mason wrote:
> >On Tue, Jun 22, 2010 at 04:12:57PM +0200, Edward Shishkin wrote:
> >
> >>Chris Mason wrote:
> >>
> >>>On Mon, Jun 21, 2010 at 09:15:28AM -0400, Chris Mason wrote:
> >>>
> >>>>I'll reproduce from your test case and provide a fix. mount -o
> >>>>max_inline=1500 would give us 50% usage in the worst case
> >>>>
> >>This is a very strange statement: how did you calculate this lower bound?
> >>
> >
> >We want room for the extent and the inode item and the inode backref.
> >It's a rough number that leaves some extra room. But even with a 2K
> >item we're getting very close to 50% usage of the metadata area.
> >
> >
> >>>>(minus the
> >>>>balancing bug you hit).
> >>>>
> >>>Ok, the balancing bug was interesting. What happens is we create all
> >>>the inodes and directory items nicely packed into the leaves.
> >>>
> >>>Then delayed allocation kicks in and starts inserting the big fat inline
> >>>extents. This often forces the balancing code to split a leaf twice in
> >>>order to make room for the new inline extent. The double split code
> >>>wasn't balancing the bits that were left over on either side of our
> >>>desired slot.
> >>>
> >>>The patch below fixes the problem for me, reducing the number of leaves
> >>>that have more than 2K of free space down from 6% of the total to about
> >>>74 leaves. It could be zero, but the balancing code doesn't push
> >>>items around if our leaf is in the first or last slot of the parent
> >>>node (complexity vs benefit tradeoff).
> >>>
> >>Nevertheless, I see leaves, which are not in the first or last slot,
> >>but mergeable with their neighbors (test case is the same):
> >>
> >
> >Correct, but it was in the first or last slot when it was balanced (I
> >confirmed this with printk).
> >
> >Then the higher layers were balanced and our leaves were no longer in
> >the first/last slot. We don't rebalance leaves when we balance level 1.
> >
> >
> >>leaf 269557760 items 22 free space 2323 generation 25 owner 2
> >>fs uuid 614fb921-cfa9-403d-9feb-940021e72382
> >>chunk uuid b1674882-a445-45f9-b250-0985e483d231
> >>
> >>leaf 280027136 items 18 free space 2627 generation 25 owner 2
> >>fs uuid 614fb921-cfa9-403d-9feb-940021e72382
> >>chunk uuid b1674882-a445-45f9-b250-0985e483d231
> >>
> >>node 269549568 level 1 items 60 free 61 generation 25 owner 2
> >>fs uuid 614fb921-cfa9-403d-9feb-940021e72382
> >>chunk uuid b1674882-a445-45f9-b250-0985e483d231
> >> key (175812608 EXTENT_ITEM 4096) block 175828992 (42927) gen 15
> >> key (176025600 EXTENT_ITEM 4096) block 176111616 (42996) gen 15
> >> key (176238592 EXTENT_ITEM 4096) block 176300032 (43042) gen 15
> >> key (176451584 EXTENT_ITEM 4096) block 216248320 (52795) gen 17
> >> key (176672768 EXTENT_ITEM 4096) block 216236032 (52792) gen 17
> >> key (176783360 EXTENT_ITEM 4096) block 216252416 (52796) gen 17
> >> key (176955392 EXTENT_ITEM 4096) block 138854400 (33900) gen 25
> >> key (177131520 EXTENT_ITEM 4096) block 280289280 (68430) gen 25
> >> key (177348608 EXTENT_ITEM 4096) block 280285184 (68429) gen 25
> >> key (177561600 EXTENT_ITEM 4096) block 269557760 (65810) gen 25
> >> key (177795072 EXTENT_ITEM 4096) block 280027136 (68366) gen 25
> >> key (178008064 EXTENT_ITEM 4096) block 280064000 (68375) gen 25
> >> key (178233344 EXTENT_ITEM 4096) block 285061120 (69595) gen 25
> >> key (178442240 EXTENT_ITEM 4096) block 178442240 (43565) gen 16
> >> key (178655232 EXTENT_ITEM 4096) block 178655232 (43617) gen 16
> >> key (178868224 EXTENT_ITEM 4096) block 178868224 (43669) gen 16
> >>[...]
> >>
> >>
> >>>With the patch, I'm able to create 106894 files (2K each) on a 1GB FS.
> >>>That doesn't sound like a huge improvement, but the default from
> >>>mkfs.btrfs is to duplicate metadata. After duplication, that's about
> >>>417MB or about 40% of the overall space.
> >>>
> >>>When you factor in the space that we reserve to avoid exploding on
> >>>enospc and the space that we have allocated for data extents, that's not
> >>>a very surprising number.
> >>>
> >>>I'm still putting this patch through more testing, the double split code
> >>>is used in some difficult corners and I need to make sure I've tried
> >>>all of them.
> >>>
> >>Chris, for the further code review we need documents, which reflect
> >>the original ideas of the balancing, etc. Could you please provide them?
> >>Obviously, I can not do it instead of you, as I don't know those ideas.
> >>
> >>
> >
> >Which part are you most interested in?
> >
>
> Balancing.
>
> What conditions do we want to provide?
> Why won't we get utilization slump in future?
> How do we need to fix things when something goes wrong?
> Whereas in accordance with the single existing paper Btrfs
> design looks really broken, because:

> 1. the balancing condition n <= number_of_keys <= 2n+1
> is not satisfied (indeed, when number_of_keys is 1
> we have 1 <= 2n+1 -- false);

That doesn't matter. It is not necessary (or desirable) to follow the
classical B-tree algorithms to the letter to get sufficiently good
properties. B-trees are quite robust to changing the details, as long
as you follow the generalised principles which make them work.

(It is entirely possible that btrfs does not follow those principles
though :-)

All that matters to guarantee utilisation and algorithmic performance
is to do enough merging on deletions, and avoid doing too many splits,
to ensure that an amount X% of all used blocks (except the root) are
at least Y% full.

Y is 50 and X is 100 in your description, but other numbers work too.
Reducing X slightly improves utilisation in common filesystem
scenarios where many blocks are filled in ascending key sequences.
(But delayed allocation/writing, like btrfs does, is another way to
achieve a similar result.) Increasing Y improves minimum utilisation,
which seems like a good thing to demand from a filesystem - for the
actual data, anyway.

From what's been said in this thread, I'm getting the impression that
btrfs does not guarantee a particular value of X, but rather depends
heuristically on typical behaviour to give an equivalent outcome, and
that that does not always work in some pathological cases. And also
that Y is/was not guaranteed, but that may be quite possibly an
ordinary bug and might have been fixed by Chris during this thread.

I must admit that only getting 200GB of 2kB files into a formatted 1GB
partition before it says ENOSPC - and that's after Chris's recent
tweak - is a bit disappointing. A little over 50% loss for 2kB files
is quite reasonable - most filesystems would do that just due to
padding to the next 4kB (the block size), and about 50% wastage in a
B-tree is fair enough (at least if there is no sane pattern to the
file's creation), as guaranteeing higher utilisation has a performance
penalty. But with the delayed allocation, I'd hope for better packing
than the worst case, and I'd expect even the worst case to be better
than 20% available for data.

Losing 80% of the space to utilisation loss and metadata duplication
is disappointing (to me), and a bit suprising, given that one of the
purposes of inline extents is to pack small files more efficiently
than if they were padded to the block size.

> 2. the trees mentioned in this paper don't allow keys of
> variable length.

Inline extents aren't the same as variable-length keys, because they
aren't sorted or searched by content. I showed in an earlier reply to
you how to show that variable length *values*[1] don't cause
unbalancing provided the splits are put in acceptable places. But to
make sense of that, you do have to grok the generalising of B-trees
beyond the original algorithms.

[1] Think of the B-tree variants which contain (key,value) pairings
where "value" is not ordered, rather than the type which contain just
"keys" which are fully ordered.

Are there any true (fully ordered) variable-length keys in btrfs?

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Chris Mason on 24 Jun 2010 09:10

On Thu, Jun 24, 2010 at 12:37:59AM +0100, Jamie Lokier wrote:
> Edward Shishkin wrote:
> > Chris Mason wrote:
> > >On Tue, Jun 22, 2010 at 04:12:57PM +0200, Edward Shishkin wrote:
> > >
> > >>Chris Mason wrote:
> > >>
> > >>>On Mon, Jun 21, 2010 at 09:15:28AM -0400, Chris Mason wrote:
> > >>>
> > >>>>I'll reproduce from your test case and provide a fix. mount -o
> > >>>>max_inline=1500 would give us 50% usage in the worst case
> > >>>>
> > >>This is a very strange statement: how did you calculate this lower bound?
> > >>
> > >
> > >We want room for the extent and the inode item and the inode backref.
> > >It's a rough number that leaves some extra room. But even with a 2K
> > >item we're getting very close to 50% usage of the metadata area.
> > >
> > >
> > >>>>(minus the
> > >>>>balancing bug you hit).
> > >>>>
> > >>>Ok, the balancing bug was interesting. What happens is we create all
> > >>>the inodes and directory items nicely packed into the leaves.
> > >>>
> > >>>Then delayed allocation kicks in and starts inserting the big fat inline
> > >>>extents. This often forces the balancing code to split a leaf twice in
> > >>>order to make room for the new inline extent. The double split code
> > >>>wasn't balancing the bits that were left over on either side of our
> > >>>desired slot.
> > >>>
> > >>>The patch below fixes the problem for me, reducing the number of leaves
> > >>>that have more than 2K of free space down from 6% of the total to about
> > >>>74 leaves. It could be zero, but the balancing code doesn't push
> > >>>items around if our leaf is in the first or last slot of the parent
> > >>>node (complexity vs benefit tradeoff).
> > >>>
> > >>Nevertheless, I see leaves, which are not in the first or last slot,
> > >>but mergeable with their neighbors (test case is the same):
> > >>
> > >
> > >Correct, but it was in the first or last slot when it was balanced (I
> > >confirmed this with printk).
> > >
> > >Then the higher layers were balanced and our leaves were no longer in
> > >the first/last slot. We don't rebalance leaves when we balance level 1.
> > >
> > >
> > >>leaf 269557760 items 22 free space 2323 generation 25 owner 2
> > >>fs uuid 614fb921-cfa9-403d-9feb-940021e72382
> > >>chunk uuid b1674882-a445-45f9-b250-0985e483d231
> > >>
> > >>leaf 280027136 items 18 free space 2627 generation 25 owner 2
> > >>fs uuid 614fb921-cfa9-403d-9feb-940021e72382
> > >>chunk uuid b1674882-a445-45f9-b250-0985e483d231
> > >>
> > >>node 269549568 level 1 items 60 free 61 generation 25 owner 2
> > >>fs uuid 614fb921-cfa9-403d-9feb-940021e72382
> > >>chunk uuid b1674882-a445-45f9-b250-0985e483d231
> > >> key (175812608 EXTENT_ITEM 4096) block 175828992 (42927) gen 15
> > >> key (176025600 EXTENT_ITEM 4096) block 176111616 (42996) gen 15
> > >> key (176238592 EXTENT_ITEM 4096) block 176300032 (43042) gen 15
> > >> key (176451584 EXTENT_ITEM 4096) block 216248320 (52795) gen 17
> > >> key (176672768 EXTENT_ITEM 4096) block 216236032 (52792) gen 17
> > >> key (176783360 EXTENT_ITEM 4096) block 216252416 (52796) gen 17
> > >> key (176955392 EXTENT_ITEM 4096) block 138854400 (33900) gen 25
> > >> key (177131520 EXTENT_ITEM 4096) block 280289280 (68430) gen 25
> > >> key (177348608 EXTENT_ITEM 4096) block 280285184 (68429) gen 25
> > >> key (177561600 EXTENT_ITEM 4096) block 269557760 (65810) gen 25
> > >> key (177795072 EXTENT_ITEM 4096) block 280027136 (68366) gen 25
> > >> key (178008064 EXTENT_ITEM 4096) block 280064000 (68375) gen 25
> > >> key (178233344 EXTENT_ITEM 4096) block 285061120 (69595) gen 25
> > >> key (178442240 EXTENT_ITEM 4096) block 178442240 (43565) gen 16
> > >> key (178655232 EXTENT_ITEM 4096) block 178655232 (43617) gen 16
> > >> key (178868224 EXTENT_ITEM 4096) block 178868224 (43669) gen 16
> > >>[...]
> > >>
> > >>
> > >>>With the patch, I'm able to create 106894 files (2K each) on a 1GB FS.
> > >>>That doesn't sound like a huge improvement, but the default from
> > >>>mkfs.btrfs is to duplicate metadata. After duplication, that's about
> > >>>417MB or about 40% of the overall space.
> > >>>
> > >>>When you factor in the space that we reserve to avoid exploding on
> > >>>enospc and the space that we have allocated for data extents, that's not
> > >>>a very surprising number.
> > >>>
> > >>>I'm still putting this patch through more testing, the double split code
> > >>>is used in some difficult corners and I need to make sure I've tried
> > >>>all of them.
> > >>>
> > >>Chris, for the further code review we need documents, which reflect
> > >>the original ideas of the balancing, etc. Could you please provide them?
> > >>Obviously, I can not do it instead of you, as I don't know those ideas.
> > >>
> > >>
> > >
> > >Which part are you most interested in?
> > >
> >
> > Balancing.
> >
> > What conditions do we want to provide?
> > Why won't we get utilization slump in future?
> > How do we need to fix things when something goes wrong?
> > Whereas in accordance with the single existing paper Btrfs
> > design looks really broken, because:
>
> > 1. the balancing condition n <= number_of_keys <= 2n+1
> > is not satisfied (indeed, when number_of_keys is 1
> > we have 1 <= 2n+1 -- false);
>
> That doesn't matter. It is not necessary (or desirable) to follow the
> classical B-tree algorithms to the letter to get sufficiently good
> properties. B-trees are quite robust to changing the details, as long
> as you follow the generalised principles which make them work.

There are lots of different perspectives here, but it is important to
step back and look at what we're using the btree for. We're making an
index to filesystem metadata. The index is there so that we can find
that metadata with a reasonable number of IOs and in a reasonable amount
of time.

Because btrfs is COW, and because of the reference counted COW used to
maintain snapshots, the cost of updating the btree is different from
traditional btrees. We do try to avoid balancing more and
intentionally allow the btree leaves to be less compact simply because
it allows us to avoid the higher cost of those operations.

The btree nodes are fairly strict. They contain fixed records and are
balanced in a simple fashion. When we are inserting a new key into a
node, if the node is completely full we split it. When we are deleting
a key, if the node is less than half full we balance it. Calculating
full on the btree nodes is a factor of the number of keys in them.

The leaves have fixed length keys that describe items of variable size.
On deletion merging is done when we're less than 1/3rd full and on
insertion we split when the item we're trying to shove in doesn't fit.
Calculating full on the btree leaves is a factor of the total size of
the items stored.

So, with all of that said, the nodes point to leaves and the leaves
contain inodes, directory entries and sometimes file data.

The most important part of the index is the higher levels of the tree.
The index of most filesystems points to but does not contain
inodes and file data, and so the higher levels of the btrfs btree are
more like a traditional FS index.

There's a knob here for the max inline item size and if someone wants to
fill their whole FS with 2K files, the default settings are not at all
suitable. Filling with 3K files actually works better because we end
up with inode and file data in the same block much of the time.

>
> (It is entirely possible that btrfs does not follow those principles
> though :-)

We try ;) Clearly there was a bug around splitting for large items.

>
> All that matters to guarantee utilisation and algorithmic performance
> is to do enough merging on deletions, and avoid doing too many splits,
> to ensure that an amount X% of all used blocks (except the root) are
> at least Y% full.
>
> Y is 50 and X is 100 in your description, but other numbers work too.
> Reducing X slightly improves utilisation in common filesystem
> scenarios where many blocks are filled in ascending key sequences.
> (But delayed allocation/writing, like btrfs does, is another way to
> achieve a similar result.) Increasing Y improves minimum utilisation,
> which seems like a good thing to demand from a filesystem - for the
> actual data, anyway.
>
> From what's been said in this thread, I'm getting the impression that
> btrfs does not guarantee a particular value of X, but rather depends
> heuristically on typical behaviour to give an equivalent outcome, and
> that that does not always work in some pathological cases. And also
> that Y is/was not guaranteed, but that may be quite possibly an
> ordinary bug and might have been fixed by Chris during this thread.
>
> I must admit that only getting 200GB of 2kB files into a formatted 1GB
> partition before it says ENOSPC - and that's after Chris's recent
> tweak - is a bit disappointing. A little over 50% loss for 2kB files
> is quite reasonable - most filesystems would do that just due to
> padding to the next 4kB (the block size), and about 50% wastage in a
> B-tree is fair enough (at least if there is no sane pattern to the
> file's creation), as guaranteeing higher utilisation has a performance
> penalty. But with the delayed allocation, I'd hope for better packing
> than the worst case, and I'd expect even the worst case to be better
> than 20% available for data.
>
> Losing 80% of the space to utilisation loss and metadata duplication
> is disappointing (to me), and a bit suprising, given that one of the
> purposes of inline extents is to pack small files more efficiently
> than if they were padded to the block size.

I agree, in this case metadata duplication is not helping. A new
default limit on the size of the inline data might be in order.

>
> > 2. the trees mentioned in this paper don't allow keys of
> > variable length.
>
> Inline extents aren't the same as variable-length keys, because they
> aren't sorted or searched by content. I showed in an earlier reply to
> you how to show that variable length *values*[1] don't cause
> unbalancing provided the splits are put in acceptable places. But to
> make sense of that, you do have to grok the generalising of B-trees
> beyond the original algorithms.
>
> [1] Think of the B-tree variants which contain (key,value) pairings
> where "value" is not ordered, rather than the type which contain just
> "keys" which are fully ordered.
>
> Are there any true (fully ordered) variable-length keys in btrfs?

The keys are a fixed length, only the size of the items stored can vary.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3
Prev: msm: 8x50: add gpio device configuration.
Next: [tip:x86/build] x86, vdso: Error out if the vdso contains external references