Balancing leaves when walking from top to down (was Btrfs:...) [Kernel]

Prev: msm: 8x50: add gpio device configuration.
Next: [tip:x86/build] x86, vdso: Error out if the vdso contains external references

From: Edward Shishkin on 18 Jun 2010 18:10

Chris Mason wrote:
> On Fri, Jun 18, 2010 at 09:29:40PM +0200, Edward Shishkin wrote:
>
>> Jamie Lokier wrote:
>>
>>> Edward Shishkin wrote:
>>>
>>>> If you decide to base your file system on some algorithms then please
>>>> use the original ones from proper academic papers. DO NOT modify the
>>>> algorithms in solitude: this is very fragile thing! All such
>>>> modifications must be reviewed by specialists in the theory of
>>>> algorithms. Such review can be done in various scientific magazines of
>>>> proper level.
>>>>
>>>> Personally I don't see any way to improve the situation with Btrfs
>>>> except full redesigning the last one. If you want to base your file
>>>> system on the paper of Ohad Rodeh, then please, use *exactly* the
>>>> Bayer's B-trees that he refers to. That said, make sure that all
>>>> records you put to the tree has equal length and all non-root nodes of
>>>> your tree are at least half filled.
>>>>
>>> First, thanks Edward for identifying a specific problem with the
>>> current btrfs implementation.
>>>
>> Hello Jamie.
>>
>>
>>> I've studied modified B-trees quite a lot and know enough to be sure
>>> that they are quite robust when you modify them in all sorts of ways.
>>>
>> Which property is robust?
>>
>>
>>> Moreover, you are incorrect to say there's an intrinsic algorithmic
>>> problem with variable-length records. It is not true; if Knuth said
>>> so, Knuth was mistaken.
>>>
>> I didn't say about intrinsic algorithmic problems :)
>> I just repeat (after Knuth et al) that B-trees with variable-length
>> records don't
>> have any sane boundary for internal fragmentation. The common idea
>> is that if we
>> don't want Btrfs to be in infinite development stage, then we should
>> choose some
>> *sane* strategy (for example the paper of Ohad Rodeh) and strictly
>> adhere this in
>> future.
>>
>
> Again, other than the inline file data, what exactly do you believe
> needs to change?

1. getting rid of inline extents;
2. new formats for directory and xattr items to not look like a train,
which is able to occupy the whole leaf;
3. make sure we do pro-active balancing like it is described in the paper.

Sorry, I don't see other ways for now..

> Top down balancing vs balancing on insertion doesn't
> impact our ability to maintain full leaves. The current code is clearly
> choosing not to merge two leaves that it should have merged, which is
> just a plain old bug.
>

How are you going to balance leaves when walking from top to down?
Suppose 1) and 2) above are not satisfied and having arrived to the leaf
level we see a number of items of variable length. What will we do to
keep leaves full?

Could you please provide a sketch of the algorithm?

Thanks!

--
Edward O. Shishkin
Principal Software Engineer
Red Hat Czech

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Ric Wheeler on 18 Jun 2010 18:40

On 06/18/2010 06:04 PM, Edward Shishkin wrote:
> Chris Mason wrote:
>> On Fri, Jun 18, 2010 at 09:29:40PM +0200, Edward Shishkin wrote:
>>> Jamie Lokier wrote:
>>>> Edward Shishkin wrote:
>>>>> If you decide to base your file system on some algorithms then please
>>>>> use the original ones from proper academic papers. DO NOT modify the
>>>>> algorithms in solitude: this is very fragile thing! All such
>>>>> modifications must be reviewed by specialists in the theory of
>>>>> algorithms. Such review can be done in various scientific magazines of
>>>>> proper level.
>>>>>
>>>>> Personally I don't see any way to improve the situation with Btrfs
>>>>> except full redesigning the last one. If you want to base your file
>>>>> system on the paper of Ohad Rodeh, then please, use *exactly* the
>>>>> Bayer's B-trees that he refers to. That said, make sure that all
>>>>> records you put to the tree has equal length and all non-root nodes of
>>>>> your tree are at least half filled.
>>>> First, thanks Edward for identifying a specific problem with the
>>>> current btrfs implementation.
>>> Hello Jamie.
>>>
>>>> I've studied modified B-trees quite a lot and know enough to be sure
>>>> that they are quite robust when you modify them in all sorts of ways.
>>> Which property is robust?
>>>
>>>> Moreover, you are incorrect to say there's an intrinsic algorithmic
>>>> problem with variable-length records. It is not true; if Knuth said
>>>> so, Knuth was mistaken.
>>> I didn't say about intrinsic algorithmic problems :)
>>> I just repeat (after Knuth et al) that B-trees with variable-length
>>> records don't
>>> have any sane boundary for internal fragmentation. The common idea
>>> is that if we
>>> don't want Btrfs to be in infinite development stage, then we should
>>> choose some
>>> *sane* strategy (for example the paper of Ohad Rodeh) and strictly
>>> adhere this in
>>> future.
>>
>> Again, other than the inline file data, what exactly do you believe
>> needs to change?
>
> 1. getting rid of inline extents;
> 2. new formats for directory and xattr items to not look like a train,
> which is able to occupy the whole leaf;
> 3. make sure we do pro-active balancing like it is described in the paper.
>
> Sorry, I don't see other ways for now..
>
>> Top down balancing vs balancing on insertion doesn't
>> impact our ability to maintain full leaves. The current code is clearly
>> choosing not to merge two leaves that it should have merged, which is
>> just a plain old bug.
>
> How are you going to balance leaves when walking from top to down?
> Suppose 1) and 2) above are not satisfied and having arrived to the leaf
> level we see a number of items of variable length. What will we do to
> keep leaves full?
>
> Could you please provide a sketch of the algorithm?
>
> Thanks!
>

Hi Edward,

Is it really a requirement to have 100% full leaves? Most DB's (assuming I
remember correctly) have deliberate strategies around this kind of thing. You
might want to leave room in leaf nodes so that future insertions can be
contiguous on the media with older data.

Regards,

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Edward Shishkin on 18 Jun 2010 20:10

Ric Wheeler wrote:
> On 06/18/2010 06:04 PM, Edward Shishkin wrote:
>> Chris Mason wrote:
>>> On Fri, Jun 18, 2010 at 09:29:40PM +0200, Edward Shishkin wrote:
>>>> Jamie Lokier wrote:
>>>>> Edward Shishkin wrote:
>>>>>> If you decide to base your file system on some algorithms then
>>>>>> please
>>>>>> use the original ones from proper academic papers. DO NOT modify the
>>>>>> algorithms in solitude: this is very fragile thing! All such
>>>>>> modifications must be reviewed by specialists in the theory of
>>>>>> algorithms. Such review can be done in various scientific
>>>>>> magazines of
>>>>>> proper level.
>>>>>>
>>>>>> Personally I don't see any way to improve the situation with Btrfs
>>>>>> except full redesigning the last one. If you want to base your file
>>>>>> system on the paper of Ohad Rodeh, then please, use *exactly* the
>>>>>> Bayer's B-trees that he refers to. That said, make sure that all
>>>>>> records you put to the tree has equal length and all non-root
>>>>>> nodes of
>>>>>> your tree are at least half filled.
>>>>> First, thanks Edward for identifying a specific problem with the
>>>>> current btrfs implementation.
>>>> Hello Jamie.
>>>>
>>>>> I've studied modified B-trees quite a lot and know enough to be sure
>>>>> that they are quite robust when you modify them in all sorts of ways.
>>>> Which property is robust?
>>>>
>>>>> Moreover, you are incorrect to say there's an intrinsic algorithmic
>>>>> problem with variable-length records. It is not true; if Knuth said
>>>>> so, Knuth was mistaken.
>>>> I didn't say about intrinsic algorithmic problems :)
>>>> I just repeat (after Knuth et al) that B-trees with variable-length
>>>> records don't
>>>> have any sane boundary for internal fragmentation. The common idea
>>>> is that if we
>>>> don't want Btrfs to be in infinite development stage, then we should
>>>> choose some
>>>> *sane* strategy (for example the paper of Ohad Rodeh) and strictly
>>>> adhere this in
>>>> future.
>>>
>>> Again, other than the inline file data, what exactly do you believe
>>> needs to change?
>>
>> 1. getting rid of inline extents;
>> 2. new formats for directory and xattr items to not look like a train,
>> which is able to occupy the whole leaf;
>> 3. make sure we do pro-active balancing like it is described in the
>> paper.
>>
>> Sorry, I don't see other ways for now..
>>
>>> Top down balancing vs balancing on insertion doesn't
>>> impact our ability to maintain full leaves. The current code is clearly
>>> choosing not to merge two leaves that it should have merged, which is
>>> just a plain old bug.
>>
>> How are you going to balance leaves when walking from top to down?
>> Suppose 1) and 2) above are not satisfied and having arrived to the leaf
>> level we see a number of items of variable length. What will we do to
>> keep leaves full?
>>
>> Could you please provide a sketch of the algorithm?
>>
>> Thanks!
>>
>
> Hi Edward,
>
> Is it really a requirement to have 100% full leaves? Most DB's
> (assuming I remember correctly) have deliberate strategies around this
> kind of thing. You might want to leave room in leaf nodes so that
> future insertions can be contiguous on the media with older data.
>
> Regards,
>
> Ric

Hello Ric.

No, every leaf shouldn't be necessarily 100% full.
We may require every L-vicinity(*) of every leaf to be full in
some sense. And this local condition sometimes brings the (global)
boundary for utilization of the whole tree.

In the classic Bayer's B-trees we have so-called "S(0)" balancing
condition implicitly satisfied, which requires every leaf to be at
least half full. It provides the low boundary 0.5 for utilization
of the whole tree.

Next example is ReiserFS file system, which uses so-called "S(1)"
balancing condition on the leaf level, which requires every 1-vicinity
of every leaf to be "incompressible" (i.e. two leaves can not be
squeezed to a single one). This local incompressibility brings the
sweet low utilization boundary 0.5-E for the whole tree (E is a small
number, which is back proportional to the tree branching).

(*) any set of L+1 neighboring nodes, which contains the leaf.

--
Edward O. Shishkin
Principal Software Engineer
Red Hat Czech

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Chris Mason on 21 Jun 2010 09:20

On Sat, Jun 19, 2010 at 12:04:06AM +0200, Edward Shishkin wrote:
> Chris Mason wrote:
> >On Fri, Jun 18, 2010 at 09:29:40PM +0200, Edward Shishkin wrote:
> >>Jamie Lokier wrote:
> >>>Edward Shishkin wrote:
> >>>>If you decide to base your file system on some algorithms then please
> >>>>use the original ones from proper academic papers. DO NOT modify the
> >>>>algorithms in solitude: this is very fragile thing! All such
> >>>>modifications must be reviewed by specialists in the theory of
> >>>>algorithms. Such review can be done in various scientific magazines of
> >>>>proper level.
> >>>>
> >>>>Personally I don't see any way to improve the situation with Btrfs
> >>>>except full redesigning the last one. If you want to base your file
> >>>>system on the paper of Ohad Rodeh, then please, use *exactly* the
> >>>>Bayer's B-trees that he refers to. That said, make sure that all
> >>>>records you put to the tree has equal length and all non-root nodes of
> >>>>your tree are at least half filled.
> >>>First, thanks Edward for identifying a specific problem with the
> >>>current btrfs implementation.
> >>Hello Jamie.
> >>
> >>>I've studied modified B-trees quite a lot and know enough to be sure
> >>>that they are quite robust when you modify them in all sorts of ways.
> >>Which property is robust?
> >>
> >>>Moreover, you are incorrect to say there's an intrinsic algorithmic
> >>>problem with variable-length records. It is not true; if Knuth said
> >>>so, Knuth was mistaken.
> >>I didn't say about intrinsic algorithmic problems :)
> >>I just repeat (after Knuth et al) that B-trees with variable-length
> >>records don't
> >>have any sane boundary for internal fragmentation. The common idea
> >>is that if we
> >>don't want Btrfs to be in infinite development stage, then we should
> >>choose some
> >>*sane* strategy (for example the paper of Ohad Rodeh) and strictly
> >>adhere this in
> >>future.
> >
> >Again, other than the inline file data, what exactly do you believe
> >needs to change?
>
> 1. getting rid of inline extents;
> 2. new formats for directory and xattr items to not look like a train,
> which is able to occupy the whole leaf;
> 3. make sure we do pro-active balancing like it is described in the paper.
>
> Sorry, I don't see other ways for now..
>
> > Top down balancing vs balancing on insertion doesn't
> >impact our ability to maintain full leaves. The current code is clearly
> >choosing not to merge two leaves that it should have merged, which is
> >just a plain old bug.
>
> How are you going to balance leaves when walking from top to down?
> Suppose 1) and 2) above are not satisfied and having arrived to the leaf
> level we see a number of items of variable length. What will we do to
> keep leaves full?

I think I'm mixing up complaints ;) top-down balancing is unrelated to
the variable length items. You could balance from the bottom up and if
you have a 1K item that can't be split next to a 3K item that can't be
split, we're going to have those two items in two leaves. This is true
regardless of how we balance. So the prime complaint here is that you
don't like variable length items, correct?

Ideally leaves are full, and ideally data blocks are full. In practice
we can end up with either one utilized less than we would like. At the
end of the day, every FS has a worst case setup where it is mostly empty
in terms of bytes stored but full in terms of blocks. On xfs:

for x in `seq 1 bazallion` ; echo 'f' > $x

This would work on ext4 too, until you ran out of inodes.

>
> Could you please provide a sketch of the algorithm?
>

I'll reproduce from your test case and provide a fix. mount -o
max_inline=1500 would give us 50% usage in the worst case (minus the
balancing bug you hit).

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Chris Mason on 21 Jun 2010 14:10

On Mon, Jun 21, 2010 at 09:15:28AM -0400, Chris Mason wrote:
>
> I'll reproduce from your test case and provide a fix. mount -o
> max_inline=1500 would give us 50% usage in the worst case (minus the
> balancing bug you hit).

Ok, the balancing bug was interesting. What happens is we create all
the inodes and directory items nicely packed into the leaves.

Then delayed allocation kicks in and starts inserting the big fat inline
extents. This often forces the balancing code to split a leaf twice in
order to make room for the new inline extent. The double split code
wasn't balancing the bits that were left over on either side of our
desired slot.

The patch below fixes the problem for me, reducing the number of leaves
that have more than 2K of free space down from 6% of the total to about
74 leaves. It could be zero, but the balancing code doesn't push
items around if our leaf is in the first or last slot of the parent
node (complexity vs benefit tradeoff).

With the patch, I'm able to create 106894 files (2K each) on a 1GB FS.
That doesn't sound like a huge improvement, but the default from
mkfs.btrfs is to duplicate metadata. After duplication, that's about
417MB or about 40% of the overall space.

When you factor in the space that we reserve to avoid exploding on
enospc and the space that we have allocated for data extents, that's not
a very surprising number.

I'm still putting this patch through more testing, the double split code
is used in some difficult corners and I need to make sure I've tried
all of them.

-chris

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 0d1d966..1f393b0 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2304,12 +2304,17 @@ noinline int btrfs_leaf_free_space(struct btrfs_root *root,
return ret;
}

+/*
+ * min slot controls the lowest index we're willing to push to the
+ * right. We'll push up to and including min_slot, but no lower
+ */
static noinline int __push_leaf_right(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
struct btrfs_path *path,
int data_size, int empty,
struct extent_buffer *right,
- int free_space, u32 left_nritems)
+ int free_space, u32 left_nritems,
+ u32 min_slot)
{
struct extent_buffer *left = path->nodes[0];
struct extent_buffer *upper = path->nodes[1];
@@ -2327,7 +2332,7 @@ static noinline int __push_leaf_right(struct btrfs_trans_handle *trans,
if (empty)
nr = 0;
else
- nr = 1;
+ nr = max_t(u32, 1, min_slot);

if (path->slots[0] >= left_nritems)
push_space += data_size;
@@ -2469,10 +2474,13 @@ out_unlock:
*
* returns 1 if the push failed because the other node didn't have enough
* room, 0 if everything worked out and < 0 if there were major errors.
+ *
+ * this will push starting from min_slot to the end of the leaf. It won't
+ * push any slot lower than min_slot
*/
static int push_leaf_right(struct btrfs_trans_handle *trans, struct btrfs_root
*root, struct btrfs_path *path, int data_size,
- int empty)
+ int empty, u32 min_slot)
{
struct extent_buffer *left = path->nodes[0];
struct extent_buffer *right;
@@ -2515,7 +2523,7 @@ static int push_leaf_right(struct btrfs_trans_handle *trans, struct btrfs_root
goto out_unlock;

return __push_leaf_right(trans, root, path, data_size, empty,
- right, free_space, left_nritems);
+ right, free_space, left_nritems, min_slot);
out_unlock:
btrfs_tree_unlock(right);
free_extent_buffer(right);
@@ -2525,12 +2533,17 @@ out_unlock:
/*
* push some data in the path leaf to the left, trying to free up at
* least data_size bytes. returns zero if the push worked, nonzero otherwise
+ *
+ * max_slot can put a limit on how far into the leaf we'll push items. The
+ * item at 'max_slot' won't be touched. Use (u32)-1 to make us do all the
+ * items
*/
static noinline int __push_leaf_left(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
struct btrfs_path *path, int data_size,
int empty, struct extent_buffer *left,
- int free_space, int right_nritems)
+ int free_space, u32 right_nritems,
+ u32 max_slot)
{
struct btrfs_disk_key disk_key;
struct extent_buffer *right = path->nodes[0];
@@ -2549,9 +2562,9 @@ static noinline int __push_leaf_left(struct btrfs_trans_handle *trans,
slot = path->slots[1];

if (empty)
- nr = right_nritems;
+ nr = min(right_nritems, max_slot);
else
- nr = right_nritems - 1;
+ nr = min(right_nritems - 1, max_slot);

for (i = 0; i < nr; i++) {
item = btrfs_item_nr(right, i);
@@ -2712,10 +2725,14 @@ out:
/*
* push some data in the path leaf to the left, trying to free up at
* least data_size bytes. returns zero if the push worked, nonzero otherwise
+ *
+ * max_slot can put a limit on how far into the leaf we'll push items. The
+ * item at 'max_slot' won't be touched. Use (u32)-1 to make us push all the
+ * items
*/
static int push_leaf_left(struct btrfs_trans_handle *trans, struct btrfs_root
*root, struct btrfs_path *path, int data_size,
- int empty)
+ int empty, u32 max_slot)
{
struct extent_buffer *right = path->nodes[0];
struct extent_buffer *left;
@@ -2762,7 +2779,8 @@ static int push_leaf_left(struct btrfs_trans_handle *trans, struct btrfs_root
}

return __push_leaf_left(trans, root, path, data_size,
- empty, left, free_space, right_nritems);
+ empty, left, free_space, right_nritems,
+ max_slot);
out:
btrfs_tree_unlock(left);
free_extent_buffer(left);
@@ -2855,6 +2873,59 @@ static noinline int copy_for_split(struct btrfs_trans_handle *trans,
}

/*
+ * double splits happen when we need to insert a big item in the middle
+ * of a leaf. A double split can leave us with 3 mostly empty leaves:
+ * leaf: [ slots 0 - N] [ our target ] [ N + 1 - total in leaf ]
+ * A B C
+ *
+ * We avoid this by trying to push the items on either side of our target
+ * into the adjacent leaves. If all goes well we can avoid the double split
+ * completely.
+ */
+static noinline int push_for_double_split(struct btrfs_trans_handle *trans,
+ struct btrfs_root *root,
+ struct btrfs_path *path)
+{
+ int ret;
+ int progress = 0;
+ int slot;
+
+ slot = path->slots[0];
+
+ /*
+ * try to push all the items after our slot into the
+ * right leaf
+ */
+ ret = push_leaf_right(trans, root, path, 1, 0, slot);
+ if (ret < 0)
+ return ret;
+
+ if (ret == 0)
+ progress++;
+
+ /*
+ * our goal is to get our slot at the start or end of a leaf. If
+ * we've done so we're done
+ */
+ if (path->slots[0] == 0 ||
+ path->slots[0] == btrfs_header_nritems(path->nodes[0]))
+ return 0;
+
+ /* try to push all the items before our slot into the next leaf */
+ slot = path->slots[0];
+ ret = push_leaf_left(trans, root, path, 1, 0, slot);
+ if (ret < 0)
+ return ret;
+
+ if (ret == 0)
+ progress++;
+
+ if (progress)
+ return 0;
+ return 1;
+}
+
+/*
* split the path's leaf in two, making sure there is at least data_size
* available for the resulting leaf level of the path.
*
@@ -2876,6 +2947,7 @@ static noinline int split_leaf(struct btrfs_trans_handle *trans,
int wret;
int split;
int num_doubles = 0;
+ int tried_avoid_double = 0;

l = path->nodes[0];
slot = path->slots[0];
@@ -2884,12 +2956,13 @@ static noinline int split_leaf(struct btrfs_trans_handle *trans,
return -EOVERFLOW;

/* first try to make some room by pushing left and right */
- if (data_size && ins_key->type != BTRFS_DIR_ITEM_KEY) {
- wret = push_leaf_right(trans, root, path, data_size, 0);
+ if (data_size) {
+ wret = push_leaf_right(trans, root, path, data_size, 0, 0);
if (wret < 0)
return wret;
if (wret) {
- wret = push_leaf_left(trans, root, path, data_size, 0);
+ wret = push_leaf_left(trans, root, path,
+ data_size, 0, (u32)-1);
if (wret < 0)
return wret;
}
@@ -2923,6 +2996,12 @@ again:
if (mid != nritems &&
leaf_space_used(l, mid, nritems - mid) +
data_size > BTRFS_LEAF_DATA_SIZE(root)) {
+ if (!tried_avoid_double) {
+ push_for_double_split(trans,
+ root, path);
+ tried_avoid_double = 1;
+ goto again;
+ }
split = 2;
}
}
@@ -2939,6 +3018,12 @@ again:
if (mid != nritems &&
leaf_space_used(l, mid, nritems - mid) +
data_size > BTRFS_LEAF_DATA_SIZE(root)) {
+ if (!tried_avoid_double) {
+ push_for_double_split(trans,
+ root, path);
+ tried_avoid_double = 1;
+ goto again;
+ }
split = 2 ;
}
}
@@ -3915,13 +4000,14 @@ int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root,
extent_buffer_get(leaf);

btrfs_set_path_blocking(path);
- wret = push_leaf_left(trans, root, path, 1, 1);
+ wret = push_leaf_left(trans, root, path, 1, 1, (u32)-1);
if (wret < 0 && wret != -ENOSPC)
ret = wret;

if (path->nodes[0] == leaf &&
btrfs_header_nritems(leaf)) {
- wret = push_leaf_right(trans, root, path, 1, 1);
+ wret = push_leaf_right(trans, root, path,
+ 1, 1, 0);
if (wret < 0 && wret != -ENOSPC)
ret = wret;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2 3
Prev: msm: 8x50: add gpio device configuration.
Next: [tip:x86/build] x86, vdso: Error out if the vdso contains external references