BTRFS: Unbelievably slow with kvm/qemu [Kernel]

Prev: Modpost error after changing CONFIG_SOUND from m to y
Next: linux-next: Tree for July 12

From: Giangiacomo Mariotti on 13 Jul 2010 22:40

On Tue, Jul 13, 2010 at 6:29 AM, Avi Kivity <avi(a)redhat.com> wrote:
> Btrfs is very slow on sync writes:
>
> 45KB/s, while 4-5MB/s traffic was actually going to the disk. For every 4KB
> that the the application writes, 400KB+ of metadata is written.
>
> (It's actually worse, since it starts faster than the average and ends up
> slower than the average).
>
> For kvm, you can try cache=writeback or cache=unsafe and get better
> performance (though still slower than ext*).
>
Yeah, well I've already moved the virtual hd file to an ext3
partition, so the problem for me was actually already "solved" before
posting the first post. I posted the first message just to report the
particularly bad performances of Btrfs for this test-case, so that, if
not already known, they could be investigated and hopefully solved.

By the way, thanks to everyone who answered!

--
Giangiacomo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Christoph Hellwig on 14 Jul 2010 15:50

There are a lot of variables when using qemu.

The most important one are:

- the cache mode on the device. The default is cache=writethrough,
which is not quite optimal. You generally do want to use cache=none
which uses O_DIRECT in qemu.
- if the backing image is sparse or not.
- if you use barrier - both in the host and the guest.

Below I have a table comparing raw blockdevices, xfs, btrfs, ext4 and
ext3. For ext3 we also compare the default, unsafe barrier=0 version
and the barrier=1 version you should use if you actually care about
your data.

The comparism is a simple untar of a Linux 2.6.34 tarball, including a
sync after it. We run this with ext3 in the guest, either using the
default barrier=0, or for the later tests also using barrier=1. It
is done on an OCZ Vertext SSD, which gets reformatted and fully TRIMed
before each test.

As you can see you generally do want to use cache=none and every
filesystem is about the same speed for that - except that on XFS you
also really need preallocation. What's interesting is how bad btrfs
is for the default compared to the others, and that for many filesystems
things actually get minimally faster when enabling barriers in the
guest. Things will look very different for barrier heavy guest, I'll
do another benchmark for those.

bdev xfs btrfs ext4 ext3 ext3 (barrier)

cache=writethrough nobarrier sparse 0m27.183s 0m42.552s 2m28.929s 0m33.749s 0m24.975s 0m37.105s
cache=writethrough nobarrier prealloc - 0m32.840s 2m28.378s 0m34.233s - -

cache=none nobarrier sparse 0m21.988s 0m49.758s 0m24.819s 0m23.977s 0m22.569s 0m24.938s
cache=none nobarrier prealloc - 0m24.464s 0m24.646s 0m24.346s - -

cache=none barrier sparse 0m21.526s 0m41.158s 0m24.403s 0m23.924s 0m23.040s 0m23.272s
cache=none barrier prealloc - 0m23.944s 0m24.284s 0m23.981s - -
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Giangiacomo Mariotti on 17 Jul 2010 01:30

On Wed, Jul 14, 2010 at 9:49 PM, Christoph Hellwig <hch(a)infradead.org> wrote:
> There are a lot of variables when using qemu.
>
> The most important one are:
>
> Â - the cache mode on the device. Â The default is cache=writethrough,
> Â which is not quite optimal. Â You generally do want to use cache=none
> Â which uses O_DIRECT in qemu.
> Â - if the backing image is sparse or not.
> Â - if you use barrier - both in the host and the guest.
>
> Below I have a table comparing raw blockdevices, xfs, btrfs, ext4 and
> ext3. Â For ext3 we also compare the default, unsafe barrier=0 version
> and the barrier=1 version you should use if you actually care about
> your data.
>
> The comparism is a simple untar of a Linux 2.6.34 tarball, including a
> sync after it. Â We run this with ext3 in the guest, either using the
> default barrier=0, or for the later tests also using barrier=1. Â It
> is done on an OCZ Vertext SSD, which gets reformatted and fully TRIMed
> before each test.
>
> As you can see you generally do want to use cache=none and every
> filesystem is about the same speed for that - except that on XFS you
> also really need preallocation. Â What's interesting is how bad btrfs
> is for the default compared to the others, and that for many filesystems
> things actually get minimally faster when enabling barriers in the
> guest. Â Things will look very different for barrier heavy guest, I'll
> do another benchmark for those.
>
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â bdev Â Â Â Â Â Â xfs Â Â Â Â Â Â btrfs Â Â Â Â Â ext4 Â Â Â Â Â Â ext3 Â Â Â Â Â Â ext3 (barrier)
>
> cache=writethrough Â Â Â nobarrier Â Â Â sparse Â Â Â Â Â 0m27.183s Â Â Â 0m42.552s Â Â Â 2m28.929s Â Â Â 0m33.749s Â Â Â 0m24.975s Â Â Â 0m37.105s
> cache=writethrough Â Â Â nobarrier Â Â Â prealloc Â Â Â Â - Â Â Â Â Â Â Â 0m32.840s Â Â Â 2m28.378s Â Â Â 0m34.233s Â Â Â - Â Â Â Â Â Â Â -
>
> cache=none Â Â Â Â Â Â Â nobarrier Â Â Â sparse Â Â Â Â Â 0m21.988s Â Â Â 0m49.758s Â Â Â 0m24.819s Â Â Â 0m23.977s Â Â Â 0m22.569s Â Â Â 0m24.938s
> cache=none Â Â Â Â Â Â Â nobarrier Â Â Â prealloc Â Â Â Â - Â Â Â Â Â Â Â 0m24.464s Â Â Â 0m24.646s Â Â Â 0m24.346s Â Â Â - Â Â Â Â Â Â Â -
>
> cache=none Â Â Â Â Â Â Â barrier Â Â Â Â sparse Â Â Â Â Â 0m21.526s Â Â Â 0m41.158s Â Â Â 0m24.403s Â Â Â 0m23.924s Â Â Â 0m23.040s Â Â Â 0m23.272s
> cache=none Â Â Â Â Â Â Â barrier Â Â Â Â prealloc Â Â Â Â - Â Â Â Â Â Â Â 0m23.944s Â Â Â 0m24.284s Â Â Â 0m23.981s Â Â Â - Â Â Â Â Â Â Â -
>
Very interesting. I haven't had the time to try it again, but now I'm
gonna try some options about the cache and see what gives me the best
results.

--
Giangiacomo

From: Ted Ts'o on 18 Jul 2010 03:10

On Wed, Jul 14, 2010 at 03:49:05PM -0400, Christoph Hellwig wrote:
> Below I have a table comparing raw blockdevices, xfs, btrfs, ext4 and
> ext3. For ext3 we also compare the default, unsafe barrier=0 version
> and the barrier=1 version you should use if you actually care about
> your data.
>
> The comparism is a simple untar of a Linux 2.6.34 tarball, including a
> sync after it. We run this with ext3 in the guest, either using the
> default barrier=0, or for the later tests also using barrier=1. It
> is done on an OCZ Vertext SSD, which gets reformatted and fully TRIMed
> before each test.
>
> As you can see you generally do want to use cache=none and every
> filesystem is about the same speed for that - except that on XFS you
> also really need preallocation. What's interesting is how bad btrfs
> is for the default compared to the others, and that for many filesystems
> things actually get minimally faster when enabling barriers in the
> guest.

Christoph,

Thanks so much for running these benchmarks. It's been on my todo
list ever since the original complaint came across on the linux-ext4
list, but I just haven't had time to do the investigation. I wonder
exactly what qemu is doing which is impact btrfs in particularly so
badly. I assume that using the qcow2 format with cache=writethrough,
it's doing lots of effectively file appends whih require allocation
(or conversion of uninitialized preallocated blocks to initialized
blocks in the fs metadata) with lots of fsync()'s afterwards.

But when I've benchmarked the fs_mark benchmark writing 10k files
followed by an fsync, I didn't see results for btrfs that were way out
of line compared to xfs, ext3, ext4, et.al. So merely doing a block
allocation, a small write, followed by an fsync, was something that
all file systems did fairly well at. So there must be something
interesting/pathalogical about what qemu is doing with
cache=writethrough. It might be interesting to understand what is
going on there, either to fix qemu/kvm, or so file systems know that
there's a particular workload that requires some special attention...

- Ted

P.S. I assume since you listed "sparse" that you were using a raw
disk and not a qcom2 block device image?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Christoph Hellwig on 18 Jul 2010 03:20

On Sat, Jul 17, 2010 at 06:28:06AM -0400, Ted Ts'o wrote:
> Thanks so much for running these benchmarks. It's been on my todo
> list ever since the original complaint came across on the linux-ext4
> list, but I just haven't had time to do the investigation. I wonder
> exactly what qemu is doing which is impact btrfs in particularly so
> badly. I assume that using the qcow2 format with cache=writethrough,
> it's doing lots of effectively file appends whih require allocation
> (or conversion of uninitialized preallocated blocks to initialized
> blocks in the fs metadata) with lots of fsync()'s afterwards.

This is using raw images. So what we're doing there is hole filling.
No explicit fsyncs are done for cache=writethrough. cache=writethrough
translates to using O_DSYNC, which makes every write synchronous, which
these days translates to an implicity ->fsync call on every write.

> P.S. I assume since you listed "sparse" that you were using a raw
> disk and not a qcom2 block device image?

All of these are using raw images. sparse means just doing a truncate
to the image size, preallocated means using fallocate to pre-allocate
the space.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev |
Pages: 1 2 3
Prev: Modpost error after changing CONFIG_SOUND from m to y
Next: linux-next: Tree for July 12