ext4 performance regression 2.6.27-stable versus 2.6.32 and later [Kernel]

Prev: oss: msnd: check request_region() return value
Next: Checkpatch: warn about unexpectedly long msleep's

From: Kay Diederichs on 28 Jul 2010 16:10

Dear all,

we reproducibly find significantly worse ext4 performance when our
fileservers run 2.6.32 or later kernels, when compared to the
2.6.27-stable series.

The hardware is RAID5 of 5 1TB WD10EACS disks (giving almost 4TB) in an
external eSATA enclosure (STARDOM ST6600); disks are not partitioned but
rather the complete disks are used:
md5 : active raid5 sde[0] sdg[5] sdd[3] sdc[2] sdf[1]
3907045376 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5]
[UUUUU]

The enclosure is connected using a Silicon Image (supported by
sata_sil24) PCIe-X1 adapter to one of our fileservers (either the backup
fileserver, 32bit desktop hardware with Intel(R) Pentium(R) D CPU
3.40GHz, or a production-fileserver 64bit Precision WorkStation 670 w/ 2
Xeon 3.2GHz).

The ext4 filesystem was created using
mke2fs -j -T largefile -E stride=128,stripe_width=512 -O extent,uninit_bg
It is mounted with noatime,data=writeback

As operating system we usually use RHEL5.5, but to exclude problems with
self-compiled kernels, we also booted USB sticks with latest Fedora12
and FC13 .

Our benchmarks consist of copying 100 6MB files from and to the RAID5,
over NFS (NVSv3, GB ethernet, TCP, async export), and tar-ing and
rsync-ing kernel trees back and forth. Before and after each individual
benchmark part, we "sync" and "echo 3 > /proc/sys/vm/drop_caches" on
both the client and the server.

The problem:
with 2.6.27.48 we typically get:
44 seconds for preparations
23 seconds to rsync 100 frames with 597M from nfs directory
33 seconds to rsync 100 frames with 595M to nfs directory
50 seconds to untar 24353 kernel files with 323M to nfs directory
56 seconds to rsync 24353 kernel files with 323M from nfs directory
67 seconds to run xds_par in nfs directory (reads and writes 600M)
301 seconds to run the script

with 2.6.32.16 we find:
49 seconds for preparations
23 seconds to rsync 100 frames with 597M from nfs directory
261 seconds to rsync 100 frames with 595M to nfs directory
74 seconds to untar 24353 kernel files with 323M to nfs directory
67 seconds to rsync 24353 kernel files with 323M from nfs directory
290 seconds to run xds_par in nfs directory (reads and writes 600M)
797 seconds to run the script

This is quite reproducible (times varying about 1-2% or so). All times
include reading and writing on the client side (stock CentOS5.5 Nehalem
machines with fast single SATA disks). The 2.6.32.16 times are the same
with FC12 and FC13 (booted from USB stick).

The 2.6.27-versus-2.6.32+ regression cannot be due to barriers because
md RAID5 does not support barriers ("JBD: barrier-based sync failed on
md5 - disabling barriers").

What we tried: noop and deadline schedulers instead of cfq;
modifications of /sys/block/sd[c-g]/queue/max_sectors_kb; switching
on/off NCQ; blockdev --setra 8192 /dev/md5; increasing
/sys/block/md5/md/stripe_cache_size

When looking at the I/O statistics while the benchmark is running, we
see very choppy patterns for 2.6.32, but quite smooth stats for
2.6.27-stable.

It is not an NFS problem; we see the same effect when transferring the
data using an rsync daemon. We believe, but are not sure, that the
problem does not exist with ext3 - it's not so quick to re-format a 4 TB
volume.

Any ideas? We cannot believe that a general ext4 regression should have
gone unnoticed. So is it due to the interaction of ext4 with md-RAID5 ?

thanks,

Kay
--
Kay Diederichs http://strucbio.biologie.uni-konstanz.de
email: Kay.Diederichs(a)uni-konstanz.de Tel +49 7531 88 4049 Fax 3183
Fachbereich Biologie, Universit�t Konstanz, Box M647, D-78457 Konstanz.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Greg Freemyer on 28 Jul 2010 17:10

On Wed, Jul 28, 2010 at 3:51 PM, Kay Diederichs
<Kay.Diederichs(a)uni-konstanz.de> wrote:
> Dear all,
>
> we reproducibly find significantly worse ext4 performance when our
> fileservers run 2.6.32 or later kernels, when compared to the
> 2.6.27-stable series.
>
> The hardware is RAID5 of 5 1TB WD10EACS disks (giving almost 4TB) in an
> external eSATA enclosure (STARDOM ST6600); disks are not partitioned but
> rather the complete disks are used:
> md5 : active raid5 sde[0] sdg[5] sdd[3] sdc[2] sdf[1]
> � �3907045376 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5]
> [UUUUU]
>
> The enclosure is connected using a Silicon Image (supported by
> sata_sil24) PCIe-X1 adapter to one of our fileservers (either the backup
> fileserver, 32bit desktop hardware with Intel(R) Pentium(R) D CPU
> 3.40GHz, or a production-fileserver 64bit Precision WorkStation 670 w/ 2
> Xeon 3.2GHz).
>
> The ext4 filesystem was created using
> mke2fs -j -T largefile -E stride=128,stripe_width=512 -O extent,uninit_bg
> It is mounted with noatime,data=writeback
>
> As operating system we usually use RHEL5.5, but to exclude problems with
> self-compiled kernels, we also booted USB sticks with latest Fedora12
> and FC13 .
>
> Our benchmarks consist of copying 100 6MB files from and to the RAID5,
> over NFS (NVSv3, GB ethernet, TCP, async export), and tar-ing and
> rsync-ing kernel trees back and forth. Before and after each individual
> benchmark part, we "sync" and "echo 3 > /proc/sys/vm/drop_caches" on
> both the client and the server.
>
> The problem:
> with 2.6.27.48 we typically get:
> �44 seconds for preparations
> �23 seconds to rsync 100 frames with 597M from nfs directory
> �33 seconds to rsync 100 frames with 595M to nfs directory
> �50 seconds to untar 24353 kernel files with 323M to nfs directory
> �56 seconds to rsync 24353 kernel files with 323M from nfs directory
> �67 seconds to run xds_par in nfs directory (reads and writes 600M)
> 301 seconds to run the script
>
> with 2.6.32.16 we find:
> �49 seconds for preparations
> �23 seconds to rsync 100 frames with 597M from nfs directory
> 261 seconds to rsync 100 frames with 595M to nfs directory
> �74 seconds to untar 24353 kernel files with 323M to nfs directory
> �67 seconds to rsync 24353 kernel files with 323M from nfs directory
> 290 seconds to run xds_par in nfs directory (reads and writes 600M)
> 797 seconds to run the script
>
> This is quite reproducible (times varying about 1-2% or so). All times
> include reading and writing on the client side (stock CentOS5.5 Nehalem
> machines with fast single SATA disks). The 2.6.32.16 times are the same
> with FC12 and FC13 (booted from USB stick).
>
> The 2.6.27-versus-2.6.32+ regression cannot be due to barriers because
> md RAID5 does not support barriers ("JBD: barrier-based sync failed on
> md5 - disabling barriers").
>
> What we tried: noop and deadline schedulers instead of cfq;
> modifications of /sys/block/sd[c-g]/queue/max_sectors_kb; switching
> on/off NCQ; blockdev --setra 8192 /dev/md5; increasing
> /sys/block/md5/md/stripe_cache_size
>
> When looking at the I/O statistics while the benchmark is running, we
> see very choppy patterns for 2.6.32, but quite smooth stats for
> 2.6.27-stable.
>
> It is not an NFS problem; we see the same effect when transferring the
> data using an rsync daemon. We believe, but are not sure, that the
> problem does not exist with ext3 - it's not so quick to re-format a 4 TB
> volume.
>
> Any ideas? We cannot believe that a general ext4 regression should have
> gone unnoticed. So is it due to the interaction of ext4 with md-RAID5 ?
>
> thanks,
>
> Kay

Kay,

I didn't read your whole e-mail, but 2.6.27 has known issues with
barriers not working in many raid configs. Thus it is more likely to
experience data loss in the event of a power failure.

With newer kernels, If you prefer to have performance over robustness,
you can mount with the "nobarrier" option.

So now you have your choice whereas with 2.6.27, with raid5 you
effectively had nobarriers as your only choice.

Greg
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Kay Diederichs on 30 Jul 2010 17:10

Am 30.07.2010 04:20, schrieb Ted Ts'o:
> On Wed, Jul 28, 2010 at 09:51:48PM +0200, Kay Diederichs wrote:
>>
>> When looking at the I/O statistics while the benchmark is running, we
>> see very choppy patterns for 2.6.32, but quite smooth stats for
>> 2.6.27-stable.
>
> Could you try to do two things for me? Using (preferably from a
> recent e2fsprogs, such as 1.41.11 or 12) run filefrag -v on the files
> created from your 2.6.27 run and your 2.6.32 run?
>
> Secondly can capture blktrace results from 2.6.27 and 2.6.32? That
> would be very helpful to understand what might be going on.
>
> Either would be helpful; both would be greatly appreciated.
>
> Thanks,
>
> - Ted

Ted,

a typical example of filefrag -v output for 2.6.27.48 is

Filesystem type is: ef53
File size of /mnt/md5/scratch/nfs-test/tmp/xds/frames/h2g28_1_00000.cbf
is 6229688 (1521 blocks, blocksize 4096)
ext logical physical expected length flags
0 0 796160000 1024
1 1024 826381312 796161023 497 eof

(99 out of 100 files have 2 extents)

whereas for 2.6.32.16 the result is typically
Filesystem type is: ef53
File size of /mnt/md5/scratch/nfs-test/tmp/xds/frames/h2g28_1_00000.cbf
is 6229688 (1521 blocks, blocksize 4096)
ext logical physical expected length flags
0 0 826376200 1521 eof
/mnt/md5/scratch/nfs-test/tmp/xds/frames/h2g28_1_00000.cbf: 1 extent found

(99 out of 100 files have 1 extent)

We'll try the blktrace ASAP and report back.

thanks,

Kay

From: Kay Diederichs on 2 Aug 2010 06:50

Greg Freemyer schrieb:
> On Wed, Jul 28, 2010 at 3:51 PM, Kay Diederichs
> <Kay.Diederichs(a)uni-konstanz.de> wrote:
>> Dear all,
>>
>> we reproducibly find significantly worse ext4 performance when our
>> fileservers run 2.6.32 or later kernels, when compared to the
>> 2.6.27-stable series.
>>
>> The hardware is RAID5 of 5 1TB WD10EACS disks (giving almost 4TB) in an
>> external eSATA enclosure (STARDOM ST6600); disks are not partitioned but
>> rather the complete disks are used:
>> md5 : active raid5 sde[0] sdg[5] sdd[3] sdc[2] sdf[1]
>> 3907045376 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5]
>> [UUUUU]
>>
>> The enclosure is connected using a Silicon Image (supported by
>> sata_sil24) PCIe-X1 adapter to one of our fileservers (either the backup
>> fileserver, 32bit desktop hardware with Intel(R) Pentium(R) D CPU
>> 3.40GHz, or a production-fileserver 64bit Precision WorkStation 670 w/ 2
>> Xeon 3.2GHz).
>>
>> The ext4 filesystem was created using
>> mke2fs -j -T largefile -E stride=128,stripe_width=512 -O extent,uninit_bg
>> It is mounted with noatime,data=writeback
>>
>> As operating system we usually use RHEL5.5, but to exclude problems with
>> self-compiled kernels, we also booted USB sticks with latest Fedora12
>> and FC13 .
>>
>> Our benchmarks consist of copying 100 6MB files from and to the RAID5,
>> over NFS (NVSv3, GB ethernet, TCP, async export), and tar-ing and
>> rsync-ing kernel trees back and forth. Before and after each individual
>> benchmark part, we "sync" and "echo 3 > /proc/sys/vm/drop_caches" on
>> both the client and the server.
>>
>> The problem:
>> with 2.6.27.48 we typically get:
>> 44 seconds for preparations
>> 23 seconds to rsync 100 frames with 597M from nfs directory
>> 33 seconds to rsync 100 frames with 595M to nfs directory
>> 50 seconds to untar 24353 kernel files with 323M to nfs directory
>> 56 seconds to rsync 24353 kernel files with 323M from nfs directory
>> 67 seconds to run xds_par in nfs directory (reads and writes 600M)
>> 301 seconds to run the script
>>
>> with 2.6.32.16 we find:
>> 49 seconds for preparations
>> 23 seconds to rsync 100 frames with 597M from nfs directory
>> 261 seconds to rsync 100 frames with 595M to nfs directory
>> 74 seconds to untar 24353 kernel files with 323M to nfs directory
>> 67 seconds to rsync 24353 kernel files with 323M from nfs directory
>> 290 seconds to run xds_par in nfs directory (reads and writes 600M)
>> 797 seconds to run the script
>>
>> This is quite reproducible (times varying about 1-2% or so). All times
>> include reading and writing on the client side (stock CentOS5.5 Nehalem
>> machines with fast single SATA disks). The 2.6.32.16 times are the same
>> with FC12 and FC13 (booted from USB stick).
>>
>> The 2.6.27-versus-2.6.32+ regression cannot be due to barriers because
>> md RAID5 does not support barriers ("JBD: barrier-based sync failed on
>> md5 - disabling barriers").
>>
>> What we tried: noop and deadline schedulers instead of cfq;
>> modifications of /sys/block/sd[c-g]/queue/max_sectors_kb; switching
>> on/off NCQ; blockdev --setra 8192 /dev/md5; increasing
>> /sys/block/md5/md/stripe_cache_size
>>
>> When looking at the I/O statistics while the benchmark is running, we
>> see very choppy patterns for 2.6.32, but quite smooth stats for
>> 2.6.27-stable.
>>
>> It is not an NFS problem; we see the same effect when transferring the
>> data using an rsync daemon. We believe, but are not sure, that the
>> problem does not exist with ext3 - it's not so quick to re-format a 4 TB
>> volume.
>>
>> Any ideas? We cannot believe that a general ext4 regression should have
>> gone unnoticed. So is it due to the interaction of ext4 with md-RAID5 ?
>>
>> thanks,
>>
>> Kay
>
> Kay,
>
> I didn't read your whole e-mail, but 2.6.27 has known issues with
> barriers not working in many raid configs. Thus it is more likely to
> experience data loss in the event of a power failure.
>
> With newer kernels, If you prefer to have performance over robustness,
> you can mount with the "nobarrier" option.
>
> So now you have your choice whereas with 2.6.27, with raid5 you
> effectively had nobarriers as your only choice.
>
> Greg

Greg,

2.6.33 and later support md5 write barriers, whereas 2.6.27-stable
doesn't. I looked thru the 2.6.32.* Changelogs at
http://kernel.org/pub/linux/kernel/v2.6/ but could not find anything
indicating that md5 write barriers were backported to 2.6.32-stable.

Anyway, we do not get the message "JBD: barrier-based sync failed on md5
- disabling barriers" when using 2.6.32.16 which might indicate that
write barriers are indeed active when specifying no options in this respect.

Performance-wise, we tried mounting with barrier versus nobarrier (or
barrier=1 versus barrier=0) and re-did the 2.6.32+ benchmarks. It turned
out that the benchmark difference with and without barrier is less than
the variation between runs (which is much higher with 2.6.32+ than with
2.6.27-stable), so the influence seems to be minor.

best,

Kay

From: Kay Diederichs on 2 Aug 2010 11:00

Dave Chinner schrieb:
> On Wed, Jul 28, 2010 at 09:51:48PM +0200, Kay Diederichs wrote:
>> Dear all,
>>
>> we reproducibly find significantly worse ext4 performance when our
>> fileservers run 2.6.32 or later kernels, when compared to the
>> 2.6.27-stable series.
>>
>> The hardware is RAID5 of 5 1TB WD10EACS disks (giving almost 4TB) in an
>> external eSATA enclosure (STARDOM ST6600); disks are not partitioned but
>> rather the complete disks are used:
>> md5 : active raid5 sde[0] sdg[5] sdd[3] sdc[2] sdf[1]
>> 3907045376 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5]
>> [UUUUU]
>>
>> The enclosure is connected using a Silicon Image (supported by
>> sata_sil24) PCIe-X1 adapter to one of our fileservers (either the backup
>> fileserver, 32bit desktop hardware with Intel(R) Pentium(R) D CPU
>> 3.40GHz, or a production-fileserver 64bit Precision WorkStation 670 w/ 2
>> Xeon 3.2GHz).
>>
>> The ext4 filesystem was created using
>> mke2fs -j -T largefile -E stride=128,stripe_width=512 -O extent,uninit_bg
>> It is mounted with noatime,data=writeback
>>
>> As operating system we usually use RHEL5.5, but to exclude problems with
>> self-compiled kernels, we also booted USB sticks with latest Fedora12
>> and FC13 .
>>
>> Our benchmarks consist of copying 100 6MB files from and to the RAID5,
>> over NFS (NVSv3, GB ethernet, TCP, async export), and tar-ing and
>> rsync-ing kernel trees back and forth. Before and after each individual
>> benchmark part, we "sync" and "echo 3 > /proc/sys/vm/drop_caches" on
>> both the client and the server.
>>
>> The problem:
>> with 2.6.27.48 we typically get:
>> 44 seconds for preparations
>> 23 seconds to rsync 100 frames with 597M from nfs directory
>> 33 seconds to rsync 100 frames with 595M to nfs directory
>> 50 seconds to untar 24353 kernel files with 323M to nfs directory
>> 56 seconds to rsync 24353 kernel files with 323M from nfs directory
>> 67 seconds to run xds_par in nfs directory (reads and writes 600M)
>> 301 seconds to run the script
>>
>> with 2.6.32.16 we find:
>> 49 seconds for preparations
>> 23 seconds to rsync 100 frames with 597M from nfs directory
>> 261 seconds to rsync 100 frames with 595M to nfs directory
>> 74 seconds to untar 24353 kernel files with 323M to nfs directory
>> 67 seconds to rsync 24353 kernel files with 323M from nfs directory
>> 290 seconds to run xds_par in nfs directory (reads and writes 600M)
>> 797 seconds to run the script
>>
>> This is quite reproducible (times varying about 1-2% or so). All times
>> include reading and writing on the client side (stock CentOS5.5 Nehalem
>> machines with fast single SATA disks). The 2.6.32.16 times are the same
>> with FC12 and FC13 (booted from USB stick).
>>
>> The 2.6.27-versus-2.6.32+ regression cannot be due to barriers because
>> md RAID5 does not support barriers ("JBD: barrier-based sync failed on
>> md5 - disabling barriers").
>>
>> What we tried: noop and deadline schedulers instead of cfq;
>> modifications of /sys/block/sd[c-g]/queue/max_sectors_kb; switching
>> on/off NCQ; blockdev --setra 8192 /dev/md5; increasing
>> /sys/block/md5/md/stripe_cache_size
>>
>> When looking at the I/O statistics while the benchmark is running, we
>> see very choppy patterns for 2.6.32, but quite smooth stats for
>> 2.6.27-stable.
>>
>> It is not an NFS problem; we see the same effect when transferring the
>> data using an rsync daemon. We believe, but are not sure, that the
>> problem does not exist with ext3 - it's not so quick to re-format a 4 TB
>> volume.
>>
>> Any ideas? We cannot believe that a general ext4 regression should have
>> gone unnoticed. So is it due to the interaction of ext4 with md-RAID5 ?
>
> Try reverting 50797481a7bdee548589506d7d7b48b08bc14dcd (ext4: Avoid
> group preallocation for closed files). IIRC it caused the same sort
> of isevere performance regressions for postmark....
>
> Cheers,
>
> Dave.

Dave,

as you suggested, we reverted "ext4: Avoid group preallocation for
closed files" and this indeed fixes a big part of the problem: after
booting the NFS server we get

NFS-Server: turn5 2.6.32.16p i686
NFS-Client: turn10 2.6.18-194.8.1.el5 x86_64

exported directory on the nfs-server:
/dev/md5 /mnt/md5 ext4
rw,seclabel,noatime,barrier=1,stripe=512,data=writeback 0 0

48 seconds for preparations
28 seconds to rsync 100 frames with 597M from nfs directory
57 seconds to rsync 100 frames with 595M to nfs directory
70 seconds to untar 24353 kernel files with 323M to nfs directory
57 seconds to rsync 24353 kernel files with 323M from nfs directory
133 seconds to run xds_par in nfs directory
425 seconds to run the script

For blktrace details, see my next email which is a response to Ted's.

best,

Kay
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2
Prev: oss: msnd: check request_region() return value
Next: Checkpatch: warn about unexpectedly long msleep's