ext4 performance regression 2.6.27-stable versus 2.6.32 and later [Kernel]

Prev: Remaining work for userns (WAS Re: [PATCH 3/3] cgroup : remove the ns_cgroup)
Next: Get interface MAC address in sys_accept4() syscall

From: Dave Chinner on 29 Jul 2010 19:30

On Wed, Jul 28, 2010 at 09:51:48PM +0200, Kay Diederichs wrote:
> Dear all,
>
> we reproducibly find significantly worse ext4 performance when our
> fileservers run 2.6.32 or later kernels, when compared to the
> 2.6.27-stable series.
>
> The hardware is RAID5 of 5 1TB WD10EACS disks (giving almost 4TB) in an
> external eSATA enclosure (STARDOM ST6600); disks are not partitioned but
> rather the complete disks are used:
> md5 : active raid5 sde[0] sdg[5] sdd[3] sdc[2] sdf[1]
> 3907045376 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5]
> [UUUUU]
>
> The enclosure is connected using a Silicon Image (supported by
> sata_sil24) PCIe-X1 adapter to one of our fileservers (either the backup
> fileserver, 32bit desktop hardware with Intel(R) Pentium(R) D CPU
> 3.40GHz, or a production-fileserver 64bit Precision WorkStation 670 w/ 2
> Xeon 3.2GHz).
>
> The ext4 filesystem was created using
> mke2fs -j -T largefile -E stride=128,stripe_width=512 -O extent,uninit_bg
> It is mounted with noatime,data=writeback
>
> As operating system we usually use RHEL5.5, but to exclude problems with
> self-compiled kernels, we also booted USB sticks with latest Fedora12
> and FC13 .
>
> Our benchmarks consist of copying 100 6MB files from and to the RAID5,
> over NFS (NVSv3, GB ethernet, TCP, async export), and tar-ing and
> rsync-ing kernel trees back and forth. Before and after each individual
> benchmark part, we "sync" and "echo 3 > /proc/sys/vm/drop_caches" on
> both the client and the server.
>
> The problem:
> with 2.6.27.48 we typically get:
> 44 seconds for preparations
> 23 seconds to rsync 100 frames with 597M from nfs directory
> 33 seconds to rsync 100 frames with 595M to nfs directory
> 50 seconds to untar 24353 kernel files with 323M to nfs directory
> 56 seconds to rsync 24353 kernel files with 323M from nfs directory
> 67 seconds to run xds_par in nfs directory (reads and writes 600M)
> 301 seconds to run the script
>
> with 2.6.32.16 we find:
> 49 seconds for preparations
> 23 seconds to rsync 100 frames with 597M from nfs directory
> 261 seconds to rsync 100 frames with 595M to nfs directory
> 74 seconds to untar 24353 kernel files with 323M to nfs directory
> 67 seconds to rsync 24353 kernel files with 323M from nfs directory
> 290 seconds to run xds_par in nfs directory (reads and writes 600M)
> 797 seconds to run the script
>
> This is quite reproducible (times varying about 1-2% or so). All times
> include reading and writing on the client side (stock CentOS5.5 Nehalem
> machines with fast single SATA disks). The 2.6.32.16 times are the same
> with FC12 and FC13 (booted from USB stick).
>
> The 2.6.27-versus-2.6.32+ regression cannot be due to barriers because
> md RAID5 does not support barriers ("JBD: barrier-based sync failed on
> md5 - disabling barriers").
>
> What we tried: noop and deadline schedulers instead of cfq;
> modifications of /sys/block/sd[c-g]/queue/max_sectors_kb; switching
> on/off NCQ; blockdev --setra 8192 /dev/md5; increasing
> /sys/block/md5/md/stripe_cache_size
>
> When looking at the I/O statistics while the benchmark is running, we
> see very choppy patterns for 2.6.32, but quite smooth stats for
> 2.6.27-stable.
>
> It is not an NFS problem; we see the same effect when transferring the
> data using an rsync daemon. We believe, but are not sure, that the
> problem does not exist with ext3 - it's not so quick to re-format a 4 TB
> volume.
>
> Any ideas? We cannot believe that a general ext4 regression should have
> gone unnoticed. So is it due to the interaction of ext4 with md-RAID5 ?

Try reverting 50797481a7bdee548589506d7d7b48b08bc14dcd (ext4: Avoid
group preallocation for closed files). IIRC it caused the same sort
of isevere performance regressions for postmark....

Cheers,

Dave.
--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Ted Ts'o on 29 Jul 2010 22:30

On Wed, Jul 28, 2010 at 09:51:48PM +0200, Kay Diederichs wrote:
>
> When looking at the I/O statistics while the benchmark is running, we
> see very choppy patterns for 2.6.32, but quite smooth stats for
> 2.6.27-stable.

Could you try to do two things for me? Using (preferably from a
recent e2fsprogs, such as 1.41.11 or 12) run filefrag -v on the files
created from your 2.6.27 run and your 2.6.32 run?

Secondly can capture blktrace results from 2.6.27 and 2.6.32? That
would be very helpful to understand what might be going on.

Either would be helpful; both would be greatly appreciated.

Thanks,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Ted Ts'o on 1 Aug 2010 19:10

On Fri, Jul 30, 2010 at 11:01:36PM +0200, Kay Diederichs wrote:
> whereas for 2.6.32.16 the result is typically
> Filesystem type is: ef53
> File size of
> /mnt/md5/scratch/nfs-test/tmp/xds/frames/h2g28_1_00000.cbf is
> 6229688 (1521 blocks, blocksize 4096)
> ext logical physical expected length flags
> 0 0 826376200 1521 eof
> /mnt/md5/scratch/nfs-test/tmp/xds/frames/h2g28_1_00000.cbf: 1 extent found

OK, so 2.6.32 is actually doing a better job laying out the files....

The blktrace will be interesting, but at this point I'm wondering if
this is a generic kernel-wide writeback regression. At $WORK we've
noticed some performance regressions between 2.6.26-based kernels and
2.6.33- and 2.6.34-based kernels with both ext2 and ext4 (in no
journal mode) that we've been trying to track down. We have a pretty
large number of patches applied to both 2.6.26 and 2.6.33/34 which is
why I haven't mentioned it up until now, but at this point it seems
pretty clear there are some writeback issues in the mainline kernel.

There are half a dozen or so patch series on LKML that are addressing
writeback in one way or another, and writeback is a major topic at the
upcoming Linux Storage and Filesystem workshop. So if this is the
cause, hopefully there will be some improvements in this area in the
near future.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Henrique de Moraes Holschuh on 2 Aug 2010 12:10

On Mon, 02 Aug 2010, Kay Diederichs wrote:
> Performance-wise, we tried mounting with barrier versus nobarrier (or
> barrier=1 versus barrier=0) and re-did the 2.6.32+ benchmarks. It turned
> out that the benchmark difference with and without barrier is less than
> the variation between runs (which is much higher with 2.6.32+ than with
> 2.6.27-stable), so the influence seems to be minor.

Did you check interactions with the IO scheduler?

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Henrique de Moraes Holschuh on 2 Aug 2010 12:20

On Mon, 02 Aug 2010, Henrique de Moraes Holschuh wrote:
> On Mon, 02 Aug 2010, Kay Diederichs wrote:
> > Performance-wise, we tried mounting with barrier versus nobarrier (or
> > barrier=1 versus barrier=0) and re-did the 2.6.32+ benchmarks. It turned
> > out that the benchmark difference with and without barrier is less than
> > the variation between runs (which is much higher with 2.6.32+ than with
> > 2.6.27-stable), so the influence seems to be minor.
>
> Did you check interactions with the IO scheduler?

Never mind, I reread your first message, and you did. I apologise for the
noise.

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

|
Pages: 1
Prev: Remaining work for userns (WAS Re: [PATCH 3/3] cgroup : remove the ns_cgroup)
Next: Get interface MAC address in sys_accept4() syscall