ext4 performance regression 2.6.27-stable versus 2.6.32 and later [Kernel]

Prev: oss: msnd: check request_region() return value
Next: Checkpatch: warn about unexpectedly long msleep's

From: Kay Diederichs on 2 Aug 2010 11:30

Ted Ts'o schrieb:
> On Wed, Jul 28, 2010 at 09:51:48PM +0200, Kay Diederichs wrote:
>> When looking at the I/O statistics while the benchmark is running, we
>> see very choppy patterns for 2.6.32, but quite smooth stats for
>> 2.6.27-stable.
>
> Could you try to do two things for me? Using (preferably from a
> recent e2fsprogs, such as 1.41.11 or 12) run filefrag -v on the files
> created from your 2.6.27 run and your 2.6.32 run?
>
> Secondly can capture blktrace results from 2.6.27 and 2.6.32? That
> would be very helpful to understand what might be going on.
>
> Either would be helpful; both would be greatly appreciated.
>
> Thanks,
>
> - Ted

Ted,

we pared down the benchmark to the last step (called "run xds_par in nfs
directory (reads 600M, and writes 50M)") because this captures most of
the problem. Here we report kernel messages with stacktrace, and the
blktrace output that you requested.

Kernel messages: with 2.6.32.16 we observe

[ 6961.838032] INFO: task jbd2/md5-8:2010 blocked for more than 120 seconds.
[ 6961.838111] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 6961.838191] jbd2/md5-8 D 00000634 0 2010 2 0x00000000
[ 6961.838200] f5171e78 00000046 231a9999 00000634 ddf91f2c f652cc4c
00000001 f5171e1c
[ 6961.838307] c0a6f140 c0a6f140 c0a6f140 c0a6a6ac f6354c20 f6354ecc
c2008140 00000000
[ 6961.838412] 00637f84 00000003 f652cc58 00000000 00000292 00000048
c20036ac f6354ecc
[ 6961.838518] Call Trace:
[ 6961.838556] [<c056c39e>] jbd2_journal_commit_transaction+0x1d9/0x1187
[ 6961.838627] [<c040220a>] ? __switch_to+0xd5/0x147
[ 6961.838681] [<c07a390a>] ? schedule+0x837/0x885
[ 6961.838734] [<c0455e5f>] ? autoremove_wake_function+0x0/0x38
[ 6961.838799] [<c0448c84>] ? try_to_del_timer_sync+0x58/0x60
[ 6961.838859] [<c0572426>] kjournald2+0xa2/0x1be
[ 6961.838909] [<c0455e5f>] ? autoremove_wake_function+0x0/0x38
[ 6961.838971] [<c0572384>] ? kjournald2+0x0/0x1be
[ 6961.839035] [<c0455c11>] kthread+0x66/0x6b
[ 6961.839089] [<c0455bab>] ? kthread+0x0/0x6b
[ 6961.839139] [<c0404167>] kernel_thread_helper+0x7/0x10
[ 6961.839215] INFO: task sync:11600 blocked for more than 120 seconds.
[ 6961.839286] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 6961.839367] sync D 00000632 0 11600 11595 0x00000000
[ 6961.839375] ddf91ea4 00000086 dca59d8b 00000632 76f570ee 00000048
00001dec 773fd796
[ 6961.839486] c0a6f140 c0a6f140 c0a6f140 c200819c f4ce0000 f4ce02ac
c2008140 00000000
[ 6961.839600] ddf91ea8 dca5c36b 00000632 dca5bb77 c2008180 773fd796
00000282 f4ce02ac
[ 6961.839727] Call Trace:
[ 6961.839762] [<c04f4de6>] bdi_sched_wait+0xd/0x11
[ 6961.841438] [<c07a3ede>] __wait_on_bit+0x3b/0x62
[ 6961.843109] [<c04f4dd9>] ? bdi_sched_wait+0x0/0x11
[ 6961.844782] [<c07a3fb5>] out_of_line_wait_on_bit+0xb0/0xb8
[ 6961.846479] [<c04f4dd9>] ? bdi_sched_wait+0x0/0x11
[ 6961.848181] [<c0455e97>] ? wake_bit_function+0x0/0x48
[ 6961.849906] [<c04f4c75>] wait_on_bit+0x25/0x31
[ 6961.851601] [<c04f4e5d>] sync_inodes_sb+0x73/0x121
[ 6961.853287] [<c04f8acc>] __sync_filesystem+0x48/0x69
[ 6961.854983] [<c04f8b72>] sync_filesystems+0x85/0xc7
[ 6961.856670] [<c04f8c04>] sys_sync+0x20/0x32
[ 6961.858363] [<c040351b>] sysenter_do_call+0x12/0x28

Blktrace: blktrace was run for 2.6.27.48, 2.6.32.16 and a patched
2.6.32.16 (called 2.6.32.16p below and in the .tar file), where the
patch just reverts "ext4: Avoid group preallocation for closed files".
This revert removes a substantial part of the regression.

For 2.6.32.16p and 2.6.27.48 there are two runs: run1 is directly after
booting, then the directory is unexported, unmounted, mounted, exported,
and then run2 is done. For 2.6.32.16 there is just run1; all subsequent
runs yield approximately the same results, i.e. they are as slow as run1.

Some numbers (time, and number of lines with flush|nfsd|sync in the
blkparse output):
2.6.27.48 2.6.32.16 2.6.32.16p
run1 run2 run1 run2 run1 run2
wallclock 113s 61s 280s ~280s 137s 61s
flush 25362 9285 71861 32656 12066
nfsd 7595 8580 8685 8359 8444
sync 2860 3925 303 212 169

The total time seems to be dominated by the number of flushes.
It should be noted that all these runs used barrier=0 ; barrier=1 does
not have a significant effect, though.

So we find:
a) in 2.6.32.16, there is a problem which manifests itself in kernel
messages associated with the jbd2/md5-8 and sync tasks, and vastly
increased number of flush operations
b) reverting patch "ext4: Avoid group preallocation for closed files"
cures part of the problem
c) even after reverting that patch, the first run takes much longer than
the subsequent runs, despite "sync", "echo 3 >
/proc/sys/vm/drop_caches", and umounting/re-mounting the filesystem.

The blktrace files are at
http://strucbio.biologie.uni-konstanz.de/~dikay/blktraces.tar.bz2 .

Should we test any other patches?

thanks,

Kay
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Eric Sandeen on 2 Aug 2010 12:20

On 08/02/2010 09:52 AM, Kay Diederichs wrote:
> Dave,
>
> as you suggested, we reverted "ext4: Avoid group preallocation for
> closed files" and this indeed fixes a big part of the problem: after
> booting the NFS server we get
>
> NFS-Server: turn5 2.6.32.16p i686
> NFS-Client: turn10 2.6.18-194.8.1.el5 x86_64
>
> exported directory on the nfs-server:
> /dev/md5 /mnt/md5 ext4
> rw,seclabel,noatime,barrier=1,stripe=512,data=writeback 0 0
>
> 48 seconds for preparations
> 28 seconds to rsync 100 frames with 597M from nfs directory
> 57 seconds to rsync 100 frames with 595M to nfs directory
> 70 seconds to untar 24353 kernel files with 323M to nfs directory
> 57 seconds to rsync 24353 kernel files with 323M from nfs directory
> 133 seconds to run xds_par in nfs directory
> 425 seconds to run the script

Interesting, I had found this commit to be a problem for small files
which are constantly created & deleted; the commit had the effect of
packing the newly created files in the first free space that could be
found, rather than walking down the disk leaving potentially fragmented
freespace behind (see seekwatcher graph attached). Reverting the patch
sped things up for this test, but left the filesystem freespace in bad
shape.

But you seem to see one of the largest effects in here:

261 seconds to rsync 100 frames with 595M to nfs directory
vs
57 seconds to rsync 100 frames with 595M to nfs directory

with the patch reverted making things go faster. So you are doing 100
6MB writes to the server, correct? Is the filesystem mkfs'd fresh
before each test, or is it aged? If not mkfs'd, is it at least
completely empty prior to the test, or does data remain on it? I'm just
wondering if fragmented freespace is contributing to this behavior as
well. If there is fragmented freespace, then with the patch I think the
allocator is more likely to hunt around for small discontiguous chunks
of free sapce, rather than going further out in the disk looking for a
large area to allocate from.

It might be interesting to use seekwatcher on the server to visualize
the allocation/IO patterns for the test running just this far?

-Eric

From: Kay Diederichs on 2 Aug 2010 17:10

Am 02.08.2010 18:12, schrieb Eric Sandeen:
> On 08/02/2010 09:52 AM, Kay Diederichs wrote:
>> Dave,
>>
>> as you suggested, we reverted "ext4: Avoid group preallocation for
>> closed files" and this indeed fixes a big part of the problem: after
>> booting the NFS server we get
>>
>> NFS-Server: turn5 2.6.32.16p i686
>> NFS-Client: turn10 2.6.18-194.8.1.el5 x86_64
>>
>> exported directory on the nfs-server:
>> /dev/md5 /mnt/md5 ext4
>> rw,seclabel,noatime,barrier=1,stripe=512,data=writeback 0 0
>>
>> 48 seconds for preparations
>> 28 seconds to rsync 100 frames with 597M from nfs directory
>> 57 seconds to rsync 100 frames with 595M to nfs directory
>> 70 seconds to untar 24353 kernel files with 323M to nfs directory
>> 57 seconds to rsync 24353 kernel files with 323M from nfs directory
>> 133 seconds to run xds_par in nfs directory
>> 425 seconds to run the script
>
> Interesting, I had found this commit to be a problem for small files
> which are constantly created& deleted; the commit had the effect of
> packing the newly created files in the first free space that could be
> found, rather than walking down the disk leaving potentially fragmented
> freespace behind (see seekwatcher graph attached). Reverting the patch
> sped things up for this test, but left the filesystem freespace in bad
> shape.
>
> But you seem to see one of the largest effects in here:
>
> 261 seconds to rsync 100 frames with 595M to nfs directory
> vs
> 57 seconds to rsync 100 frames with 595M to nfs directory
>
> with the patch reverted making things go faster. So you are doing 100
> 6MB writes to the server, correct?

correct.

>
> Is the filesystem mkfs'd fresh
> before each test, or is it aged?

it is too big to "just create it freshly". It was actually created a
week ago, and filled by a single ~ 10-hour rsync job run on the server
such that the filesystem should be filled in the most linear way
possible. Since then, the benchmarking has created and deleted lots of
files.

> If not mkfs'd, is it at least
> completely empty prior to the test, or does data remain on it? I'm just

it's not empty: df -h reports
Filesystem Size Used Avail Use% Mounted on
/dev/md5 3.7T 2.8T 712G 80% /mnt/md5

e2freefrag-1.41.12 reports:
Device: /dev/md5
Blocksize: 4096 bytes
Total blocks: 976761344
Free blocks: 235345984 (24.1%)

Min. free extent: 4 KB
Max. free extent: 99348 KB
Avg. free extent: 1628 KB

HISTOGRAM OF FREE EXTENT SIZES:
Extent Size Range : Free extents Free Blocks Percent
4K... 8K- : 1858 1858 0.00%
8K... 16K- : 3415 8534 0.00%
16K... 32K- : 9952 54324 0.02%
32K... 64K- : 23884 288848 0.12%
64K... 128K- : 27901 658130 0.28%
128K... 256K- : 25761 1211519 0.51%
256K... 512K- : 35863 3376274 1.43%
512K... 1024K- : 48643 9416851 4.00%
1M... 2M- : 150311 60704033 25.79%
2M... 4M- : 244895 148283666 63.01%
4M... 8M- : 3970 5508499 2.34%
8M... 16M- : 187 551835 0.23%
16M... 32M- : 302 1765912 0.75%
32M... 64M- : 282 2727162 1.16%
64M... 128M- : 42 788539 0.34%

> wondering if fragmented freespace is contributing to this behavior as
> well. If there is fragmented freespace, then with the patch I think the
> allocator is more likely to hunt around for small discontiguous chunks
> of free sapce, rather than going further out in the disk looking for a
> large area to allocate from.

the last step of the benchmark, "xds_par", reads 600MB and writes 50MB.
It has 16 threads which might put some additional pressure on the
freespace hunting. That step also is fast in 2.6.27.48 but slow in 2.6.32+ .

>
> It might be interesting to use seekwatcher on the server to visualize
> the allocation/IO patterns for the test running just this far?
>
> -Eric

will try to install seekwatcher.

thanks,

Kay

From: Kay Diederichs on 3 Aug 2010 09:40

Eric Sandeen schrieb:
> On 08/02/2010 09:52 AM, Kay Diederichs wrote:
>> Dave,
>>
>> as you suggested, we reverted "ext4: Avoid group preallocation for
>> closed files" and this indeed fixes a big part of the problem: after
>> booting the NFS server we get
>>
>> NFS-Server: turn5 2.6.32.16p i686
>> NFS-Client: turn10 2.6.18-194.8.1.el5 x86_64
>>
>> exported directory on the nfs-server:
>> /dev/md5 /mnt/md5 ext4
>> rw,seclabel,noatime,barrier=1,stripe=512,data=writeback 0 0
>>
>> 48 seconds for preparations
>> 28 seconds to rsync 100 frames with 597M from nfs directory
>> 57 seconds to rsync 100 frames with 595M to nfs directory
>> 70 seconds to untar 24353 kernel files with 323M to nfs directory
>> 57 seconds to rsync 24353 kernel files with 323M from nfs directory
>> 133 seconds to run xds_par in nfs directory
>> 425 seconds to run the script
>
> Interesting, I had found this commit to be a problem for small files
> which are constantly created & deleted; the commit had the effect of
> packing the newly created files in the first free space that could be
> found, rather than walking down the disk leaving potentially fragmented
> freespace behind (see seekwatcher graph attached). Reverting the patch
> sped things up for this test, but left the filesystem freespace in bad
> shape.
>
> But you seem to see one of the largest effects in here:
>
> 261 seconds to rsync 100 frames with 595M to nfs directory
> vs
> 57 seconds to rsync 100 frames with 595M to nfs directory
>
> with the patch reverted making things go faster. So you are doing 100
> 6MB writes to the server, correct? Is the filesystem mkfs'd fresh
> before each test, or is it aged? If not mkfs'd, is it at least
> completely empty prior to the test, or does data remain on it? I'm just
> wondering if fragmented freespace is contributing to this behavior as
> well. If there is fragmented freespace, then with the patch I think the
> allocator is more likely to hunt around for small discontiguous chunks
> of free sapce, rather than going further out in the disk looking for a
> large area to allocate from.
>
> It might be interesting to use seekwatcher on the server to visualize
> the allocation/IO patterns for the test running just this far?
>
> -Eric
>
>
> ------------------------------------------------------------------------
>

Eric,

seekwatcher does not seem to understand the blktrace output of old
kernels, so I rolled my own primitive plotting, e.g.

blkparse -i md5.xds_par.2.6.32.16p_run1 > blkparse.out

grep flush blkparse.out | grep W > flush_W
grep flush blkparse.out | grep R > flush_R

grep nfsd blkparse.out | grep R > nfsd_R
grep nfsd blkparse.out | grep W > nfsd_W

grep sync blkparse.out | grep R > sync_R
grep sync blkparse.out | grep W > sync_W

gnuplot<<EOF
set term png
set out '2.6.32.16p_run1.png'
set key outside
set title "2.6.32.16p_run1"
plot 'nfsd_W' us 4:8,'flush_W' us 4:8,'sync_W' us 4:8,'nfsd_R' us
4:8,'flush_R' us 4:8
EOF

I attach the resulting plots for 2.6.27.48_run1 (after booting) and
2.6.27.48_run2 (after run1 ; sync; and drop_cache). They show seconds on
the x axis (horizontal) and block numbers (512-byte blocks, I suppose;
the ext4 filesystem has 976761344 4096-byte blocks so that would be
about 8e+09 512-byte blocks) on the y axis (vertical).

You'll have to do the real interpretation of the plots yourself, but
even someone who does not know exactly what the pdflush (in 2.6.27.48)
or flush (in 2.6.32+) kernel threads are supposed to do can tell that
the kernels behave _very_ differently.

In particular, stock 2.6.32.16 every time (only run1 is shown, but run2
is the same) has the flush thread visiting all of the filesystem, in
steps of 263168 blocks. I have no idea why it does this.
Roughly the first 1/3 of the filesystem is also visited by kernels
2.6.27.48 and the patched 2.6.32.16 that Dave Chinner suggested, but
only in the first run after booting. Subsequent runs are fast and do not
employ the flush thread much.

Hope this helps to pin down the regression.

thanks,

Kay

From: Kay Diederichs on 4 Aug 2010 04:20

Am 02.08.2010 22:21, schrieb Ted Ts'o:
> On Mon, Aug 02, 2010 at 05:30:03PM +0200, Kay Diederichs wrote:
>>
>> we pared down the benchmark to the last step (called "run xds_par in nfs
>> directory (reads 600M, and writes 50M)") because this captures most of
>> the problem. Here we report kernel messages with stacktrace, and the
>> blktrace output that you requested.
>
> Thanks, I'll take a look at it.
>
> Is NFS required to reproduce the problem? If you simply copy the 100
> files using rsync, or cp -r while logged onto the server, do you
> notice the performance regression?
>
> Thanks, regards,
>
> - Ted

Ted,

we've run the benchmarks internally on the file server; it turns out
that NFS is not required to reproduce the problem.

We also took the opportunity to try 2.6.32.17 which just came out.
2.6.32.17 behaves similar to 2.6.32.16-patched (i.e. with reverted
"ext4: Avoid group preallocation for closed files"); 2.6.32.17 has quite
a few ext4 patches so one or a couple of those seems to have a similar
effect as reverting "ext4: Avoid group preallocation for closed files".

These are the times for the second (and higher) benchmark runs; the
first run is always slower. The last step ("run xds_par") is slower than
in the NFS case because it's heavy in CPU usage (total CPU time is more
than 200 seconds); the NFS client is a 8-core (+HT) Nehalem-type
machine, whereas the NFS server is just a 2-core Pentium D @ 3.40GHz

Local machine: turn5 2.6.27.48 i686
Raid5: /dev/md5 /mnt/md5 ext4dev
rw,noatime,barrier=1,stripe=512,data=writeback 0 0
32 seconds for preparations
19 seconds to rsync 100 frames with 597M from raid5,ext4 directory
17 seconds to rsync 100 frames with 595M to raid5,ext4 directory
36 seconds to untar 24353 kernel files with 323M to raid5,ext4 directory
31 seconds to rsync 24353 kernel files with 323M from raid5,ext4 directory
267 seconds to run xds_par in raid5,ext4 directory
427 seconds to run the script

Local machine: turn5 2.6.32.16 i686 (vanilla, i.e. not patched)
Raid5: /dev/md5 /mnt/md5 ext4
rw,seclabel,noatime,barrier=0,stripe=512,data=writeback 0 0
36 seconds for preparations
18 seconds to rsync 100 frames with 597M from raid5,ext4 directory
33 seconds to rsync 100 frames with 595M to raid5,ext4 directory
68 seconds to untar 24353 kernel files with 323M to raid5,ext4 directory
40 seconds to rsync 24353 kernel files with 323M from raid5,ext4 directory
489 seconds to run xds_par in raid5,ext4 directory
714 seconds to run the script

Local machine: turn5 2.6.32.17 i686
Raid5: /dev/md5 /mnt/md5 ext4
rw,seclabel,noatime,barrier=0,stripe=512,data=writeback 0 0
38 seconds for preparations
18 seconds to rsync 100 frames with 597M from raid5,ext4 directory
33 seconds to rsync 100 frames with 595M to raid5,ext4 directory
67 seconds to untar 24353 kernel files with 323M to raid5,ext4 directory
41 seconds to rsync 24353 kernel files with 323M from raid5,ext4 directory
266 seconds to run xds_par in raid5,ext4 directory
492 seconds to run the script

So even if the patches that went into 2.6.32.17 seem to fix the worst
stalls, it is obvious that untarring and rsyncing kernel files is
significantly slower on 2.6.32.17 than 2.6.27.48 .

HTH,

Kay

First | Prev |
Pages: 1 2
Prev: oss: msnd: check request_region() return value
Next: Checkpatch: warn about unexpectedly long msleep's