From: Wu Fengguang on
Dave,

Here is one more test on a big ext4 disk file:

16k 39.7 MB/s
32k 54.3 MB/s
64k 63.6 MB/s
128k 72.6 MB/s
256k 71.7 MB/s
rsize ==> 512k 71.7 MB/s
1024k 72.2 MB/s
2048k 71.0 MB/s
4096k 73.0 MB/s
8192k 74.3 MB/s
16384k 74.5 MB/s

It shows that >=128k client side readahead is enough for single disk
case :) As for RAID configurations, I guess big server side readahead
should be enough.

#!/bin/sh

file=/mnt/ext4_test/zero
BDI=0:24

for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
do
echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
echo readahead_size=${rasize}k
fadvise $file 0 0 dontneed
ssh p9 "fadvise $file 0 0 dontneed"
dd if=$file of=/dev/null bs=4k count=402400
done

Thanks,
Fengguang

On Fri, Feb 26, 2010 at 03:49:16PM +0800, Wu Fengguang wrote:
> On Wed, Feb 24, 2010 at 03:39:40PM +0800, Dave Chinner wrote:
> > On Wed, Feb 24, 2010 at 02:12:47PM +0800, Wu Fengguang wrote:
> > > On Wed, Feb 24, 2010 at 01:22:15PM +0800, Dave Chinner wrote:
> > > > What I'm trying to say is that while I agree with your premise that
> > > > a 7.8MB readahead window is probably far larger than was ever
> > > > intended, I disagree with your methodology and environment for
> > > > selecting a better default value. The default readahead value needs
> > > > to work well in as many situations as possible, not just in perfect
> > > > 1:1 client/server environment.
> > >
> > > Good points. It's imprudent to change a default value based on one
> > > single benchmark. Need to collect more data, which may take time..
> >
> > Agreed - better to spend time now to get it right...
>
> I collected more data with large network latency as well as rsize=32k,
> and updates the readahead size accordingly to 4*rsize.
>
> ===
> nfs: use 2*rsize readahead size
>
> With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS
> readahead size 512k*15=7680k is too large than necessary for typical
> clients.
>
> On a e1000e--e1000e connection, I got the following numbers
> (this reads sparse file from server and involves no disk IO)
>
> readahead size normal 1ms+1ms 5ms+5ms 10ms+10ms(*)
> 16k 35.5 MB/s 4.8 MB/s 2.1 MB/s 1.2 MB/s
> 32k 54.3 MB/s 6.7 MB/s 3.6 MB/s 2.3 MB/s
> 64k 64.1 MB/s 12.6 MB/s 6.5 MB/s 4.7 MB/s
> 128k 70.5 MB/s 20.1 MB/s 11.9 MB/s 8.7 MB/s
> 256k 74.6 MB/s 38.6 MB/s 21.3 MB/s 15.0 MB/s
> rsize ==> 512k 77.4 MB/s 59.4 MB/s 39.8 MB/s 25.5 MB/s
> 1024k 85.5 MB/s 77.9 MB/s 65.7 MB/s 43.0 MB/s
> 2048k 86.8 MB/s 81.5 MB/s 84.1 MB/s 59.7 MB/s
> 4096k 87.9 MB/s 77.4 MB/s 56.2 MB/s 59.2 MB/s
> 8192k 89.0 MB/s 81.2 MB/s 78.0 MB/s 41.2 MB/s
> 16384k 87.7 MB/s 85.8 MB/s 62.0 MB/s 56.5 MB/s
>
> readahead size normal 1ms+1ms 5ms+5ms 10ms+10ms(*)
> 16k 37.2 MB/s 6.4 MB/s 2.1 MB/s 1.2 MB/s
> rsize ==> 32k 56.6 MB/s 6.8 MB/s 3.6 MB/s 2.3 MB/s
> 64k 66.1 MB/s 12.7 MB/s 6.6 MB/s 4.7 MB/s
> 128k 69.3 MB/s 22.0 MB/s 12.2 MB/s 8.9 MB/s
> 256k 69.6 MB/s 41.8 MB/s 20.7 MB/s 14.7 MB/s
> 512k 71.3 MB/s 54.1 MB/s 25.0 MB/s 16.9 MB/s
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 1024k 71.5 MB/s 48.4 MB/s 26.0 MB/s 16.7 MB/s
> 2048k 71.7 MB/s 53.2 MB/s 25.3 MB/s 17.6 MB/s
> 4096k 71.5 MB/s 50.4 MB/s 25.7 MB/s 17.1 MB/s
> 8192k 71.1 MB/s 52.3 MB/s 26.3 MB/s 16.9 MB/s
> 16384k 70.2 MB/s 56.6 MB/s 27.0 MB/s 16.8 MB/s
>
> (*) 10ms+10ms means to add delay on both client & server sides with
> # /sbin/tc qdisc change dev eth0 root netem delay 10ms
> The total >=20ms delay is so large for NFS, that a simple `vi some.sh`
> command takes a dozen seconds. Note that the actual delay reported
> by ping is larger, eg. for the 1ms+1ms case:
> rtt min/avg/max/mdev = 7.361/8.325/9.710/0.837 ms
>
>
> So it seems that readahead_size=4*rsize (ie. keep 4 RPC requests in
> flight) is able to get near full NFS bandwidth. Reducing the mulriple
> from 15 to 4 not only makes the client side readahead size more sane
> (2MB by default), but also reduces the disorderness of the server side
> RPC read requests, which yeilds better server side readahead behavior.
>
> To avoid small readahead when the client mount with "-o rsize=32k" or
> the server only supports rsize <= 32k, we take the max of 2*rsize and
> default_backing_dev_info.ra_pages. The latter defaults to 512K, and can
> be explicitly changed by user with kernel parameter "readahead=" and
> runtime tunable "/sys/devices/virtual/bdi/default/read_ahead_kb" (which
> takes effective for future NFS mounts).
>
> The test script is:
>
> #!/bin/sh
>
> file=/mnt/sparse
> BDI=0:15
>
> for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
> do
> echo 3 > /proc/sys/vm/drop_caches
> echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
> echo readahead_size=${rasize}k
> dd if=$file of=/dev/null bs=4k count=1024000
> done
>
> CC: Dave Chinner <david(a)fromorbit.com>
> CC: Trond Myklebust <Trond.Myklebust(a)netapp.com>
> Signed-off-by: Wu Fengguang <fengguang.wu(a)intel.com>
> ---
> fs/nfs/client.c | 4 +++-
> fs/nfs/internal.h | 8 --------
> 2 files changed, 3 insertions(+), 9 deletions(-)
>
> --- linux.orig/fs/nfs/client.c 2010-02-26 10:10:46.000000000 +0800
> +++ linux/fs/nfs/client.c 2010-02-26 11:07:22.000000000 +0800
> @@ -889,7 +889,9 @@ static void nfs_server_set_fsinfo(struct
> server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
>
> server->backing_dev_info.name = "nfs";
> - server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
> + server->backing_dev_info.ra_pages = max_t(unsigned long,
> + default_backing_dev_info.ra_pages,
> + 4 * server->rpages);
> server->backing_dev_info.capabilities |= BDI_CAP_ACCT_UNSTABLE;
>
> if (server->wsize > max_rpc_payload)
> --- linux.orig/fs/nfs/internal.h 2010-02-26 10:10:46.000000000 +0800
> +++ linux/fs/nfs/internal.h 2010-02-26 11:07:07.000000000 +0800
> @@ -10,14 +10,6 @@
>
> struct nfs_string;
>
> -/* Maximum number of readahead requests
> - * FIXME: this should really be a sysctl so that users may tune it to suit
> - * their needs. People that do NFS over a slow network, might for
> - * instance want to reduce it to something closer to 1 for improved
> - * interactive response.
> - */
> -#define NFS_MAX_READAHEAD (RPC_DEF_SLOT_TABLE - 1)
> -
> /*
> * Determine if sessions are in use.
> */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Trond Myklebust on
On Tue, 2010-03-02 at 11:10 +0800, Wu Fengguang wrote:
> Dave,
>
> Here is one more test on a big ext4 disk file:
>
> 16k 39.7 MB/s
> 32k 54.3 MB/s
> 64k 63.6 MB/s
> 128k 72.6 MB/s
> 256k 71.7 MB/s
> rsize ==> 512k 71.7 MB/s
> 1024k 72.2 MB/s
> 2048k 71.0 MB/s
> 4096k 73.0 MB/s
> 8192k 74.3 MB/s
> 16384k 74.5 MB/s
>
> It shows that >=128k client side readahead is enough for single disk
> case :) As for RAID configurations, I guess big server side readahead
> should be enough.

There are lots of people who would like to use NFS on their company WAN,
where you typically have high bandwidths (up to 10GigE), but often a
high latency too (due to geographical dispersion).
My ping latency from here to a typical server in NetApp's Bangalore
office is ~ 312ms. I read your test results with 10ms delays, but have
you tested with higher than that?

Cheers
Trond
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: John Stoffel on
>>>>> "Trond" == Trond Myklebust <Trond.Myklebust(a)netapp.com> writes:

Trond> On Tue, 2010-03-02 at 11:10 +0800, Wu Fengguang wrote:
>> Dave,
>>
>> Here is one more test on a big ext4 disk file:
>>
>> 16k 39.7 MB/s
>> 32k 54.3 MB/s
>> 64k 63.6 MB/s
>> 128k 72.6 MB/s
>> 256k 71.7 MB/s
>> rsize ==> 512k 71.7 MB/s
>> 1024k 72.2 MB/s
>> 2048k 71.0 MB/s
>> 4096k 73.0 MB/s
>> 8192k 74.3 MB/s
>> 16384k 74.5 MB/s
>>
>> It shows that >=128k client side readahead is enough for single disk
>> case :) As for RAID configurations, I guess big server side readahead
>> should be enough.

Trond> There are lots of people who would like to use NFS on their
Trond> company WAN, where you typically have high bandwidths (up to
Trond> 10GigE), but often a high latency too (due to geographical
Trond> dispersion). My ping latency from here to a typical server in
Trond> NetApp's Bangalore office is ~ 312ms. I read your test results
Trond> with 10ms delays, but have you tested with higher than that?

If you have that high a latency, the low level TCP protocol is going
to kill your performance before you get to the NFS level. You really
need to open up the TCP window size at that point. And it only gets
worse as the bandwidth goes up too.

There's no good solution, because while you can get good throughput at
points, latency is going to suffer no matter what.

John
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Trond Myklebust on
On Tue, 2010-03-02 at 12:33 -0500, John Stoffel wrote:
> >>>>> "Trond" == Trond Myklebust <Trond.Myklebust(a)netapp.com> writes:
>
> Trond> On Tue, 2010-03-02 at 11:10 +0800, Wu Fengguang wrote:
> >> Dave,
> >>
> >> Here is one more test on a big ext4 disk file:
> >>
> >> 16k 39.7 MB/s
> >> 32k 54.3 MB/s
> >> 64k 63.6 MB/s
> >> 128k 72.6 MB/s
> >> 256k 71.7 MB/s
> >> rsize ==> 512k 71.7 MB/s
> >> 1024k 72.2 MB/s
> >> 2048k 71.0 MB/s
> >> 4096k 73.0 MB/s
> >> 8192k 74.3 MB/s
> >> 16384k 74.5 MB/s
> >>
> >> It shows that >=128k client side readahead is enough for single disk
> >> case :) As for RAID configurations, I guess big server side readahead
> >> should be enough.
>
> Trond> There are lots of people who would like to use NFS on their
> Trond> company WAN, where you typically have high bandwidths (up to
> Trond> 10GigE), but often a high latency too (due to geographical
> Trond> dispersion). My ping latency from here to a typical server in
> Trond> NetApp's Bangalore office is ~ 312ms. I read your test results
> Trond> with 10ms delays, but have you tested with higher than that?
>
> If you have that high a latency, the low level TCP protocol is going
> to kill your performance before you get to the NFS level. You really
> need to open up the TCP window size at that point. And it only gets
> worse as the bandwidth goes up too.

Yes. You need to open the TCP window in addition to reading ahead
aggressively.

> There's no good solution, because while you can get good throughput at
> points, latency is going to suffer no matter what.

It depends upon your workload. Sequential read and write should still be
doable if you have aggressive readahead and open up for lots of parallel
write RPCs.

Cheers
Trond
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Bret Towe on
On Mon, Mar 1, 2010 at 7:10 PM, Wu Fengguang <fengguang.wu(a)intel.com> wrote:
> Dave,
>
> Here is one more test on a big ext4 disk file:
>
> � � � � � 16k �39.7 MB/s
> � � � � � 32k �54.3 MB/s
> � � � � � 64k �63.6 MB/s
> � � � � �128k �72.6 MB/s
> � � � � �256k �71.7 MB/s
> rsize ==> 512k �71.7 MB/s
> � � � � 1024k �72.2 MB/s
> � � � � 2048k �71.0 MB/s
> � � � � 4096k �73.0 MB/s
> � � � � 8192k �74.3 MB/s
> � � � �16384k �74.5 MB/s
>
> It shows that >=128k client side readahead is enough for single disk
> case :) As for RAID configurations, I guess big server side readahead
> should be enough.
>
> #!/bin/sh
>
> file=/mnt/ext4_test/zero
> BDI=0:24
>
> for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
> do
> � � � �echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
> � � � �echo readahead_size=${rasize}k
> � � � �fadvise $file 0 0 dontneed
> � � � �ssh p9 "fadvise $file 0 0 dontneed"
> � � � �dd if=$file of=/dev/null bs=4k count=402400
> done

how do you determine which bdi to use? I skimmed thru
the filesystem in /sys and didn't see anything that says which is what

> Thanks,
> Fengguang
>
> On Fri, Feb 26, 2010 at 03:49:16PM +0800, Wu Fengguang wrote:
>> On Wed, Feb 24, 2010 at 03:39:40PM +0800, Dave Chinner wrote:
>> > On Wed, Feb 24, 2010 at 02:12:47PM +0800, Wu Fengguang wrote:
>> > > On Wed, Feb 24, 2010 at 01:22:15PM +0800, Dave Chinner wrote:
>> > > > What I'm trying to say is that while I agree with your premise that
>> > > > a 7.8MB readahead window is probably far larger than was ever
>> > > > intended, I disagree with your methodology and environment for
>> > > > selecting a better default value. �The default readahead value needs
>> > > > to work well in as many situations as possible, not just in perfect
>> > > > 1:1 client/server environment.
>> > >
>> > > Good points. It's imprudent to change a default value based on one
>> > > single benchmark. Need to collect more data, which may take time..
>> >
>> > Agreed - better to spend time now to get it right...
>>
>> I collected more data with large network latency as well as rsize=32k,
>> and updates the readahead size accordingly to 4*rsize.
>>
>> ===
>> nfs: use 2*rsize readahead size
>>
>> With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS
>> readahead size 512k*15=7680k is too large than necessary for typical
>> clients.
>>
>> On a e1000e--e1000e connection, I got the following numbers
>> (this reads sparse file from server and involves no disk IO)
>>
>> readahead size � � � �normal � � � � �1ms+1ms � � � � 5ms+5ms � � � � 10ms+10ms(*)
>> � � � � �16k �35.5 MB/s � � � �4.8 MB/s � � � �2.1 MB/s � � � 1.2 MB/s
>> � � � � �32k �54.3 MB/s � � � �6.7 MB/s � � � �3.6 MB/s � � � 2.3 MB/s
>> � � � � �64k �64.1 MB/s � � � 12.6 MB/s � � � �6.5 MB/s � � � 4.7 MB/s
>> � � � � 128k �70.5 MB/s � � � 20.1 MB/s � � � 11.9 MB/s � � � 8.7 MB/s
>> � � � � 256k �74.6 MB/s � � � 38.6 MB/s � � � 21.3 MB/s � � �15.0 MB/s
>> rsize ==> 512k � � � �77.4 MB/s � � � 59.4 MB/s � � � 39.8 MB/s � � �25.5 MB/s
>> � � � �1024k �85.5 MB/s � � � 77.9 MB/s � � � 65.7 MB/s � � �43.0 MB/s
>> � � � �2048k �86.8 MB/s � � � 81.5 MB/s � � � 84.1 MB/s � � �59.7 MB/s
>> � � � �4096k �87.9 MB/s � � � 77.4 MB/s � � � 56.2 MB/s � � �59.2 MB/s
>> � � � �8192k �89.0 MB/s � � � 81.2 MB/s � � � 78.0 MB/s � � �41.2 MB/s
>> � � � 16384k �87.7 MB/s � � � 85.8 MB/s � � � 62.0 MB/s � � �56.5 MB/s
>>
>> readahead size � � � �normal � � � � �1ms+1ms � � � � 5ms+5ms � � � � 10ms+10ms(*)
>> � � � � �16k �37.2 MB/s � � � �6.4 MB/s � � � �2.1 MB/s � � � �1.2 MB/s
>> rsize ==> �32k � � � �56.6 MB/s � � � �6.8 MB/s � � � �3.6 MB/s � � � �2.3 MB/s
>> � � � � �64k �66.1 MB/s � � � 12.7 MB/s � � � �6.6 MB/s � � � �4.7 MB/s
>> � � � � 128k �69.3 MB/s � � � 22.0 MB/s � � � 12.2 MB/s � � � �8.9 MB/s
>> � � � � 256k �69.6 MB/s � � � 41.8 MB/s � � � 20.7 MB/s � � � 14.7 MB/s
>> � � � � 512k �71.3 MB/s � � � 54.1 MB/s � � � 25.0 MB/s � � � 16.9 MB/s
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> � � � �1024k �71.5 MB/s � � � 48.4 MB/s � � � 26.0 MB/s � � � 16.7 MB/s
>> � � � �2048k �71.7 MB/s � � � 53.2 MB/s � � � 25.3 MB/s � � � 17.6 MB/s
>> � � � �4096k �71.5 MB/s � � � 50.4 MB/s � � � 25.7 MB/s � � � 17.1 MB/s
>> � � � �8192k �71.1 MB/s � � � 52.3 MB/s � � � 26.3 MB/s � � � 16.9 MB/s
>> � � � 16384k �70.2 MB/s � � � 56.6 MB/s � � � 27.0 MB/s � � � 16.8 MB/s
>>
>> (*) 10ms+10ms means to add delay on both client & server sides with
>> � � # /sbin/tc qdisc change dev eth0 root netem delay 10ms
>> � � The total >=20ms delay is so large for NFS, that a simple `vi some.sh`
>> � � command takes a dozen seconds. Note that the actual delay reported
>> � � by ping is larger, eg. for the 1ms+1ms case:
>> � � � � rtt min/avg/max/mdev = 7.361/8.325/9.710/0.837 ms
>>
>>
>> So it seems that readahead_size=4*rsize (ie. keep 4 RPC requests in
>> flight) is able to get near full NFS bandwidth. Reducing the mulriple
>> from 15 to 4 not only makes the client side readahead size more sane
>> (2MB by default), but also reduces the disorderness of the server side
>> RPC read requests, which yeilds better server side readahead behavior.
>>
>> To avoid small readahead when the client mount with "-o rsize=32k" or
>> the server only supports rsize <= 32k, we take the max of 2*rsize and
>> default_backing_dev_info.ra_pages. The latter defaults to 512K, and can
>> be explicitly changed by user with kernel parameter "readahead=" and
>> runtime tunable "/sys/devices/virtual/bdi/default/read_ahead_kb" (which
>> takes effective for future NFS mounts).
>>
>> The test script is:
>>
>> #!/bin/sh
>>
>> file=/mnt/sparse
>> BDI=0:15
>>
>> for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
>> do
>> � � � echo 3 > /proc/sys/vm/drop_caches
>> � � � echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
>> � � � echo readahead_size=${rasize}k
>> � � � dd if=$file of=/dev/null bs=4k count=1024000
>> done
>>
>> CC: Dave Chinner <david(a)fromorbit.com>
>> CC: Trond Myklebust <Trond.Myklebust(a)netapp.com>
>> Signed-off-by: Wu Fengguang <fengguang.wu(a)intel.com>
>> ---
>> �fs/nfs/client.c � | � �4 +++-
>> �fs/nfs/internal.h | � �8 --------
>> �2 files changed, 3 insertions(+), 9 deletions(-)
>>
>> --- linux.orig/fs/nfs/client.c � � � �2010-02-26 10:10:46.000000000 +0800
>> +++ linux/fs/nfs/client.c � � 2010-02-26 11:07:22.000000000 +0800
>> @@ -889,7 +889,9 @@ static void nfs_server_set_fsinfo(struct
>> � � � server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
>>
>> � � � server->backing_dev_info.name = "nfs";
>> - � � server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
>> + � � server->backing_dev_info.ra_pages = max_t(unsigned long,
>> + � � � � � � � � � � � � � � � � � � � � � default_backing_dev_info.ra_pages,
>> + � � � � � � � � � � � � � � � � � � � � � 4 * server->rpages);
>> � � � server->backing_dev_info.capabilities |= BDI_CAP_ACCT_UNSTABLE;
>>
>> � � � if (server->wsize > max_rpc_payload)
>> --- linux.orig/fs/nfs/internal.h � � �2010-02-26 10:10:46.000000000 +0800
>> +++ linux/fs/nfs/internal.h � 2010-02-26 11:07:07.000000000 +0800
>> @@ -10,14 +10,6 @@
>>
>> �struct nfs_string;
>>
>> -/* Maximum number of readahead requests
>> - * FIXME: this should really be a sysctl so that users may tune it to suit
>> - * � � � �their needs. People that do NFS over a slow network, might for
>> - * � � � �instance want to reduce it to something closer to 1 for improved
>> - * � � � �interactive response.
>> - */
>> -#define NFS_MAX_READAHEAD � �(RPC_DEF_SLOT_TABLE - 1)
>> -
>> �/*
>> � * Determine if sessions are in use.
>> � */
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo(a)vger.kernel.org
> More majordomo info at �http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at �http://www.tux.org/lkml/
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/