From: Miklos Szeredi on
Hi Jens,

I'm chasing a performance bottleneck identified by tiobench that seems
to be caused by CFQ. On a SLES10-SP3 kernel (2.6.16, with some patches
moving cfq closer to 2.6.17) tiobench with 8 threads gets about 260MB/s
sequential read throughput. On a recent kernels (including vanilla
2.6.34-rc) it makes about 145MB/s, a regression of 45%. The queue and
readahead parameters are the same.

This goes back some time, 2.6.27 already seems to have a bad
performance.

Changing the scheduler to noop will increase the throughput back into
the 260MB/s range. So this is not a driver issue.

Also increasing quantum *and* readahead will increase the throughput,
but not by as much. Both noop and these tweaks decrease the write
throughput somewhat however...

Apparently on recent kernels the number of dispatched requests stays
mostly at or below 4 and the dispatched sector count at or below 2000,
which is not enough to fill the bandwidth on this setup.

On 2.6.16 the number of dispatched requests hovers around 22 and the
sector count around 16000.

I uploaded blktraces for the read part of the tiobench runs for both
2.6.16 and 2.6.32:

http://www.kernel.org/pub/linux/kernel/people/mszeredi/blktrace/

Do you have any idea about the cause of this regression?

Thanks,
Miklos

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Chris on
On Fri, Apr 16, 2010 at 02:27:58PM +0200, Miklos Szeredi wrote:
> Hi Jens,
>
> I'm chasing a performance bottleneck identified by tiobench that seems
> to be caused by CFQ. On a SLES10-SP3 kernel (2.6.16, with some patches
> moving cfq closer to 2.6.17) tiobench with 8 threads gets about 260MB/s
> sequential read throughput. On a recent kernels (including vanilla
> 2.6.34-rc) it makes about 145MB/s, a regression of 45%. The queue and
> readahead parameters are the same.

I've also just noticed this using the most recent Redhat kernels. Writes don't
seem to be affected at all. If the latest Redhat kernels mean anything here, I
might as well show you what I've got, in case there is something common.

../disktest -B 4k -h 1 -I BD -K 32 -p l -P T -T 300 -r /dev/sdf

With cfq we get this:
STAT | 17260 | v1.4.2 | /dev/sdf | Heartbeat read throughput: 15032320.0B/s (14.34MB/s), IOPS 3670.0/s
And with noop we get this:
STAT | 17260 | v1.4.2 | /dev/sdf | Heartbeat read throughput: 111759360.0B/s (106.58MB/s), IOPS 27285.0/s.

Setting some very large and busy web servers to noop just out of curiousity
also reduced the average io time and dropped the load.

Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Corrado Zoccolo on
Hi Miklos,
I don't think this is related to CFQ. I've made a graph of the
accessed (read) sectors (see attached).
You can see that the green cloud (2.6.16) is much more concentrated,
while the red one (2.6.32) is split in two, and you can better
recognize the different lines.
This means that the FS put more distance between the blocks of the
files written by the tio threads, and the read time is therefore
impacted, since the disk head has to perform longer seeks. On the
other hand, if you read those files sequentially with a single thread,
the performance may be better with the new layout, so YMMV. When
testing 2.6.32 and up, you should consider testing also with
low_latency setting disabled, since tuning for latency can negatively
affect throughput.

Thanks,
Corrado

On Fri, Apr 16, 2010 at 2:27 PM, Miklos Szeredi <mszeredi(a)suse.cz> wrote:
> Hi Jens,
>
> I'm chasing a performance bottleneck identified by tiobench that seems
> to be caused by CFQ.  On a SLES10-SP3 kernel (2.6.16, with some patches
> moving cfq closer to 2.6.17) tiobench with 8 threads gets about 260MB/s
> sequential read throughput.  On a recent kernels (including vanilla
> 2.6.34-rc) it makes about 145MB/s, a regression of 45%.  The queue and
> readahead parameters are the same.
>
> This goes back some time, 2.6.27 already seems to have a bad
> performance.
>
> Changing the scheduler to noop will increase the throughput back into
> the 260MB/s range.  So this is not a driver issue.
>
> Also increasing quantum *and* readahead will increase the throughput,
> but not by as much.  Both noop and these tweaks decrease the write
> throughput somewhat however...
>
> Apparently on recent kernels the number of dispatched requests stays
> mostly at or below 4 and the dispatched sector count at or below 2000,
> which is not enough to fill the bandwidth on this setup.
>
> On 2.6.16 the number of dispatched requests hovers around 22 and the
> sector count around 16000.
>
> I uploaded blktraces for the read part of the tiobench runs for both
> 2.6.16 and 2.6.32:
>
>  http://www.kernel.org/pub/linux/kernel/people/mszeredi/blktrace/
>
> Do you have any idea about the cause of this regression?
>
> Thanks,
> Miklos
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo(a)vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:czoccolo(a)gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
Tales of Power - C. Castaneda
From: Miklos Szeredi on
Hi Corrado,

On Sat, 2010-04-17 at 14:46 +0200, Corrado Zoccolo wrote:
> I don't think this is related to CFQ. I've made a graph of the
> accessed (read) sectors (see attached).
> You can see that the green cloud (2.6.16) is much more concentrated,
> while the red one (2.6.32) is split in two, and you can better
> recognize the different lines.
> This means that the FS put more distance between the blocks of the
> files written by the tio threads, and the read time is therefore
> impacted, since the disk head has to perform longer seeks. On the
> other hand, if you read those files sequentially with a single thread,
> the performance may be better with the new layout, so YMMV. When
> testing 2.6.32 and up, you should consider testing also with
> low_latency setting disabled, since tuning for latency can negatively
> affect throughput.

low_latency is set to zero in all tests.

The layout difference doesn't explain why setting the scheduler to
"noop" consistently speeds up read throughput in 8-thread tiobench to
almost twice. This fact alone pretty clearly indicates that the I/O
scheduler is the culprit.

There are other indications, see the attached btt output for both
traces. From there it appears that 2.6.16 does more and longer seeks,
yet it's getting an overall better performance.

I've also tested with plain "dd" instead of tiobench where the
filesystem layout stayed exactly the same between tests. Still the
speed difference is there.

Thanks,
Miklos

************************************************************************
btt output for 2.6.16:
==================== All Devices ====================

ALL MIN AVG MAX N
--------------- ------------- ------------- ------------- -----------

Q2Q 0.000000047 0.000854417 1.003550405 67465
Q2G 0.000000458 0.000001211 0.000123527 46062
G2I 0.000000123 0.000001815 0.000494517 46074
Q2M 0.000000186 0.000001798 0.000010296 21404
I2D 0.000000162 0.000158001 0.040794333 46062
M2D 0.000000878 0.000133130 0.040585566 21404
D2C 0.000053870 0.023778266 0.234154543 67466
Q2C 0.000056746 0.023931014 0.234176000 67466

==================== Device Overhead ====================

DEV | Q2G G2I Q2M I2D D2C
---------- | --------- --------- --------- --------- ---------
( 8, 64) | 0.0035% 0.0052% 0.0024% 0.4508% 99.3617%
---------- | --------- --------- --------- --------- ---------
Overall | 0.0035% 0.0052% 0.0024% 0.4508% 99.3617%

==================== Device Merge Information ====================

DEV | #Q #D Ratio | BLKmin BLKavg BLKmax Total
---------- | -------- -------- ------- | -------- -------- -------- --------
( 8, 64) | 67466 46062 1.5 | 8 597 1024 27543688

==================== Device Q2Q Seek Information ====================

DEV | NSEEKS MEAN MEDIAN | MODE
---------- | --------------- --------------- --------------- | ---------------
( 8, 64) | 67466 866834.0 0 | 0(34558)
---------- | --------------- --------------- --------------- | ---------------
Overall | NSEEKS MEAN MEDIAN | MODE
Average | 67466 866834.0 0 | 0(34558)

==================== Device D2D Seek Information ====================

DEV | NSEEKS MEAN MEDIAN | MODE
---------- | --------------- --------------- --------------- | ---------------
( 8, 64) | 46062 1265503.9 0 | 0(13242)
---------- | --------------- --------------- --------------- | ---------------
Overall | NSEEKS MEAN MEDIAN | MODE
Average | 46062 1265503.9 0 | 0(13242)

==================== Plug Information ====================

DEV | # Plugs # Timer Us | % Time Q Plugged
---------- | ---------- ---------- | ----------------
( 8, 64) | 29271( 533) | 3.878105328%

DEV | IOs/Unp IOs/Unp(to)
---------- | ---------- ----------
( 8, 64) | 19.2 19.7
---------- | ---------- ----------
Overall | IOs/Unp IOs/Unp(to)
Average | 19.2 19.7

==================== Active Requests At Q Information ====================

DEV | Avg Reqs @ Q
---------- | -------------
( 8, 64) | 0.8

==================== I/O Active Period Information ====================

DEV | # Live Avg. Act Avg. !Act % Live
---------- | ---------- ------------- ------------- ------
( 8, 64) | 545 0.100012237 0.005766640 94.56
---------- | ---------- ------------- ------------- ------
Total Sys | 545 0.100012237 0.005766640 94.56

************************************************************************
btt output for 2.6.32:

==================== All Devices ====================

ALL MIN AVG MAX N
--------------- ------------- ------------- ------------- -----------

Q2Q 0.000000279 0.001710581 1.803934429 69429
Q2G 0.000000908 0.001798735 23.144764798 54940
S2G 0.022460311 6.581680621 23.144763751 15
G2I 0.000000628 0.000001576 0.000120409 54942
Q2M 0.000000628 0.000001611 0.000013201 14490
I2D 0.000000768 0.289812035 86.820205789 54940
M2D 0.000005518 0.098208187 0.794441158 14490
D2C 0.000173141 0.008056256 0.219516385 69430
Q2C 0.000179077 0.259305605 86.820559403 69430

==================== Device Overhead ====================

DEV | Q2G G2I Q2M I2D D2C
---------- | --------- --------- --------- --------- ---------
( 8, 64) | 0.5489% 0.0005% 0.0001% 88.4394% 3.1069%
---------- | --------- --------- --------- --------- ---------
Overall | 0.5489% 0.0005% 0.0001% 88.4394% 3.1069%

==================== Device Merge Information ====================

DEV | #Q #D Ratio | BLKmin BLKavg BLKmax Total
---------- | -------- -------- ------- | -------- -------- -------- --------
( 8, 64) | 69430 54955 1.3 | 8 520 2048 28614984

==================== Device Q2Q Seek Information ====================

DEV | NSEEKS MEAN MEDIAN | MODE
---------- | --------------- --------------- --------------- | ---------------
( 8, 64) | 69430 546377.3 0 | 0(50235)
---------- | --------------- --------------- --------------- | ---------------
Overall | NSEEKS MEAN MEDIAN | MODE
Average | 69430 546377.3 0 | 0(50235)

==================== Device D2D Seek Information ====================

DEV | NSEEKS MEAN MEDIAN | MODE
---------- | --------------- --------------- --------------- | ---------------
( 8, 64) | 54955 565286.3 0 | 0(37535)
---------- | --------------- --------------- --------------- | ---------------
Overall | NSEEKS MEAN MEDIAN | MODE
Average | 54955 565286.3 0 | 0(37535)

==================== Plug Information ====================

DEV | # Plugs # Timer Us | % Time Q Plugged
---------- | ---------- ---------- | ----------------
( 8, 64) | 2310( 0) | 0.049353257%

DEV | IOs/Unp IOs/Unp(to)
---------- | ---------- ----------
( 8, 64) | 7.3 0.0
---------- | ---------- ----------
Overall | IOs/Unp IOs/Unp(to)
Average | 7.3 0.0

==================== Active Requests At Q Information ====================

DEV | Avg Reqs @ Q
---------- | -------------
( 8, 64) | 132.8

==================== I/O Active Period Information ====================

DEV | # Live Avg. Act Avg. !Act % Live
---------- | ---------- ------------- ------------- ------
( 8, 64) | 4835 0.023848998 0.000714665 97.09
---------- | ---------- ------------- ------------- ------
Total Sys | 4835 0.023848998 0.000714665 97.09



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Corrado Zoccolo on
On Mon, Apr 19, 2010 at 1:46 PM, Miklos Szeredi <mszeredi(a)suse.cz> wrote:
> Hi Corrado,
>
> On Sat, 2010-04-17 at 14:46 +0200, Corrado Zoccolo wrote:
>> I don't think this is related to CFQ. I've made a graph of the
>> accessed (read) sectors (see attached).
>> You can see that the green cloud (2.6.16) is much more concentrated,
>> while the red one (2.6.32) is split in two, and you can better
>> recognize the different lines.
>> This means that the FS put more distance between the blocks of the
>> files written by the tio threads, and the read time is therefore
>> impacted, since the disk head has to perform longer seeks. On the
>> other hand, if you read those files sequentially with a single thread,
>> the performance may be better with the new layout, so YMMV. When
>> testing 2.6.32 and up, you should consider testing also with
>> low_latency setting disabled, since tuning for latency can negatively
>> affect throughput.
Hi Miklos,
can you give more information about the setup?
How much memory do you have, what is the disk configuration (is this a
hw raid?) and so on.
>
> low_latency is set to zero in all tests.
>
> The layout difference doesn't explain why setting the scheduler to
> "noop" consistently speeds up read throughput in 8-thread tiobench to
> almost twice.  This fact alone pretty clearly indicates that the I/O
> scheduler is the culprit.
From the attached btt output, I see that a lot of time is spent
waiting to allocate new request structures.
> S2G 0.022460311 6.581680621 23.144763751 15
Since noop doesn't attach fancy data to each request, it can save
those allocations, thus resulting in no sleeps.
The delays in allocation, though, may not be completely imputable to
the I/O scheduler, and working in constrained memory conditions will
negatively affect it.

> There are other indications, see the attached btt output for both
> traces.  From there it appears that 2.6.16 does more and longer seeks,
> yet it's getting an overall better performance.
I see less seeks for 2.6.16, but longer on average.
It seems that 2.6.16 allows more requests from the same process to be
streamed to disk before switching to an other process.
Since the timeslice is the same, it might be that we are limiting the
number of requests per queue due to memory congestion.

> I've also tested with plain "dd" instead of tiobench where the
> filesystem layout stayed exactly the same between tests.  Still the
> speed difference is there.
Does dropping caches before the read test change the situation?

Thanks,
Corrado
>
> Thanks,
> Miklos
>
> ************************************************************************
> btt output for 2.6.16:
> ==================== All Devices ====================
>
>            ALL           MIN           AVG           MAX           N
> --------------- ------------- ------------- ------------- -----------
>
> Q2Q               0.000000047   0.000854417   1.003550405       67465
> Q2G               0.000000458   0.000001211   0.000123527       46062
> G2I               0.000000123   0.000001815   0.000494517       46074
> Q2M               0.000000186   0.000001798   0.000010296       21404
> I2D               0.000000162   0.000158001   0.040794333       46062
> M2D               0.000000878   0.000133130   0.040585566       21404
> D2C               0.000053870   0.023778266   0.234154543       67466
> Q2C               0.000056746   0.023931014   0.234176000       67466
>
> ==================== Device Overhead ====================
>
>       DEV |       Q2G       G2I       Q2M       I2D       D2C
> ---------- | --------- --------- --------- --------- ---------
>  (  8, 64) |   0.0035%   0.0052%   0.0024%   0.4508%  99.3617%
> ---------- | --------- --------- --------- --------- ---------
>   Overall |   0.0035%   0.0052%   0.0024%   0.4508%  99.3617%
>
> ==================== Device Merge Information ====================
>
>       DEV |       #Q       #D   Ratio |   BLKmin   BLKavg   BLKmax    Total
> ---------- | -------- -------- ------- | -------- -------- -------- --------
>  (  8, 64) |    67466    46062     1.5 |        8      597     1024 27543688
>
> ==================== Device Q2Q Seek Information ====================
>
>       DEV |          NSEEKS            MEAN          MEDIAN | MODE
> ---------- | --------------- --------------- --------------- | ---------------
>  (  8, 64) |           67466        866834.0               0 | 0(34558)
> ---------- | --------------- --------------- --------------- | ---------------
>   Overall |          NSEEKS            MEAN          MEDIAN | MODE
>   Average |           67466        866834.0               0 | 0(34558)
>
> ==================== Device D2D Seek Information ====================
>
>       DEV |          NSEEKS            MEAN          MEDIAN | MODE
> ---------- | --------------- --------------- --------------- | ---------------
>  (  8, 64) |           46062       1265503.9               0 | 0(13242)
> ---------- | --------------- --------------- --------------- | ---------------
>   Overall |          NSEEKS            MEAN          MEDIAN | MODE
>   Average |           46062       1265503.9               0 | 0(13242)
>
> ==================== Plug Information ====================
>
>       DEV |    # Plugs # Timer Us  | % Time Q Plugged
> ---------- | ---------- ----------  | ----------------
>  (  8, 64) |      29271(       533) |   3.878105328%
>
>       DEV |    IOs/Unp   IOs/Unp(to)
> ---------- | ----------   ----------
>  (  8, 64) |       19.2         19.7
> ---------- | ----------   ----------
>   Overall |    IOs/Unp   IOs/Unp(to)
>   Average |       19.2         19.7
>
> ==================== Active Requests At Q Information ====================
>
>       DEV |  Avg Reqs @ Q
> ---------- | -------------
>  (  8, 64) |           0.8
>
> ==================== I/O Active Period Information ====================
>
>       DEV |     # Live      Avg. Act     Avg. !Act % Live
> ---------- | ---------- ------------- ------------- ------
>  (  8, 64) |        545   0.100012237   0.005766640  94.56
> ---------- | ---------- ------------- ------------- ------
>  Total Sys |        545   0.100012237   0.005766640  94.56
>
> ************************************************************************
> btt output for 2.6.32:
>
> ==================== All Devices ====================
>
>            ALL           MIN           AVG           MAX           N
> --------------- ------------- ------------- ------------- -----------
>
> Q2Q               0.000000279   0.001710581   1.803934429       69429
> Q2G               0.000000908   0.001798735  23.144764798       54940
> S2G               0.022460311   6.581680621  23.144763751          15
> G2I               0.000000628   0.000001576   0.000120409       54942
> Q2M               0.000000628   0.000001611   0.000013201       14490
> I2D               0.000000768   0.289812035  86.820205789       54940
> M2D               0.000005518   0.098208187   0.794441158       14490
> D2C               0.000173141   0.008056256   0.219516385       69430
> Q2C               0.000179077   0.259305605  86.820559403       69430
>
> ==================== Device Overhead ====================
>
>       DEV |       Q2G       G2I       Q2M       I2D       D2C
> ---------- | --------- --------- --------- --------- ---------
>  (  8, 64) |   0.5489%   0.0005%   0.0001%  88.4394%   3.1069%
> ---------- | --------- --------- --------- --------- ---------
>   Overall |   0.5489%   0.0005%   0.0001%  88.4394%   3.1069%
>
> ==================== Device Merge Information ====================
>
>       DEV |       #Q       #D   Ratio |   BLKmin   BLKavg   BLKmax    Total
> ---------- | -------- -------- ------- | -------- -------- -------- --------
>  (  8, 64) |    69430    54955     1.3 |        8      520     2048 28614984
>
> ==================== Device Q2Q Seek Information ====================
>
>       DEV |          NSEEKS            MEAN          MEDIAN | MODE
> ---------- | --------------- --------------- --------------- | ---------------
>  (  8, 64) |           69430        546377.3               0 | 0(50235)
> ---------- | --------------- --------------- --------------- | ---------------
>   Overall |          NSEEKS            MEAN          MEDIAN | MODE
>   Average |           69430        546377.3               0 | 0(50235)
>
> ==================== Device D2D Seek Information ====================
>
>       DEV |          NSEEKS            MEAN          MEDIAN | MODE
> ---------- | --------------- --------------- --------------- | ---------------
>  (  8, 64) |           54955        565286.3               0 | 0(37535)
> ---------- | --------------- --------------- --------------- | ---------------
>   Overall |          NSEEKS            MEAN          MEDIAN | MODE
>   Average |           54955        565286.3               0 | 0(37535)
>
> ==================== Plug Information ====================
>
>       DEV |    # Plugs # Timer Us  | % Time Q Plugged
> ---------- | ---------- ----------  | ----------------
>  (  8, 64) |       2310(         0) |   0.049353257%
>
>       DEV |    IOs/Unp   IOs/Unp(to)
> ---------- | ----------   ----------
>  (  8, 64) |        7.3          0.0
> ---------- | ----------   ----------
>   Overall |    IOs/Unp   IOs/Unp(to)
>   Average |        7.3          0.0
>
> ==================== Active Requests At Q Information ====================
>
>       DEV |  Avg Reqs @ Q
> ---------- | -------------
>  (  8, 64) |         132.8
>
> ==================== I/O Active Period Information ====================
>
>       DEV |     # Live      Avg. Act     Avg. !Act % Live
> ---------- | ---------- ------------- ------------- ------
>  (  8, 64) |       4835   0.023848998   0.000714665  97.09
> ---------- | ---------- ------------- ------------- ------
>  Total Sys |       4835   0.023848998   0.000714665  97.09
>
>
>
>