From: Mel Gorman on
On Mon, Dec 07, 2009 at 03:39:49PM +0100, Christian Ehrhardt wrote:
> Hi,
> I tracked a huge performance regression for a while and got it bisected
> down to commit "e084b2d95e48b31aa45f9c49ffc6cdae8bdb21d4 -
> page-allocator: preserve PFN ordering when __GFP_COLD is set".
>

Darn. That is related to IO controllers being able to automatically merge
requests. The problem it was fixing was that pages were arriving in reverse
PFN order, the controller was unable to merge and performance was impaired. Any
controller that can merge should be faster as a result of the patch.

> The scenario I'm running is a low memory system (256M total), that does
> sequential I/O with parallel iozone processes.
> One process per disk, each process reading a 2Gb file. The disks I use
> are fcp scsi disks attached to a s390 host. File system is ext2.
>

I don't know what controller is in use there but does it
opportunistically merge requests if they are on physically contiguous
pages? If so, can it be disabled?

> The regression appears as up to 76% loss in throughput at 16 processes
> (processes are scaled from 1 to 64, performance is bad everywhere - 16
> is just the peak - avg loss is about 40% throughput).
> I already know that giving the system just a bit (~64m+) more memory
> solves the issue almost completely, probably because there is almost no
> "memory pressure" left in that cases.
> I also know that using direct-I/O instead of I/O through page cache
> doesn't have the problem at all.

This makes sense because it's a sequentual read load, so readahead is a
factor and that is why __GFP_COLD is used - the data is not for
immediate use so doesn't need to be cache hot.

> Comparing sysstat data taken while running with the kernels just with &
> without the bisected patch shows nothing obvious except that I/O seems
> to take much longer (lower interrupt ratio etc).
>

Maybe the controller is spending an age trying to merge requests because
it should be able to but takes a long time figuring it out?

> The patch alone looks very reasonable, so I'd prefer understanding and
> fixing the real issue instead of getting it eventually reverted due to
> this regression being larger than the one it was intended to fix.
> In the patch it is clear that hot pages (cold==0) freed in rmqueue_bulk
> should behave like before. So maybe the question is "are our pages cold
> while they shouldn't be"?
> Well I don't really have a clue yet to explain how patch e084b exactly
> causes that big regression, ideas welcome :-)
>

Only theory I have at the moment is that the controller notices it can
merge requests and either spends a long time figuring out how to do the
merging or performs worse with merged IO requests.

If the problem is in the driver, oprofile might show where the problem lies.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on
On Thu, Dec 10, 2009 at 03:36:04PM +0100, Christian Ehrhardt wrote:
> Keeping the old discussion in the mail tail, adding the new information
> up here were everyone finds them :-)
>
> Things I was able to confirm so far summarized:
> - The controller doesn't care about pfn ordering in any way (proved by
> HW statistics)
> - regression appears in sequential AND random workloads -> also without
> readahead
> - oprofile & co are no option atm.
> The effective consumed cpu cycles per transferred kb are almost the
> same so I would not expect sampling would give us huge insights.
> Therefore I expect that it is more a matter of lost time (latency) than
> more expensive tasks (cpu consumption)

But earlier, you said that latency was lower - "latency statistics clearly
state that your patch is working as intended - the latency from entering
the controller until the interrupt to linux device driver is ~30% lower!."

Also, if the controller is doing no merging of IO requests, why is the
interrupt rate lower?

> I don't want to preclude it completely, but sampling has to wait as
> long as we have better tracks to follow
>
> So the question is where time is lost in Linux. I used blktrace to
> create latency summaries.
> I only list the random case for discussion as the effects are more clear
> int hat data.
> Abbreviations are (like the blkparse man page explains) - sorted in
> order it would appear per request:
> A -- remap For stacked devices, incoming i/o is remapped to device
> below it in the i/o stack. The remap action details what exactly is being
> remapped to what.
> G -- get request To send any type of request to a block device, a
> struct request container must be allocated first.
> I -- inserted A request is being sent to the i/o scheduler for
> addition to the internal queue and later service by the driver. The
> request is fully formed at this time.
> D -- issued A request that previously resided on the block layer
> queue or in the i/o scheduler has been sent to the driver.
> C -- complete A previously issued request has been completed. The
> output will detail the sector and size of that request, as well as the
> success or failure of it.
>
> The following table shows the average latencies from A to G, G to I and
> so on.
> C2A is special and tries to summarize how long it takes after completing
> an I/O until the next one arrives in the block device layer.
>
> avg-A2G avg-G2I avg-I2D avg-D2C avg-C2A-in-avg+-stddev %C2A-in-avg+-stddev
> deviation good->bad -3.48% -0.56% -1.57% -1.31% 128.69% 97.26%
>
> It clearly shows that all latencies once block device layer and device
> driver are involved are almost equal. Remember that the throughput of
> good vs. bad case is more than x3.
> But we can also see that the value of C2A increases by a huge amount.
> That huge C2A increase let me assume that the time is actually lost
> "above" the block device layer.
>

To be clear. As C is "completion" and "A" is remapping new IO, it
implies that time is being lost between when one IO completes and
another starts, right?

> I don't expect the execution speed of iozone as user process itself is
> affected by commit e084b,

Not by this much anyway. Lets say cache hotness is a problem, I would
expect some performance loss but not this much.
> so the question is where the time is lost
> between the "read" issued by iozone and entering the block device layer.
> Actually I expect it somewhere in the area of getting a page cache page
> for the I/O. On one hand page handling is what commit e084b changes and
> on the other hand pages are under pressure (systat vm effectiveness
> ~100%, >40% scanned directly in both cases).
>
> I'll continue hunting down the lost time - maybe with ftrace if it is
> not concealing the effect by its invasiveness -, any further
> ideas/comments welcome.
>

One way of finding out if cache hottness was the problem would be to profile
for cache misses and see if there are massive differences with and without
the patch. Is that an option?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on
On Fri, Dec 18, 2009 at 02:38:15PM +0100, Christian Ehrhardt wrote:
> Christian Ehrhardt wrote:
>> Mel Gorman wrote:
>> [...]
>>>>
>>>
>>> One way of finding out if cache hottness was the problem would be to
>>> profile
>>> for cache misses and see if there are massive differences with and
>>> without
>>> the patch. Is that an option?
>>>
>> It is an option to verify things, but as mentioned above I would
>> expect an increase amounts of consumed cycles per kb which I don't see.
>> I'll track caches anyway to be sure.
>>
>> My personal current assumption is that either there is some time lost
>> from the read syscall until the A event blocktrace tracks or I'm wrong
>> with my assumption about user processes and iozone runs "slower".
>>
>
> After I was able to verify that the time is lost above the block device
> layer I ran further tests to check out where the additional latency due
> to commit e084b occurs.
> With "strace -frT -e trace=open,close,read,write" I could isolate the
> continuous stream of read calls done by the iozone child process.
> In those data two things were revealed:
> a) The time is lost between read and Block device enter (not e.g. due to
> slow userspace execution or scheduling effects)
> Time spent in userspace until the next read call is almost the same,
> but the average time spent "inside" the read call increases from 788�s
> to 3278�s
> b) The time is lost in a few big peaks
> The majority of read calls are ~500-750�s in both cases, but there is
> a slight shift of approximately 8% that occur now as durations between
> 15000 and 60000�s
>
> As mentioned before there is no significant increase of spent cpu cycles
> - therefore I watched out how the read call could end up in sleeps
> (latency but not used cycles).
> I found that there is a chance to run into "congestion_wait" in case
> __alloc_pages_direct_reclaim() was able to free some pages with
> try_to_free_pages() but doesn't get a page in the subsequent
> get_page_from_freelist().

This is true although it was also the case before the patch.

> To get details about that I patched in some simple counters and
> ktime_get based timings. The results are impressive and point exactly
> towards the congestion_wait theory
>
> The following data is again taken with iozone 1 thread random read in a
> very low memory environment, the counters were zeroed before starting
> iozone
> Bad = the kernel just with commit e084b, good = kernel one commit before
> e084b. The throughput in good case while taking this data was
> approximately ~168% of that of the bad case.
>
> GOOD BAD
> perf_count_congestion_wait 245 1388
> perf_count_pages_direct_reclaim 10993 5365
> perf_count_failed_pages_direct_reclaim 0 1090
> Note - the data also shows that the average time spent in the
> f_op->aio_read call issued by do_sync_read increases by 68.5% which is
> directly inversely proportional to the average loss in throughput.
>
> This lets me conclude that patch e084b causes the system to run into
> congestion_wait after the semi-failed call to
> __alloc_pages_direct_reclaim.

You mention that the "GOOD" kernel is one commit before e084b but can
you test with just that patch reverted please? As the patch is only
about list manipulation, I'm failing to see how it can affect
congestion_wait().

> Unfortunately I don't understand memory management enough to find the
> relation between e084b actually changing the order and position of freed
> pages in the LRU list to __alloc_pages_direct_reclaim not getting pages
> as effectively as before.

Off-hand, neither can I. I regret that I'm going off-line for the
Christmas very soon and I'll be in the middle of nowhere with no
internet connection or test machines in the interim. It's going to be
the new year before I get to look at this properly.

Even if this patch is not the real problem, your instrumentation patches
will help pin down the real problem. Thanks and sorry I'll be delayed in
getting back to you properly.

> Ideas and comments welcome!
>
> Kind regards,
> Christian
>
> P.S: attached are the three patches I used to get those new numbers
>
>
> --
>
> Gr�sse / regards, Christian Ehrhardt
> IBM Linux Technology Center, Linux Performance on System z
>

> Index: linux-2.6-git-2.6.31-bisect2/fs/read_write.c
> ===================================================================
> --- linux-2.6-git-2.6.31-bisect2.orig/fs/read_write.c
> +++ linux-2.6-git-2.6.31-bisect2/fs/read_write.c
> @@ -249,25 +249,37 @@ static void wait_on_retry_sync_kiocb(str
> __set_current_state(TASK_RUNNING);
> }
>
> +unsigned long perf_count_do_sync_read_calls_aio_read = 0;
> +unsigned long perf_tsum_do_sync_read_calls_aio_read = 0;
> +unsigned long perf_count_do_sync_read_calls_wait_on_sync_kiocb = 0;
> +unsigned long perf_tsum_do_sync_read_calls_wait_on_sync_kiocb = 0;
> ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
> {
> struct iovec iov = { .iov_base = buf, .iov_len = len };
> struct kiocb kiocb;
> ssize_t ret;
> + ktime_t tv64_pre;
>
> init_sync_kiocb(&kiocb, filp);
> kiocb.ki_pos = *ppos;
> kiocb.ki_left = len;
>
> for (;;) {
> + perf_count_do_sync_read_calls_aio_read++;
> + tv64_pre = ktime_get();
> ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos);
> + perf_tsum_do_sync_read_calls_aio_read += ktime_sub(ktime_get(),tv64_pre).tv64;
> if (ret != -EIOCBRETRY)
> break;
> wait_on_retry_sync_kiocb(&kiocb);
> }
>
> - if (-EIOCBQUEUED == ret)
> + if (-EIOCBQUEUED == ret) {
> + perf_count_do_sync_read_calls_wait_on_sync_kiocb++;
> + tv64_pre = ktime_get();
> ret = wait_on_sync_kiocb(&kiocb);
> + perf_tsum_do_sync_read_calls_wait_on_sync_kiocb += ktime_sub(ktime_get(),tv64_pre).tv64;
> + }
> *ppos = kiocb.ki_pos;
> return ret;
> }
> Index: linux-2.6-git-2.6.31-bisect2/kernel/sysctl.c
> ===================================================================
> --- linux-2.6-git-2.6.31-bisect2.orig/kernel/sysctl.c
> +++ linux-2.6-git-2.6.31-bisect2/kernel/sysctl.c
> @@ -257,6 +257,10 @@ extern unsigned long perf_count_pages_di
> extern unsigned long perf_count_failed_pages_direct_reclaim;
> extern unsigned long perf_count_do_generic_file_read_calls_page_cache_alloc_cold;
> extern unsigned long perf_tsum_do_generic_file_read_calls_page_cache_alloc_cold;
> +extern unsigned long perf_count_do_sync_read_calls_aio_read;
> +extern unsigned long perf_tsum_do_sync_read_calls_aio_read;
> +extern unsigned long perf_count_do_sync_read_calls_wait_on_sync_kiocb;
> +extern unsigned long perf_tsum_do_sync_read_calls_wait_on_sync_kiocb;
> static struct ctl_table perf_table[] = {
> {
> .ctl_name = CTL_UNNUMBERED,
> @@ -297,6 +301,38 @@ static struct ctl_table perf_table[] = {
> .maxlen = sizeof(unsigned long),
> .mode = 0666,
> .proc_handler = &proc_doulongvec_minmax,
> + },
> + {
> + .ctl_name = CTL_UNNUMBERED,
> + .procname = "perf_count_do_sync_read_calls_aio_read",
> + .data = &perf_count_do_sync_read_calls_aio_read,
> + .maxlen = sizeof(unsigned long),
> + .mode = 0666,
> + .proc_handler = &proc_doulongvec_minmax,
> + },
> + {
> + .ctl_name = CTL_UNNUMBERED,
> + .procname = "perf_tsum_do_sync_read_calls_aio_read",
> + .data = &perf_tsum_do_sync_read_calls_aio_read,
> + .maxlen = sizeof(unsigned long),
> + .mode = 0666,
> + .proc_handler = &proc_doulongvec_minmax,
> + },
> + {
> + .ctl_name = CTL_UNNUMBERED,
> + .procname = "perf_count_do_sync_read_calls_wait_on_sync_kiocb",
> + .data = &perf_count_do_sync_read_calls_wait_on_sync_kiocb,
> + .maxlen = sizeof(unsigned long),
> + .mode = 0666,
> + .proc_handler = &proc_doulongvec_minmax,
> + },
> + {
> + .ctl_name = CTL_UNNUMBERED,
> + .procname = "perf_tsum_do_sync_read_calls_wait_on_sync_kiocb",
> + .data = &perf_tsum_do_sync_read_calls_wait_on_sync_kiocb,
> + .maxlen = sizeof(unsigned long),
> + .mode = 0666,
> + .proc_handler = &proc_doulongvec_minmax,
> },
> { .ctl_name = 0 }
> };

> Index: linux-2.6-git-2.6.31-bisect2/include/linux/sysctl.h
> ===================================================================
> --- linux-2.6-git-2.6.31-bisect2.orig/include/linux/sysctl.h
> +++ linux-2.6-git-2.6.31-bisect2/include/linux/sysctl.h
> @@ -69,6 +69,7 @@ enum
> CTL_BUS=8, /* Busses */
> CTL_ABI=9, /* Binary emulation */
> CTL_CPU=10, /* CPU stuff (speed scaling, etc) */
> + CTL_PERF=11, /* Performance counters and timer sums for debugging */
> CTL_ARLAN=254, /* arlan wireless driver */
> CTL_S390DBF=5677, /* s390 debug */
> CTL_SUNRPC=7249, /* sunrpc debug */
> Index: linux-2.6-git-2.6.31-bisect2/kernel/sysctl.c
> ===================================================================
> --- linux-2.6-git-2.6.31-bisect2.orig/kernel/sysctl.c
> +++ linux-2.6-git-2.6.31-bisect2/kernel/sysctl.c
> @@ -177,6 +177,7 @@ static struct ctl_table_root sysctl_tabl
> .default_set.list = LIST_HEAD_INIT(root_table_header.ctl_entry),
> };
>
> +static struct ctl_table perf_table[];
> static struct ctl_table kern_table[];
> static struct ctl_table vm_table[];
> static struct ctl_table fs_table[];
> @@ -230,6 +231,13 @@ static struct ctl_table root_table[] = {
> .mode = 0555,
> .child = dev_table,
> },
> + {
> + .ctl_name = CTL_PERF,
> + .procname = "perf",
> + .mode = 0555,
> + .child = perf_table,
> + },
> +
> /*
> * NOTE: do not add new entries to this table unless you have read
> * Documentation/sysctl/ctl_unnumbered.txt
> @@ -244,6 +252,19 @@ static int min_wakeup_granularity_ns;
> static int max_wakeup_granularity_ns = NSEC_PER_SEC; /* 1 second */
> #endif
>
> +extern unsigned long perf_count_congestion_wait;
> +static struct ctl_table perf_table[] = {
> + {
> + .ctl_name = CTL_UNNUMBERED,
> + .procname = "perf_count_congestion_wait",
> + .data = &perf_count_congestion_wait,
> + .mode = 0666,
> + .maxlen = sizeof(unsigned long),
> + .proc_handler = &proc_doulongvec_minmax,
> + },
> + { .ctl_name = 0 }
> +};
> +
> static struct ctl_table kern_table[] = {
> #ifdef CONFIG_SCHED_DEBUG
> {
> Index: linux-2.6-git-2.6.31-bisect2/mm/backing-dev.c
> ===================================================================
> --- linux-2.6-git-2.6.31-bisect2.orig/mm/backing-dev.c
> +++ linux-2.6-git-2.6.31-bisect2/mm/backing-dev.c
> @@ -305,6 +305,7 @@ void set_bdi_congested(struct backing_de
> }
> EXPORT_SYMBOL(set_bdi_congested);
>
> +unsigned long perf_count_congestion_wait = 0;
> /**
> * congestion_wait - wait for a backing_dev to become uncongested
> * @sync: SYNC or ASYNC IO
> @@ -320,6 +321,7 @@ long congestion_wait(int sync, long time
> DEFINE_WAIT(wait);
> wait_queue_head_t *wqh = &congestion_wqh[sync];
>
> + perf_count_congestion_wait++;
> prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> ret = io_schedule_timeout(timeout);
> finish_wait(wqh, &wait);

> Index: linux-2.6-git-2.6.31-bisect2/kernel/sysctl.c
> ===================================================================
> --- linux-2.6-git-2.6.31-bisect2.orig/kernel/sysctl.c
> +++ linux-2.6-git-2.6.31-bisect2/kernel/sysctl.c
> @@ -253,6 +253,8 @@ static int max_wakeup_granularity_ns = N
> #endif
>
> extern unsigned long perf_count_congestion_wait;
> +extern unsigned long perf_count_pages_direct_reclaim;
> +extern unsigned long perf_count_failed_pages_direct_reclaim;
> static struct ctl_table perf_table[] = {
> {
> .ctl_name = CTL_UNNUMBERED,
> @@ -262,6 +264,22 @@ static struct ctl_table perf_table[] = {
> .maxlen = sizeof(unsigned long),
> .proc_handler = &proc_doulongvec_minmax,
> },
> + {
> + .ctl_name = CTL_UNNUMBERED,
> + .procname = "perf_count_pages_direct_reclaim",
> + .data = &perf_count_pages_direct_reclaim,
> + .maxlen = sizeof(unsigned long),
> + .mode = 0666,
> + .proc_handler = &proc_doulongvec_minmax,
> + },
> + {
> + .ctl_name = CTL_UNNUMBERED,
> + .procname = "perf_count_failed_pages_direct_reclaim",
> + .data = &perf_count_failed_pages_direct_reclaim,
> + .maxlen = sizeof(unsigned long),
> + .mode = 0666,
> + .proc_handler = &proc_doulongvec_minmax,
> + },
> { .ctl_name = 0 }
> };
>
> Index: linux-2.6-git-2.6.31-bisect2/mm/page_alloc.c
> ===================================================================
> --- linux-2.6-git-2.6.31-bisect2.orig/mm/page_alloc.c
> +++ linux-2.6-git-2.6.31-bisect2/mm/page_alloc.c
> @@ -1605,6 +1605,9 @@ out:
> return page;
> }
>
> +unsigned long perf_count_pages_direct_reclaim = 0;
> +unsigned long perf_count_failed_pages_direct_reclaim = 0;
> +
> /* The really slow allocator path where we enter direct reclaim */
> static inline struct page *
> __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> @@ -1645,6 +1648,11 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
> zonelist, high_zoneidx,
> alloc_flags, preferred_zone,
> migratetype);
> +
> + perf_count_pages_direct_reclaim++;
> + if (!page)
> + perf_count_failed_pages_direct_reclaim++;
> +
> return page;
> }
>


--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/