From: Andrew Morton on
On Mon, 28 Jun 2010 10:44:59 -0700
Michael Rubin <mrubin(a)google.com> wrote:

> Adding four read-only files to /proc/sys/vm
>
> To help developers and applications gain visibility into writeback
> behaviour adding four read only sysctl files into /proc/sys/vm.
> These files allow user apps to understand writeback behaviour over time
> and learn how it is impacting their performance.
>
> # cat /proc/sys/vm/pages_dirtied
> 3747
> # cat /proc/sys/vm/pages_entered_writeback
> 3618
> # cat /proc/sys/vm/dirty_threshold
> 816673
> # cat /proc/sys/vm/dirty_background_threshold
> 408336
>
> Documentation/vm.txt has been updated.
>
> In order to track the "cleaned" and "dirtied" counts we added two
> vm_stat_items. Per memory node stats have been added also. So we can
> see per node granularity:
>
> # cat /sys/devices/system/node/node20/writebackstat
> Node 20 pages_writeback: 0 times
> Node 20 pages_dirtied: 0 times
>
> A helper function, account_page_writeback, was added to encapsulate
> incrementing vm stats from nilfs. ceph code was also changed to use a
> mm helper routine.
>

Well... why are these useful? In what operational scenario would
someone use these and get goodness from the experience? Where is the
value? Sell it to us!



I'm generally reluctant to add /proc knobs which expose internals or
which tie us into particular implementations.

It's hard to see how any future implementation could have a problem
implementing pages_dirtied and pages_entered_writeback, however
dirty_threshold and dirty_background_threshold are, I think, somewhat
specific to the current implementation and may be hard to maintain next
time we rip up and rewrite everything.

>
> ...
>
> +dirty_background_threshold
> +
> +Contains the exact amount of dirty memory memory the kernel uses to trigger the
> +background writeout daemon will start writing out dirty data. This value
> +depends on memory state, dirty_background_ratio and/or
> +dirty_background_bytes. This value is read-only.

Documentation doesn't describe the units. Pages? kbytes? bytes?

I think it's best to encode the units in the procfs filename
(eg: dirty_expire_centisecs, min_free_kbytes).

> +==============================================================
> +
> dirty_bytes
>
> Contains the amount of dirty memory at which a process generating disk writes
> @@ -123,6 +136,15 @@ data.
>
> ==============================================================
>
> +dirty_threshold
> +
> +Contains the exact amount of dirty memory the kernel uses to decide when
> +a process which is generating disk writes will itself start writing
> +out data. This value depends on memory state, dirty_ratio and/or
> +dirty_bytes. This value is read-only.

units?

> +=============================================================
> +
> +pages_dirtied
> +
> +Number of pages that have ever been dirtied since boot.
> +This value is read-only.
> +
> =============================================================
>
> +pages_entered_writeback
> +
> +Number of pages that have been moved from dirty to writeback since boot.
> +This is only a count of file pages. This value is read-only.
> +

Am interested in hearing (in the changelog!) why these are considered
useful.

We're very very interested in knowing how many pages entered writeback
via mm/vmscan.c however this procfs file lumps those together with the
pages which entered writeback via the regular writeback paths, I assume.

>
> ...
>
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -105,13 +105,7 @@ static int ceph_set_page_dirty(struct page *page)
> spin_lock_irq(&mapping->tree_lock);
> if (page->mapping) { /* Race with truncate? */
> WARN_ON_ONCE(!PageUptodate(page));
> -
> - if (mapping_cap_account_dirty(mapping)) {
> - __inc_zone_page_state(page, NR_FILE_DIRTY);
> - __inc_bdi_stat(mapping->backing_dev_info,
> - BDI_RECLAIMABLE);
> - task_io_account_write(PAGE_CACHE_SIZE);
> - }
> + account_page_dirtied(page, mapping);
> radix_tree_tag_set(&mapping->page_tree,
> page_index(page), PAGECACHE_TAG_DIRTY);

Nice cleanup. And a bugfix, perhaps? The missing
task_dirty_inc(current)?

But we need EXPORT_SYMBOL(account_page_dirtied), methinks.

This should be a separate patch IMO.

>
> ...
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Michael Rubin on
On Wed, Jun 30, 2010 at 2:24 PM, Andrew Morton
<akpm(a)linux-foundation.org> wrote:
> Well... �why are these useful? �In what operational scenario would
> someone use these and get goodness from the experience? �Where is the
> value? �Sell it to us!

OK here it is in email before I add it to the commit description.

Before when users are trying to track their IO activity there has always
been a gap in the flow from user app to disk for buffered IO. With
page_dirtied and
page_entered_writeback the user can now track IO from buffered writes
as they are indicated to the block layer.

pages_dirtied helps storage workloads generating buffered writes
that need to see over time how much memory the app is able to dirty.
It can help trace app issues where iostat won't. In mixed workloads
where an appserver is writing via DIRECT_IO it can help root cause
issues where other apps are giving bursts of io behavior.

pages_entered_writeback is useful to help grant visibility into the
writeback subsystem. By tracking pages_entered_writeback with
pages_dirtied app developers can learn about the performance and/or
stability of the writeback subsystem. Comparing the rates of change
between the two allow developers to see when writeback is not able to
keep up with incoming traffic and the rate of dirty memory being sent
to the IO back end.

> It's hard to see how any future implementation could have a problem
> implementing pages_dirtied and pages_entered_writeback, however
> dirty_threshold and dirty_background_threshold are, I think, somewhat
> specific to the current implementation and may be hard to maintain next
> time we rip up and rewrite everything.

We already expose these thresholds in /proc/sys/vm with
dirty_background_ratio and background_ratio. What's frustrating about
the ratio variables and the need for these are that they are not honored
by the kernel. Instead the kernel may alter the number requested without
giving the user any indication that is the case. An app developer can
set the ratio to 2% but end up with 5% as get_dirty_limits makes sure
it is never lower than 5% when set from the ratio. Arguably that can
be fixed too but the limits which decide whether writeback is invoked
to aggressively clean dirty pages is dependent on changing page state
retrieved in determine_dirtyable_memory. It makes understanding when
the kernel decides to writeback data a moving target that no app can
ever determine. With these thresholds visible and collected over time it
gives apps a chance to know why writeback happened, or why it did not.
As systems get larger and larger RAM developers use the ratios to predict
when their workloads will see writeback invoked. Today there is no way
to accurately predict this.

> Documentation doesn't describe the units. �Pages? �kbytes? �bytes?

Ouch. Thanks. That will be fixed.

> I think it's best to encode the units in the procfs filename
> (eg: dirty_expire_centisecs, min_free_kbytes).

I agree that will be fixed.

> units?
They will all get units

> We're very very interested in knowing how many pages entered writeback
> via mm/vmscan.c however this procfs file lumps those together with the
> pages which entered writeback via the regular writeback paths, I assume.

Yes and I think that's ok. It describes how the whole system is moving
dirty memory to writeback state and sending it to the I/O path.
TO me trying to distinguish between fs/fs-writeback.c code doing this
or vmscan.c code doing this is exposing implementation that we may
change in the future.


> But we need EXPORT_SYMBOL(account_page_dirtied), methinks.

Ouch thanks. Will be fixed.

> This should be a separate patch IMO.

I will split these into two patches. One with the fix and then the
other with the counters.

mrubin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/