memcg: per cgroup dirty limit (v7) [Kernel]

Prev: linux-next: manual merge of the fsnotify tree with Linus' tree
Next: genirq: warn about IRQF_SHARED|IRQF_DISABLED at the right place

From: KAMEZAWA Hiroyuki on 14 Mar 2010 22:50

On Mon, 15 Mar 2010 00:26:37 +0100
Andrea Righi <arighi(a)develer.com> wrote:

> Control the maximum amount of dirty pages a cgroup can have at any given time.
>
> Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim)
> page cache used by any cgroup. So, in case of multiple cgroup writers, they
> will not be able to consume more than their designated share of dirty pages and
> will be forced to perform write-out if they cross that limit.
>
> The overall design is the following:
>
> - account dirty pages per cgroup
> - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes
> and memory.dirty_background_ratio / memory.dirty_background_bytes in
> cgroupfs
> - start to write-out (background or actively) when the cgroup limits are
> exceeded
>
> This feature is supposed to be strictly connected to any underlying IO
> controller implementation, so we can stop increasing dirty pages in VM layer
> and enforce a write-out before any cgroup will consume the global amount of
> dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and
> /proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits.
>
> Changelog (v6 -> v7)
> ~~~~~~~~~~~~~~~~~~~~~~
> * introduce trylock_page_cgroup() to guarantee that lock_page_cgroup()
> is never called under tree_lock (no strict accounting, but better overall
> performance)
> * do not account file cache statistics for the root cgroup (zero
> overhead for the root cgroup)
> * fix: evaluate cgroup free pages as at the minimum free pages of all
> its parents
>
> Results
> ~~~~~~~
> The testcase is a kernel build (2.6.33 x86_64_defconfig) on a Intel Core 2 @
> 1.2GHz:
>
> <before>
> - root cgroup: 11m51.983s
> - child cgroup: 11m56.596s
>
> <after>
> - root cgroup: 11m51.742s
> - child cgroup: 12m5.016s
>
> In the previous version of this patchset, using the "complex" locking scheme
> with the _locked and _unlocked version of mem_cgroup_update_page_stat(), the
> child cgroup required 11m57.896s and 12m9.920s with lock_page_cgroup()+irq_disabled.
>
> With this version there's no overhead for the root cgroup (the small difference
> is in error range). I expected to see less overhead for the child cgroup, I'll
> do more testing and try to figure better what's happening.
>
Okay, thanks. This seems good result. Optimization for children can be done under
-mm tree, I think. (If no nack, this seems ready for test in -mm.)

> In the while, it would be great if someone could perform some tests on a larger
> system... unfortunately at the moment I don't have a big system available for
> this kind of tests...
>
I hope, too.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Righi on 15 Mar 2010 06:10

On Mon, Mar 15, 2010 at 11:36:12AM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 15 Mar 2010 00:26:37 +0100
> Andrea Righi <arighi(a)develer.com> wrote:
>
> > Control the maximum amount of dirty pages a cgroup can have at any given time.
> >
> > Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim)
> > page cache used by any cgroup. So, in case of multiple cgroup writers, they
> > will not be able to consume more than their designated share of dirty pages and
> > will be forced to perform write-out if they cross that limit.
> >
> > The overall design is the following:
> >
> > - account dirty pages per cgroup
> > - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes
> > and memory.dirty_background_ratio / memory.dirty_background_bytes in
> > cgroupfs
> > - start to write-out (background or actively) when the cgroup limits are
> > exceeded
> >
> > This feature is supposed to be strictly connected to any underlying IO
> > controller implementation, so we can stop increasing dirty pages in VM layer
> > and enforce a write-out before any cgroup will consume the global amount of
> > dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and
> > /proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits.
> >
> > Changelog (v6 -> v7)
> > ~~~~~~~~~~~~~~~~~~~~~~
> > * introduce trylock_page_cgroup() to guarantee that lock_page_cgroup()
> > is never called under tree_lock (no strict accounting, but better overall
> > performance)
> > * do not account file cache statistics for the root cgroup (zero
> > overhead for the root cgroup)
> > * fix: evaluate cgroup free pages as at the minimum free pages of all
> > its parents
> >
> > Results
> > ~~~~~~~
> > The testcase is a kernel build (2.6.33 x86_64_defconfig) on a Intel Core 2 @
> > 1.2GHz:
> >
> > <before>
> > - root cgroup: 11m51.983s
> > - child cgroup: 11m56.596s
> >
> > <after>
> > - root cgroup: 11m51.742s
> > - child cgroup: 12m5.016s
> >
> > In the previous version of this patchset, using the "complex" locking scheme
> > with the _locked and _unlocked version of mem_cgroup_update_page_stat(), the
> > child cgroup required 11m57.896s and 12m9.920s with lock_page_cgroup()+irq_disabled.
> >
> > With this version there's no overhead for the root cgroup (the small difference
> > is in error range). I expected to see less overhead for the child cgroup, I'll
> > do more testing and try to figure better what's happening.
> >
> Okay, thanks. This seems good result. Optimization for children can be done under
> -mm tree, I think. (If no nack, this seems ready for test in -mm.)

OK, I'll wait a bit to see if someone has other fixes or issues and post
a new version soon including these small changes.

Thanks,
-Andrea
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Vivek Goyal on 15 Mar 2010 13:20

On Mon, Mar 15, 2010 at 12:26:37AM +0100, Andrea Righi wrote:
> Control the maximum amount of dirty pages a cgroup can have at any given time.
>
> Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim)
> page cache used by any cgroup. So, in case of multiple cgroup writers, they
> will not be able to consume more than their designated share of dirty pages and
> will be forced to perform write-out if they cross that limit.
>

For me even with this version I see that group with 100M limit is getting
much more BW.

root cgroup
==========
#time dd if=/dev/zero of=/root/zerofile bs=4K count=1M
4294967296 bytes (4.3 GB) copied, 55.7979 s, 77.0 MB/s

real 0m56.209s

test1 cgroup with memory limit of 100M
======================================
# time dd if=/dev/zero of=/root/zerofile1 bs=4K count=1M
4294967296 bytes (4.3 GB) copied, 20.9252 s, 205 MB/s

real 0m21.096s

Note, these two jobs are not running in parallel. These are running one
after the other.

Vivek

> The overall design is the following:
>
> - account dirty pages per cgroup
> - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes
> and memory.dirty_background_ratio / memory.dirty_background_bytes in
> cgroupfs
> - start to write-out (background or actively) when the cgroup limits are
> exceeded
>
> This feature is supposed to be strictly connected to any underlying IO
> controller implementation, so we can stop increasing dirty pages in VM layer
> and enforce a write-out before any cgroup will consume the global amount of
> dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and
> /proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits.
>
> Changelog (v6 -> v7)
> ~~~~~~~~~~~~~~~~~~~~~~
> * introduce trylock_page_cgroup() to guarantee that lock_page_cgroup()
> is never called under tree_lock (no strict accounting, but better overall
> performance)
> * do not account file cache statistics for the root cgroup (zero
> overhead for the root cgroup)
> * fix: evaluate cgroup free pages as at the minimum free pages of all
> its parents
>
> Results
> ~~~~~~~
> The testcase is a kernel build (2.6.33 x86_64_defconfig) on a Intel Core 2 @
> 1.2GHz:
>
> <before>
> - root cgroup: 11m51.983s
> - child cgroup: 11m56.596s
>
> <after>
> - root cgroup: 11m51.742s
> - child cgroup: 12m5.016s
>
> In the previous version of this patchset, using the "complex" locking scheme
> with the _locked and _unlocked version of mem_cgroup_update_page_stat(), the
> child cgroup required 11m57.896s and 12m9.920s with lock_page_cgroup()+irq_disabled.
>
> With this version there's no overhead for the root cgroup (the small difference
> is in error range). I expected to see less overhead for the child cgroup, I'll
> do more testing and try to figure better what's happening.
>
> In the while, it would be great if someone could perform some tests on a larger
> system... unfortunately at the moment I don't have a big system available for
> this kind of tests...
>
> Thanks,
> -Andrea
>
> Documentation/cgroups/memory.txt | 36 +++
> fs/nfs/write.c | 4 +
> include/linux/memcontrol.h | 87 ++++++-
> include/linux/page_cgroup.h | 35 +++
> include/linux/writeback.h | 2 -
> mm/filemap.c | 1 +
> mm/memcontrol.c | 542 +++++++++++++++++++++++++++++++++++---
> mm/page-writeback.c | 215 ++++++++++------
> mm/rmap.c | 4 +-
> mm/truncate.c | 1 +
> 10 files changed, 806 insertions(+), 121 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Vivek Goyal on 15 Mar 2010 13:30

On Mon, Mar 15, 2010 at 01:12:09PM -0400, Vivek Goyal wrote:
> On Mon, Mar 15, 2010 at 12:26:37AM +0100, Andrea Righi wrote:
> > Control the maximum amount of dirty pages a cgroup can have at any given time.
> >
> > Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim)
> > page cache used by any cgroup. So, in case of multiple cgroup writers, they
> > will not be able to consume more than their designated share of dirty pages and
> > will be forced to perform write-out if they cross that limit.
> >
>
> For me even with this version I see that group with 100M limit is getting
> much more BW.
>
> root cgroup
> ==========
> #time dd if=/dev/zero of=/root/zerofile bs=4K count=1M
> 4294967296 bytes (4.3 GB) copied, 55.7979 s, 77.0 MB/s
>
> real 0m56.209s
>
> test1 cgroup with memory limit of 100M
> ======================================
> # time dd if=/dev/zero of=/root/zerofile1 bs=4K count=1M
> 4294967296 bytes (4.3 GB) copied, 20.9252 s, 205 MB/s
>
> real 0m21.096s
>
> Note, these two jobs are not running in parallel. These are running one
> after the other.
>

Ok, here is the strange part. I am seeing similar behavior even without
your patches applied.

root cgroup
==========
#time dd if=/dev/zero of=/root/zerofile bs=4K count=1M
4294967296 bytes (4.3 GB) copied, 56.098 s, 76.6 MB/s

real 0m56.614s

test1 cgroup with memory limit 100M
===================================
# time dd if=/dev/zero of=/root/zerofile1 bs=4K count=1M
4294967296 bytes (4.3 GB) copied, 19.8097 s, 217 MB/s

real 0m19.992s

Vivek

>
> > The overall design is the following:
> >
> > - account dirty pages per cgroup
> > - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes
> > and memory.dirty_background_ratio / memory.dirty_background_bytes in
> > cgroupfs
> > - start to write-out (background or actively) when the cgroup limits are
> > exceeded
> >
> > This feature is supposed to be strictly connected to any underlying IO
> > controller implementation, so we can stop increasing dirty pages in VM layer
> > and enforce a write-out before any cgroup will consume the global amount of
> > dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and
> > /proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits.
> >
> > Changelog (v6 -> v7)
> > ~~~~~~~~~~~~~~~~~~~~~~
> > * introduce trylock_page_cgroup() to guarantee that lock_page_cgroup()
> > is never called under tree_lock (no strict accounting, but better overall
> > performance)
> > * do not account file cache statistics for the root cgroup (zero
> > overhead for the root cgroup)
> > * fix: evaluate cgroup free pages as at the minimum free pages of all
> > its parents
> >
> > Results
> > ~~~~~~~
> > The testcase is a kernel build (2.6.33 x86_64_defconfig) on a Intel Core 2 @
> > 1.2GHz:
> >
> > <before>
> > - root cgroup: 11m51.983s
> > - child cgroup: 11m56.596s
> >
> > <after>
> > - root cgroup: 11m51.742s
> > - child cgroup: 12m5.016s
> >
> > In the previous version of this patchset, using the "complex" locking scheme
> > with the _locked and _unlocked version of mem_cgroup_update_page_stat(), the
> > child cgroup required 11m57.896s and 12m9.920s with lock_page_cgroup()+irq_disabled.
> >
> > With this version there's no overhead for the root cgroup (the small difference
> > is in error range). I expected to see less overhead for the child cgroup, I'll
> > do more testing and try to figure better what's happening.
> >
> > In the while, it would be great if someone could perform some tests on a larger
> > system... unfortunately at the moment I don't have a big system available for
> > this kind of tests...
> >
> > Thanks,
> > -Andrea
> >
> > Documentation/cgroups/memory.txt | 36 +++
> > fs/nfs/write.c | 4 +
> > include/linux/memcontrol.h | 87 ++++++-
> > include/linux/page_cgroup.h | 35 +++
> > include/linux/writeback.h | 2 -
> > mm/filemap.c | 1 +
> > mm/memcontrol.c | 542 +++++++++++++++++++++++++++++++++++---
> > mm/page-writeback.c | 215 ++++++++++------
> > mm/rmap.c | 4 +-
> > mm/truncate.c | 1 +
> > 10 files changed, 806 insertions(+), 121 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Balbir Singh on 17 Mar 2010 02:50

* Andrea Righi <arighi(a)develer.com> [2010-03-15 00:26:37]:

> Control the maximum amount of dirty pages a cgroup can have at any given time.
>
> Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim)
> page cache used by any cgroup. So, in case of multiple cgroup writers, they
> will not be able to consume more than their designated share of dirty pages and
> will be forced to perform write-out if they cross that limit.
>
> The overall design is the following:
>
> - account dirty pages per cgroup
> - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes
> and memory.dirty_background_ratio / memory.dirty_background_bytes in
> cgroupfs
> - start to write-out (background or actively) when the cgroup limits are
> exceeded
>
> This feature is supposed to be strictly connected to any underlying IO
> controller implementation, so we can stop increasing dirty pages in VM layer
> and enforce a write-out before any cgroup will consume the global amount of
> dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and
> /proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits.
>
> Changelog (v6 -> v7)
> ~~~~~~~~~~~~~~~~~~~~~~
> * introduce trylock_page_cgroup() to guarantee that lock_page_cgroup()
> is never called under tree_lock (no strict accounting, but better overall
> performance)
> * do not account file cache statistics for the root cgroup (zero
> overhead for the root cgroup)
> * fix: evaluate cgroup free pages as at the minimum free pages of all
> its parents
>
> Results
> ~~~~~~~
> The testcase is a kernel build (2.6.33 x86_64_defconfig) on a Intel Core 2 @
> 1.2GHz:
>
> <before>
> - root cgroup: 11m51.983s
> - child cgroup: 11m56.596s
>
> <after>
> - root cgroup: 11m51.742s
> - child cgroup: 12m5.016s
>
> In the previous version of this patchset, using the "complex" locking scheme
> with the _locked and _unlocked version of mem_cgroup_update_page_stat(), the
> child cgroup required 11m57.896s and 12m9.920s with lock_page_cgroup()+irq_disabled.
>
> With this version there's no overhead for the root cgroup (the small difference
> is in error range). I expected to see less overhead for the child cgroup, I'll
> do more testing and try to figure better what's happening.

I like that the root overhead is going away.

>
> In the while, it would be great if someone could perform some tests on a larger
> system... unfortunately at the moment I don't have a big system available for
> this kind of tests...
>

I'll test this, I have a small machine to test on at the moment, I'll
revert back with data.

--
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2 3
Prev: linux-next: manual merge of the fsnotify tree with Linus' tree
Next: genirq: warn about IRQF_SHARED|IRQF_DISABLED at the right place