From: David Rientjes on
On Tue, 23 Feb 2010, Vivek Goyal wrote:

> > > Because you have modified dirtyable_memory() and made it per cgroup, I
> > > think it automatically takes care of the cases of per cgroup dirty ratio,
> > > I mentioned in my previous mail. So we will use system wide dirty ratio
> > > to calculate the allowed dirty pages in this cgroup (dirty_ratio *
> > > available_memory()) and if this cgroup wrote too many pages start
> > > writeout?
> >
> > OK, if I've understood well, you're proposing to use per-cgroup
> > dirty_ratio interface and do something like:
>
> I think we can use system wide dirty_ratio for per cgroup (instead of
> providing configurable dirty_ratio for each cgroup where each memory
> cgroup can have different dirty ratio. Can't think of a use case
> immediately).

I think each memcg should have both dirty_bytes and dirty_ratio,
dirty_bytes defaults to 0 (disabled) while dirty_ratio is inherited from
the global vm_dirty_ratio. Changing vm_dirty_ratio would not change
memcgs already using their own dirty_ratio, but new memcgs would get the
new value by default. The ratio would act over the amount of available
memory to the cgroup as though it were its own "virtual system" operating
with a subset of the system's RAM and the same global ratio.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Righi on
On Tue, Feb 23, 2010 at 02:22:12PM -0800, David Rientjes wrote:
> On Tue, 23 Feb 2010, Vivek Goyal wrote:
>
> > > > Because you have modified dirtyable_memory() and made it per cgroup, I
> > > > think it automatically takes care of the cases of per cgroup dirty ratio,
> > > > I mentioned in my previous mail. So we will use system wide dirty ratio
> > > > to calculate the allowed dirty pages in this cgroup (dirty_ratio *
> > > > available_memory()) and if this cgroup wrote too many pages start
> > > > writeout?
> > >
> > > OK, if I've understood well, you're proposing to use per-cgroup
> > > dirty_ratio interface and do something like:
> >
> > I think we can use system wide dirty_ratio for per cgroup (instead of
> > providing configurable dirty_ratio for each cgroup where each memory
> > cgroup can have different dirty ratio. Can't think of a use case
> > immediately).
>
> I think each memcg should have both dirty_bytes and dirty_ratio,
> dirty_bytes defaults to 0 (disabled) while dirty_ratio is inherited from
> the global vm_dirty_ratio. Changing vm_dirty_ratio would not change
> memcgs already using their own dirty_ratio, but new memcgs would get the
> new value by default. The ratio would act over the amount of available
> memory to the cgroup as though it were its own "virtual system" operating
> with a subset of the system's RAM and the same global ratio.

Agreed.

-Andrea
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Righi on
On Tue, Feb 23, 2010 at 04:29:43PM -0500, Vivek Goyal wrote:
> On Sun, Feb 21, 2010 at 04:18:45PM +0100, Andrea Righi wrote:
>
> [..]
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 0b19943..c9ff1cd 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -137,10 +137,11 @@ static struct prop_descriptor vm_dirties;
> > */
> > static int calc_period_shift(void)
> > {
> > - unsigned long dirty_total;
> > + unsigned long dirty_total, dirty_bytes;
> >
> > - if (vm_dirty_bytes)
> > - dirty_total = vm_dirty_bytes / PAGE_SIZE;
> > + dirty_bytes = mem_cgroup_dirty_bytes();
> > + if (dirty_bytes)
> > + dirty_total = dirty_bytes / PAGE_SIZE;
> > else
> > dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> > 100;
>
> Ok, I don't understand this so I better ask. Can you explain a bit how memory
> cgroup dirty ratio is going to play with per BDI dirty proportion thing.
>
> Currently we seem to be calculating per BDI proportion (based on recently
> completed events), of system wide dirty ratio and decide whether a process
> should be throttled or not.
>
> Because throttling decision is also based on BDI and its proportion, how
> are we going to fit it with mem cgroup? Is it going to be BDI proportion
> of dirty memory with-in memory cgroup (and not system wide)?

IMHO we need to calculate the BDI dirty threshold as a function of the
cgroup's dirty memory, and keep BDI statistics system wide.

So, if a task is generating some writes, the threshold to start itself
the writeback will be calculated as a function of the cgroup's dirty
memory. If the BDI dirty memory is greater than this threshold, the task
must start to writeback dirty pages until it reaches the expected dirty
limit.

OK, in this way a cgroup with a small dirty limit may be forced to
writeback a lot of pages dirtied by other cgroups on the same device.
But this is always related to the fact that tasks are forced to
writeback dirty inodes randomly, and not the inodes they've actually
dirtied.

-Andrea
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: KAMEZAWA Hiroyuki on
On Thu, 25 Feb 2010 15:34:44 +0100
Andrea Righi <arighi(a)develer.com> wrote:

> On Tue, Feb 23, 2010 at 02:22:12PM -0800, David Rientjes wrote:
> > On Tue, 23 Feb 2010, Vivek Goyal wrote:
> >
> > > > > Because you have modified dirtyable_memory() and made it per cgroup, I
> > > > > think it automatically takes care of the cases of per cgroup dirty ratio,
> > > > > I mentioned in my previous mail. So we will use system wide dirty ratio
> > > > > to calculate the allowed dirty pages in this cgroup (dirty_ratio *
> > > > > available_memory()) and if this cgroup wrote too many pages start
> > > > > writeout?
> > > >
> > > > OK, if I've understood well, you're proposing to use per-cgroup
> > > > dirty_ratio interface and do something like:
> > >
> > > I think we can use system wide dirty_ratio for per cgroup (instead of
> > > providing configurable dirty_ratio for each cgroup where each memory
> > > cgroup can have different dirty ratio. Can't think of a use case
> > > immediately).
> >
> > I think each memcg should have both dirty_bytes and dirty_ratio,
> > dirty_bytes defaults to 0 (disabled) while dirty_ratio is inherited from
> > the global vm_dirty_ratio. Changing vm_dirty_ratio would not change
> > memcgs already using their own dirty_ratio, but new memcgs would get the
> > new value by default. The ratio would act over the amount of available
> > memory to the cgroup as though it were its own "virtual system" operating
> > with a subset of the system's RAM and the same global ratio.
>
> Agreed.
>
BTW, please add background_dirty_ratio in the same series of patches.
(or something other to kick background-writeback in proper manner.)

If not, we can't kick background write-back until we're caught by dirty_ratio.

Thanks,
-Kame





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Vivek Goyal on
On Thu, Feb 25, 2010 at 04:12:11PM +0100, Andrea Righi wrote:
> On Tue, Feb 23, 2010 at 04:29:43PM -0500, Vivek Goyal wrote:
> > On Sun, Feb 21, 2010 at 04:18:45PM +0100, Andrea Righi wrote:
> >
> > [..]
> > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > index 0b19943..c9ff1cd 100644
> > > --- a/mm/page-writeback.c
> > > +++ b/mm/page-writeback.c
> > > @@ -137,10 +137,11 @@ static struct prop_descriptor vm_dirties;
> > > */
> > > static int calc_period_shift(void)
> > > {
> > > - unsigned long dirty_total;
> > > + unsigned long dirty_total, dirty_bytes;
> > >
> > > - if (vm_dirty_bytes)
> > > - dirty_total = vm_dirty_bytes / PAGE_SIZE;
> > > + dirty_bytes = mem_cgroup_dirty_bytes();
> > > + if (dirty_bytes)
> > > + dirty_total = dirty_bytes / PAGE_SIZE;
> > > else
> > > dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> > > 100;
> >
> > Ok, I don't understand this so I better ask. Can you explain a bit how memory
> > cgroup dirty ratio is going to play with per BDI dirty proportion thing.
> >
> > Currently we seem to be calculating per BDI proportion (based on recently
> > completed events), of system wide dirty ratio and decide whether a process
> > should be throttled or not.
> >
> > Because throttling decision is also based on BDI and its proportion, how
> > are we going to fit it with mem cgroup? Is it going to be BDI proportion
> > of dirty memory with-in memory cgroup (and not system wide)?
>
> IMHO we need to calculate the BDI dirty threshold as a function of the
> cgroup's dirty memory, and keep BDI statistics system wide.
>
> So, if a task is generating some writes, the threshold to start itself
> the writeback will be calculated as a function of the cgroup's dirty
> memory. If the BDI dirty memory is greater than this threshold, the task
> must start to writeback dirty pages until it reaches the expected dirty
> limit.
>

Ok, so calculate dirty per cgroup and calculate BDI's proportion from
cgroup dirty? So will you be keeping track of vm_completion events per
cgroup or will rely on existing system wide and per BDI completion events
to calculate BDI proportion?

BDI proportion are more of an indication of device speed and faster device
gets higher share of dirty, so may be we don't have to keep track of
completion events per cgroup and can rely on system wide completion events
for calculating the proportion of a BDI.

> OK, in this way a cgroup with a small dirty limit may be forced to
> writeback a lot of pages dirtied by other cgroups on the same device.
> But this is always related to the fact that tasks are forced to
> writeback dirty inodes randomly, and not the inodes they've actually
> dirtied.

So we are left with following two issues.

- Should we rely on global BDI stats for BDI_RECLAIMABLE and BDI_WRITEBACK
or we need to make these per cgroup to determine actually how many pages
have been dirtied by a cgroup and force writeouts accordingly?

- Once we decide to throttle a cgroup, it should write its inodes and
should not be serialized behind other cgroup's inodes.

If we don't tackle above two issues, I am not sure what probelm will be solved
by the patch set. The only thing I can see is that we will be doing write-outs
much more aggressively when we have got some memory cgroups created. (Smaller
dirty per cgroup will lead to smaller per BDI dirty and when compared with
overall BDI stat, it should lead to more writeouts).

if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
break;

Because bdi_thres calculation will be based on per cgroup dirty and
bdi_nr_reclaimable and bdi_nr_writeback will be system wide, we will be
doing much more aggressive writeouts.

But we will not achieve parallel writeback paths so probably will not help IO
controller a lot.

Kame-san, is it a problem, with current memory cgroups where writeback is
not happening that actively, and you run into situation where there are too
many dirty pages in a cgroup and reclaim can take long time?

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/