blkiocg async support [Kernel]

Prev: [-next July 9 - s390 ] Badness at fs/sysfs/symlink.c:82 during qeth initalization
Next: Badness at fs/sysfs/symlink.c:82 during qeth initalization

From: Andrea Righi on 9 Jul 2010 06:10

On Thu, Jul 08, 2010 at 10:57:13PM -0400, Munehiro Ikeda wrote:
> This might be a piece of puzzle for complete async write support of blkio
> controller. One of other pieces in my head is page dirtying ratio control.
> I believe Andrea Righi was working on it...how about the situation?

Greg Thelen (cc'ed) did some progresses on my original work. AFAIK
there're still some locking issue to resolve, principally because
cgroup dirty memory accounting requires lock_page_cgroup() to be
irq-safe. I did some tests using the irq-safe locking vs trylock
approach and Greg also tested the RCU way.

The RCU approach seems promising IMHO, because the page's cgroup owner
is supposed to change rarely (except for the files shared and frequently
written by many cgroups).

Greg, do you have a patch rebased to a recent kernel?

Thanks,
-Andrea
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Vivek Goyal on 9 Jul 2010 09:50

On Thu, Jul 08, 2010 at 10:57:13PM -0400, Munehiro Ikeda wrote:
> These RFC patches are trial to add async (cached) write support on blkio
> controller.
>
> Only test which has been done is to compile, boot, and that write bandwidth
> seems prioritized when pages which were dirtied by two different processes in
> different cgroups are written back to a device simultaneously. I know this
> is the minimum (or less) test but I posted this as RFC because I would like
> to hear your opinions about the design direction in the early stage.
>
> Patches are for 2.6.35-rc4.
>
> This patch series consists of two chunks.
>
> (1) iotrack (patch 01/11 -- 06/11)
>
> This is a functionality to track who dirtied a page, in exact which cgroup a
> process which dirtied a page belongs to. Blkio controller will read the info
> later and prioritize when the page is actually written to a block device.
> This work is originated from Ryo Tsuruta and Hirokazu Takahashi and includes
> Andrea Righi's idea. It was posted as a part of dm-ioband which was one of
> proposals for IO controller.
>
>
> (2) blkio controller modification (07/11 -- 11/11)
>
> The main part of blkio controller async write support.
> Currently async queues are device-wide and async write IOs are always treated
> as root group.
> These patches make async queues per a cfq_group per a device to control them.
> Async write is handled by flush kernel thread. Because queue pointers are
> stored in cfq_io_context, io_context of the thread has to have multiple
> cfq_io_contexts per a device. So these patches make cfq_io_context per an
> io_context per a cfq_group, which means per an io_context per a cgroup per a
> device.
>
>
> This might be a piece of puzzle for complete async write support of blkio
> controller. One of other pieces in my head is page dirtying ratio control.
> I believe Andrea Righi was working on it...how about the situation?

Thanks Muuh. I will look into the patches in detail.

In my initial patches I had implemented the support for ASYNC control
(also included Ryo's IO tracking patches) but it did not work well and
it was unpredictable. I realized that until and unless we implement
some kind of per group dirty ratio/page cache share at VM level and
create parallel paths for ASYNC IO, writes often get serialized.

So writes belonging to high priority group get stuck behind low priority
group and you don't get any service differentiation.

So IMHO, this piece should go into kernel after we have first fixed the
problem at VM (read memory controller) with per cgroup dirty ratio kind
of thing.

>
> And also, I'm thinking that async write support is required by bandwidth
> capping policy of blkio controller. Bandwidth capping can be done in upper
> layer than elevator.

I think capping facility we should implement in higher layers otherwise
it is not useful for higher level logical devices (dm/md).

It was ok to implement proportional bandwidth division at CFQ level
because one can do proportional BW division at each leaf node and still get
overall service differentation at higher level logical node. But same can
not be done for max BW control.

> However I think it should be also done in elevator layer
> in my opinion. Elevator buffers and sort requests. If there is another
> buffering functionality in upper layer, it is doubled buffering and it can be
> harmful for elevator's prediction.

I don't mind doing it at elevator layer also because in that case of
somebody is not using dm/md, then one does not have to load max bw
control module and one can simply enable max bw control in CFQ.

Thinking more about it, now we are suggesting implementing max BW
control at two places. I think it will be duplication of code and
increased complexity in CFQ. Probably implement max bw control with
the help of dm module and use same for CFQ also. There is pain
associated with configuring dm device but I guess it is easier than
maintaining two max bw control schemes in kernel.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Munehiro Ikeda on 9 Jul 2010 20:30

Vivek Goyal wrote, on 07/09/2010 09:45 AM:
> On Thu, Jul 08, 2010 at 10:57:13PM -0400, Munehiro Ikeda wrote:
>> These RFC patches are trial to add async (cached) write support on blkio
>> controller.
>>
>> Only test which has been done is to compile, boot, and that write bandwidth
>> seems prioritized when pages which were dirtied by two different processes in
>> different cgroups are written back to a device simultaneously. I know this
>> is the minimum (or less) test but I posted this as RFC because I would like
>> to hear your opinions about the design direction in the early stage.
>>
>> Patches are for 2.6.35-rc4.
>>
>> This patch series consists of two chunks.
>>
>> (1) iotrack (patch 01/11 -- 06/11)
>>
>> This is a functionality to track who dirtied a page, in exact which cgroup a
>> process which dirtied a page belongs to. Blkio controller will read the info
>> later and prioritize when the page is actually written to a block device.
>> This work is originated from Ryo Tsuruta and Hirokazu Takahashi and includes
>> Andrea Righi's idea. It was posted as a part of dm-ioband which was one of
>> proposals for IO controller.
>>
>>
>> (2) blkio controller modification (07/11 -- 11/11)
>>
>> The main part of blkio controller async write support.
>> Currently async queues are device-wide and async write IOs are always treated
>> as root group.
>> These patches make async queues per a cfq_group per a device to control them.
>> Async write is handled by flush kernel thread. Because queue pointers are
>> stored in cfq_io_context, io_context of the thread has to have multiple
>> cfq_io_contexts per a device. So these patches make cfq_io_context per an
>> io_context per a cfq_group, which means per an io_context per a cgroup per a
>> device.
>>
>>
>> This might be a piece of puzzle for complete async write support of blkio
>> controller. One of other pieces in my head is page dirtying ratio control.
>> I believe Andrea Righi was working on it...how about the situation?
>
> Thanks Muuh. I will look into the patches in detail.
>
> In my initial patches I had implemented the support for ASYNC control
> (also included Ryo's IO tracking patches) but it did not work well and
> it was unpredictable. I realized that until and unless we implement
> some kind of per group dirty ratio/page cache share at VM level and
> create parallel paths for ASYNC IO, writes often get serialized.
>
> So writes belonging to high priority group get stuck behind low priority
> group and you don't get any service differentiation.

I also faced the situation that high priority writes are behind
lower priority writes. Although this patch seems to prioritize
IOs if these IOs are contended, yes, it is rare a bit because they
are serialized often.

> So IMHO, this piece should go into kernel after we have first fixed the
> problem at VM (read memory controller) with per cgroup dirty ratio kind
> of thing.

Well, right. I agree.
But I think we can work parallel. I will try to struggle on both.

By the way, I guess that write serialization is caused by page selection
of flush kernel thread. If so, simple dirty ratio/page cache share
controlling don't seem to be able to solve that for me. Instead or in
addition to it, page selection order should be modified. Am I correct?

>> And also, I'm thinking that async write support is required by bandwidth
>> capping policy of blkio controller. Bandwidth capping can be done in upper
>> layer than elevator.
>
> I think capping facility we should implement in higher layers otherwise
> it is not useful for higher level logical devices (dm/md).
>
> It was ok to implement proportional bandwidth division at CFQ level
> because one can do proportional BW division at each leaf node and still get
> overall service differentation at higher level logical node. But same can
> not be done for max BW control.

A reason why I prefer to have BW control in elevator is
based on my evaluation result of three proposed IO controller
comparison before blkio controller was merged. Three proposals
were dm-ioband, io-throttle, and elevator implementation which is
the closest one to current blkio controller. Former two handled
BIOs and only last one handled REQUESTs. The result shows that
only handling REQUESTs can produce expected service differentiation.
Though I've not dived into the cause analysis, I guess that BIO
is not associated with actual IO request one by one and elevator
behavior are possibly the cause.
But on the other hand, as you say, BW controller in elevator
cannot control logical devices (or quite hard to adapt to them).
It's painful situation.

I will analyse the cause of non-differentiation in BIO handling
case much deeper.

>> However I think it should be also done in elevator layer
>> in my opinion. Elevator buffers and sort requests. If there is another
>> buffering functionality in upper layer, it is doubled buffering and it can be
>> harmful for elevator's prediction.
>
> I don't mind doing it at elevator layer also because in that case of
> somebody is not using dm/md, then one does not have to load max bw
> control module and one can simply enable max bw control in CFQ.
>
> Thinking more about it, now we are suggesting implementing max BW
> control at two places. I think it will be duplication of code and
> increased complexity in CFQ. Probably implement max bw control with
> the help of dm module and use same for CFQ also. There is pain
> associated with configuring dm device but I guess it is easier than
> maintaining two max bw control schemes in kernel.

Do you mean that sharing code for max BW control between dm and CFQ
is a possible solution? It's interesting. I will think about it.

> Thanks
> Vivek

Greatly thanks for your suggestion as always.
Muuhh

--
IKEDA, Munehiro
NEC Corporation of America
m-ikeda(a)ds.jp.nec.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nauman Rafique on 9 Jul 2010 21:00

On Fri, Jul 9, 2010 at 5:17 PM, Munehiro Ikeda <m-ikeda(a)ds.jp.nec.com> wrote:
> Vivek Goyal wrote, on 07/09/2010 09:45 AM:
>>
>> On Thu, Jul 08, 2010 at 10:57:13PM -0400, Munehiro Ikeda wrote:
>>>
>>> These RFC patches are trial to add async (cached) write support on blkio
>>> controller.
>>>
>>> Only test which has been done is to compile, boot, and that write
>>> bandwidth
>>> seems prioritized when pages which were dirtied by two different
>>> processes in
>>> different cgroups are written back to a device simultaneously. �I know
>>> this
>>> is the minimum (or less) test but I posted this as RFC because I would
>>> like
>>> to hear your opinions about the design direction in the early stage.
>>>
>>> Patches are for 2.6.35-rc4.
>>>
>>> This patch series consists of two chunks.
>>>
>>> (1) iotrack (patch 01/11 -- 06/11)
>>>
>>> This is a functionality to track who dirtied a page, in exact which
>>> cgroup a
>>> process which dirtied a page belongs to. �Blkio controller will read the
>>> info
>>> later and prioritize when the page is actually written to a block device.
>>> This work is originated from Ryo Tsuruta and Hirokazu Takahashi and
>>> includes
>>> Andrea Righi's idea. �It was posted as a part of dm-ioband which was one
>>> of
>>> proposals for IO controller.
>>>
>>>
>>> (2) blkio controller modification (07/11 -- 11/11)
>>>
>>> The main part of blkio controller async write support.
>>> Currently async queues are device-wide and async write IOs are always
>>> treated
>>> as root group.
>>> These patches make async queues per a cfq_group per a device to control
>>> them.
>>> Async write is handled by flush kernel thread. �Because queue pointers
>>> are
>>> stored in cfq_io_context, io_context of the thread has to have multiple
>>> cfq_io_contexts per a device. �So these patches make cfq_io_context per
>>> an
>>> io_context per a cfq_group, which means per an io_context per a cgroup
>>> per a
>>> device.
>>>
>>>
>>> This might be a piece of puzzle for complete async write support of blkio
>>> controller. �One of other pieces in my head is page dirtying ratio
>>> control.
>>> I believe Andrea Righi was working on it...how about the situation?
>>
>> Thanks Muuh. I will look into the patches in detail.
>>
>> In my initial patches I had implemented the support for ASYNC control
>> (also included Ryo's IO tracking patches) but it did not work well and
>> it was unpredictable. I realized that until and unless we implement
>> some kind of per group dirty ratio/page cache share at VM level and
>> create parallel paths for ASYNC IO, writes often get serialized.
>>
>> So writes belonging to high priority group get stuck behind low priority
>> group and you don't get any service differentiation.
>
> I also faced the situation that high priority writes are behind
> lower priority writes. �Although this patch seems to prioritize
> IOs if these IOs are contended, yes, it is rare a bit because they
> are serialized often.
>
>
>> So IMHO, this piece should go into kernel after we have first fixed the
>> problem at VM (read memory controller) with per cgroup dirty ratio kind
>> of thing.
>
> Well, right. �I agree.
> But I think we can work parallel. �I will try to struggle on both.

IMHO, we have a classic chicken and egg problem here. We should try to
merge pieces as they become available. If we get to agree on patches
that do async IO tracking for IO controller, we should go ahead with
them instead of trying to wait for per cgroup dirty ratios.

In terms of getting numbers, we have been using patches that add per
cpuset dirty ratios on top of NUMA_EMU, and we get good
differentiation between buffered writes as well as buffered writes vs.
reads.

It is really obvious that as long as flusher threads ,etc are not
cgroup aware, differentiation for buffered writes would not be perfect
in all cases, but this is a step in the right direction and we should
go for it.

>
> By the way, I guess that write serialization is caused by page selection
> of flush kernel thread. �If so, simple dirty ratio/page cache share
> controlling don't seem to be able to solve that for me. �Instead or in
> addition to it, page selection order should be modified. �Am I correct?
>
>
>>> And also, I'm thinking that async write support is required by bandwidth
>>> capping policy of blkio controller. �Bandwidth capping can be done in
>>> upper
>>> layer than elevator.
>>
>> I think capping facility we should implement in higher layers otherwise
>> it is not useful for higher level logical devices (dm/md).
>>
>> It was ok to implement proportional bandwidth division at CFQ level
>> because one can do proportional BW division at each leaf node and still
>> get
>> overall service differentation at higher level logical node. But same can
>> not be done for max BW control.
>
> A reason why I prefer to have BW control in elevator is
> based on my evaluation result of three proposed IO controller
> comparison before blkio controller was merged. �Three proposals
> were dm-ioband, io-throttle, and elevator implementation which is
> the closest one to current blkio controller. �Former two handled
> BIOs and only last one handled REQUESTs. �The result shows that
> only handling REQUESTs can produce expected service differentiation.
> Though I've not dived into the cause analysis, I guess that BIO
> is not associated with actual IO request one by one and elevator
> behavior are possibly the cause.
> But on the other hand, as you say, BW controller in elevator
> cannot control logical devices (or quite hard to adapt to them).
> It's painful situation.
>
> I will analyse the cause of non-differentiation in BIO handling
> case much deeper.
>
>
>>> �However I think it should be also done in elevator layer
>>> in my opinion. �Elevator buffers and sort requests. �If there is another
>>> buffering functionality in upper layer, it is doubled buffering and it
>>> can be
>>> harmful for elevator's prediction.
>>
>> I don't mind doing it at elevator layer also because in that case of
>> somebody is not using dm/md, then one does not have to load max bw
>> control module and one can simply enable max bw control in CFQ.
>>
>> Thinking more about it, now we are suggesting implementing max BW
>> control at two places. I think it will be duplication of code and
>> increased complexity in CFQ. Probably implement max bw control with
>> the help of dm module and use same for CFQ also. There is pain
>> associated with configuring dm device but I guess it is easier than
>> maintaining two max bw control schemes in kernel.
>
> Do you mean that sharing code for max BW control between dm and CFQ
> is a possible solution? �It's interesting. �I will think about it.
>
>
>> Thanks
>> Vivek
>
>
> Greatly thanks for your suggestion as always.
> Muuhh
>
>
> --
> IKEDA, Munehiro
> �NEC Corporation of America
> � �m-ikeda(a)ds.jp.nec.com
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo(a)vger.kernel.org
> More majordomo info at �http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at �http://www.tux.org/lkml/
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Vivek Goyal on 10 Jul 2010 09:30

On Fri, Jul 09, 2010 at 05:55:23PM -0700, Nauman Rafique wrote:

[..]
> > Well, right. �I agree.
> > But I think we can work parallel. �I will try to struggle on both.
>
> IMHO, we have a classic chicken and egg problem here. We should try to
> merge pieces as they become available. If we get to agree on patches
> that do async IO tracking for IO controller, we should go ahead with
> them instead of trying to wait for per cgroup dirty ratios.
>
> In terms of getting numbers, we have been using patches that add per
> cpuset dirty ratios on top of NUMA_EMU, and we get good
> differentiation between buffered writes as well as buffered writes vs.
> reads.
>
> It is really obvious that as long as flusher threads ,etc are not
> cgroup aware, differentiation for buffered writes would not be perfect
> in all cases, but this is a step in the right direction and we should
> go for it.

Working parallel on two separate pieces is fine. But pushing second piece
in first does not make much sense to me because second piece does not work
if first piece is not in. There is no way to test it. What's the point of
pushing a code in kernel which only compiles but does not achieve intented
purposes because some other pieces are missing.

Per cgroup dirty ratio is a little hard problem and few attempts have
already been made at it. IMHO, we need to first work on that piece and
get it inside the kernel and then work on IO tracking patches. Lets
fix the hard problem first that is necessary to make second set of patches
work.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2 3 4 5 6
Prev: [-next July 9 - s390 ] Badness at fs/sysfs/symlink.c:82 during qeth initalization
Next: Badness at fs/sysfs/symlink.c:82 during qeth initalization