From: Christoph Hellwig on
Btw, I'm very happy with all this writeback related progress we've made
for the 2.6.36 cycle. The only major thing that's really missing, and
which should help dramatically with the I/O patters is stopping direct
writeback from balance_dirty_pages(). I've seen patches frrom Wu and
and Jan for this and lots of discussion. If we get either variant in
this should be once of the best VM release from the filesystem point of
view.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Wu Fengguang on
On Thu, Jul 29, 2010 at 04:45:23PM +0800, Christoph Hellwig wrote:
> Btw, I'm very happy with all this writeback related progress we've made
> for the 2.6.36 cycle. The only major thing that's really missing, and
> which should help dramatically with the I/O patters is stopping direct
> writeback from balance_dirty_pages(). I've seen patches frrom Wu and
> and Jan for this and lots of discussion. If we get either variant in
> this should be once of the best VM release from the filesystem point of
> view.

Sorry for the delay. But I'm not feeling good about the current
patches, both mine and Jan's.

Accounting overheads/accuracy are the obvious problem. Both patches do
not perform well on large NUMA machines and fast storage. They are found
hard to improve in previous discussions.

We might do dirty throttling based on throughput, ignoring the
writeback completions totally. The basic idea is, for current process,
we already have a per-bdi-and-task threshold B as the local throttle
target. When dirty pages go beyond B*80% for example, we start
throttling the task's writeback throughput. The more closer to B, the
lower throughput. When reaches B or global threshold, we completely
stop it. The hope is, the throughput will be sustained at some balance
point. This will need careful calculation to perform stable/robust.

In this way, the throttle can be made very smooth. My old experiments
show that the current writeback completion based throttling fluctuates
a lot for the stall time. In particular it makes bumpy writeback for
NFS, so that some times the network pipe is not active at all and
performance is impacted noticeably.

By the way, we'll harvest a writeback IO controller :)

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jan Kara on
On Tue 03-08-10 15:34:49, Wu Fengguang wrote:
> On Thu, Jul 29, 2010 at 04:45:23PM +0800, Christoph Hellwig wrote:
> > Btw, I'm very happy with all this writeback related progress we've made
> > for the 2.6.36 cycle. The only major thing that's really missing, and
> > which should help dramatically with the I/O patters is stopping direct
> > writeback from balance_dirty_pages(). I've seen patches frrom Wu and
> > and Jan for this and lots of discussion. If we get either variant in
> > this should be once of the best VM release from the filesystem point of
> > view.
>
> Sorry for the delay. But I'm not feeling good about the current
> patches, both mine and Jan's.
>
> Accounting overheads/accuracy are the obvious problem. Both patches do
> not perform well on large NUMA machines and fast storage. They are found
> hard to improve in previous discussions.
Yes, my patch for balance_dirty_pages() has a problem with percpu counter
(im)precision and resorting to pure atomic type could result in bouncing
of the cache line among CPUs completing the IO (at least that is the reason
why all other BDI stats are per-cpu I believe).
We could solve the problem by doing the accounting on page IO submission
time (there using the atomic type should be fine as we mostly submit IO
from the flusher thread anyway). It's just that doing the accounting on
completion time has the nice property that we really hold the throttled
thread upto the moment when vm can really reuse the pages.

> We might do dirty throttling based on throughput, ignoring the
> writeback completions totally. The basic idea is, for current process,
> we already have a per-bdi-and-task threshold B as the local throttle
Do we? The limit is currently just per-bdi, isn't it? Or do you mean
the ratelimiting - i.e. how often do we call balance_dirty_pages()?
That is per-cpu if I'm right.

> target. When dirty pages go beyond B*80% for example, we start
> throttling the task's writeback throughput. The more closer to B, the
> lower throughput. When reaches B or global threshold, we completely
> stop it. The hope is, the throughput will be sustained at some balance
> point. This will need careful calculation to perform stable/robust.
But what do you exactly mean by throttling the task in your scenario?
What would it wait on?

> In this way, the throttle can be made very smooth. My old experiments
> show that the current writeback completion based throttling fluctuates
> a lot for the stall time. In particular it makes bumpy writeback for
> NFS, so that some times the network pipe is not active at all and
> performance is impacted noticeably.
>
> By the way, we'll harvest a writeback IO controller :)

Honza
--
Jan Kara <jack(a)suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Wu Fengguang on
On Tue, Aug 03, 2010 at 08:52:49PM +0800, Jan Kara wrote:
> On Tue 03-08-10 15:34:49, Wu Fengguang wrote:
> > On Thu, Jul 29, 2010 at 04:45:23PM +0800, Christoph Hellwig wrote:
> > > Btw, I'm very happy with all this writeback related progress we've made
> > > for the 2.6.36 cycle. The only major thing that's really missing, and
> > > which should help dramatically with the I/O patters is stopping direct
> > > writeback from balance_dirty_pages(). I've seen patches frrom Wu and
> > > and Jan for this and lots of discussion. If we get either variant in
> > > this should be once of the best VM release from the filesystem point of
> > > view.
> >
> > Sorry for the delay. But I'm not feeling good about the current
> > patches, both mine and Jan's.
> >
> > Accounting overheads/accuracy are the obvious problem. Both patches do
> > not perform well on large NUMA machines and fast storage. They are found
> > hard to improve in previous discussions.
> Yes, my patch for balance_dirty_pages() has a problem with percpu counter
> (im)precision and resorting to pure atomic type could result in bouncing
> of the cache line among CPUs completing the IO (at least that is the reason
> why all other BDI stats are per-cpu I believe).
> We could solve the problem by doing the accounting on page IO submission
> time (there using the atomic type should be fine as we mostly submit IO
> from the flusher thread anyway). It's just that doing the accounting on
> completion time has the nice property that we really hold the throttled
> thread upto the moment when vm can really reuse the pages.

Could try this and check how it works with NFS. The attached patch
will also be necessary for the test. It implements a writeback wait
queue for NFS, without it all dirty pages may be put to writeback.

I suspect the resulting fluctuations will be the same. Because
balance_dirty_pages() will wait on some background writeback (as you
proposed), which will block on the NFS writeback queue, which in turn
wait for the completion of COMMIT RPCs (the current patches directly
wait here). On the completion of one COMMIT, lots of pages may be
freed in a burst, which makes the whole stack progress very bumpy.

> > We might do dirty throttling based on throughput, ignoring the
> > writeback completions totally. The basic idea is, for current process,
> > we already have a per-bdi-and-task threshold B as the local throttle
> Do we? The limit is currently just per-bdi, isn't it? Or do you mean

bdi_dirty_limit() calls task_dirty_limit(), so it's also related to
the current task. For convenience we called it per-bdi writeback :)

> the ratelimiting - i.e. how often do we call balance_dirty_pages()?
> That is per-cpu if I'm right.
> > target. When dirty pages go beyond B*80% for example, we start
> > throttling the task's writeback throughput. The more closer to B, the
> > lower throughput. When reaches B or global threshold, we completely
> > stop it. The hope is, the throughput will be sustained at some balance
> > point. This will need careful calculation to perform stable/robust.
> But what do you exactly mean by throttling the task in your scenario?
> What would it wait on?

It will simply wait for eg. 10ms for every N pages written. The more
closer to B, the less N will be.

Thanks,
Fengguang

> > In this way, the throttle can be made very smooth. My old experiments
> > show that the current writeback completion based throttling fluctuates
> > a lot for the stall time. In particular it makes bumpy writeback for
> > NFS, so that some times the network pipe is not active at all and
> > performance is impacted noticeably.
> >
> > By the way, we'll harvest a writeback IO controller :)
>
> Honza
> --
> Jan Kara <jack(a)suse.cz>
> SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Wu Fengguang on
Sorry, forgot the attachment :)

Thanks,
Fengguang

On Tue, Aug 03, 2010 at 11:04:46PM +0800, Wu Fengguang wrote:
> On Tue, Aug 03, 2010 at 08:52:49PM +0800, Jan Kara wrote:
> > On Tue 03-08-10 15:34:49, Wu Fengguang wrote:
> > > On Thu, Jul 29, 2010 at 04:45:23PM +0800, Christoph Hellwig wrote:
> > > > Btw, I'm very happy with all this writeback related progress we've made
> > > > for the 2.6.36 cycle. The only major thing that's really missing, and
> > > > which should help dramatically with the I/O patters is stopping direct
> > > > writeback from balance_dirty_pages(). I've seen patches frrom Wu and
> > > > and Jan for this and lots of discussion. If we get either variant in
> > > > this should be once of the best VM release from the filesystem point of
> > > > view.
> > >
> > > Sorry for the delay. But I'm not feeling good about the current
> > > patches, both mine and Jan's.
> > >
> > > Accounting overheads/accuracy are the obvious problem. Both patches do
> > > not perform well on large NUMA machines and fast storage. They are found
> > > hard to improve in previous discussions.
> > Yes, my patch for balance_dirty_pages() has a problem with percpu counter
> > (im)precision and resorting to pure atomic type could result in bouncing
> > of the cache line among CPUs completing the IO (at least that is the reason
> > why all other BDI stats are per-cpu I believe).
> > We could solve the problem by doing the accounting on page IO submission
> > time (there using the atomic type should be fine as we mostly submit IO
> > from the flusher thread anyway). It's just that doing the accounting on
> > completion time has the nice property that we really hold the throttled
> > thread upto the moment when vm can really reuse the pages.
>
> Could try this and check how it works with NFS. The attached patch
> will also be necessary for the test. It implements a writeback wait
> queue for NFS, without it all dirty pages may be put to writeback.
>
> I suspect the resulting fluctuations will be the same. Because
> balance_dirty_pages() will wait on some background writeback (as you
> proposed), which will block on the NFS writeback queue, which in turn
> wait for the completion of COMMIT RPCs (the current patches directly
> wait here). On the completion of one COMMIT, lots of pages may be
> freed in a burst, which makes the whole stack progress very bumpy.
>
> > > We might do dirty throttling based on throughput, ignoring the
> > > writeback completions totally. The basic idea is, for current process,
> > > we already have a per-bdi-and-task threshold B as the local throttle
> > Do we? The limit is currently just per-bdi, isn't it? Or do you mean
>
> bdi_dirty_limit() calls task_dirty_limit(), so it's also related to
> the current task. For convenience we called it per-bdi writeback :)
>
> > the ratelimiting - i.e. how often do we call balance_dirty_pages()?
> > That is per-cpu if I'm right.
> > > target. When dirty pages go beyond B*80% for example, we start
> > > throttling the task's writeback throughput. The more closer to B, the
> > > lower throughput. When reaches B or global threshold, we completely
> > > stop it. The hope is, the throughput will be sustained at some balance
> > > point. This will need careful calculation to perform stable/robust.
> > But what do you exactly mean by throttling the task in your scenario?
> > What would it wait on?
>
> It will simply wait for eg. 10ms for every N pages written. The more
> closer to B, the less N will be.
>
> Thanks,
> Fengguang
>
> > > In this way, the throttle can be made very smooth. My old experiments
> > > show that the current writeback completion based throttling fluctuates
> > > a lot for the stall time. In particular it makes bumpy writeback for
> > > NFS, so that some times the network pipe is not active at all and
> > > performance is impacted noticeably.
> > >
> > > By the way, we'll harvest a writeback IO controller :)
> >
> > Honza
> > --
> > Jan Kara <jack(a)suse.cz>
> > SUSE Labs, CR