From: Nauman Rafique on
Hi Vivek,
Me, Divyesh, Fernando and Yoshikawa had a chance to have a chat with
Jens about IO controller during Linux Plumbers Conference '09. Jens
expressed his concerns about the size and complexity of the patches. I
believe that is a reasonable concern. We talked about things that
could be done to reduce the size of the patches. The requirement that
the "solution has to work with all IO schedulers" seems like a
secondary concern at this point; and it came out as one thing that can
help to reduce the size of the patch set. Another possibility is to
use a simpler scheduling algorithm e.g. weighted round robin, instead
of BFQ scheduler. BFQ indeed has great properties, but we cannot deny
the fact that it is complex to understand, and might be cumbersome to
maintain. Also, hierarchical scheduling is something that could be
unnecessary in the first set of patches, even though cgroups are
hierarchical in nature.

We are starting from a point where there is no cgroup based IO
scheduling in the kernel. And it is probably not reasonable to satisfy
all IO scheduling related requirements in one patch set. We can start
with something simple, and build on top of that. So a very simple
patch set that enables cgroup based proportional scheduling for CFQ
seems like the way to go at this point.

It would be great if we discuss our plans on the mailing list, so we
can get early feedback from everyone.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Vivek Goyal on
On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote:
> Hi Vivek,
> Me, Divyesh, Fernando and Yoshikawa had a chance to have a chat with
> Jens about IO controller during Linux Plumbers Conference '09. Jens
> expressed his concerns about the size and complexity of the patches. I
> believe that is a reasonable concern. We talked about things that
> could be done to reduce the size of the patches. The requirement that
> the "solution has to work with all IO schedulers" seems like a
> secondary concern at this point; and it came out as one thing that can
> help to reduce the size of the patch set.

Initially doing cgroup based IO control only for CFQ should help a lot in
reducing the patchset size.

> Another possibility is to
> use a simpler scheduling algorithm e.g. weighted round robin, instead
> of BFQ scheduler. BFQ indeed has great properties, but we cannot deny
> the fact that it is complex to understand, and might be cumbersome to
> maintain.

Core of the BFQ I have gotten rid of already. The remaining part is idle tree
and data structures. I will see how can I simplify it further.

> Also, hierarchical scheduling is something that could be
> unnecessary in the first set of patches, even though cgroups are
> hierarchical in nature.

Sure. Though I don't think that a lot of code is there because of
hierarchical nature. If we solve the issue at CFQ layer, we have to
maintain atleast two levels. One for queue and other for groups. So even
the simplest solution becomes almost hierarchical in nature. But I will
still see how to get rid of some code here too...
>
> We are starting from a point where there is no cgroup based IO
> scheduling in the kernel. And it is probably not reasonable to satisfy
> all IO scheduling related requirements in one patch set. We can start
> with something simple, and build on top of that. So a very simple
> patch set that enables cgroup based proportional scheduling for CFQ
> seems like the way to go at this point.

Sure, we can start with CFQ only. But a bigger question we need to answer
is that is CFQ the right place to solve the issue? Jens, do you think
that CFQ is the right place to solve the problem?

Andrew seems to favor a high level approach so that IO schedulers are less
complex and we can provide fairness at high level logical devices also.

I will again try to summarize my understanding so far about the pros/cons
of each approach and then we can take the discussion forward.

Fairness in terms of size of IO or disk time used
=================================================
On a seeky media, fairness in terms of disk time can get us better results
instead fairness interms of size of IO or number of IO.

If we implement some kind of time based solution at higher layer, then
that higher layer should know who used how much of time each group used. We
can probably do some kind of timestamping in bio to get a sense when did it
get into disk and when did it finish. But on a multi queue hardware there
can be multiple requests in the disk either from same queue or from differnet
queues and with pure timestamping based apparoch, so far I could not think
how at high level we will get an idea who used how much of time.

So this is the first point of contention that how do we want to provide
fairness. In terms of disk time used or in terms of size of IO/number of
IO.

Max bandwidth Controller or Proportional bandwidth controller
=============================================================
What is our primary requirement here? A weight based proportional
bandwidth controller where we can use the resources optimally and any
kind of throttling kicks in only if there is contention for the disk.

Or we want max bandwidth control where a group is not allowed to use the
disk even if disk is free.

Or we need both? I would think that at some point of time we will need
both but we can start with proportional bandwidth control first.

Fairness for higher level logical devices
=========================================
Do we want good fairness numbers for higher level logical devices also
or it is sufficient to provide fairness at leaf nodes. Providing fairness
at leaf nodes can help us use the resources optimally and in the process
we can get fairness at higher level also in many of the cases.

But do we want strict fairness numbers on higher level logical devices
even if it means sub-optimal usage of unerlying phsical devices?

I think that for proportinal bandwidth control, it should be ok to provide
fairness at higher level logical device but for max bandwidth control it
might make more sense to provide fairness at higher level. Consider a
case where from a striped device a customer wants to limit a group to
30MB/s and in case of leaf node control, if every leaf node provides
30MB/s, it might accumulate to much more than specified rate at logical
device.

Latency Control and strong isolation between groups
===================================================
Do we want a good isolation between groups and better latencies and
stronger isolation between groups?

I think if problem is solved at IO scheduler level, we can achieve better
latency control and hence stronger isolation between groups.

Higher level solutions should find it hard to provide same kind of latency
control and isolation between groups as IO scheduler based solution.

Fairness for buffered writes
============================
Doing io control at any place below page cache has disadvantage that page
cache might not dispatch more writes from higher weight group hence higher
weight group might not see more IO done. Andrew says that we don't have
a solution to this problem in kernel and he would like to see it handled
properly.

Only way to solve this seems to be to slow down the writers before they
write into page cache. IO throttling patch handled it by slowing down
writer if it crossed max specified rate. Other suggestions have come in
the form of dirty_ratio per memory cgroup or a separate cgroup controller
al-together where some kind of per group write limit can be specified.

So if solution is implemented at IO scheduler layer or at device mapper
layer, both shall have to rely on another controller to be co-mounted
to handle buffered writes properly.

Fairness with-in group
======================
One of the issues with higher level controller is that how to do fair
throttling so that fairness with-in group is not impacted. Especially
the case of making sure that we don't break the notion of ioprio of the
processes with-in group.

Especially io throttling patch was very bad in terms of prio with-in
group where throttling treated everyone equally and difference between
process prio disappeared.

Reads Vs Writes
===============
A higher level control most likely will change the ratio in which reads
and writes are dispatched to disk with-in group. It used to be decided
by IO scheduler so far but with higher level groups doing throttling and
possibly buffering the bios and releasing them later, they will have to
come up with their own policy on in what proportion reads and writes
should be dispatched. In case of IO scheduler based control, all the
queuing takes place at IO scheduler and it still retains control of
in what ration reads and writes should be dispatched.


Summary
=======

- An io scheduler based io controller can provide better latencies,
stronger isolation between groups, time based fairness and will not
interfere with io schedulers policies like class, ioprio and
reader vs writer issues.

But it can gunrantee fairness at higher logical level devices.
Especially in case of max bw control, leaf node control does not sound
to be the most appropriate thing.

- IO throttling provides max bw control in terms of absolute rate. It has
the advantage that it can provide control at higher level logical device
and also control buffered writes without need of additional controller
co-mounted.

But it does only max bw control and not proportion control so one might
not be using resources optimally. It looses sense of task prio and class
with-in group as any of the task can be throttled with-in group. Because
throttling does not kick in till you hit the max bw limit, it should find
it hard to provide same latencies as io scheduler based control.

- dm-ioband also has the advantage that it can provide fairness at higher
level logical devices.

But, fairness is provided only in terms of size of IO or number of IO.
No time based fairness. It is very throughput oriented and does not
throttle high speed group if other group is running slow random reader.
This results in bad latnecies for random reader group and weaker
isolation between groups.

Also it does not provide fairness if a group is not continuously
backlogged. So if one is running 1-2 dd/sequential readers in the group,
one does not get fairness until workload is increased to a point where
group becomes continuously backlogged. This also results in poor
latencies and limited fairness.

At this point of time it does not look like a single IO controller all
the scenarios/requirements. This means few things to me.

- Drop some of the requirements and go with one implementation which meets
those reduced set of requirements.

- Have more than one IO controller implementation in kenrel. One for lower
level control for better latencies, stronger isolation and optimal resource
usage and other one for fairness at higher level logical devices and max
bandwidth control.

And let user decide which one to use based on his/her needs.

- Come up with more intelligent way of doing IO control where single
controller covers all the cases.

At this point of time, I am more inclined towards option 2 of having more
than one implementation in kernel. :-) (Until and unless we can brainstrom
and come up with ideas to make option 3 happen).

>
> It would be great if we discuss our plans on the mailing list, so we
> can get early feedback from everyone.

This is what comes to my mind so far. Please add to the list if I have missed
some points. Also correct me if I am wrong about the pros/cons of the
approaches.

Thoughts/ideas/opinions are welcome...

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mike Galbraith on
On Mon, 2009-09-28 at 19:51 +0200, Mike Galbraith wrote:

> I'll give your patch a spin as well.

I applied it to tip, and fixed up rejects. I haven't done a line for
line verification against the original patch yet (brave or..), so add
giant economy sized pinch of salt.

In the form it ended up in, it didn't help here. I tried twiddling
knobs, but it didn't help either. Reducing latency target from 300 to
30 did nada, but dropping to 3 did... I got to poke BRB.

Plugging Vivek's fairness tweakable on top, and enabling it, my timings
return to decent numbers, so that one liner absatively posilutely is
where my write vs read woes are coming from.

FWIW, below is patch wedged into tip v2.6.31-10215-ga3c9602

---
block/cfq-iosched.c | 281 ++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 227 insertions(+), 54 deletions(-)

Index: linux-2.6/block/cfq-iosched.c
===================================================================
--- linux-2.6.orig/block/cfq-iosched.c
+++ linux-2.6/block/cfq-iosched.c
@@ -27,6 +27,12 @@ static const int cfq_slice_sync = HZ / 1
static int cfq_slice_async = HZ / 25;
static const int cfq_slice_async_rq = 2;
static int cfq_slice_idle = HZ / 125;
+static int cfq_target_latency = HZ * 3/10; /* 300 ms */
+static int cfq_hist_divisor = 4;
+/*
+ * Number of times that other workloads can be scheduled before async
+ */
+static const unsigned int cfq_async_penalty = 4;

/*
* offset from end of service tree
@@ -36,7 +42,7 @@ static int cfq_slice_idle = HZ / 125;
/*
* below this threshold, we consider thinktime immediate
*/
-#define CFQ_MIN_TT (2)
+#define CFQ_MIN_TT (1)

#define CFQ_SLICE_SCALE (5)
#define CFQ_HW_QUEUE_MIN (5)
@@ -67,8 +73,9 @@ static DEFINE_SPINLOCK(ioc_gone_lock);
struct cfq_rb_root {
struct rb_root rb;
struct rb_node *left;
+ unsigned count;
};
-#define CFQ_RB_ROOT (struct cfq_rb_root) { RB_ROOT, NULL, }
+#define CFQ_RB_ROOT (struct cfq_rb_root) { RB_ROOT, NULL, 0, }

/*
* Per process-grouping structure
@@ -113,6 +120,21 @@ struct cfq_queue {
unsigned short ioprio_class, org_ioprio_class;

pid_t pid;
+
+ struct cfq_rb_root *service_tree;
+ struct cfq_io_context *cic;
+};
+
+enum wl_prio_t {
+ IDLE_WL = -1,
+ BE_WL = 0,
+ RT_WL = 1
+};
+
+enum wl_type_t {
+ ASYNC_WL = 0,
+ SYNC_NOIDLE_WL = 1,
+ SYNC_WL = 2
};

/*
@@ -124,7 +146,13 @@ struct cfq_data {
/*
* rr list of queues with requests and the count of them
*/
- struct cfq_rb_root service_tree;
+ struct cfq_rb_root service_trees[2][3];
+ struct cfq_rb_root service_tree_idle;
+
+ enum wl_prio_t serving_prio;
+ enum wl_type_t serving_type;
+ unsigned long workload_expires;
+ unsigned int async_starved;

/*
* Each priority tree is sorted by next_request position. These
@@ -134,9 +162,11 @@ struct cfq_data {
struct rb_root prio_trees[CFQ_PRIO_LISTS];

unsigned int busy_queues;
+ unsigned int busy_queues_avg[2];

int rq_in_driver[2];
int sync_flight;
+ int reads_delayed;

/*
* queue-depth detection
@@ -173,6 +203,9 @@ struct cfq_data {
unsigned int cfq_slice[2];
unsigned int cfq_slice_async_rq;
unsigned int cfq_slice_idle;
+ unsigned int cfq_target_latency;
+ unsigned int cfq_hist_divisor;
+ unsigned int cfq_async_penalty;

struct list_head cic_list;

@@ -182,6 +215,11 @@ struct cfq_data {
struct cfq_queue oom_cfqq;
};

+static struct cfq_rb_root * service_tree_for(enum wl_prio_t prio, enum wl_type_t type,
+ struct cfq_data *cfqd) {
+ return prio == IDLE_WL ? &cfqd->service_tree_idle : &cfqd->service_trees[prio][type];
+}
+
enum cfqq_state_flags {
CFQ_CFQQ_FLAG_on_rr = 0, /* on round-robin busy list */
CFQ_CFQQ_FLAG_wait_request, /* waiting for a request */
@@ -226,6 +264,17 @@ CFQ_CFQQ_FNS(coop);
#define cfq_log(cfqd, fmt, args...) \
blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)

+#define CIC_SEEK_THR 1024
+#define CIC_SEEKY(cic) ((cic)->seek_mean > CIC_SEEK_THR)
+#define CFQQ_SEEKY(cfqq) (!cfqq->cic || CIC_SEEKY(cfqq->cic))
+
+static inline int cfq_busy_queues_wl(enum wl_prio_t wl, struct cfq_data *cfqd) {
+ return wl==IDLE_WL? cfqd->service_tree_idle.count :
+ cfqd->service_trees[wl][ASYNC_WL].count
+ + cfqd->service_trees[wl][SYNC_NOIDLE_WL].count
+ + cfqd->service_trees[wl][SYNC_WL].count;
+}
+
static void cfq_dispatch_insert(struct request_queue *, struct request *);
static struct cfq_queue *cfq_get_queue(struct cfq_data *, int,
struct io_context *, gfp_t);
@@ -247,6 +296,7 @@ static inline void cic_set_cfqq(struct c
struct cfq_queue *cfqq, int is_sync)
{
cic->cfqq[!!is_sync] = cfqq;
+ cfqq->cic = cic;
}

/*
@@ -301,10 +351,33 @@ cfq_prio_to_slice(struct cfq_data *cfqd,
return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
}

+static inline unsigned
+cfq_get_interested_queues(struct cfq_data *cfqd, bool rt) {
+ unsigned min_q, max_q;
+ unsigned mult = cfqd->cfq_hist_divisor - 1;
+ unsigned round = cfqd->cfq_hist_divisor / 2;
+ unsigned busy = cfq_busy_queues_wl(rt, cfqd);
+ min_q = min(cfqd->busy_queues_avg[rt], busy);
+ max_q = max(cfqd->busy_queues_avg[rt], busy);
+ cfqd->busy_queues_avg[rt] = (mult * max_q + min_q + round) /
+ cfqd->cfq_hist_divisor;
+ return cfqd->busy_queues_avg[rt];
+}
+
static inline void
cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
- cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
+ unsigned process_thr = cfqd->cfq_target_latency / cfqd->cfq_slice[1];
+ unsigned iq = cfq_get_interested_queues(cfqd, cfq_class_rt(cfqq));
+ unsigned slice = cfq_prio_to_slice(cfqd, cfqq);
+
+ if (iq > process_thr) {
+ unsigned low_slice = 2 * slice * cfqd->cfq_slice_idle
+ / cfqd->cfq_slice[1];
+ slice = max(slice * process_thr / iq, min(slice, low_slice));
+ }
+
+ cfqq->slice_end = jiffies + slice;
cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
}

@@ -443,6 +516,7 @@ static void cfq_rb_erase(struct rb_node
if (root->left == n)
root->left = NULL;
rb_erase_init(n, &root->rb);
+ --root->count;
}

/*
@@ -483,46 +557,56 @@ static unsigned long cfq_slice_offset(st
}

/*
- * The cfqd->service_tree holds all pending cfq_queue's that have
+ * The cfqd->service_trees holds all pending cfq_queue's that have
* requests waiting to be processed. It is sorted in the order that
* we will service the queues.
*/
-static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
- int add_front)
+static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
struct rb_node **p, *parent;
struct cfq_queue *__cfqq;
unsigned long rb_key;
+ struct cfq_rb_root *service_tree;
int left;

if (cfq_class_idle(cfqq)) {
rb_key = CFQ_IDLE_DELAY;
- parent = rb_last(&cfqd->service_tree.rb);
+ service_tree = &cfqd->service_tree_idle;
+ parent = rb_last(&service_tree->rb);
if (parent && parent != &cfqq->rb_node) {
__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
rb_key += __cfqq->rb_key;
} else
rb_key += jiffies;
- } else if (!add_front) {
+ } else {
+ enum wl_prio_t prio = cfq_class_rt(cfqq) ? RT_WL : BE_WL;
+ enum wl_type_t type = cfq_cfqq_sync(cfqq) ? SYNC_WL : ASYNC_WL;
+
rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
rb_key += cfqq->slice_resid;
cfqq->slice_resid = 0;
- } else
- rb_key = 0;
+
+ if (type == SYNC_WL && (CFQQ_SEEKY(cfqq) || !cfq_cfqq_idle_window(cfqq)))
+ type = SYNC_NOIDLE_WL;
+
+ service_tree = service_tree_for(prio, type, cfqd);
+ }

if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
/*
* same position, nothing more to do
*/
- if (rb_key == cfqq->rb_key)
+ if (rb_key == cfqq->rb_key && cfqq->service_tree == service_tree)
return;

- cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
+ cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree);
+ cfqq->service_tree = NULL;
}

left = 1;
parent = NULL;
- p = &cfqd->service_tree.rb.rb_node;
+ cfqq->service_tree = service_tree;
+ p = &service_tree->rb.rb_node;
while (*p) {
struct rb_node **n;

@@ -554,11 +638,12 @@ static void cfq_service_tree_add(struct
}

if (left)
- cfqd->service_tree.left = &cfqq->rb_node;
+ service_tree->left = &cfqq->rb_node;

cfqq->rb_key = rb_key;
rb_link_node(&cfqq->rb_node, parent, p);
- rb_insert_color(&cfqq->rb_node, &cfqd->service_tree.rb);
+ rb_insert_color(&cfqq->rb_node, &service_tree->rb);
+ service_tree->count++;
}

static struct cfq_queue *
@@ -631,7 +716,7 @@ static void cfq_resort_rr_list(struct cf
* Resorting requires the cfqq to be on the RR list already.
*/
if (cfq_cfqq_on_rr(cfqq)) {
- cfq_service_tree_add(cfqd, cfqq, 0);
+ cfq_service_tree_add(cfqd, cfqq);
cfq_prio_tree_add(cfqd, cfqq);
}
}
@@ -660,8 +745,10 @@ static void cfq_del_cfqq_rr(struct cfq_d
BUG_ON(!cfq_cfqq_on_rr(cfqq));
cfq_clear_cfqq_on_rr(cfqq);

- if (!RB_EMPTY_NODE(&cfqq->rb_node))
- cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
+ if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
+ cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree);
+ cfqq->service_tree = NULL;
+ }
if (cfqq->p_root) {
rb_erase(&cfqq->p_node, cfqq->p_root);
cfqq->p_root = NULL;
@@ -923,10 +1010,11 @@ static inline void cfq_slice_expired(str
*/
static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
{
- if (RB_EMPTY_ROOT(&cfqd->service_tree.rb))
- return NULL;
+ struct cfq_rb_root *service_tree = service_tree_for(cfqd->serving_prio, cfqd->serving_type, cfqd);

- return cfq_rb_first(&cfqd->service_tree);
+ if (RB_EMPTY_ROOT(&service_tree->rb))
+ return NULL;
+ return cfq_rb_first(service_tree);
}

/*
@@ -954,9 +1042,6 @@ static inline sector_t cfq_dist_from_las
return cfqd->last_position - blk_rq_pos(rq);
}

-#define CIC_SEEK_THR 8 * 1024
-#define CIC_SEEKY(cic) ((cic)->seek_mean > CIC_SEEK_THR)
-
static inline int cfq_rq_close(struct cfq_data *cfqd, struct request *rq)
{
struct cfq_io_context *cic = cfqd->active_cic;
@@ -1044,6 +1129,10 @@ static struct cfq_queue *cfq_close_coope
if (cfq_cfqq_coop(cfqq))
return NULL;

+ /* we don't want to mix processes with different characteristics */
+ if (cfqq->service_tree != cur_cfqq->service_tree)
+ return NULL;
+
if (!probe)
cfq_mark_cfqq_coop(cfqq);
return cfqq;
@@ -1087,14 +1176,15 @@ static void cfq_arm_slice_timer(struct c

cfq_mark_cfqq_wait_request(cfqq);

- /*
- * we don't want to idle for seeks, but we do want to allow
- * fair distribution of slice time for a process doing back-to-back
- * seeks. so allow a little bit of time for him to submit a new rq
- */
- sl = cfqd->cfq_slice_idle;
- if (sample_valid(cic->seek_samples) && CIC_SEEKY(cic))
+ sl = min_t(unsigned, cfqd->cfq_slice_idle, cfqq->slice_end - jiffies);
+
+ /* very small idle if we are serving noidle trees, and there are more trees */
+ if (cfqd->serving_type == SYNC_NOIDLE_WL &&
+ service_tree_for(cfqd->serving_prio, SYNC_NOIDLE_WL, cfqd)->count > 0) {
+ if (blk_queue_nonrot(cfqd->queue))
+ return;
sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));
+ }

mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
@@ -1110,6 +1200,11 @@ static void cfq_dispatch_insert(struct r

cfq_log_cfqq(cfqd, cfqq, "dispatch_insert");

+ if (!time_before(jiffies, rq->start_time + cfqd->cfq_target_latency / 2) && rq_data_dir(rq)==READ) {
+ cfqd->reads_delayed = max_t(int, cfqd->reads_delayed,
+ (jiffies - rq->start_time) / (cfqd->cfq_target_latency / 2));
+ }
+
cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq);
cfq_remove_request(rq);
cfqq->dispatched++;
@@ -1156,6 +1251,16 @@ cfq_prio_to_maxrq(struct cfq_data *cfqd,
return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - cfqq->ioprio));
}

+enum wl_type_t cfq_choose_sync_async(struct cfq_data *cfqd, enum wl_prio_t prio) {
+ struct cfq_queue *id, *ni;
+ ni = cfq_rb_first(service_tree_for(prio, SYNC_NOIDLE_WL, cfqd));
+ id = cfq_rb_first(service_tree_for(prio, SYNC_WL, cfqd));
+ if (id && ni && id->rb_key < ni->rb_key)
+ return SYNC_WL;
+ if (!ni) return SYNC_WL;
+ return SYNC_NOIDLE_WL;
+}
+
/*
* Select a queue for service. If we have a current active queue,
* check whether to continue servicing it, or retrieve and set a new one.
@@ -1196,15 +1301,68 @@ static struct cfq_queue *cfq_select_queu
* flight or is idling for a new request, allow either of these
* conditions to happen (or time out) before selecting a new queue.
*/
- if (timer_pending(&cfqd->idle_slice_timer) ||
+ if (timer_pending(&cfqd->idle_slice_timer) ||
(cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) {
cfqq = NULL;
goto keep_queue;
}
-
expire:
cfq_slice_expired(cfqd, 0);
new_queue:
+ if (!new_cfqq) {
+ enum wl_prio_t previous_prio = cfqd->serving_prio;
+
+ if (cfq_busy_queues_wl(RT_WL, cfqd))
+ cfqd->serving_prio = RT_WL;
+ else if (cfq_busy_queues_wl(BE_WL, cfqd))
+ cfqd->serving_prio = BE_WL;
+ else {
+ cfqd->serving_prio = IDLE_WL;
+ cfqd->workload_expires = jiffies + 1;
+ cfqd->reads_delayed = 0;
+ }
+
+ if (cfqd->serving_prio != IDLE_WL) {
+ int counts[]={
+ service_tree_for(cfqd->serving_prio, ASYNC_WL, cfqd)->count,
+ service_tree_for(cfqd->serving_prio, SYNC_NOIDLE_WL, cfqd)->count,
+ service_tree_for(cfqd->serving_prio, SYNC_WL, cfqd)->count
+ };
+ int nonzero_counts= !!counts[0] + !!counts[1] + !!counts[2];
+
+ if (previous_prio != cfqd->serving_prio || (nonzero_counts == 1)) {
+ cfqd->serving_type = counts[1] ? SYNC_NOIDLE_WL : counts[2] ? SYNC_WL : ASYNC_WL;
+ cfqd->async_starved = 0;
+ cfqd->reads_delayed = 0;
+ } else {
+ if (!counts[cfqd->serving_type] || time_after(jiffies, cfqd->workload_expires)) {
+ if (cfqd->serving_type != ASYNC_WL && counts[ASYNC_WL] &&
+ cfqd->async_starved++ > cfqd->cfq_async_penalty * (1 + cfqd->reads_delayed))
+ cfqd->serving_type = ASYNC_WL;
+ else
+ cfqd->serving_type = cfq_choose_sync_async(cfqd, cfqd->serving_prio);
+ } else
+ goto same_wl;
+ }
+
+ {
+ unsigned slice = cfqd->cfq_target_latency;
+ slice = slice * counts[cfqd->serving_type] /
+ max_t(unsigned, cfqd->busy_queues_avg[cfqd->serving_prio],
+ counts[SYNC_WL] + counts[SYNC_NOIDLE_WL] + counts[ASYNC_WL]);
+
+ if (cfqd->serving_type == ASYNC_WL)
+ slice = max(1U, (slice / (1 + cfqd->reads_delayed))
+ * cfqd->cfq_slice[0] / cfqd->cfq_slice[1]);
+ else
+ slice = max(slice, 2U * max(1U, cfqd->cfq_slice_idle));
+
+ cfqd->workload_expires = jiffies + slice;
+ cfqd->async_starved *= (cfqd->serving_type != ASYNC_WL);
+ }
+ }
+ }
+ same_wl:
cfqq = cfq_set_active_queue(cfqd, new_cfqq);
keep_queue:
return cfqq;
@@ -1231,8 +1389,13 @@ static int cfq_forced_dispatch(struct cf
{
struct cfq_queue *cfqq;
int dispatched = 0;
+ int i,j;
+ for (i = 0; i < 2; ++i)
+ for (j = 0; j < 3; ++j)
+ while ((cfqq = cfq_rb_first(&cfqd->service_trees[i][j])) != NULL)
+ dispatched += __cfq_forced_dispatch_cfqq(cfqq);

- while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL)
+ while ((cfqq = cfq_rb_first(&cfqd->service_tree_idle)) != NULL)
dispatched += __cfq_forced_dispatch_cfqq(cfqq);

cfq_slice_expired(cfqd, 0);
@@ -1300,6 +1463,12 @@ static int cfq_dispatch_requests(struct
return 0;

/*
+ * Drain async requests before we start sync IO
+ */
+ if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
+ return 0;
+
+ /*
* If this is an async queue and we have sync IO in flight, let it wait
*/
if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
@@ -1993,18 +2162,8 @@ cfq_should_preempt(struct cfq_data *cfqd
if (cfq_class_idle(cfqq))
return 1;

- /*
- * if the new request is sync, but the currently running queue is
- * not, let the sync request have priority.
- */
- if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
- return 1;
-
- /*
- * So both queues are sync. Let the new request get disk time if
- * it's a metadata request and the current queue is doing regular IO.
- */
- if (rq_is_meta(rq) && !cfqq->meta_pending)
+ if (cfqd->serving_type == SYNC_NOIDLE_WL
+ && new_cfqq->service_tree == cfqq->service_tree)
return 1;

/*
@@ -2035,13 +2194,9 @@ static void cfq_preempt_queue(struct cfq
cfq_log_cfqq(cfqd, cfqq, "preempt");
cfq_slice_expired(cfqd, 1);

- /*
- * Put the new queue at the front of the of the current list,
- * so we know that it will be selected next.
- */
BUG_ON(!cfq_cfqq_on_rr(cfqq));

- cfq_service_tree_add(cfqd, cfqq, 1);
+ cfq_service_tree_add(cfqd, cfqq);

cfqq->slice_end = 0;
cfq_mark_cfqq_slice_new(cfqq);
@@ -2438,13 +2593,16 @@ static void cfq_exit_queue(struct elevat
static void *cfq_init_queue(struct request_queue *q)
{
struct cfq_data *cfqd;
- int i;
+ int i,j;

cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node);
if (!cfqd)
return NULL;

- cfqd->service_tree = CFQ_RB_ROOT;
+ for (i = 0; i < 2; ++i)
+ for (j = 0; j < 3; ++j)
+ cfqd->service_trees[i][j] = CFQ_RB_ROOT;
+ cfqd->service_tree_idle = CFQ_RB_ROOT;

/*
* Not strictly needed (since RB_ROOT just clears the node and we
@@ -2481,6 +2639,9 @@ static void *cfq_init_queue(struct reque
cfqd->cfq_slice[1] = cfq_slice_sync;
cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
cfqd->cfq_slice_idle = cfq_slice_idle;
+ cfqd->cfq_target_latency = cfq_target_latency;
+ cfqd->cfq_hist_divisor = cfq_hist_divisor;
+ cfqd->cfq_async_penalty = cfq_async_penalty;
cfqd->hw_tag = 1;

return cfqd;
@@ -2517,6 +2678,7 @@ fail:
/*
* sysfs parts below -->
*/
+
static ssize_t
cfq_var_show(unsigned int var, char *page)
{
@@ -2550,6 +2712,9 @@ SHOW_FUNCTION(cfq_slice_idle_show, cfqd-
SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
+SHOW_FUNCTION(cfq_target_latency_show, cfqd->cfq_target_latency, 1);
+SHOW_FUNCTION(cfq_hist_divisor_show, cfqd->cfq_hist_divisor, 0);
+SHOW_FUNCTION(cfq_async_penalty_show, cfqd->cfq_async_penalty, 0);
#undef SHOW_FUNCTION

#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
@@ -2581,6 +2746,11 @@ STORE_FUNCTION(cfq_slice_sync_store, &cf
STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
UINT_MAX, 0);
+
+STORE_FUNCTION(cfq_target_latency_store, &cfqd->cfq_target_latency, 1, 1000, 1);
+STORE_FUNCTION(cfq_hist_divisor_store, &cfqd->cfq_hist_divisor, 1, 100, 0);
+STORE_FUNCTION(cfq_async_penalty_store, &cfqd->cfq_async_penalty, 1, UINT_MAX, 0);
+
#undef STORE_FUNCTION

#define CFQ_ATTR(name) \
@@ -2596,6 +2766,9 @@ static struct elv_fs_entry cfq_attrs[] =
CFQ_ATTR(slice_async),
CFQ_ATTR(slice_async_rq),
CFQ_ATTR(slice_idle),
+ CFQ_ATTR(target_latency),
+ CFQ_ATTR(hist_divisor),
+ CFQ_ATTR(async_penalty),
__ATTR_NULL
};



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Corrado Zoccolo on
Hi Vivek,
On Mon, Sep 28, 2009 at 7:14 PM, Vivek Goyal <vgoyal(a)redhat.com> wrote:
> On Mon, Sep 28, 2009 at 05:35:02PM +0200, Corrado Zoccolo wrote:
>> On Mon, Sep 28, 2009 at 4:56 PM, Vivek Goyal <vgoyal(a)redhat.com> wrote:
>> > On Sun, Sep 27, 2009 at 07:00:08PM +0200, Corrado Zoccolo wrote:
>> >> Hi Vivek,
>> >> On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <vgoyal(a)redhat.com> wrote:
>> >> > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote:
>> >> >> Vivek Goyal wrote:
>> >> >> > Notes:
>> >> >> > - With vanilla CFQ, random writers can overwhelm a random reader.
>> >> >> >   Bring down its throughput and bump up latencies significantly.
>> >> >>
>> >> >>
>> >> >> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
>> >> >> too.
>> >> >>
>> >> >> I'm basing this assumption on the observations I made on both OpenSuse
>> >> >> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
>> >> >> titled: "Poor desktop responsiveness with background I/O-operations" of
>> >> >> 2009-09-20.
>> >> >> (Message ID: 4AB59CBB.8090907(a)datenparkplatz.de)
>> >> >>
>> >> >>
>> >> >> Thus, I'm posting this to show that your work is greatly appreciated,
>> >> >> given the rather disappointig status quo of Linux's fairness when it
>> >> >> comes to disk IO time.
>> >> >>
>> >> >> I hope that your efforts lead to a change in performance of current
>> >> >> userland applications, the sooner, the better.
>> >> >>
>> >> > [Please don't remove people from original CC list. I am putting them back.]
>> >> >
>> >> > Hi Ulrich,
>> >> >
>> >> > I quicky went through that mail thread and I tried following on my
>> >> > desktop.
>> >> >
>> >> > ##########################################
>> >> > dd if=/home/vgoyal/4G-file of=/dev/null &
>> >> > sleep 5
>> >> > time firefox
>> >> > # close firefox once gui pops up.
>> >> > ##########################################
>> >> >
>> >> > It was taking close to 1 minute 30 seconds to launch firefox and dd got
>> >> > following.
>> >> >
>> >> > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s
>> >> >
>> >> > (Results do vary across runs, especially if system is booted fresh. Don't
>> >> >  know why...).
>> >> >
>> >> >
>> >> > Then I tried putting both the applications in separate groups and assign
>> >> > them weights 200 each.
>> >> >
>> >> > ##########################################
>> >> > dd if=/home/vgoyal/4G-file of=/dev/null &
>> >> > echo $! > /cgroup/io/test1/tasks
>> >> > sleep 5
>> >> > echo $$ > /cgroup/io/test2/tasks
>> >> > time firefox
>> >> > # close firefox once gui pops up.
>> >> > ##########################################
>> >> >
>> >> > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3.
>> >> >
>> >> > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s
>> >> >
>> >> > Notice that throughput of dd also improved.
>> >> >
>> >> > I ran the block trace and noticed in many a cases firefox threads
>> >> > immediately preempted the "dd". Probably because it was a file system
>> >> > request. So in this case latency will arise from seek time.
>> >> >
>> >> > In some other cases, threads had to wait for up to 100ms because dd was
>> >> > not preempted. In this case latency will arise both from waiting on queue
>> >> > as well as seek time.
>> >>
>> >> I think cfq should already be doing something similar, i.e. giving
>> >> 100ms slices to firefox, that alternate with dd, unless:
>> >> * firefox is too seeky (in this case, the idle window will be too small)
>> >> * firefox has too much think time.
>> >>
>> >
>> Hi Vivek,
>> > Hi Corrado,
>> >
>> > "firefox" is the shell script to setup the environment and launch the
>> > broser. It seems to be a group of threads. Some of them run in parallel
>> > and some of these seems to be running one after the other (once previous
>> > process or threads finished).
>>
>> Ok.
>>
>> >
>> >> To rule out the first case, what happens if you run the test with your
>> >> "fairness for seeky processes" patch?
>> >
>> > I applied that patch and it helps a lot.
>> >
>> > http://lwn.net/Articles/341032/
>> >
>> > With above patchset applied, and fairness=1, firefox pops up in 27-28 seconds.
>>
>> Great.
>> Can you try the attached patch (on top of 2.6.31)?
>> It implements the alternative approach we discussed privately in july,
>> and it addresses the possible latency increase that could happen with
>> your patch.
>>
>> To summarize for everyone, we separate sync sequential queues, sync
>> seeky queues and async queues in three separate RR strucutres, and
>> alternate servicing requests between them.
>>
>> When servicing seeky queues (the ones that are usually penalized by
>> cfq, for which no fairness is usually provided), we do not idle
>> between them, but we do idle for the last queue (the idle can be
>> exited when any seeky queue has requests). This allows us to allocate
>> disk time globally for all seeky processes, and to reduce seeky
>> processes latencies.
>>
>
> Ok, I seem to be doing same thing at group level (In group scheduling
> patches). I do not idle on individual sync seeky queues but if this is
> last queue in the group, then I do idle to make sure group does not loose
> its fair share and exit from idle the moment there is any busy queue in
> the group.
>
> So you seem to be grouping all the sync seeky queues system wide in a
> single group. So all the sync seeky queues collectively get 100ms in a
> single round of dispatch?

A round of dispatch (defined by tunable target_latency, default 300ms)
is subdivided between the three groups, proportionally to how many
queues are waiting in each, so if we have 1 sequential and 2 seeky
(and 0 async), we get 100ms for seq and 200ms for seeky.

> I am wondering what happens if there are lot
> of such sync seeky queues this 100ms time slice is consumed before all the
> sync seeky queues got a chance to dispatch. Does that mean that some of
> the queues can completely skip the one dispatch round?
It can happen: if each seek costs 10ms, and you have more than 30
seeky processes, then you are guaranteed that they cannot issue all in
the same round.
When this happens, the ones that did not issue before, will be the
first ones to be issued in the next round.

Thanks,
Corrado

>
> Thanks
> Vivek
>
>> I tested with 'konsole -e exit', while doing a sequential write with
>> dd, and the start up time reduced from 37s to 7s, on an old laptop
>> disk.
>>
>> Thanks,
>> Corrado
>>
>> >
>> >> To rule out the first case, what happens if you run the test with your
>> >> "fairness for seeky processes" patch?
>> >
>> > I applied that patch and it helps a lot.
>> >
>> > http://lwn.net/Articles/341032/
>> >
>> > With above patchset applied, and fairness=1, firefox pops up in 27-28
>> > seconds.
>> >
>> > So it looks like if we don't disable idle window for seeky processes on
>> > hardware supporting command queuing, it helps in this particular case.
>> >
>> > Thanks
>> > Vivek
>> >
>
>
>



--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:czoccolo(a)gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
Tales of Power - C. Castaneda
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Corrado Zoccolo on
Hi Mike,
On Mon, Sep 28, 2009 at 8:53 PM, Mike Galbraith <efault(a)gmx.de> wrote:
> On Mon, 2009-09-28 at 14:18 -0400, Vivek Goyal wrote:
>> On Mon, Sep 28, 2009 at 07:51:14PM +0200, Mike Galbraith wrote:
>
>> I guess changing class to IDLE should have helped a bit as now this is
>> equivalent to setting the quantum to 1 and after dispatching one request
>> to disk, CFQ will always expire the writer once. So it might happen that
>> by the the reader preempted writer, we have less number of requests in
>> disk and lesser latency for this reader.
>
> I expected SCHED_IDLE to be better than setting quantum to 1, because
> max is quantum*4 if you aren't IDLE.  But that's not what happened.  I
> just retested with all knobs set back to stock, fairness off, and
> quantum set to 1 with everything running nice 0.  2.8 seconds avg :-/

Idle doesn't work very well for async writes, since the writer process
will just send its writes to the page cache.
The real writeback will happen in the context of a kernel thread, with
best effort scheduling class.

>
>> > I saw
>> > the reference to Vivek's patch, and gave it a shot.  Makes a large
>> > difference.
>> >                                                            Avg
>> > perf stat     12.82     7.19     8.49     5.76     9.32    8.7     anticipatory
>> >               16.24   175.82   154.38   228.97   147.16  144.5     noop
>> >               43.23    57.39    96.13   148.25   180.09  105.0     deadline
>> >                9.15    14.51     9.39    15.06     9.90   11.6     cfq fairness=0 dd=nice 0
>> >               12.22     9.85    12.55     9.88    15.06   11.9     cfq fairness=0 dd=nice 19
>> >                9.77    13.19    11.78    17.40     9.51   11.9     cfq fairness=0 dd=SCHED_IDLE
>> >                4.59     2.74     4.70     3.45     4.69    4.0     cfq fairness=1 dd=nice 0
>> >                3.79     4.66     2.66     5.15     3.03    3.8     cfq fairness=1 dd=nice 19
>> >                2.79     4.73     2.79     4.02     2.50    3.3     cfq fairness=1 dd=SCHED_IDLE
>> >
>>
>> Hmm.., looks like average latency went down only in  case of fairness=1
>> and not in case of fairness=0. (Looking at previous mail, average vanilla
>> cfq latencies were around 12 seconds).
>
> Yup.
>
>> Are you running all this in root group or have you put writers and readers
>> into separate cgroups?
>
> No cgroups here.
>
>> If everything is running in root group, then I am curious why latency went
>> down in case of fairness=1. The only thing fairness=1 parameter does is
>> that it lets complete all the requests from previous queue before start
>> dispatching from next queue. On top of this is valid only if no preemption
>> took place. In your test case, konsole should preempt the writer so
>> practically fairness=1 might not make much difference.
>
> fairness=1 very definitely makes a very large difference.  All of those
> cfq numbers were logged in back to back runs.
>
>> In fact now Jens has committed a patch which achieves the similar effect as
>> fairness=1 for async queues.
>
> Yeah, I was there yesterday.  I speculated that that would hurt my
> reader, but rearranging things didn't help one bit.  Playing with merge,
> I managed to give dd ~7% more throughput, and injured poor reader even
> more.  (problem analysis via hammer/axe not always most effective;)
>
>> commit 5ad531db6e0f3c3c985666e83d3c1c4d53acccf9
>> Author: Jens Axboe <jens.axboe(a)oracle.com>
>> Date:   Fri Jul 3 12:57:48 2009 +0200
>>
>>     cfq-iosched: drain device queue before switching to a sync queue
>>
>>     To lessen the impact of async IO on sync IO, let the device drain of
>>     any async IO in progress when switching to a sync cfqq that has idling
>>     enabled.
>>
>>
>> If everything is in separate cgroups, then we should have seen latency
>> improvements in case of fairness=0 case also. I am little perplexed here..
>>
>> Thanks
>> Vivek
>
>

Thanks,
Corrado


--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:czoccolo(a)gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
Tales of Power - C. Castaneda
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/