From: Vivek Goyal on
On Fri, Sep 25, 2009 at 06:07:24PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
>
> Vivek Goyal <vgoyal(a)redhat.com> wrote:
> > Higher level solutions are not keeping track of time slices. Time slices will
> > be allocated by CFQ which does not have any idea about grouping. Higher
> > level controller just keeps track of size of IO done at group level and
> > then run either a leaky bucket or token bucket algorithm.
> >
> > IO throttling is a max BW controller, so it will not even care about what is
> > happening in other group. It will just be concerned with rate of IO in one
> > particular group and if we exceed specified limit, throttle it. So until and
> > unless sequential reader group hits it max bw limit, it will keep sending
> > reads down to CFQ, and CFQ will happily assign 100ms slices to readers.
> >
> > dm-ioband will not try to choke the high throughput sequential reader group
> > for the slow random reader group because that would just kill the throughput
> > of rotational media. Every sequential reader will run for few ms and then
> > be throttled and this goes on. Disk will soon be seek bound.
>
> Because dm-ioband provides faireness in terms of how many IO requests
> are issued or how many bytes are transferred, so this behaviour is to
> be expected. Do you think fairness in terms of IO requests and size is
> not fair?
>

Hi Ryo,

Fairness in terms of size of IO or number of requests is probably not the
best thing to do on rotational media where seek latencies are significant.

It probably should work just well on media with very low seek latencies
like SSD.

So on rotational media, either you will not provide fairness to random
readers because they are too slow or you will choke the sequential readers
in other group and also bring down the overall disk throughput.

If you don't decide to choke/throttle sequential reader group for the sake
of random reader in other group then you will not have a good control
on random reader latencies. Because now IO scheduler sees the IO from both
sequential reader as well as random reader and sequential readers have not
been throttled. So the dispatch pattern/time slices will again look like..

SR1 SR2 SR3 SR4 SR5 RR.....

instead of

SR1 RR SR2 RR SR3 RR SR4 RR ....

SR --> sequential reader, RR --> random reader

> > > > Buffering at higher layer can delay read requests for more than slice idle
> > > > period of CFQ (default 8 ms). That means, it is possible that we are waiting
> > > > for a request from the queue but it is buffered at higher layer and then idle
> > > > timer will fire. It means that queue will losse its share at the same time
> > > > overall throughput will be impacted as we lost those 8 ms.
> > >
> > > That sounds like a bug.
> > >
> >
> > Actually this probably is a limitation of higher level controller. It most
> > likely is sitting so high in IO stack that it has no idea what underlying
> > IO scheduler is and what are IO scheduler's policies. So it can't keep up
> > with IO scheduler's policies. Secondly, it might be a low weight group and
> > tokens might not be available fast enough to release the request.
> >
> > > > Read Vs Write
> > > > -------------
> > > > Writes can overwhelm readers hence second level controller FIFO release
> > > > will run into issue here. If there is a single queue maintained then reads
> > > > will suffer large latencies. If there separate queues for reads and writes
> > > > then it will be hard to decide in what ratio to dispatch reads and writes as
> > > > it is IO scheduler's decision to decide when and how much read/write to
> > > > dispatch. This is another place where higher level controller will not be in
> > > > sync with lower level io scheduler and can change the effective policies of
> > > > underlying io scheduler.
> > >
> > > The IO schedulers already take care of read-vs-write and already take
> > > care of preventing large writes-starve-reads latencies (or at least,
> > > they're supposed to).
> >
> > True. Actually this is a limitation of higher level controller. A higher
> > level controller will most likely implement some of kind of queuing/buffering
> > mechanism where it will buffer requeuests when it decides to throttle the
> > group. Now once a fair number read and requests are buffered, and if
> > controller is ready to dispatch some requests from the group, which
> > requests/bio should it dispatch? reads first or writes first or reads and
> > writes in certain ratio?
>
> The write-starve-reads on dm-ioband, that you pointed out before, was
> not caused by FIFO release, it was caused by IO flow control in
> dm-ioband. When I turned off the flow control, then the read
> throughput was quite improved.

What was flow control doing?

>
> Now I'm considering separating dm-ioband's internal queue into sync
> and async and giving a certain priority of dispatch to async IOs.

Even if you maintain separate queues for sync and async, in what ratio will
you dispatch reads and writes to underlying layer once fresh tokens become
available to the group and you decide to unthrottle the group.

Whatever policy you adopt for read and write dispatch, it might not match
with policy of underlying IO scheduler because every IO scheduler seems to
have its own way of determining how reads and writes should be dispatched.

Now somebody might start complaining that my job inside the group is not
getting same reader/writer ratio as it was getting outside the group.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Rik van Riel on
Ryo Tsuruta wrote:

> Because dm-ioband provides faireness in terms of how many IO requests
> are issued or how many bytes are transferred, so this behaviour is to
> be expected. Do you think fairness in terms of IO requests and size is
> not fair?

When there are two workloads competing for the same
resources, I would expect each of the workloads to
run at about 50% of the speed at which it would run
on an uncontended system.

Having one of the workloads run at 95% of the
uncontended speed and the other workload at 5%
is "not fair" (to put it diplomatically).

--
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Vivek Goyal on
On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote:
> Vivek Goyal wrote:
> > Notes:
> > - With vanilla CFQ, random writers can overwhelm a random reader.
> > Bring down its throughput and bump up latencies significantly.
>
>
> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
> too.
>
> I'm basing this assumption on the observations I made on both OpenSuse
> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
> titled: "Poor desktop responsiveness with background I/O-operations" of
> 2009-09-20.
> (Message ID: 4AB59CBB.8090907(a)datenparkplatz.de)
>
>
> Thus, I'm posting this to show that your work is greatly appreciated,
> given the rather disappointig status quo of Linux's fairness when it
> comes to disk IO time.
>
> I hope that your efforts lead to a change in performance of current
> userland applications, the sooner, the better.
>
[Please don't remove people from original CC list. I am putting them back.]

Hi Ulrich,

I quicky went through that mail thread and I tried following on my
desktop.

##########################################
dd if=/home/vgoyal/4G-file of=/dev/null &
sleep 5
time firefox
# close firefox once gui pops up.
##########################################

It was taking close to 1 minute 30 seconds to launch firefox and dd got
following.

4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s

(Results do vary across runs, especially if system is booted fresh. Don't
know why...).


Then I tried putting both the applications in separate groups and assign
them weights 200 each.

##########################################
dd if=/home/vgoyal/4G-file of=/dev/null &
echo $! > /cgroup/io/test1/tasks
sleep 5
echo $$ > /cgroup/io/test2/tasks
time firefox
# close firefox once gui pops up.
##########################################

Now I firefox pops up in 27 seconds. So it cut down the time by 2/3.

4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s

Notice that throughput of dd also improved.

I ran the block trace and noticed in many a cases firefox threads
immediately preempted the "dd". Probably because it was a file system
request. So in this case latency will arise from seek time.

In some other cases, threads had to wait for up to 100ms because dd was
not preempted. In this case latency will arise both from waiting on queue
as well as seek time.

With cgroup thing, We will run 100ms slice for the group in which firefox
is being launched and then give 100ms uninterrupted time slice to dd. So
it should cut down on number of seeks happening and that's why we probably
see this improvement.

So grouping can help in such cases. May be you can move your X session in
one group and launch the big IO in other group. Most likely you should
have better desktop experience without compromising on dd thread output.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mike Galbraith on
On Fri, 2009-09-25 at 16:26 -0400, Vivek Goyal wrote:
> On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote:
> > Vivek Goyal wrote:
> > > Notes:
> > > - With vanilla CFQ, random writers can overwhelm a random reader.
> > > Bring down its throughput and bump up latencies significantly.
> >
> >
> > IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
> > too.
> >
> > I'm basing this assumption on the observations I made on both OpenSuse
> > 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
> > titled: "Poor desktop responsiveness with background I/O-operations" of
> > 2009-09-20.
> > (Message ID: 4AB59CBB.8090907(a)datenparkplatz.de)
> >
> >
> > Thus, I'm posting this to show that your work is greatly appreciated,
> > given the rather disappointig status quo of Linux's fairness when it
> > comes to disk IO time.
> >
> > I hope that your efforts lead to a change in performance of current
> > userland applications, the sooner, the better.
> >
> [Please don't remove people from original CC list. I am putting them back.]
>
> Hi Ulrich,
>
> I quicky went through that mail thread and I tried following on my
> desktop.
>
> ##########################################
> dd if=/home/vgoyal/4G-file of=/dev/null &
> sleep 5
> time firefox
> # close firefox once gui pops up.
> ##########################################
>
> It was taking close to 1 minute 30 seconds to launch firefox and dd got
> following.
>
> 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s
>
> (Results do vary across runs, especially if system is booted fresh. Don't
> know why...).
>
>
> Then I tried putting both the applications in separate groups and assign
> them weights 200 each.
>
> ##########################################
> dd if=/home/vgoyal/4G-file of=/dev/null &
> echo $! > /cgroup/io/test1/tasks
> sleep 5
> echo $$ > /cgroup/io/test2/tasks
> time firefox
> # close firefox once gui pops up.
> ##########################################
>
> Now I firefox pops up in 27 seconds. So it cut down the time by 2/3.
>
> 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s
>
> Notice that throughput of dd also improved.
>
> I ran the block trace and noticed in many a cases firefox threads
> immediately preempted the "dd". Probably because it was a file system
> request. So in this case latency will arise from seek time.
>
> In some other cases, threads had to wait for up to 100ms because dd was
> not preempted. In this case latency will arise both from waiting on queue
> as well as seek time.

Hm, with tip, I see ~10ms max wakeup latency running scriptlet below.

> With cgroup thing, We will run 100ms slice for the group in which firefox
> is being launched and then give 100ms uninterrupted time slice to dd. So
> it should cut down on number of seeks happening and that's why we probably
> see this improvement.

I'm not testing with group IO/CPU, but my numbers kinda agree that it's
seek latency that's THE killer. What the compiled numbers below from
the cheezy script below that _seem_ to be telling me is that the default
setting of CFQ quantum is allowing too many write requests through,
inflicting too much read latency... for the disk where my binaries live.
The longer the seeky burst, the more it hurts both reader/writer, so
cutting down the max requests queueable helps the reader (which i think
can't queue anything near per unit time that the writer can) finish and
get out of the writer's way sooner.

'nuff possibly useless words, onward to possibly useless numbers :)

dd pre == number dd emits upon receiving USR1 before execing perf.
perf stat == time to load/execute perf stat konsole -e exit.
dd post == same after dd number, after perf finishes.

quantum = 1 Avg
dd pre 58.4 52.5 56.1 61.6 52.3 56.1 MB/s
perf stat 2.87 0.91 1.64 1.41 0.90 1.5 Sec
dd post 56.6 61.0 66.3 64.7 60.9 61.9

quantum = 2
dd pre 59.7 62.4 58.9 65.3 60.3 61.3
perf stat 5.81 6.09 6.24 10.13 6.21 6.8
dd post 64.0 62.6 64.2 60.4 61.1 62.4

quantum = 3
dd pre 65.5 57.7 54.5 51.1 56.3 57.0
perf stat 14.01 13.71 8.35 5.35 8.57 9.9
dd post 59.2 49.1 58.8 62.3 62.1 58.3

quantum = 4
dd pre 57.2 52.1 56.8 55.2 61.6 56.5
perf stat 11.98 1.61 9.63 16.21 11.13 10.1
dd post 57.2 52.6 62.2 49.3 50.2 54.3

Nothing pinned btw, 4 cores available, but only 1 drive.

#!/bin/sh

DISK=sdb
QUANTUM=/sys/block/$DISK/queue/iosched/quantum
END=$(cat $QUANTUM)

for q in `seq 1 $END`; do
echo $q > $QUANTUM
LOGFILE=quantum_log_$q
rm -f $LOGFILE
for i in `seq 1 5`; do
echo 2 > /proc/sys/vm/drop_caches
sh -c "dd if=/dev/zero of=./deleteme.dd 2>&1|tee -a $LOGFILE" &
sleep 30
sh -c "echo quantum $(cat $QUANTUM) loop $i" 2>&1|tee -a $LOGFILE
perf stat -- killlall -q get_stuf_into_ram >/dev/null 2>&1
sleep 1
killall -q -USR1 dd &
sleep 1
sh -c "perf stat -- konsole -e exit" 2>&1|tee -a $LOGFILE
sleep 1
killall -q -USR1 dd &
sleep 5
killall -qw dd
rm -f ./deleteme.dd
sync
sh -c "echo" 2>&1|tee -a $LOGFILE
done;
done;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mike Galbraith on
My dd vs load non-cached binary woes seem to be coming from backmerge.

#if 0 /*MIKEDIDIT sand in gearbox?*/
/*
* See if our hash lookup can find a potential backmerge.
*/
__rq = elv_rqhash_find(q, bio->bi_sector);
if (__rq && elv_rq_merge_ok(__rq, bio)) {
*req = __rq;
return ELEVATOR_BACK_MERGE;
}
#endif

- = stock = 0
+ = /sys/block/sdb/queue/nomerges = 1
x = backmerge disabled

quantum = 1 Avg
dd pre 58.4 52.5 56.1 61.6 52.3 56.1- MB/s virgin/foo
59.6 54.4 53.0 56.1 58.6 56.3+ 1.003
53.8 56.6 54.7 50.7 59.3 55.0x .980
perf stat 2.87 0.91 1.64 1.41 0.90 1.5- Sec
2.61 1.14 1.45 1.43 1.47 1.6+ 1.066
1.07 1.19 1.20 1.24 1.37 1.2x .800
dd post 56.6 61.0 66.3 64.7 60.9 61.9-
54.0 59.3 61.1 58.3 58.9 58.3+ .941
54.3 60.2 59.6 60.6 60.3 59.0x .953

quantum = 2
dd pre 59.7 62.4 58.9 65.3 60.3 61.3-
49.4 51.9 58.7 49.3 52.4 52.3+ .853
58.3 52.8 53.1 50.4 59.9 54.9x .895
perf stat 5.81 6.09 6.24 10.13 6.21 6.8-
2.48 2.10 3.23 2.29 2.31 2.4+ .352
2.09 2.73 1.72 1.96 1.83 2.0x .294
dd post 64.0 62.6 64.2 60.4 61.1 62.4-
52.9 56.2 49.6 51.3 51.2 52.2+ .836
54.7 60.9 56.0 54.0 55.4 56.2x .900

quantum = 3
dd pre 65.5 57.7 54.5 51.1 56.3 57.0-
58.1 53.9 52.2 58.2 51.8 54.8+ .961
60.5 56.5 56.7 55.3 54.6 56.7x .994
perf stat 14.01 13.71 8.35 5.35 8.57 9.9-
1.84 2.30 2.14 2.10 2.45 2.1+ .212
2.12 1.63 2.54 2.23 2.29 2.1x .212
dd post 59.2 49.1 58.8 62.3 62.1 58.3-
59.8 53.2 55.2 50.9 53.7 54.5+ .934
56.1 61.9 51.9 54.3 53.1 55.4x .950

quantun = 4
dd pre 57.2 52.1 56.8 55.2 61.6 56.5-
48.7 55.4 51.3 49.7 54.5 51.9+ .918
55.8 54.5 50.3 56.4 49.3 53.2x .941
perf stat 11.98 1.61 9.63 16.21 11.13 10.1-
2.29 1.94 2.68 2.46 2.45 2.3+ .227
3.01 1.84 2.11 2.27 2.30 2.3x .227
dd post 57.2 52.6 62.2 49.3 50.2 54.3-
50.1 54.5 58.4 54.1 49.0 53.2+ .979
52.9 53.2 50.6 53.2 50.5 52.0x .957


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/