From: Jens Axboe on
On Fri, Oct 02 2009, Ingo Molnar wrote:
>
> * Jens Axboe <jens.axboe(a)oracle.com> wrote:
>
> > It's not hard to make the latency good, the hard bit is making sure we
> > also perform well for all other scenarios.
>
> Looking at the numbers from Mike:
>
> | dd competing against perf stat -- konsole -e exec timings, 5 back to
> | back runs
> | Avg
> | before 9.15 14.51 9.39 15.06 9.90 11.6
> | after [+patch] 1.76 1.54 1.93 1.88 1.56 1.7
>
> _PLEASE_ make read latencies this good - the numbers are _vastly_
> better. We'll worry about the 'other' things _after_ we've reached good
> latencies.
>
> I thought this principle was a well established basic rule of Linux IO
> scheduling. Why do we have to have a 'latency vs. bandwidth' discussion
> again and again? I thought latency won hands down.

It's really not that simple, if we go and do easy latency bits, then
throughput drops 30% or more. You can't say it's black and white latency
vs throughput issue, that's just not how the real world works. The
server folks would be most unpleased.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mike Galbraith on
On Fri, 2009-10-02 at 11:24 +0200, Ingo Molnar wrote:
> * Jens Axboe <jens.axboe(a)oracle.com> wrote:
>
> > It's not hard to make the latency good, the hard bit is making sure we
> > also perform well for all other scenarios.
>
> Looking at the numbers from Mike:
>
> | dd competing against perf stat -- konsole -e exec timings, 5 back to
> | back runs
> | Avg
> | before 9.15 14.51 9.39 15.06 9.90 11.6
> | after [+patch] 1.76 1.54 1.93 1.88 1.56 1.7
>
> _PLEASE_ make read latencies this good - the numbers are _vastly_
> better. We'll worry about the 'other' things _after_ we've reached good
> latencies.
>
> I thought this principle was a well established basic rule of Linux IO
> scheduling. Why do we have to have a 'latency vs. bandwidth' discussion
> again and again? I thought latency won hands down.

Just a note: In the testing I've done so far, we're better off today
than ever, and I can't recall beating on root ever being anything less
than agony for interactivity. IO seekers look a lot like CPU sleepers
to me. Looks like both can be as annoying as hell ;-)

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jens Axboe on
On Fri, Oct 02 2009, Mike Galbraith wrote:
> On Fri, 2009-10-02 at 10:04 +0200, Jens Axboe wrote:
> > On Fri, Oct 02 2009, Mike Galbraith wrote:
>
> > > If we're in the idle window and doing the async drain thing, we've at
> > > the spot where Vivek's patch helps a ton. Seemed like a great time to
> > > limit the size of any io that may land in front of my sync reader to
> > > plain "you are not alone" quantity.
> >
> > You can't be in the idle window and doing async drain at the same time,
> > the idle window doesn't start until the sync queue has completed a
> > request. Hence my above rant on device interference.
>
> I'll take your word for it.
>
> /*
> * Drain async requests before we start sync IO
> */
> if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
>
> Looked about the same to me as..
>
> enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
>
> ..where Vivek prevented turning 1 into 0, so I stamped it ;-)

cfq_cfqq_idle_window(cfqq) just tells you whether this queue may enter
idling, not that it is currently idling. The actual idling happens from
cfq_completed_request(), here:

else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) &&
sync && !rq_noidle(rq))
cfq_arm_slice_timer(cfqd);

and after that the queue will be marked as waiting, so
cfq_cfqq_wait_request(cfqq) is a better indication of whether we are
currently waiting for a request (idling) or not.

> > > Dunno, I was just tossing rocks and sticks at it.
> > >
> > > I don't really understand the reasoning behind overloading: I can see
> > > that allows cutting thicker slabs for the disk, but with the streaming
> > > writer vs reader case, seems only the writers can do that. The reader
> > > is unlikely to be alone isn't it? Seems to me that either dd, a flusher
> > > thread or kjournald is going to be there with it, which gives dd a huge
> > > advantage.. it has two proxies to help it squabble over disk, konsole
> > > has none.
> >
> > That is true, async queues have a huge advantage over sync ones. But
> > sync vs async is only part of it, any combination of queued sync, queued
> > sync random etc have different ramifications on behaviour of the
> > individual queue.
> >
> > It's not hard to make the latency good, the hard bit is making sure we
> > also perform well for all other scenarios.
>
> Yeah, that's why I'm trying to be careful about what I say, I know full
> well this ain't easy to get right. I'm not even thinking of submitting
> anything, it's just diagnostic testing.

It's much appreciated btw, if we can make this better without killing
throughput, then I'm surely interested in picking up your interesting
bits and getting them massaged into something we can include. So don't
be discouraged, I'm just being realistic :-)


--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Corrado Zoccolo on
Hi Jens,
On Fri, Oct 2, 2009 at 11:28 AM, Jens Axboe <jens.axboe(a)oracle.com> wrote:
> On Fri, Oct 02 2009, Ingo Molnar wrote:
>>
>> * Jens Axboe <jens.axboe(a)oracle.com> wrote:
>>
>
> It's really not that simple, if we go and do easy latency bits, then
> throughput drops 30% or more. You can't say it's black and white latency
> vs throughput issue, that's just not how the real world works. The
> server folks would be most unpleased.
Could we be more selective when the latency optimization is introduced?

The code that is currently touched by Vivek's patch is:
if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
(cfqd->hw_tag && CIC_SEEKY(cic)))
enable_idle = 0;
basically, when fairness=1, it becomes just:
if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle)
enable_idle = 0;

Note that, even if we enable idling here, the cfq_arm_slice_timer will use
a different idle window for seeky (2ms) than for normal I/O.

I think that the 2ms idle window is good for a single rotational SATA disk scenario,
even if it supports NCQ. Realistic access times for those disks are still around 8ms
(but it is proportional to seek lenght), and waiting 2ms to see if we get a nearby
request may pay off, not only in latency and fairness, but also in throughput.

What we don't want to do is to enable idling for NCQ enabled SSDs
(and this is already taken care in cfq_arm_slice_timer) or for hardware RAIDs.
If we agree that hardware RAIDs should be marked as non-rotational, then that
code could become:

if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
(blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag && CIC_SEEKY(cic)))
enable_idle = 0;
else if (sample_valid(cic->ttime_samples)) {
unsigned idle_time = CIC_SEEKY(cic) ? CFQ_MIN_TT : cfqd->cfq_slice_idle;
if (cic->ttime_mean > idle_time)
enable_idle = 0;
else
enable_idle = 1;
}

Thanks,
Corrado

>
> --
> Jens Axboe
>

--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:czoccolo(a)gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jens Axboe on
On Fri, Oct 02 2009, Corrado Zoccolo wrote:
> Hi Jens,
> On Fri, Oct 2, 2009 at 11:28 AM, Jens Axboe <jens.axboe(a)oracle.com> wrote:
> > On Fri, Oct 02 2009, Ingo Molnar wrote:
> >>
> >> * Jens Axboe <jens.axboe(a)oracle.com> wrote:
> >>
> >
> > It's really not that simple, if we go and do easy latency bits, then
> > throughput drops 30% or more. You can't say it's black and white latency
> > vs throughput issue, that's just not how the real world works. The
> > server folks would be most unpleased.
> Could we be more selective when the latency optimization is introduced?
>
> The code that is currently touched by Vivek's patch is:
> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
> (cfqd->hw_tag && CIC_SEEKY(cic)))
> enable_idle = 0;
> basically, when fairness=1, it becomes just:
> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle)
> enable_idle = 0;
>
> Note that, even if we enable idling here, the cfq_arm_slice_timer will use
> a different idle window for seeky (2ms) than for normal I/O.
>
> I think that the 2ms idle window is good for a single rotational SATA
> disk scenario, even if it supports NCQ. Realistic access times for
> those disks are still around 8ms (but it is proportional to seek
> lenght), and waiting 2ms to see if we get a nearby request may pay
> off, not only in latency and fairness, but also in throughput.

I agree, that change looks good.

> What we don't want to do is to enable idling for NCQ enabled SSDs
> (and this is already taken care in cfq_arm_slice_timer) or for hardware RAIDs.

Right, it was part of the bigger SSD optimization stuff I did a few
revisions back.

> If we agree that hardware RAIDs should be marked as non-rotational, then that
> code could become:
>
> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
> (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag && CIC_SEEKY(cic)))
> enable_idle = 0;
> else if (sample_valid(cic->ttime_samples)) {
> unsigned idle_time = CIC_SEEKY(cic) ? CFQ_MIN_TT : cfqd->cfq_slice_idle;
> if (cic->ttime_mean > idle_time)
> enable_idle = 0;
> else
> enable_idle = 1;
> }

Yes agree on that too. We probably should make a different flag for
hardware raids, telling the io scheduler that this device is really
composed if several others. If it's composited only by SSD's (or has a
frontend similar to that), then non-rotational applies.

But yes, we should pass that information down.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/