Please pull from lost-spurious-irq [Kernel]

Prev: hpwdt: make NMI code a config option (+ other cleanup)
Next: linux-next: manual merge of the staging-next tree with the v4l-dvb tree

From: Tejun Heo on 28 Jul 2010 09:50

On 07/28/2010 03:42 PM, Tejun Heo wrote:
> Hello, Thomas.
>
> With Jeff's acks added, patches to make libata use irq-expect are
> commited. Please pull from the following branch to receive patches[1]
> to improve lost/spurious irq handling.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git lost-spurious-irq

I'll ask Stephen to pull the above branch into linux-next until it
shows up via tip so that we don't lose any more linux-next testing
time.

Thank you.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Thomas Gleixner on 29 Jul 2010 04:50

On Wed, 28 Jul 2010, Tejun Heo wrote:

> On 07/28/2010 03:42 PM, Tejun Heo wrote:
> > Hello, Thomas.
> >
> > With Jeff's acks added, patches to make libata use irq-expect are
> > commited. Please pull from the following branch to receive patches[1]
> > to improve lost/spurious irq handling.
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git lost-spurious-irq
>
> I'll ask Stephen to pull the above branch into linux-next until it
> shows up via tip so that we don't lose any more linux-next testing
> time.

I'm working on it already and I'm currently twisting my brain around
the problem this patches will impose to the RT stuff, where we cannot
access timer_list timers from atomic regions :(

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Tejun Heo on 30 Jul 2010 05:50

Hello,

On 07/29/2010 10:44 AM, Thomas Gleixner wrote:
>> I'll ask Stephen to pull the above branch into linux-next until it
>> shows up via tip so that we don't lose any more linux-next testing
>> time.
>
> I'm working on it already and I'm currently twisting my brain around
> the problem this patches will impose to the RT stuff, where we cannot
> access timer_list timers from atomic regions :(

Oh, I see. A dull last ditch solution would be just disabling it on
CONFIG_PREEMPT_RT, but yeah that will be dull. :-(

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Thomas Gleixner on 2 Aug 2010 10:10

Tejun,

On Fri, 30 Jul 2010, Tejun Heo wrote:

> Hello,
>
> On 07/29/2010 10:44 AM, Thomas Gleixner wrote:
> >> I'll ask Stephen to pull the above branch into linux-next until it
> >> shows up via tip so that we don't lose any more linux-next testing
> >> time.
> >
> > I'm working on it already and I'm currently twisting my brain around
> > the problem this patches will impose to the RT stuff, where we cannot
> > access timer_list timers from atomic regions :(
>
> Oh, I see. A dull last ditch solution would be just disabling it on
> CONFIG_PREEMPT_RT, but yeah that will be dull. :-(

Yup, but I have some other issues with this series as well. That thing
is massively overengineered. You mangle all the various functions into
irq_poll which makes it really hard to understand what the code does
under which circumstances.

I understand most of the problems you want to solve, but I don't like
the outcome too much.

Let's look at the various parts:

1) irq polling

- Get rid of the for_each_irq() loop in the poll timer.

You can solve this by adding the interrupt to a linked list and
let the poll timer run through the list. Also why is the simple
counter based decision not good enough ? Why do we need an extra
time period for this ?

- Reenable the interrupt from time to time to check whether the irq
storm subsided.

Generally a good idea, but we really do not want to wait for
another 10k unhandled interrupts flooding the machine. Ideally we
should read out the irq pending register from the irq chip to see
whether the interrupt is still asserted.

2) irq watch

I have to admit, that I have no clue what this thing exactly
does. After staring into the code for quite a while I came to the
conclusion that it's a dynamic polling machinery, which adjusts
it's polling frequency depending on the number of good and bad
samples. The heuristics for this are completely non obvious.

Aside of that the watch mechanism adds unneccesary code to the hard
interrupt context. There is no reason why you need to update that
watch magic in the hard interrupt context. Analysing the stats can
be done in the watch timer as well. All it takes is in
handle_irq_event()

case IRQ_HANDLED:
action->handled++;

and in the watch timer function

if (ret == IRQ_HANDLED)
action->polled++;

irq_watch should use a separate global timer and a linked list. The
timer is started when an interrupt is added to the watch mechanism
and stopped when the list becomes empty.

That's a clear separation of watch and poll and simplifies the
whole business a lot.

3) irq expect

So you basically poll the interrupt until either the interrupt
happened or the device driver timeout occured.

That's really weird. What does it solve ? Not running into the
device driver timeout routine if the hardware interrupt has been
lost?

That's overkill, isn't it ? When the hardware loses an interrupt
occasionally then why isn't the device driver timeout routine
sufficient? If it does not check whether the device has an
interrupt pending despite the fact that it has not been delivered,
then this code needs to be fixed and not worked around with weird
polling in the generic irq code.

Btw, this polling is pretty bad in terms of power savings.

- The timers are not aligned, so if there are several expects armed
you get unbatched wakeups.

- If the poll interval is smaller than the average hardware
interrupt delivery time, you add more wakeups than necessary.

- If the hardware interrupt has happened, you let the poll timer
pending, which gives us an extra wakeup for each interrupt in the
worst case.

I assume your concern is to detect and deal with flaky interrupts,
but avoiding the permanent poll when the device is not used, right?

So that can be done simpler as well. One timer and a linked list
again.

expect_irq() adds the irq to the linked list and arms the timer, if
it is not armed already. unexpect_irq() removes the irq from the
linked list and deletes the timer when the list becomes empty.

That can reuse the list_head in irq_desc which you need for irq
poll and the timer interval should be fixed without dynamic
adjustment.

Keep it simple!

Thanks

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Tejun Heo on 2 Aug 2010 11:30

Hello, Thomas.

On 08/02/2010 04:07 PM, Thomas Gleixner wrote:
> Yup, but I have some other issues with this series as well. That thing
> is massively overengineered. You mangle all the various functions into
> irq_poll which makes it really hard to understand what the code does
> under which circumstances.

Oh, I'll explain why I muliplexed timers this way below.

> I understand most of the problems you want to solve, but I don't like
> the outcome too much.
>
> Let's look at the various parts:

Heh, fair enough. Let's talk about it.

> 1) irq polling

Do you mean spurious irq polling here?

> - Get rid of the for_each_irq() loop in the poll timer.
>
> You can solve this by adding the interrupt to a linked list and
> let the poll timer run through the list.

I suppose you're talking about using per-desc timer. I first thought
about using a single common timer too but decided against it because
timer events can be merged from timer code (depending on the interval
and slack) and I wanted to use different polling frequencies for
different cases. We can of course implement logic to put polled irqs
on a list in chronological order but that's basically redoing the work
timer code already does.

> Also why is the simple counter based decision not good enough?
> Why do we need an extra time period for this?

Because in many cases IRQ storms are transient and spurious IRQ
polling worsens performance a lot, it's worthwhile to be able to
recover from such conditions, so the extra time period is there to
trigger reenabling of the offending IRQ to see whether the storm is
over now. Please note that many of this type of IRQ storms are
extremely obscure (for example, happens during resume from STR once in
a blue moon due to bad interaction between legacy ATA state machine in
the controller and the drive's certain behavior) and some are very
difficult to avoid from the driver in any reasonable way.

> - Reenable the interrupt from time to time to check whether the irq
> storm subsided.

Or maybe you were talking about something else?

> Generally a good idea, but we really do not want to wait for
> another 10k unhandled interrupts flooding the machine. Ideally we
> should read out the irq pending register from the irq chip to see
> whether the interrupt is still asserted.

In simulated tests with hosed IRQ routing the behavior was already
quite acceptable, but yeah if peeking at interrupt state can be done
easily that would definitely be an improvement.

> 2) irq watch
>
> I have to admit, that I have no clue what this thing exactly
> does. After staring into the code for quite a while I came to the
> conclusion that it's a dynamic polling machinery, which adjusts
> it's polling frequency depending on the number of good and bad
> samples. The heuristics for this are completely non obvious.

Eh, I thought I did pretty good on documenting each mechanism.
Obviously, not enough at all. :-)

It basically allows drivers to tell the irq code that IRQs are likely
to happen. In response, IRQ code polls the desc for a while and sees
whether poll detects IRQ handling w/o actual IRQ deliveries. If that
seems to happen more than usual, IRQ code can determine that IRQ is
not being delivered and enables full blown IRQ polling.

The WARY state is inbetween state between good and bad. Dropping this
will simplify the logic a bit. It's basically to avoid false
positives as it would suck to enable polling for a working IRQ line.

Anyways, the basic heuristics is it watches certain number of IRQ
events and count how many are good (via IRQ) and bad (via poll). If
bad goes over certain threshold, polling is enabled.

> Aside of that the watch mechanism adds unneccesary code to the hard
> interrupt context. There is no reason why you need to update that
> watch magic in the hard interrupt context. Analysing the stats can
> be done in the watch timer as well. All it takes is in
> handle_irq_event()
>
> case IRQ_HANDLED:
> action->handled++;
>
> and in the watch timer function
>
> if (ret == IRQ_HANDLED)
> action->polled++;

The whole thing is executed iff unlikely(desc->status &
IRQ_CHECK_WATCHES) which the processor will be predicting correctly
most of the time. Do you think it would make any noticeable
difference?

> irq_watch should use a separate global timer and a linked list. The
> timer is started when an interrupt is added to the watch mechanism
> and stopped when the list becomes empty.
>
> That's a clear separation of watch and poll and simplifies the
> whole business a lot.

The reason why poll_irq() looks complicated is because different types
of timers on the same desc shares the timer. The reason why it's
shared this way instead of across different descs of the same type is
to avoid locking overhead in maintaining the timer and linked lists of
its targets. By sharing the timer in the same desc, everything can be
protected by desc->lock but if we use timers across different descs,
we need another layer of locking.

> 3) irq expect
>
> So you basically poll the interrupt until either the interrupt
> happened or the device driver timeout occured.
>
> That's really weird. What does it solve ? Not running into the
> device driver timeout routine if the hardware interrupt has been
> lost?
>
> That's overkill, isn't it ? When the hardware loses an interrupt
> occasionally then why isn't the device driver timeout routine
> sufficient? If it does not check whether the device has an
> interrupt pending despite the fact that it has not been delivered,
> then this code needs to be fixed and not worked around with weird
> polling in the generic irq code.

Delays caused by lost interrupt and by other failures can differ a lot
in their magnitude and, for example, libata does check IRQ delivery
failure in its timeout handler but at that point it doesn't really
matter why the timeout has occurred for recovery of that specific
command. Too much time has passed anyway.

We definitely can implement such quick polling timeouts which check
for IRQ misdeliveries in drivers so that it can detect such events and
maybe add an interface in the IRQ polling code which the driver can
call to enable polling on the IRQ. But then again the pattern of such
failures and handling of them would be very similar across different
drivers and we already of most of polling machinary implemented in IRQ
code, so I think it makes sense to have a generic implementation which
drivers can use. Moreover, given the difference in test/review
coverage and general complexity of doing it right, I think it is far
better to do it in the core code and make it easy to use for drivers.

> Btw, this polling is pretty bad in terms of power savings.
>
> - The timers are not aligned, so if there are several expects armed
> you get unbatched wakeups.

In usual operation conditions, the timer interval will quickly become
3 seconds w/ 1 sec slack. I doubt it can cause much problem. Right?

> - If the poll interval is smaller than the average hardware
> interrupt delivery time, you add more wakeups than necessary.

The same thing as above. If IRQ delivery is working, it will be
polling every 3 seconds w/ 1 sec slack. It wouldn't really matter.

> - If the hardware interrupt has happened, you let the poll timer
> pending, which gives us an extra wakeup for each interrupt in the
> worst case.

That's to avoid the overhead of constantly manipulating the timer for
high frequency commands. expect/unexpect fast paths don't do much, no
locking, no timer manipulations. No matter how busy the IRQ is, the
timer will only activate and rearmed occassionally.

> I assume your concern is to detect and deal with flaky interrupts,
> but avoiding the permanent poll when the device is not used, right?
>
> So that can be done simpler as well. One timer and a linked list
> again.
>
> expect_irq() adds the irq to the linked list and arms the timer, if
> it is not armed already. unexpect_irq() removes the irq from the
> linked list and deletes the timer when the list becomes empty.
>
> That can reuse the list_head in irq_desc which you need for irq
> poll and the timer interval should be fixed without dynamic
> adjustment.

The main problem here is that expect/unexpect_irq() needs to be low
overhead so that drivers can easily call in w/o worrying too much
about performance overhead and I would definitely want to avoid
locking in fast paths.

> Keep it simple!

To me, the more interesting question is where to put the complexity.
In many cases, the most we can do is pushing complexity around, and if
we can make things easier for drivers by making the core IRQ code
somewhat more complex, I'll definitely go for that every time.

That's also the main theme of this IRQ spurious/missing patchset. The
implementation might be a bit complex but using it from the driver
side is almost no brainer. The drivers just need to give hints to the
IRQ polling code and the IRQ polling code will take care of everything
else, and the low overhead for irq_expect/unexpect() is important in
that aspect because drivers can only use it without worrying iff its
overhead is sufficiently low.

That said, I definitely think poll_irq() can be better factored. Its
current form is mainly due to its transition from spurious poller to
the current multiplexed form. Factoring out different parts and
possibly killing watch handling should simplify it quite a bit.

Thank you.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2 3
Prev: hpwdt: make NMI code a config option (+ other cleanup)
Next: linux-next: manual merge of the staging-next tree with the v4l-dvb tree