Please pull from lost-spurious-irq [Kernel]

Prev: hpwdt: make NMI code a config option (+ other cleanup)
Next: linux-next: manual merge of the staging-next tree with the v4l-dvb tree

From: Tejun Heo on 2 Aug 2010 11:40

On 08/02/2010 05:28 PM, Tejun Heo wrote:
> The reason why poll_irq() looks complicated is because different types
> of timers on the same desc shares the timer. The reason why it's
> shared this way instead of across different descs of the same type is
> to avoid locking overhead in maintaining the timer and linked lists of
> its targets. By sharing the timer in the same desc, everything can be
> protected by desc->lock but if we use timers across different descs,
> we need another layer of locking.

Ooh, another reason is timer locality. If timers are shared per desc,
they have much higher chance of being on the same processor. Global
timers would be pretty bad in that respect.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Thomas Gleixner on 2 Aug 2010 13:20

Tejun,

On Mon, 2 Aug 2010, Tejun Heo wrote:

> Hello, Thomas.
>
> On 08/02/2010 04:07 PM, Thomas Gleixner wrote:
> > 1) irq polling
>
> Do you mean spurious irq polling here?

Yep

> > - Get rid of the for_each_irq() loop in the poll timer.
> >
> > You can solve this by adding the interrupt to a linked list and
> > let the poll timer run through the list.
>
> I suppose you're talking about using per-desc timer. I first thought
> about using a single common timer too but decided against it because
> timer events can be merged from timer code (depending on the interval
> and slack) and I wanted to use different polling frequencies for
> different cases. We can of course implement logic to put polled irqs
> on a list in chronological order but that's basically redoing the work
> timer code already does.

They might be merged depending on interval and slack, but why do we
need different frequencies at all ?

> > Also why is the simple counter based decision not good enough?
> > Why do we need an extra time period for this?
>
> Because in many cases IRQ storms are transient and spurious IRQ
> polling worsens performance a lot, it's worthwhile to be able to
> recover from such conditions, so the extra time period is there to
> trigger reenabling of the offending IRQ to see whether the storm is
> over now. Please note that many of this type of IRQ storms are
> extremely obscure (for example, happens during resume from STR once in
> a blue moon due to bad interaction between legacy ATA state machine in
> the controller and the drive's certain behavior) and some are very
> difficult to avoid from the driver in any reasonable way.

If the IRQ has been marked as spurious, then switch to polling for a
defined time period (simply count the number of poll timer runs for
that irq). After that switch back to non polling and keep a flag which
kicks the stupid thing back to poll after 10 spurious ones in a row.

> > - Reenable the interrupt from time to time to check whether the irq
> > storm subsided.
>
> Or maybe you were talking about something else?

No, that stuff is so convoluted that I did not understand it.

> > Generally a good idea, but we really do not want to wait for
> > another 10k unhandled interrupts flooding the machine. Ideally we
> > should read out the irq pending register from the irq chip to see
> > whether the interrupt is still asserted.
>
> In simulated tests with hosed IRQ routing the behavior was already
> quite acceptable, but yeah if peeking at interrupt state can be done
> easily that would definitely be an improvement.

It could be done on various irq chips, but not on all. So we need that
real retry anyway; no need to add another function to irq_chip.

> > 2) irq watch
>
> The whole thing is executed iff unlikely(desc->status &
> IRQ_CHECK_WATCHES) which the processor will be predicting correctly
> most of the time. Do you think it would make any noticeable
> difference?

It's not about the !WATCH case. I don't like the extra work for those
watched ones.

> > irq_watch should use a separate global timer and a linked list. The
> > timer is started when an interrupt is added to the watch mechanism
> > and stopped when the list becomes empty.
> >
> > That's a clear separation of watch and poll and simplifies the
> > whole business a lot.
>
> The reason why poll_irq() looks complicated is because different types
> of timers on the same desc shares the timer. The reason why it's
> shared this way instead of across different descs of the same type is
> to avoid locking overhead in maintaining the timer and linked lists of
> its targets. By sharing the timer in the same desc, everything can be
> protected by desc->lock but if we use timers across different descs,
> we need another layer of locking.

The locking overhead for a global timer + linked list is minimal.

There is only one additional lock, the one which protects the
list_head. So you take it when watch_irq() adds the watch, when the
watch is removed and in the timer routine. That's zero overhead as
lock contention is totaly unlikely. There is no need to fiddle with
the timer interval. Watch the stupid thing for a while, if it behaves
remove it from the list, if not you need to keep polling anyway.

> > 3) irq expect
> >
> We definitely can implement such quick polling timeouts which check
> for IRQ misdeliveries in drivers so that it can detect such events and
> maybe add an interface in the IRQ polling code which the driver can
> call to enable polling on the IRQ. But then again the pattern of such
> failures and handling of them would be very similar across different
> drivers and we already of most of polling machinary implemented in IRQ
> code, so I think it makes sense to have a generic implementation which
> drivers can use. Moreover, given the difference in test/review
> coverage and general complexity of doing it right, I think it is far
> better to do it in the core code and make it easy to use for drivers.

I'm not opposed to have a generic solution for this, but I want a
simple and understandable one. The irq_poll() works for everything
magic is definitely not in this category.

> > Btw, this polling is pretty bad in terms of power savings.
> >
> > - The timers are not aligned, so if there are several expects armed
> > you get unbatched wakeups.
>
> In usual operation conditions, the timer interval will quickly become
> 3 seconds w/ 1 sec slack. I doubt it can cause much problem. Right?

And those 4 seconds are probably larger than the device timeouts in
most cases. So what's the point of running that timer at all ?

> > I assume your concern is to detect and deal with flaky interrupts,
> > but avoiding the permanent poll when the device is not used, right?
> >
> > So that can be done simpler as well. One timer and a linked list
> > again.
> >
> > expect_irq() adds the irq to the linked list and arms the timer, if
> > it is not armed already. unexpect_irq() removes the irq from the
> > linked list and deletes the timer when the list becomes empty.
> >
> > That can reuse the list_head in irq_desc which you need for irq
> > poll and the timer interval should be fixed without dynamic
> > adjustment.
>
> The main problem here is that expect/unexpect_irq() needs to be low
> overhead so that drivers can easily call in w/o worrying too much
> about performance overhead and I would definitely want to avoid
> locking in fast paths.

In general you only want to use this expect thing when you detected
that the device has a problem. So why not use the watch mechanism for
this ?

When you detect the first timeout in the device driver, set the watch
on that action and let it figure out whether it goes back to
normal. If it goes back to normal the watch disappears. If it happens
again, repeat.

expect_irq() would be basically a NOP, but I like to keep it, as it
can be used for power management decisions.

unexpect_irq() would either add or remove the watch depending on the
timeout argument. So there is only overhead when things went bad and
when we go back to normal in all other cases it's zero.

Though we need a reference to irqaction.watch of that device, but
that's not a big deal. With that we can call

unexpect_irq(watch, false/true);

which would be an inline function:

static inline void unexpect_irq(watch, timeout)
{
if (unlikely(watch->active ^ timeout))
__unexpect_irq(watch, timeout);
}

Simple, right ?

> To me, the more interesting question is where to put the complexity.
> In many cases, the most we can do is pushing complexity around, and if
> we can make things easier for drivers by making the core IRQ code
> somewhat more complex, I'll definitely go for that every time.

No objections, but I doubt, that we need all the heuristics and
complex code to deal with it.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Thomas Gleixner on 2 Aug 2010 15:00

On Mon, 2 Aug 2010, Tejun Heo wrote:

> On 08/02/2010 05:28 PM, Tejun Heo wrote:
> > The reason why poll_irq() looks complicated is because different types
> > of timers on the same desc shares the timer. The reason why it's
> > shared this way instead of across different descs of the same type is
> > to avoid locking overhead in maintaining the timer and linked lists of
> > its targets. By sharing the timer in the same desc, everything can be
> > protected by desc->lock but if we use timers across different descs,
> > we need another layer of locking.
>
> Ooh, another reason is timer locality. If timers are shared per desc,
> they have much higher chance of being on the same processor. Global
> timers would be pretty bad in that respect.

That's irrelevant. If you need to poll an interrupt, then it does not
matter at all whether you bounce some cache lines or not.

In fact we have two cases:

1) An interrupt needs to be polled all the time. That sucks whether
the poll timer bounces a few cache lines or not.

2) Polling an irq for some time. Either it works again after a
while, so your suckage is restricted to the poll period. If not
see #1

And it's even less of an issue as the main users of this misfeature
are laptops and desktop machines, where locality is not really that
important. If an enterprise admin decides to ignore the fact that the
hardware is flaky, then he does not care about the cache line bounces
either.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Tejun Heo on 2 Aug 2010 17:00

Hello, Thomas.

On 08/02/2010 08:52 PM, Thomas Gleixner wrote:
>> Ooh, another reason is timer locality. If timers are shared per desc,
>> they have much higher chance of being on the same processor. Global
>> timers would be pretty bad in that respect.
>
> That's irrelevant. If you need to poll an interrupt, then it does not
> matter at all whether you bounce some cache lines or not.
>
> In fact we have two cases:
>
> 1) An interrupt needs to be polled all the time. That sucks whether
> the poll timer bounces a few cache lines or not.
>
> 2) Polling an irq for some time. Either it works again after a
> while, so your suckage is restricted to the poll period. If not
> see #1

Hmm... for spurious and watch the above are true and if it were the
above two it would definitely make more sense to use per-purpose
global timers. The problem is w/ expect tho. It's supposed to be
used with normal hot paths, so expect/unexpect operations better be
low overhead and local. I'll talk more about it in the other reply.

> And it's even less of an issue as the main users of this misfeature
> are laptops and desktop machines, where locality is not really that
> important. If an enterprise admin decides to ignore the fact that the
> hardware is flaky, then he does not care about the cache line bounces
> either.

These problems do happen on intel chipset machines and is something
which can be worked around with some effort. Eh, let's talk on the
other reply.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Tejun Heo on 2 Aug 2010 17:10

Hello, Thomas.

On 08/02/2010 08:52 PM, Thomas Gleixner wrote:
>> Ooh, another reason is timer locality. If timers are shared per desc,
>> they have much higher chance of being on the same processor. Global
>> timers would be pretty bad in that respect.
>
> That's irrelevant. If you need to poll an interrupt, then it does not
> matter at all whether you bounce some cache lines or not.
>
> In fact we have two cases:
>
> 1) An interrupt needs to be polled all the time. That sucks whether
> the poll timer bounces a few cache lines or not.
>
> 2) Polling an irq for some time. Either it works again after a
> while, so your suckage is restricted to the poll period. If not
> see #1

Hmm... for spurious and watch the above are true and if it were the
above two it would definitely make more sense to use per-purpose
global timers. The problem is w/ expect tho. It's supposed to be
used with normal hot paths, so expect/unexpect operations better be
low overhead and local. I'll talk more about it in the other reply.

> And it's even less of an issue as the main users of this misfeature
> are laptops and desktop machines, where locality is not really that
> important. If an enterprise admin decides to ignore the fact that the
> hardware is flaky, then he does not care about the cache line bounces
> either.

These problems do happen on intel chipset machines and is something
which can be worked around with some effort. Eh, let's talk on the
other reply.

Thanks.

--
tejun

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3
Prev: hpwdt: make NMI code a config option (+ other cleanup)
Next: linux-next: manual merge of the staging-next tree with the v4l-dvb tree