From: Maciej W. Rozycki on
On Thu, 5 Nov 2009, Suresh Siddha wrote:

> > I've checked some docs too and you may want to check whether to qualify
> > the use of the EOI register further with the DT (Delivery Type) bit in the
> > Boot Configuration register. It affects ICH2 at least
>
> For relatively newer ICH's like ICH5, boot configuration register is
> marked as a reserved register (perhaps with the serial bus going away,
> so did this register but again with out the io-apic version change).
>
> Anyways, our understanding is that EOI register in ICH2 should work
> irrespective of DT bit in the boot config register.

What I mean is if the serial delivery type is used, then an interrupt
will be acked twice -- once via an EOI message sent from the local APIC
over the serial bus and then again via the write to the EOI register.
There is a race condition here, so if the IRQ line is still/again
asserted, then most likely two consecutive IRQ messages will be sent by
the I/O APIC and they may be accepted by two different local units and
eventually delivered to the OS. Linux will cope, but still this is sloppy
programming, so it should be best avoided -- if possible that is.

> > Also you may want to see whether the complication in ack_apic_level()
> > that is meant to deal with an APIC erratum really matters for FSB delivery
> > -- I guess not, because if you explicitly ACK an interrupt in the I/O
> > unit, then even if it was incorrectly recorded as edge-triggered in the
> > local unit, the IRR bit will be correctly reset and the next message
> > delivered properly. Given you introduce a conditional statement anyway,
> > you can place more code within it and there will be no performance hit for
> > the other path and certainly a gain for this one.
>
> I am not sure if I follow. With the recent changes (tip commit
> 5231a68614b94f60e8f6c56bc6e3d75955b9e75e), we use ipi on a new cpu to
> handle a pending level-triggered interrupt on the cpu that is going
> offline. It's just not only in the case of io-apic erratum for 0x11, we
> see level triggered interrupt as edge interrupt at the cpu.

I don't get it, sorry -- an interrupt has its trigger mode implied by the
IRQ_LEVEL status flag being set or clear in the IRQ's descriptor. What's
set in the local APIC's TMR does not (or should not) matter IMO.

Has the trigger mode mismatch erratum been seen for FSB delivered
interrupts anyway? My proposed patch is below. Includes the code
rearrangement I proposed as well. Untested, not even built.

Maciej
---
Complete the EOI dance for the trigger mode mismatch APIC erratum before
proceeding to IRQ migration. Omit the erratum workaround for I/O APICs
using FSB delivery as they get their EOI message delivered by hand.

Signed-off-by: Maciej W. Rozycki <macro(a)linux-mips.org>
---
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 31e9db3..6e4edad 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -2555,33 +2555,50 @@ static void ack_apic_level(unsigned int irq)
#endif

/*
- * It appears there is an erratum which affects at least version 0x11
- * of I/O APIC (that's the 82093AA and cores integrated into various
- * chipsets). Under certain conditions a level-triggered interrupt is
- * erroneously delivered as edge-triggered one but the respective IRR
- * bit gets set nevertheless. As a result the I/O unit expects an EOI
- * message but it will never arrive and further interrupts are blocked
- * from the source. The exact reason is so far unknown, but the
- * phenomenon was observed when two consecutive interrupt requests
- * from a given source get delivered to the same CPU and the source is
- * temporarily disabled in between.
- *
- * A workaround is to simulate an EOI message manually. We achieve it
- * by setting the trigger mode to edge and then to level when the edge
- * trigger mode gets detected in the TMR of a local APIC for a
- * level-triggered interrupt. We mask the source for the time of the
- * operation to prevent an edge-triggered interrupt escaping meanwhile.
- * The idea is from Manfred Spraul. --macro
+ * We must acknowledge the irq before we move it or the acknowledge
+ * will not propagate properly.
*/
- cfg = desc->chip_data;
- i = cfg->vector;
- v = apic_read(APIC_TMR + ((i & ~0x1f) >> 1));
+ if (use_eoi_reg) {
+ ack_APIC_irq();
+ eoi_ioapic_irq(desc);
+ } else {
+ /*
+ * It appears there is an erratum which affects at least
+ * version 0x11 of I/O APIC (that's the 82093AA and cores
+ * integrated into various chipsets). Under certain
+ * conditions a level-triggered interrupt is erroneously
+ * delivered as edge-triggered one but the respective IRR
+ * bit gets set nevertheless. As a result the I/O unit
+ * expects an EOI message but it will never arrive and
+ * further interrupts are blocked from the source. The
+ * exact reason is so far unknown, but the phenomenon was
+ * observed when two consecutive interrupt requests from
+ * a given source get delivered to the same CPU and the
+ * source is temporarily disabled in between.
+ *
+ * A workaround is to simulate an EOI message manually.
+ * We achieve it by setting the trigger mode to edge and
+ * then to level when the edge trigger mode gets detected
+ * in the TMR of a local APIC for a level-triggered
+ * interrupt. We mask the source for the time of the
+ * operation to prevent an edge-triggered interrupt
+ * escaping meanwhile.
+ *
+ * The idea is from Manfred Spraul. --macro
+ */
+ cfg = desc->chip_data;
+ i = cfg->vector;
+ v = apic_read(APIC_TMR + ((i & ~0x1f) >> 1));

- /*
- * We must acknowledge the irq before we move it or the acknowledge will
- * not propagate properly.
- */
- ack_APIC_irq();
+ ack_APIC_irq();
+
+ /* Tail end of version 0x11 I/O APIC bug workaround. */
+ atomic_inc(&irq_mis_count);
+ spin_lock(&ioapic_lock);
+ __mask_and_edge_IO_APIC_irq(cfg);
+ __unmask_and_level_IO_APIC_irq(cfg);
+ spin_unlock(&ioapic_lock);
+ }

/* Now we can move and renable the irq */
if (unlikely(do_unmask_irq)) {
@@ -2616,20 +2633,6 @@ static void ack_apic_level(unsigned int irq)
move_masked_irq(irq);
unmask_IO_APIC_irq_desc(desc);
}
-
- /* Tail end of version 0x11 I/O APIC bug workaround */
- if (!(v & (1 << (i & 0x1f)))) {
- atomic_inc(&irq_mis_count);
-
- if (use_eoi_reg)
- eoi_ioapic_irq(desc);
- else {
- spin_lock(&ioapic_lock);
- __mask_and_edge_IO_APIC_irq(cfg);
- __unmask_and_level_IO_APIC_irq(cfg);
- spin_unlock(&ioapic_lock);
- }
- }
}

#ifdef CONFIG_INTR_REMAP
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Suresh Siddha on
On Thu, 2009-11-05 at 22:53 -0800, Maciej W. Rozycki wrote:
> What I mean is if the serial delivery type is used, then an interrupt
> will be acked twice -- once via an EOI message sent from the local APIC
> over the serial bus and then again via the write to the EOI register.

Maciej, Case where we are doing an explicit EOI to the io-apic (using
EOI register) is when the level triggered interrupt gets registered at
the cpu as an edge interrupt (in the local apic's trigger mode
register).

It will arrive as an edge interrupt for two cases.
a) for corner conditions which hit the 82093AA (io-apic version 0x11)
erratum
b) with my recent patch in -tip, during a cpu offline, when we send an
ipi (IPI is always registered as an edge triggered at the cpu) to
service the interrupt at the new cpu destination, instead of servicing
at it's old destination cpu (as it has already disabled interrupts and
going down -- like not being on the cpu_online_map etc).

So we are not acking the io-apic twice in this case, as the eoi to the
local apic won't brodcast the eoi to the io-apic (because of the edge
mode in trigger mode register of the local apic).

> There is a race condition here, so if the IRQ line is still/again
> asserted, then most likely two consecutive IRQ messages will be sent by
> the I/O APIC and they may be accepted by two different local units and
> eventually delivered to the OS. Linux will cope, but still this is sloppy
> programming, so it should be best avoided -- if possible that is.

I hope the above clarified.

> >
> > I am not sure if I follow. With the recent changes (tip commit
> > 5231a68614b94f60e8f6c56bc6e3d75955b9e75e), we use ipi on a new cpu to
> > handle a pending level-triggered interrupt on the cpu that is going
> > offline. It's just not only in the case of io-apic erratum for 0x11, we
> > see level triggered interrupt as edge interrupt at the cpu.
>
> I don't get it, sorry -- an interrupt has its trigger mode implied by the
> IRQ_LEVEL status flag being set or clear in the IRQ's descriptor. What's
> set in the local APIC's TMR does not (or should not) matter IMO.

Same, did the above clarify?

>
> Has the trigger mode mismatch erratum been seen for FSB delivered
> interrupts anyway?

No.

> Complete the EOI dance for the trigger mode mismatch APIC erratum before
> proceeding to IRQ migration.

I am ok with what you are tying to fix, but not your patch. please see
below.

> + if (use_eoi_reg) {
> + ack_APIC_irq();
> + eoi_ioapic_irq(desc);

We shouldn't do this unconditionally. i.e., we should do this double ack
only when the level-triggered interrupt is seen as an edge triggered
interrupt at the cpu (specified by the trigger mode register in local
apic).

Otherwise as you mentioned above, we will see two EOI msg's at the
io-apic (one by the cpu's eoi broadcast and another by explicit eoi to
the io-apic).

Best way to fix the issue you noticed is by simply moving the tail end
of that erratum fix before we try to migrate the irq to a new place.

Do you agree?

---
From: "Maciej W. Rozycki" <macro(a)linux-mips.org>
Subject: x86, io-apic: move the effort of clearing remoteIRR explicitly
before migrating the irq

When the level-triggered interrupt is seen as an edge interrupt, we try
to clear the remoteIRR explicitly (using either an io-apic eoi register
when present or through the idea of changing trigger mode of the io-apic
RTE to edge and then back to level). But this explicit try also needs to
happen before we try to migrate the irq. Otherwise irq migration attempt
will fail anyhow, as it postpones the irq migration to a later attempt
when it sees the remoteIRR in the io-apic RTE still set.

Signed-off-by: "Maciej W. Rozycki" <macro(a)linux-mips.org>
Reviewed-by: Suresh Siddha <suresh.b.siddha(a)intel.com>
---
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 47d95c3..9a26ea1 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -2583,6 +2583,20 @@ static void ack_apic_level(unsigned int irq)
*/
ack_APIC_irq();

+ /* Tail end of version 0x11 I/O APIC bug workaround */
+ if (!(v & (1 << (i & 0x1f)))) {
+ atomic_inc(&irq_mis_count);
+
+ if (use_eoi_reg)
+ eoi_ioapic_irq(desc);
+ else {
+ spin_lock(&ioapic_lock);
+ __mask_and_edge_IO_APIC_irq(cfg);
+ __unmask_and_level_IO_APIC_irq(cfg);
+ spin_unlock(&ioapic_lock);
+ }
+ }
+
/* Now we can move and renable the irq */
if (unlikely(do_unmask_irq)) {
/* Only migrate the irq if the ack has been received.
@@ -2616,20 +2630,6 @@ static void ack_apic_level(unsigned int irq)
move_masked_irq(irq);
unmask_IO_APIC_irq_desc(desc);
}
-
- /* Tail end of version 0x11 I/O APIC bug workaround */
- if (!(v & (1 << (i & 0x1f)))) {
- atomic_inc(&irq_mis_count);
-
- if (use_eoi_reg)
- eoi_ioapic_irq(desc);
- else {
- spin_lock(&ioapic_lock);
- __mask_and_edge_IO_APIC_irq(cfg);
- __unmask_and_level_IO_APIC_irq(cfg);
- spin_unlock(&ioapic_lock);
- }
- }
}

#ifdef CONFIG_INTR_REMAP


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Maciej W. Rozycki on
On Fri, 6 Nov 2009, Suresh Siddha wrote:

> > What I mean is if the serial delivery type is used, then an interrupt
> > will be acked twice -- once via an EOI message sent from the local APIC
> > over the serial bus and then again via the write to the EOI register.
>
> Maciej, Case where we are doing an explicit EOI to the io-apic (using
> EOI register) is when the level triggered interrupt gets registered at
> the cpu as an edge interrupt (in the local apic's trigger mode
> register).
>
> It will arrive as an edge interrupt for two cases.
> a) for corner conditions which hit the 82093AA (io-apic version 0x11)
> erratum
> b) with my recent patch in -tip, during a cpu offline, when we send an
> ipi (IPI is always registered as an edge triggered at the cpu) to
> service the interrupt at the new cpu destination, instead of servicing
> at it's old destination cpu (as it has already disabled interrupts and
> going down -- like not being on the cpu_online_map etc).
>
> So we are not acking the io-apic twice in this case, as the eoi to the
> local apic won't brodcast the eoi to the io-apic (because of the edge
> mode in trigger mode register of the local apic).

OK, I see what you mean, but that makes me wonder why you are going
through such contortions. In the case of a CPU going offline I would
expect it to be done more or less in such a way:

1. Write all-zeroes to its local APIC's LDR register and set its TPR to
0xef. This will take this APIC out from LoPri arbitration and thus
from accepting any I/O APIC interrupts. Fixed delivery mode IPIs will
still be accepted (if that's not needed then the TPR can be set to
0xff; any received IPIs will be lost).

2. Service any outstanding interrupts that have already been accepted by
the local APIC (you may have to poll on the local IRR register with
interrupts enabled for a short while).

3. Disable the local APIC via the SVR register, mask local interrupts in
the processor's EFLAGS register and start the offline procedure. This
is the point of no-return, further IPIs won't be accepted and the CPU
has to be put through an INIT-IPI+StartUp-IPI cycle to get in control
again.

If IPI reception was not needed through stage #2 above, then the local
APIC could have been disabled at #1 instead -- interrupts pending in
the local APIC as recorded in the IRR or marked as in-progress in the
ISR are guaranteed to be delivered to the CPU and EOIed (as
appropriate) normally even in the disabled state of the local APIC.

> Do you agree?

If the scenario I have outlined above cannot be made to work for some
reason, then please do me and the others a favour and since with this
change you are tying new functionality to code originally meant as a
workaround for an obscure erratum only, do write a proper explanation and
place it next to the original comment describing previous use of the code.
With your change as it is, it is all but obvious what this piece of code
is meant to do.

Your change is OK with me once accompanied with said comment, but please
investigate my scenario first -- your approach looks like a horrible hack
to me, sorry.

Maciej
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Suresh Siddha on
Maciej,

Sorry for delayed response. I was busy with other stuff and didn't get a
chance till now to get back on this. I just posted few patches which
came up as a result of our past discussions here in this thread.

Please see my responses inline for your earlier comments:

On Sun, 2009-11-08 at 11:06 -0800, Maciej W. Rozycki wrote:
> OK, I see what you mean, but that makes me wonder why you are going
> through such contortions. In the case of a CPU going offline I would
> expect it to be done more or less in such a way:
>
> 1. Write all-zeroes to its local APIC's LDR register and set its TPR to
> 0xef. This will take this APIC out from LoPri arbitration and thus
> from accepting any I/O APIC interrupts. Fixed delivery mode IPIs will
> still be accepted (if that's not needed then the TPR can be set to
> 0xff; any received IPIs will be lost).
>
> 2. Service any outstanding interrupts that have already been accepted by
> the local APIC (you may have to poll on the local IRR register with
> interrupts enabled for a short while).
>
> 3. Disable the local APIC via the SVR register, mask local interrupts in
> the processor's EFLAGS register and start the offline procedure. This
> is the point of no-return, further IPIs won't be accepted and the CPU
> has to be put through an INIT-IPI+StartUp-IPI cycle to get in control
> again.
>
> If IPI reception was not needed through stage #2 above, then the local
> APIC could have been disabled at #1 instead -- interrupts pending in
> the local APIC as recorded in the IRR or marked as in-progress in the
> ISR are guaranteed to be delivered to the CPU and EOIed (as
> appropriate) normally even in the disabled state of the local APIC.

But before these 3 steps you listed here, we need to migrate the irq to
the new destination. And that step will modify the IO-APIC RTE with the
new vector and new destination. And during this process, remoteIRR of
the IO-APIC RTE might be set and this inflight interrupt will get
registered at the original destination that is going offline.

So when we come to step 2 you listed above and service any outstanding
interrupts, EOI broadcast as part of that handling won't clear the
remoteIRR of the IO-APIC RTE, as the vector information in the io-apic
RTE got modified because of irq migration and is different from the
vector information in EOI broadcast message sent by the cpu. This will
result in stuck level interrupt.

This is one of the challenges Eric Biederman had in the past and he
tried things like polling from the process context and modifying the
IO-APIC RTE (with new destination and vector information) only when the
remoteIRR is not set etc. But Eric still saw some hangs and stuck
interrupt conditions with his experimental patches.

We took a route which needed minor changes to the existing code and fix
the local_irq_enable()/local_irq_disable() issue and stuck interrupt
issue in the cpu offline path by using the IO-APIC EOI register. Our
tests on Intel platforms having an EOI register for io-apic's and IBM's
(Gary) tests on io-apic's which don't have EOI register using AMD
platforms worked fine with our approach.

> > Do you agree?
>
> If the scenario I have outlined above cannot be made to work for some
> reason,

Perhaps we can make it work but it needs more changes and validation.
And atleast Eric's similar experiments in the past didn't yield good
results.

> then please do me and the others a favour and since with this
> change you are tying new functionality to code originally meant as a
> workaround for an obscure erratum only, do write a proper explanation and
> place it next to the original comment describing previous use of the code.
> With your change as it is, it is all but obvious what this piece of code
> is meant to do.

This patch was also in the series that I just sent.

> Your change is OK with me once accompanied with said comment, but please
> investigate my scenario first -- your approach looks like a horrible hack
> to me, sorry.

This being a fragile area and considering our experiences in the past, I
leveraged the existing code (of clearing remoteIRR manually on ioapic's
which don't have an EOI register) and luckily we had good success so
far.

thanks,
suresh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/