From: Vivek Goyal on
On Thu, Apr 01, 2010 at 08:53:04AM -0400, Neil Horman wrote:
> On Wed, Mar 31, 2010 at 10:24:18PM -0400, Vivek Goyal wrote:
> > On Wed, Mar 31, 2010 at 09:13:11PM -0400, Neil Horman wrote:
> > > On Wed, Mar 31, 2010 at 02:25:35PM -0700, Chris Wright wrote:
> > > > * Neil Horman (nhorman(a)tuxdriver.com) wrote:
> > > > > Flush iommu during shutdown
> > > > >
> > > > > When using an iommu, its possible, if a kdump kernel boot follows a primary
> > > > > kernel crash, that dma operations might still be in flight from the previous
> > > > > kernel during the kdump kernel boot. This can lead to memory corruption,
> > > > > crashes, and other erroneous behavior, specifically I've seen it manifest during
> > > > > a kdump boot as endless iommu error log entries of the form:
> > > > > AMD-Vi: Event logged [IO_PAGE_FAULT device=00:14.1 domain=0x000d
> > > > > address=0x000000000245a0c0 flags=0x0070]
> > > >
> > > > We've already fixed this problem once before, so some code shift must
> > > > have brought it back. Personally, I prefer to do this on the bringup
> > > > path than the teardown path. Besides keeping the teardown path as
> > > > simple as possible (goal is to get to kdump kernel asap), there's also
> > > > reason to competely flush on startup in genernal in case BIOS has done
> > > > anything unsavory.
> > > >
> > > Chris,
> > > Can you elaborate on what you did with the iommu to make this safe? It
> > > will save me time digging through the history on this code, and help me
> > > understand better whats going on here.
> > >
> > > I was starting to think that we should just leave the iommu on through a kdump,
> > > and re-construct a new page table based on the old table (filtered by the error
> > > log) on kdump boot, but it sounds like a better solution might be in place.
> > >
> >
> > Hi Neil,
> >
> > Is following sequence possible.
> >
> > - In crashed kernel, take away the write permission from all the devices.
> > Mark bit 62 zero for all devices in device table.
> >
> > - Leave the iommu on and let the device entries be valid in kdump kernel
> > so that any in-flight dma does not become pass through (which can cause
> > more damage and corrupt kdump kernel).
> >
> > - During kdump kernel initialization, load a new device table where again
> > all the devices don't have write permission. looks like by default
> > we create a device table with all bits zero except DEV_ENTRY_VALID
> > and DEV_ENTRY_TRANSLATION bit.
> >
> > - Reset the device where we want to setup any dma or operate on.
> >
> > - Allow device to do DMA/write.
> >
> > So by default all the devices will not be able to do write to memory
> > and selective devices are given access only after a reset.
> >
> > I am not sure what are the dependencies for loading a new device table
> > in second kernel. If it requires disabling the IOMMU, then we leave a
> > window where in-flight dma will become passthrough and has the potential
> > to corrupt kdump kernel.
> >
> I think this is possible, but I'm a bit concerned with how some devices will
> handle a reset. For instance, what will happen to an HBA or a disk, if we reset
> it as the module is loading? Is that safe?

I think we need to reset devices in driver if "reset_devices" is set. So
we will not reset these during normal boot.

Regarding being safe, I don't know. I am assuming that driver knows (or
need to know), how to reset device safely while driver is initializing.
That's the whole assumption kdump is built on, that once driver is
initializing, it will first reset the device (if reset_devices is set), so
that chances of device working properly in second kernel increase.

Vivek

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Neil Horman on
On Wed, Mar 31, 2010 at 10:24:18PM -0400, Vivek Goyal wrote:
> On Wed, Mar 31, 2010 at 09:13:11PM -0400, Neil Horman wrote:
> > On Wed, Mar 31, 2010 at 02:25:35PM -0700, Chris Wright wrote:
> > > * Neil Horman (nhorman(a)tuxdriver.com) wrote:
> > > > Flush iommu during shutdown
> > > >
> > > > When using an iommu, its possible, if a kdump kernel boot follows a primary
> > > > kernel crash, that dma operations might still be in flight from the previous
> > > > kernel during the kdump kernel boot. This can lead to memory corruption,
> > > > crashes, and other erroneous behavior, specifically I've seen it manifest during
> > > > a kdump boot as endless iommu error log entries of the form:
> > > > AMD-Vi: Event logged [IO_PAGE_FAULT device=00:14.1 domain=0x000d
> > > > address=0x000000000245a0c0 flags=0x0070]
> > >
> > > We've already fixed this problem once before, so some code shift must
> > > have brought it back. Personally, I prefer to do this on the bringup
> > > path than the teardown path. Besides keeping the teardown path as
> > > simple as possible (goal is to get to kdump kernel asap), there's also
> > > reason to competely flush on startup in genernal in case BIOS has done
> > > anything unsavory.
> > >
> > Chris,
> > Can you elaborate on what you did with the iommu to make this safe? It
> > will save me time digging through the history on this code, and help me
> > understand better whats going on here.
> >
> > I was starting to think that we should just leave the iommu on through a kdump,
> > and re-construct a new page table based on the old table (filtered by the error
> > log) on kdump boot, but it sounds like a better solution might be in place.
> >
>
> Hi Neil,
>
> Is following sequence possible.
>
> - In crashed kernel, take away the write permission from all the devices.
> Mark bit 62 zero for all devices in device table.
>
> - Leave the iommu on and let the device entries be valid in kdump kernel
> so that any in-flight dma does not become pass through (which can cause
> more damage and corrupt kdump kernel).
>
> - During kdump kernel initialization, load a new device table where again
> all the devices don't have write permission. looks like by default
> we create a device table with all bits zero except DEV_ENTRY_VALID
> and DEV_ENTRY_TRANSLATION bit.
>
> - Reset the device where we want to setup any dma or operate on.
>
> - Allow device to do DMA/write.
>
> So by default all the devices will not be able to do write to memory
> and selective devices are given access only after a reset.
>
> I am not sure what are the dependencies for loading a new device table
> in second kernel. If it requires disabling the IOMMU, then we leave a
> window where in-flight dma will become passthrough and has the potential
> to corrupt kdump kernel.
>
I think this is possible, but I'm a bit concerned with how some devices will
handle a reset. For instance, what will happen to an HBA or a disk, if we reset
it as the module is loading? Is that safe?

Neil

> Thanks
> Vivek
>
> _______________________________________________
> kexec mailing list
> kexec(a)lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Neil Horman on
On Thu, Apr 01, 2010 at 05:56:43PM +0200, Joerg Roedel wrote:
> On Thu, Apr 01, 2010 at 10:47:36AM -0400, Neil Horman wrote:
> > On Thu, Apr 01, 2010 at 04:29:02PM +0200, Joerg Roedel wrote:
> > > I am back in office next tuesday and will look into this problem too.
> > >
> > Thank you.
>
> Just took a look and I think the problem is that the devices are
> attached to domains before the IOMMU hardware is enabled. This happens
> in the function prealloc_protection_domains(). The attach code issues
> the dte-invalidate commands but they are not executed because the
> hardware is off. I will verify this when I have access to hardware
> again.
> The possible fix will be to enable the hardware earlier in the
> initialization path.
>
That sounds like a reasonable theory, I'll try hack something together shortly.

> > > Right. The default for all devices is to forbid DMA.
> > >
> > Thanks, glad to know I read that right, took me a bit to understand it :)
>
> I should probably add a comment :-)
>
> > > Thats indeed true. I have seen that with ixgbe cards for example. They
> > > seem to be really confused after an target abort.
> > >
> > Yeah, this part worries me, target aborts lead to various brain dead hardware
> > pieces. What are you thoughts on leaving the iommu on through a reboot to avoid
> > this issue (possibly resetting any pci device that encounters a target abort, as
> > noted in the error log on the iommu?
>
> This would only prevent possible data corruption. When the IOMMU is off
> the devices will not get a target abort but will only write to different
> physical memory locations. The window where a target abort can happen
> starts when the kdump kernel re-enables the IOMMU and ends when the new
> driver for that device attaches. This is a small window but there is not
> a lot we can do to avoid this small time window.
>
Can you explain this a bit further please? From what I read, when the iommu is
disabled, AIUI it does no translations. That means that any dma addresses which
the driver mapped via the iommu prior to a crash that are stored in devices will
just get strobed on the bus without any translation. If those dma address do
not lay on top of any physical ram, won't that lead to bus errors, and
transaction aborts? Worse, if those dma addresses do lie on top of real
physical addresses, won't we get corruption in various places? Or am I missing
part of how that works?

Regards
Neil

> Joerg
>
>
> _______________________________________________
> kexec mailing list
> kexec(a)lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Joerg Roedel on
On Thu, Apr 01, 2010 at 01:11:49PM -0400, Neil Horman wrote:
> On Thu, Apr 01, 2010 at 05:56:43PM +0200, Joerg Roedel wrote:

> > The possible fix will be to enable the hardware earlier in the
> > initialization path.
> >
> That sounds like a reasonable theory, I'll try hack something together
> shortly.

Great. So the problem might be already fixed when I am back in the
office ;-)

> > This would only prevent possible data corruption. When the IOMMU is off
> > the devices will not get a target abort but will only write to different
> > physical memory locations. The window where a target abort can happen
> > starts when the kdump kernel re-enables the IOMMU and ends when the new
> > driver for that device attaches. This is a small window but there is not
> > a lot we can do to avoid this small time window.
> >
> Can you explain this a bit further please? From what I read, when the iommu is
> disabled, AIUI it does no translations. That means that any dma addresses which
> the driver mapped via the iommu prior to a crash that are stored in devices will
> just get strobed on the bus without any translation. If those dma address do
> not lay on top of any physical ram, won't that lead to bus errors, and
> transaction aborts? Worse, if those dma addresses do lie on top of real
> physical addresses, won't we get corruption in various places? Or am I missing
> part of how that works?

Hm, the device address may not be a valid host physical address, thats
true. But the problem with the small time-window when the IOMMU hardware
is re-programmed from the kdump kernel still exists.
I need to think about other possible side-effects of leaving the IOMMU
enabled on shutdown^Wboot into a kdump kernel.

Joerg

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Neil Horman on
On Thu, Apr 01, 2010 at 10:14:34PM +0200, Joerg Roedel wrote:
> On Thu, Apr 01, 2010 at 01:11:49PM -0400, Neil Horman wrote:
> > On Thu, Apr 01, 2010 at 05:56:43PM +0200, Joerg Roedel wrote:
>
> > > The possible fix will be to enable the hardware earlier in the
> > > initialization path.
> > >
> > That sounds like a reasonable theory, I'll try hack something together
> > shortly.
>
> Great. So the problem might be already fixed when I am back in the
> office ;-)
>
Don't hold your breath, but I'll try my best :)

> > > This would only prevent possible data corruption. When the IOMMU is off
> > > the devices will not get a target abort but will only write to different
> > > physical memory locations. The window where a target abort can happen
> > > starts when the kdump kernel re-enables the IOMMU and ends when the new
> > > driver for that device attaches. This is a small window but there is not
> > > a lot we can do to avoid this small time window.
> > >
> > Can you explain this a bit further please? From what I read, when the iommu is
> > disabled, AIUI it does no translations. That means that any dma addresses which
> > the driver mapped via the iommu prior to a crash that are stored in devices will
> > just get strobed on the bus without any translation. If those dma address do
> > not lay on top of any physical ram, won't that lead to bus errors, and
> > transaction aborts? Worse, if those dma addresses do lie on top of real
> > physical addresses, won't we get corruption in various places? Or am I missing
> > part of how that works?
>
> Hm, the device address may not be a valid host physical address, thats
> true. But the problem with the small time-window when the IOMMU hardware
> is re-programmed from the kdump kernel still exists.
> I need to think about other possible side-effects of leaving the IOMMU
> enabled on shutdown^Wboot into a kdump kernel.
>
I think its an interesting angle to consider. Thats why I was talking about
cloning the old tables in the new kdump kernel and using the error log to filter
out entries that we could safely assume were complete until enough of the iommu
page tables were free, so that we could continue to hobble along in the kdump
kernel until we got to a proper reboot. All just thought experiment of course.
I'll try tinkering with your idea above first.
Neil

> Joerg
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/