amd iommu: force flush of iommu prior during shutdown [Kernel]

Prev: RFC [Patch] Remove "please try 'cgroup_disable=memory' option if you don't want memory cgroups" printk at boot time.
Next: [PATCH v2 3/11] Enhance replace_page() to support pagecache

From: Neil Horman on 1 Apr 2010 09:40

On Thu, Apr 01, 2010 at 12:10:40AM -0700, Chris Wright wrote:
> * Vivek Goyal (vgoyal(a)redhat.com) wrote:
> > On Wed, Mar 31, 2010 at 02:25:35PM -0700, Chris Wright wrote:
> > > * Neil Horman (nhorman(a)tuxdriver.com) wrote:
> > > > Flush iommu during shutdown
> > > >
> > > > When using an iommu, its possible, if a kdump kernel boot follows a primary
> > > > kernel crash, that dma operations might still be in flight from the previous
> > > > kernel during the kdump kernel boot. This can lead to memory corruption,
> > > > crashes, and other erroneous behavior, specifically I've seen it manifest during
> > > > a kdump boot as endless iommu error log entries of the form:
> > > > AMD-Vi: Event logged [IO_PAGE_FAULT device=00:14.1 domain=0x000d
> > > > address=0x000000000245a0c0 flags=0x0070]
> > >
> > > We've already fixed this problem once before, so some code shift must
> > > have brought it back. Personally, I prefer to do this on the bringup
> > > path than the teardown path. Besides keeping the teardown path as
> > > simple as possible (goal is to get to kdump kernel asap), there's also
> > > reason to competely flush on startup in genernal in case BIOS has done
> > > anything unsavory.
> >
> > Can we flush domains (all the I/O TLBs assciated with each domain), during
> > initialization? I think all the domain data built by previous kernel will
> > be lost and new kernel will have no idea about.
>
> We first invalidate the device table entry, so new translation requests
> will see the new domainid for a given BDF. Then we invalidate the
> whole set of page tables associated w/ the new domainid. Now all dma
> transactions will need page table walk (page tables will be empty excpet
> for any 1:1 mappings). Any old domainid's from previous kernel that
> aren't found in new device table entries are effectively moot. Just so
> happens that in kexec/kdump case, they'll be the same domainid's, but
> that doesn't matter.
>
> thanks,
> -chris
>
Additionally chris (this is just for my own education here), what happens when
we disable the iommu while dma's are in flight? I ask because from what I read,
my assumption is that the iommu effectively enters a passive mode where bus
accesses from devices holding dma addresses that were previously provided by an
iommu translation will just get strobed onto the bus without being translated
back to physical addresses. Won't that result in bus errors causing master
aborts? If so, it would seem that it would be further cause to leave the iommu
on during a crash/kdump boot.

Neil

> _______________________________________________
> kexec mailing list
> kexec(a)lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Joerg Roedel on 1 Apr 2010 10:30

Hi Neil,

first some general words about the problem you discovered: The problem
is not caused by in-flight DMA. The problem is that the IOMMU hardware
has cached the old DTE entry for the device (including the old
page-table root pointer) and is using it still when the kdump kernel has
booted. We had this problem once and fixed it by flushing a DTE in the
IOMMU before it is used for the first time. This seems to be broken
now. Which kernel have you seen this on?

I am back in office next tuesday and will look into this problem too.

On Wed, Mar 31, 2010 at 04:27:45PM -0400, Neil Horman wrote:
> So I'm officially rescinding this patch.

Yeah, the right solution to this problem is to find out why every DTE is
not longer flushed before first use.

> It apparently just covered up the problem, rather than solved it
> outright. This is going to take some more thought on my part. I read
> the code a bit closer, and the amd iommu on boot up currently marks
> all its entries as valid and having a valid translation (because if
> they're marked as invalid they're passed through untranslated which
> strikes me as dangerous, since it means a dma address treated as a bus
> address could lead to memory corruption. The saving grace is that
> they are marked as non-readable and non-writeable, so any device doing
> a dma after the reinit should get logged (which it does), and then
> target aborted (which should effectively squash the translation)

Right. The default for all devices is to forbid DMA.

> I'm starting to wonder if:
>
> 1) some dmas are so long lived they start aliasing new dmas that get mapped in
> the kdump kernel leading to various erroneous behavior

At least not in this case. Even when this is true the DMA would target
memory of the crashed kernel and not the kdump area. This is not even
memory corruption because the device will write to memory the driver has
allocated for it.

> 2) a slew of target aborts to some hardware results in them being in an
> inconsistent state

Thats indeed true. I have seen that with ixgbe cards for example. They
seem to be really confused after an target abort.

> I'm going to try marking the dev table on shutdown such that all devices have no
> read/write permissions to see if that changes the situation. It should I think
> give me a pointer as to weather (1) or (2) is the more likely problem.

Probably not. You still need to flush the old entries out of the IOMMU.

Thanks,

Joerg

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Neil Horman on 1 Apr 2010 10:50

On Thu, Apr 01, 2010 at 04:29:02PM +0200, Joerg Roedel wrote:
> Hi Neil,
>
> first some general words about the problem you discovered: The problem
> is not caused by in-flight DMA. The problem is that the IOMMU hardware
> has cached the old DTE entry for the device (including the old
> page-table root pointer) and is using it still when the kdump kernel has
> booted. We had this problem once and fixed it by flushing a DTE in the
> IOMMU before it is used for the first time. This seems to be broken
> now. Which kernel have you seen this on?
>
First noted on 2.6.32 (the RHEL6 beta kernel), but I've reproduced with the
latest linus tree as well.

> I am back in office next tuesday and will look into this problem too.
>
Thank you.

> On Wed, Mar 31, 2010 at 04:27:45PM -0400, Neil Horman wrote:
> > So I'm officially rescinding this patch.
>
> Yeah, the right solution to this problem is to find out why every DTE is
> not longer flushed before first use.
>
Right, I've checked the commits that chris noted in his previous email and
they're in place, so I'm not sure how we're getting stale dte's

> > It apparently just covered up the problem, rather than solved it
> > outright. This is going to take some more thought on my part. I read
> > the code a bit closer, and the amd iommu on boot up currently marks
> > all its entries as valid and having a valid translation (because if
> > they're marked as invalid they're passed through untranslated which
> > strikes me as dangerous, since it means a dma address treated as a bus
> > address could lead to memory corruption. The saving grace is that
> > they are marked as non-readable and non-writeable, so any device doing
> > a dma after the reinit should get logged (which it does), and then
> > target aborted (which should effectively squash the translation)
>
> Right. The default for all devices is to forbid DMA.
>
Thanks, glad to know I read that right, took me a bit to understand it :)

> > I'm starting to wonder if:
> >
> > 1) some dmas are so long lived they start aliasing new dmas that get mapped in
> > the kdump kernel leading to various erroneous behavior
>
> At least not in this case. Even when this is true the DMA would target
> memory of the crashed kernel and not the kdump area. This is not even
> memory corruption because the device will write to memory the driver has
> allocated for it.
>
Yeah, I figured that old dma's going to old locations were ok, I was more
concerned that if an 'old' dma lived through our resetting of the iommu page
table, leading to us pointing an old dma address to a new physical address
within the new kernel memory space. Although, given the reset state of the
tables, for that to happen a dma would have to not attempt a memory transaction
until sometime later in the boot, which seems...unlikely to say the least, so I
agree this is almost certainly not happening.

> > 2) a slew of target aborts to some hardware results in them being in an
> > inconsistent state
>
> Thats indeed true. I have seen that with ixgbe cards for example. They
> seem to be really confused after an target abort.
>
Yeah, this part worries me, target aborts lead to various brain dead hardware
pieces. What are you thoughts on leaving the iommu on through a reboot to avoid
this issue (possibly resetting any pci device that encounters a target abort, as
noted in the error log on the iommu?

> > I'm going to try marking the dev table on shutdown such that all devices have no
> > read/write permissions to see if that changes the situation. It should I think
> > give me a pointer as to weather (1) or (2) is the more likely problem.
>
> Probably not. You still need to flush the old entries out of the IOMMU.
>
Yeah, after reading your explination above, I agree
Neil

>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Joerg Roedel on 1 Apr 2010 12:00

On Thu, Apr 01, 2010 at 10:47:36AM -0400, Neil Horman wrote:
> On Thu, Apr 01, 2010 at 04:29:02PM +0200, Joerg Roedel wrote:
> > I am back in office next tuesday and will look into this problem too.
> >
> Thank you.

Just took a look and I think the problem is that the devices are
attached to domains before the IOMMU hardware is enabled. This happens
in the function prealloc_protection_domains(). The attach code issues
the dte-invalidate commands but they are not executed because the
hardware is off. I will verify this when I have access to hardware
again.
The possible fix will be to enable the hardware earlier in the
initialization path.

> > Right. The default for all devices is to forbid DMA.
> >
> Thanks, glad to know I read that right, took me a bit to understand it :)

I should probably add a comment :-)

> > Thats indeed true. I have seen that with ixgbe cards for example. They
> > seem to be really confused after an target abort.
> >
> Yeah, this part worries me, target aborts lead to various brain dead hardware
> pieces. What are you thoughts on leaving the iommu on through a reboot to avoid
> this issue (possibly resetting any pci device that encounters a target abort, as
> noted in the error log on the iommu?

This would only prevent possible data corruption. When the IOMMU is off
the devices will not get a target abort but will only write to different
physical memory locations. The window where a target abort can happen
starts when the kdump kernel re-enables the IOMMU and ends when the new
driver for that device attaches. This is a small window but there is not
a lot we can do to avoid this small time window.

Joerg

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Neil Horman on 1 Apr 2010 12:30

On Thu, Apr 01, 2010 at 11:02:03AM -0400, Vivek Goyal wrote:
> On Thu, Apr 01, 2010 at 08:53:04AM -0400, Neil Horman wrote:
> > On Wed, Mar 31, 2010 at 10:24:18PM -0400, Vivek Goyal wrote:
> > > On Wed, Mar 31, 2010 at 09:13:11PM -0400, Neil Horman wrote:
> > > > On Wed, Mar 31, 2010 at 02:25:35PM -0700, Chris Wright wrote:
> > > > > * Neil Horman (nhorman(a)tuxdriver.com) wrote:
> > > > > > Flush iommu during shutdown
> > > > > >
> > > > > > When using an iommu, its possible, if a kdump kernel boot follows a primary
> > > > > > kernel crash, that dma operations might still be in flight from the previous
> > > > > > kernel during the kdump kernel boot. This can lead to memory corruption,
> > > > > > crashes, and other erroneous behavior, specifically I've seen it manifest during
> > > > > > a kdump boot as endless iommu error log entries of the form:
> > > > > > AMD-Vi: Event logged [IO_PAGE_FAULT device=00:14.1 domain=0x000d
> > > > > > address=0x000000000245a0c0 flags=0x0070]
> > > > >
> > > > > We've already fixed this problem once before, so some code shift must
> > > > > have brought it back. Personally, I prefer to do this on the bringup
> > > > > path than the teardown path. Besides keeping the teardown path as
> > > > > simple as possible (goal is to get to kdump kernel asap), there's also
> > > > > reason to competely flush on startup in genernal in case BIOS has done
> > > > > anything unsavory.
> > > > >
> > > > Chris,
> > > > Can you elaborate on what you did with the iommu to make this safe? It
> > > > will save me time digging through the history on this code, and help me
> > > > understand better whats going on here.
> > > >
> > > > I was starting to think that we should just leave the iommu on through a kdump,
> > > > and re-construct a new page table based on the old table (filtered by the error
> > > > log) on kdump boot, but it sounds like a better solution might be in place.
> > > >
> > >
> > > Hi Neil,
> > >
> > > Is following sequence possible.
> > >
> > > - In crashed kernel, take away the write permission from all the devices.
> > > Mark bit 62 zero for all devices in device table.
> > >
> > > - Leave the iommu on and let the device entries be valid in kdump kernel
> > > so that any in-flight dma does not become pass through (which can cause
> > > more damage and corrupt kdump kernel).
> > >
> > > - During kdump kernel initialization, load a new device table where again
> > > all the devices don't have write permission. looks like by default
> > > we create a device table with all bits zero except DEV_ENTRY_VALID
> > > and DEV_ENTRY_TRANSLATION bit.
> > >
> > > - Reset the device where we want to setup any dma or operate on.
> > >
> > > - Allow device to do DMA/write.
> > >
> > > So by default all the devices will not be able to do write to memory
> > > and selective devices are given access only after a reset.
> > >
> > > I am not sure what are the dependencies for loading a new device table
> > > in second kernel. If it requires disabling the IOMMU, then we leave a
> > > window where in-flight dma will become passthrough and has the potential
> > > to corrupt kdump kernel.
> > >
> > I think this is possible, but I'm a bit concerned with how some devices will
> > handle a reset. For instance, what will happen to an HBA or a disk, if we reset
> > it as the module is loading? Is that safe?
>
> I think we need to reset devices in driver if "reset_devices" is set. So
> we will not reset these during normal boot.
>
> Regarding being safe, I don't know. I am assuming that driver knows (or
> need to know), how to reset device safely while driver is initializing.
> That's the whole assumption kdump is built on, that once driver is
> initializing, it will first reset the device (if reset_devices is set), so
> that chances of device working properly in second kernel increase.
>
Yes, I agree, I was more just asking is it safe to unilaterally reset devices
during boot? I suppose it is, but I'm not entirely sure
Neil

> Vivek
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6
Prev: RFC [Patch] Remove "please try 'cgroup_disable=memory' option if you don't want memory cgroups" printk at boot time.
Next: [PATCH v2 3/11] Enhance replace_page() to support pagecache