swiotlb detection should be memory hotplug aware ? [Kernel]

Prev: [git pull v3] documentation: fix almost duplicate filenames (io/IO-mapping.txt)
Next: x86, xsave: make init_xstate_buf static

From: Alok Kataria on 22 Jul 2010 14:40

Hi,

On Wed, 2010-07-21 at 17:03 -0700, FUJITA Tomonori wrote:
> On Thu, 22 Jul 2010 08:44:42 +0900
> FUJITA Tomonori <fujita.tomonori(a)lab.ntt.co.jp> wrote:
>
> > On Wed, 21 Jul 2010 10:13:34 -0700
> > Alok Kataria <akataria(a)vmware.com> wrote:
> >
> > > > Basically, you want to add hot-plug memory and enable swiotlb, right?
> > >
> > > Not really, I am planning to do something like this,
> > >
> > > @@ -52,7 +52,7 @@ int __init pci_swiotlb_detect(void)
> > >
> > > /* don't initialize swiotlb if iommu=off (no_iommu=1) */
> > > #ifdef CONFIG_X86_64
> > > - if (!no_iommu && max_pfn > MAX_DMA32_PFN)
> > > + if (!no_iommu && (max_pfn > MAX_DMA32_PFN || hotplug_possible()))
> > > swiotlb = 1;
> >
> > Always enable swiotlb with memory hotplug enabled?

yep though only on systems which have hotpluggable memory support.

> Wasting 64MB on a
> > x86_64 system with 128MB doesn't look to be a good idea. I don't think
> > that there is an easy solution for this issue though.

Good now that you agree that, that's the only feasible solution, do you
have any suggestions for any interfaces that are available from SRAT for
implementing hotplug_possible ?

>
> btw, you need more work to enable switch on the fly.
>
> You need to change the dma_ops pointer (see get_dma_ops()). It means
> that you need to track outstanding dma operations per device, locking,
> etc.

Yeah though if we are doing this during swiotlb_init time i.e. at bootup
as suggested in the pseudo patch, we don't need to worry about all this,
right ?

Thanks,
Alok

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Konrad Rzeszutek Wilk on 23 Jul 2010 10:30

On Thu, Jul 22, 2010 at 11:34:40AM -0700, Alok Kataria wrote:
> Hi,
>
> On Wed, 2010-07-21 at 17:03 -0700, FUJITA Tomonori wrote:
> > On Thu, 22 Jul 2010 08:44:42 +0900
> > FUJITA Tomonori <fujita.tomonori(a)lab.ntt.co.jp> wrote:
> >
> > > On Wed, 21 Jul 2010 10:13:34 -0700
> > > Alok Kataria <akataria(a)vmware.com> wrote:
> > >
> > > > > Basically, you want to add hot-plug memory and enable swiotlb, right?
> > > >
> > > > Not really, I am planning to do something like this,
> > > >
> > > > @@ -52,7 +52,7 @@ int __init pci_swiotlb_detect(void)
> > > >
> > > > /* don't initialize swiotlb if iommu=off (no_iommu=1) */
> > > > #ifdef CONFIG_X86_64
> > > > - if (!no_iommu && max_pfn > MAX_DMA32_PFN)
> > > > + if (!no_iommu && (max_pfn > MAX_DMA32_PFN || hotplug_possible()))
> > > > swiotlb = 1;
> > >
> > > Always enable swiotlb with memory hotplug enabled?
>
> yep though only on systems which have hotpluggable memory support.

What machines are there that have hotplug support and no hardware IOMMU?
I know of the IBM ones - but they use the Calgary IOMMU.
>
> > Wasting 64MB on a
> > > x86_64 system with 128MB doesn't look to be a good idea. I don't think
> > > that there is an easy solution for this issue though.
>
> Good now that you agree that, that's the only feasible solution, do you
> have any suggestions for any interfaces that are available from SRAT for
> implementing hotplug_possible ?

I thought SRAT has NUMA affinity information - so for example my AMD
desktop box has that, but it does not support hotplug capability.

I think first your 'hotplug_possible' code needs to be more specific -
not just check if SRAT exists, but also if there are swaths of memory
that are non-populated. It would also help if there was some indication
of whether the box truly does a hardware hotplug - is there a way to do
this?

>
> >
> > btw, you need more work to enable switch on the fly.
> >
> > You need to change the dma_ops pointer (see get_dma_ops()). It means
> > that you need to track outstanding dma operations per device, locking,
> > etc.
>
> Yeah though if we are doing this during swiotlb_init time i.e. at bootup
> as suggested in the pseudo patch, we don't need to worry about all this,
> right ?

Right..
>
> Thanks,
> Alok
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andi Kleen on 23 Jul 2010 10:40

> I thought SRAT has NUMA affinity information - so for example my AMD
> desktop box has that, but it does not support hotplug capability.
>
> I think first your 'hotplug_possible' code needs to be more specific -
> not just check if SRAT exists, but also if there are swaths of memory
> that are non-populated. It would also help if there was some indication
> of whether the box truly does a hardware hotplug - is there a way to do
> this?

The SRAT declares hotplug memory ranges in advance. And Linux already
uses this
information in the SRAT parser (just the code for doing this is a bit
dumb, I have a rewrite
somewhere)

The only drawback is that some older systems claimed to have large
hotplug memory ranges
when they didn't actually support it. So it's better to not do anything
with a lot
of overhead.

So yes it would be reasonable to let swiotlb (and possibly other code
sizing itself
based on memory) call into the SRAT parser and check the hotplug ranges too.

BTW longer term swiotlb should be really more dynamic anyways and grow
and shrink on demand. I attempted this some time ago with my DMA
allocator patchkit,
unfortunately that didn't go forward.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Konrad Rzeszutek Wilk on 23 Jul 2010 11:10

On Fri, Jul 23, 2010 at 04:33:32PM +0200, Andi Kleen wrote:
>
> >I thought SRAT has NUMA affinity information - so for example my AMD
> >desktop box has that, but it does not support hotplug capability.
> >
> >I think first your 'hotplug_possible' code needs to be more specific -
> >not just check if SRAT exists, but also if there are swaths of memory
> >that are non-populated. It would also help if there was some indication
> >of whether the box truly does a hardware hotplug - is there a way to do
> >this?
>
> The SRAT declares hotplug memory ranges in advance. And Linux
> already uses this
> information in the SRAT parser (just the code for doing this is a
> bit dumb, I have a rewrite
> somewhere)
>
> The only drawback is that some older systems claimed to have large
> hotplug memory ranges
> when they didn't actually support it. So it's better to not do
> anything with a lot
> of overhead.
>
> So yes it would be reasonable to let swiotlb (and possibly other
> code sizing itself
> based on memory) call into the SRAT parser and check the hotplug ranges too.
>
> BTW longer term swiotlb should be really more dynamic anyways and grow
> and shrink on demand. I attempted this some time ago with my DMA

I was thinking about this at some point. I think the first step is to
make SWIOTLB use the debugfs to actually print out how much of its
buffers are used - and see if the 64MB is a good fit.

The shrinking part scares me - I think it might be more prudent to first
explore on how to grow it. The big problem looks to allocate a physical
contiguity set of pages. And I guess SWIOTLB would need to change from
using one big region to something of a pool system?

> allocator patchkit,
> unfortunately that didn't go forward.

I wasn't present at that time so I don't know what the issues were - you
wouldn't have a link to LKML for this?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andi Kleen on 23 Jul 2010 11:30

> I was thinking about this at some point. I think the first step is to
> make SWIOTLB use the debugfs to actually print out how much of its
> buffers are used - and see if the 64MB is a good fit.

swiotlb is near always wrongly sized. For most system it's far too much,
but for some
not enough. I have some systemtap scripts around to instrument it.

Also it depends on the IO load, so if you size it reasonable you
risk overflow on large IO (however these days this very rarely happens
because
all "serious" IO devices don't need swiotlb anymore)

The other problem is that using only two bits for the needed address
space is also extremly
inefficient (4GB and 16MB on x86). Really want masks everywhere and
optimize for the
actual requirements.

> The shrinking part scares me - I think it might be more prudent to first
> explore on how to grow it. The big problem looks to allocate a physical
> contiguity set of pages. And I guess SWIOTLB would need to change from
> using one big region to something of a pool system?
>

Shrinking: you define a movable zone, so with some delay it can be
always freed.

The problem with swiotlb is however it still cannot block, but it can
adapt to load.

The real fix would be blockable swiotlb, but the way drivers are set up
this is difficult
(at least in kernels using spinlocks)

>> allocator patchkit,
>> unfortunately that didn't go forward.
> I wasn't present at that time so I don't know what the issues were - you
> wouldn't have a link to LKML for this?

There wasn't all that much opposition, but I ran out of time because
fixing the infrastructure
(e.g. getting rid of all of GFP_DMA) is a lot of work. In a sense it's a
big tree sweep project like
getting rid of BKL.

The old patch kit is at ftp://firstfloor.org/pub/ak/dma/
"intro" has the rationale.

I have a slightly newer version of the SCSI & misc drivers patchkit
somewhere.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3
Prev: [git pull v3] documentation: fix almost duplicate filenames (io/IO-mapping.txt)
Next: x86, xsave: make init_xstate_buf static