From: Catalin Marinas on
On Sun, 2010-03-07 at 08:23 +0000, Pavel Machek wrote:
> > > Seems like ARM has requirement other architectures do not, that is
> > > a) not documented anywhere
> > > b) causes problems
> >
> > Well, ARM is pretty similar to other architectures in this respect. And
> > I'm sure other architectures have similar problems, only that they only
> > become visible in some circumstances they may not have encountered (i.e.
> > PIO drivers + filesystem that doesn't call flush_dcache_page like ext*).
> > Some other architectures may do heavier flushing
> >
> > Of course, a Documentation/arm/cachetlb.txt file would make sense.
>
> Actually, short/simple documentation for driver authors would be even
> better. Then you can claim it is bug in driver :-).

That would help, but only once we agree whether it's a driver bug or the
arch code needs changing.

--
Catalin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Catalin Marinas on
On Sat, 2010-03-06 at 19:36 +0000, Russell King - ARM Linux wrote:
> On Sat, Mar 06, 2010 at 04:17:23PM +0530, James Bottomley wrote:
> > On a fault in of exec data, we first try to get the page out of the page
> > cache. If it's not present, we put the faulting process to sleep and
> > fetch it in from storage. When we do the read, on the PIO path, the
> > kernel alias for the page becomes dirty. Some time later, we place the
> > page into the user space (updating the pte entry that caused a fault).
> > At this point, we'll call both flush_icache_page() and
> > update_mmu_cache() ... this is where the I/D resolution should be done.
>
> No - this is where things get extremely icky.
>
> The problem at this point occurs on SMP architectures. As soon as you
> update the PTE entry, it is visible to other threads of the application.
> If you do I-cache handling after updating the PTE, then there is a window
> where another CPU can execute the page:
>
> CPU0 CPU1
> speculatively prefetches from page N via kernel
> mapping, loads garbage into I-cache
> attempts to execute P
> page fault
> page N allocated
> set_pte_at
> executes P
> *splat*
> flush I-cache

You have two choices - either invalidate the I-cache before the user pte
becomes visible or set the page as not-executable in set_pte_at() and
later mark it as executable in update_mmu_cache (via set_pte_ext).

We currently invalidate the whole I-cache for historical reasons but we
could actually only invalidate a single page. Since even on latest ARM
CPUs, the I-cache is a real VIPT (i.e. can have aliases), we would need
to invalidate on the user mapping (or create a temporary one). The
latter approach of clearing the X bit in set_pte_at may actually help
with this scenario (I haven't done any tests though).

--
Catalin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Benjamin Herrenschmidt on
On Sun, 2010-03-07 at 09:07 +0530, James Bottomley wrote:
> So, assuming full congruence of user space, can't you use the VMA as an
> indicator? i.e. if we have no user space mappings, we have to flush the
> icache ... if we have one or more, the icache has been flushed and
> placing the same page congruently in a different address space benefits
> from that prior flush, so consequently there's no need to flush again?

the VMA ? Or you mean struct page -> mapping ? That would work I suppose
in the case where we want to flush the icache pages for all pages mapped
into user space. But on processors that support per-page execute
permission, we really only want to flush pages that are executed from
(lazily). In that case, we do need a dedicated bit to keep track of
whether a given page has been flushed already.

> I also think we've established the relevant facts for the I/O thread
> (that we only need to either flush the kernel D cache or mark it as to
> be flushed later on PIO reads). We're now into deep technicalities of
> how the mm system operates at the architecture level, so perhaps we
> should move this to linux-arch?

No objection though moving threads after the fact is a recipe for
trouble :-)

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Paul Mundt on
On Fri, Mar 05, 2010 at 03:44:55PM +1100, Benjamin Herrenschmidt wrote:
> > For these parts the PG_dcache_dirty approach
> > saves us from a lot of flushing, and the corner cases were isolated
> > enough that we could tolerate fixups at the driver level, even on a
> > write-allocate D-cache.
>
> But how wide a range of devices do you have to support with those ? Is
> this a few SoCs or people putting any random PCI device in there for
> example ?
>
> If I were to do it that way on ppc32, I worried that it would be more
> than a few drivers that I would have to fix :-) All the 32-bit PowerMac
> and PowerBooks for example, all of freescale 74xx based parts, etc...
> those guys have PCI, and all sort of random HW plugged into them.
>
Many of those parts do support PCI, but are rarely used with arbitrary
devices. The PCI controller on those parts also permits one to establish
coherency for any transactions between PCI and memory through a rudimentary
snoop controller that requires the CPU to avoid entering any sleep
states. This works ok in practice since that series of host controllers
doesn't really support power management anyways (nor do any of the cores
of that generation implement any of the more complex sleep states).

> > For second generation SH-4A (SH-X2) and up parts, read and exec are split
> > out and we could reasonably adopt the PG_dcache_clean approach there
> > while adopting the same sort of flushing semantics as PPC to avoid
> > flushing constantly. The current generation of parts far outnumber their
> > legacy counterparts, so it's certainly something I plan to experiment
> > with.
>
> I'd be curious to see whether you get a perf imporovement with that.
>
> Note that we still have this additional thing that is floating around in
> this thread which I thing is definitely worthwhile to do, which is to
> mark clean pages that have been written to with DMA in dma_unmap and
> friends.... if we can fix the icache problem. So far, I haven't found
> James replies on this satisfactory :-) But maybe I just missed
> something.
>
I'll start in on profiling some of this once I start on 2.6.35 stuff. I
think I still have my old numbers from when we did the PG_mapped to
PG_dcache_dirty transition, so it will be interesting to see how
PG_dcache_clean stacks up against both of those.

> > We have an additional level of complexity on some of the SMP parts with a
> > non-coherent I-cache,
>
> I've that on some embedded ppc's too, where the icache flush instrutions
> aren't broadcast, like ARM11MP in fact. Pretty horrible. Fortunately
> today nobody sane (appart from Bluegene) did an SMP part with those and
> so we have well localized internal hacks for them. But I've heared that
> some vendors might be pumping out SoCs with that stuff too soon which
> worries me.
>
I-cache invalidations are broadcast on all mass produced SH-4A SMP parts,
but we do have some early proto chips that screwed that up. For the case
of mainline, we ought to be able to assume hardware broadcast though.

> > some of the early CPUs have broken broadcasting of
> > the cacheops in hardware and so need to rely on IPIs, while the later
> > parts broadcast properly. We also need to deal with D-cache IPIs when
> > using mixed coherency protocols on different CPUs.
>
> Right, that sucks. Do those have no-exec permission support ? If they
> do, then you can do what I did for BG, which is to ping pong user pages
> so they are either writable or executable (since userspace code itself
> will break as it will assume the cache ops -are- broadcast, since that's
> what the architecture says).
>
Yes, these all support no-exec. I'll give the ping ponging thing a try,
thanks for the tip.

> Do you also, like ARM11MP, have a case of non-cache coherent DMA and
> non-broadcast cache ops in SMP ? That's somewhat of a killer, I still
> don't see how it can be dealt properly other than using load/store
> tricks to bring the data into the local cache and flushing it from
> there. DMA ops are called way to deep into spinlock hell to rely on IPIs

The only thing we really lack is I-cache coherency, which isn't such a
big deal with invalidations being broadcast. All DMA accesses are
snooped, and the D-cache is fully coherent.

> (unless your HW also provides some kind of NMI IPIs).
>
While we don't have anything like FIQs to work with, we do have IRQ
priority levels to play with. I'd toyed with this idea in the past of
simply having a reserved level that never gets masked, particularly for
things like broadcast backtraces.

> > Using PG_dcache_clean from the DMA API sounds like a pretty good idea,
> > and certainly worth experimenting with. I don't know how we would do the
> > I-cache optimization without a PG_arch_2, though.
>
> Right. That's the one thing I've been trying to figure out without
> success. But then, is it a big deal to add PG_arch_2 ? doesn't sound
> like it to me...
>
Well, it does start to get a bit painful with sparsemem section or NUMA
node IDs also digging in to the page flags on 32-bit.. the benefits would
have to be pretty compelling to offset the pain.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Benjamin Herrenschmidt on
On Wed, 2010-03-10 at 12:52 +0900, Paul Mundt wrote:
> Well, it does start to get a bit painful with sparsemem section or
> NUMA
> node IDs also digging in to the page flags on 32-bit.. the benefits
> would
> have to be pretty compelling to offset the pain.

Unless we play a dangerous trick and re-use another flag that isn't
meaningful for allocated pages... maybe PG_buddy ? Or do I miss
something about that guy semantics ?

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/