From: Benjamin Herrenschmidt on

> Basically we have two different MMUs on VIPT parts, the older one on all
> SH-4 parts were all read-implies-exec with no ability to differentiate
> between read or exec access.

Ok, this is the same as the older ppc32 processors.

> For these parts the PG_dcache_dirty approach
> saves us from a lot of flushing, and the corner cases were isolated
> enough that we could tolerate fixups at the driver level, even on a
> write-allocate D-cache.

But how wide a range of devices do you have to support with those ? Is
this a few SoCs or people putting any random PCI device in there for
example ?

If I were to do it that way on ppc32, I worried that it would be more
than a few drivers that I would have to fix :-) All the 32-bit PowerMac
and PowerBooks for example, all of freescale 74xx based parts, etc...
those guys have PCI, and all sort of random HW plugged into them.

I would -love- to avoid that horrible amount of flushing we do on these,
it's quite high on any profile run, but I haven't found a good way to do
so. There's also a nasty issue of icache content leaking between
processes which I doubt is exploitable but I had people having a go at
me about it when I tried to avoid icache cleaning anonymous pages by
default.

> For second generation SH-4A (SH-X2) and up parts, read and exec are split
> out and we could reasonably adopt the PG_dcache_clean approach there
> while adopting the same sort of flushing semantics as PPC to avoid
> flushing constantly. The current generation of parts far outnumber their
> legacy counterparts, so it's certainly something I plan to experiment
> with.

I'd be curious to see whether you get a perf imporovement with that.

Note that we still have this additional thing that is floating around in
this thread which I thing is definitely worthwhile to do, which is to
mark clean pages that have been written to with DMA in dma_unmap and
friends.... if we can fix the icache problem. So far, I haven't found
James replies on this satisfactory :-) But maybe I just missed
something.

> We have an additional level of complexity on some of the SMP parts with a
> non-coherent I-cache,

I've that on some embedded ppc's too, where the icache flush instrutions
aren't broadcast, like ARM11MP in fact. Pretty horrible. Fortunately
today nobody sane (appart from Bluegene) did an SMP part with those and
so we have well localized internal hacks for them. But I've heared that
some vendors might be pumping out SoCs with that stuff too soon which
worries me.

> some of the early CPUs have broken broadcasting of
> the cacheops in hardware and so need to rely on IPIs, while the later
> parts broadcast properly. We also need to deal with D-cache IPIs when
> using mixed coherency protocols on different CPUs.

Right, that sucks. Do those have no-exec permission support ? If they
do, then you can do what I did for BG, which is to ping pong user pages
so they are either writable or executable (since userspace code itself
will break as it will assume the cache ops -are- broadcast, since that's
what the architecture says).

> For older PIPT parts we've never used the deferred flush, since the only
> time we ever had to bother with cache maintenance was in the DMA ops, as
> anything closer to the CPU than the PCI DMAC had no opportunity to be
> snooped.

Do you also, like ARM11MP, have a case of non-cache coherent DMA and
non-broadcast cache ops in SMP ? That's somewhat of a killer, I still
don't see how it can be dealt properly other than using load/store
tricks to bring the data into the local cache and flushing it from
there. DMA ops are called way to deep into spinlock hell to rely on IPIs
(unless your HW also provides some kind of NMI IPIs).

> > > I'm not familiar with SH but for PIO devices the flushing shouldn't be
> > > more aggressive. For the DMA devices, Russell suggested that we mark
> > > the page as clean (set PG_dcache_clean) in the DMA API to avoid the
> > > default flushing.
> >
> > I really like that idea, as I said earlier, but I'm worried about the I$
> > side of things. IE. What I'm trying to say is that I can't see how to do
> > that optimisation without ending up with missing I$ invalidations or
> > doing way too many of them, unless we have a separate bit to track I$
> > state.
> >
> Using PG_dcache_clean from the DMA API sounds like a pretty good idea,
> and certainly worth experimenting with. I don't know how we would do the
> I-cache optimization without a PG_arch_2, though.

Right. That's the one thing I've been trying to figure out without
success. But then, is it a big deal to add PG_arch_2 ? doesn't sound
like it to me...

> In any event, if there's going to be a mass exodus to PG_dcache_clean,
> Documentation/cachetlb.txt could use a considerable amount of expanding.
> The read/exec and I-cache optimizations are something that would be
> valuable to document, as opposed to simply being pointed at the sparc64
> approach with the regular PG_dcache_dirty caveats.

Cheers,
Ben.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo(a)vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Catalin Marinas on
On Fri, 2010-03-05 at 04:34 +0000, Benjamin Herrenschmidt wrote:
> On Thu, 2010-03-04 at 22:11 +0000, Catalin Marinas wrote:
> > On Thu, 2010-03-04 at 21:37 +0000, Benjamin Herrenschmidt wrote:
> > > On Thu, 2010-03-04 at 18:07 +0000, Catalin Marinas wrote:
> > > > I'm not familiar with SH but for PIO devices the flushing shouldn't be
> > > > more aggressive. For the DMA devices, Russell suggested that we mark
> > > > the page as clean (set PG_dcache_clean) in the DMA API to avoid the
> > > > default flushing.
> > >
> > > I really like that idea, as I said earlier, but I'm worried about the I$
> > > side of things. IE. What I'm trying to say is that I can't see how to do
> > > that optimisation without ending up with missing I$ invalidations or
> > > doing way too many of them, unless we have a separate bit to track I$
> > > state.
> >
> > But does this optimisation really matter? I think with careful checking
> > in set_pte_at(), you are not going to invalidate the I-cache more than
> > necessary. If the original page wasn't pte_present() you would need to
> > do the I-cache invalidation. The other cases where set_pte_at() is
> > called for LRU (pte_young) or COW (pte_write) we can avoid the extra
> > invalidation.
>
> No. Not on PIPT (or non aliasing VIPT).
>
> Take your typical glibc text page. This is a struct page that will be
> mapped in almost every process in your system. You do not want to do the
> icache inval every time. Once it's been cleaned once, it's clean for
> subsequent mappings. Only VIVT needs such multiple invalidates I suppose
> though in this case you probably do everything differently anyways.

Yes, you are right, shared libraries don't need the extra flushing with
PIPT caches.

Thanks.

--
Catalin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: James Bottomley on
On Thu, 2010-03-04 at 14:27 +0000, Russell King - ARM Linux wrote:
> On Thu, Mar 04, 2010 at 07:51:52PM +0530, James Bottomley wrote:
> > On Thu, 2010-03-04 at 14:51 +0100, Pavel Machek wrote:
> > > Seems like ARM has requirement other architectures do not, that is
> > > a) not documented anywhere
> > > b) causes problems
> > >
> > > You could argue that performance improvement (how big is it, anyway?)
> > > is worth it, but this should be agreed to by wider community...
> >
> > Performance is always worth it provided we don't sacrifice correctness.
> > The thing which was discovered in this thread is basically that ARM is
> > handling deferred flushing (for D/I coherency) in a slightly different
> > way from everyone else ... once that's fixed, ARM will likely not have
> > the D/I problem, but we'll still have the libata (and other PIO systems)
> > D flushing issue.
>
> I think you've got that backwards.
>
> Reversing the meaning of PG_arch_1 will probably fix the D aliasing issue -
> since we'll interpret '0' to mean "page is dirty, it needs flushing before
> hitting userspace", whereas '1' means "page has been cleaned; there are no
> aliases."

Yes, that looks about right ... I'll think about doing this for parisc
as well.

> This doesn not address the I/D coherency issue, where the Icache needs
> attention to get rid of speculatively loaded cache lines while old data
> was present in the cache.

No, I understand that. However, I/D coherency is handled way after the
writes to the page in the page cache.

On a fault in of exec data, we first try to get the page out of the page
cache. If it's not present, we put the faulting process to sleep and
fetch it in from storage. When we do the read, on the PIO path, the
kernel alias for the page becomes dirty. Some time later, we place the
page into the user space (updating the pte entry that caused a fault).
At this point, we'll call both flush_icache_page() and
update_mmu_cache() ... this is where the I/D resolution should be done.
Since it's after any I/O has occurred, it doesn't matter whether the CPU
speculatively moved anything in or not. As long as you flush the kernel
alias and invalidate the user I and D aliases, we're good to go. Using
the page arch flags is really only to optimise this process (defer
kernel D alias flushing).

James


James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Russell King - ARM Linux on
On Sat, Mar 06, 2010 at 04:17:23PM +0530, James Bottomley wrote:
> On a fault in of exec data, we first try to get the page out of the page
> cache. If it's not present, we put the faulting process to sleep and
> fetch it in from storage. When we do the read, on the PIO path, the
> kernel alias for the page becomes dirty. Some time later, we place the
> page into the user space (updating the pte entry that caused a fault).
> At this point, we'll call both flush_icache_page() and
> update_mmu_cache() ... this is where the I/D resolution should be done.

No - this is where things get extremely icky.

The problem at this point occurs on SMP architectures. As soon as you
update the PTE entry, it is visible to other threads of the application.
If you do I-cache handling after updating the PTE, then there is a window
where another CPU can execute the page:

CPU0 CPU1
speculatively prefetches from page N via kernel
mapping, loads garbage into I-cache
attempts to execute P
page fault
page N allocated
set_pte_at
executes P
*splat*
flush I-cache
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Benjamin Herrenschmidt on
On Sat, 2010-03-06 at 16:17 +0530, James Bottomley wrote:
> On a fault in of exec data, we first try to get the page out of the page
> cache. If it's not present, we put the faulting process to sleep and
> fetch it in from storage. When we do the read, on the PIO path, the
> kernel alias for the page becomes dirty. Some time later, we place the
> page into the user space (updating the pte entry that caused a fault).
> At this point, we'll call both flush_icache_page() and
> update_mmu_cache() ... this is where the I/D resolution should be done.
> Since it's after any I/O has occurred, it doesn't matter whether the CPU
> speculatively moved anything in or not. As long as you flush the kernel
> alias and invalidate the user I and D aliases, we're good to go. Using
> the page arch flags is really only to optimise this process (defer
> kernel D alias flushing).

Ok, so while flush_icache_page() looks like something we could use
instead of set_pte_at() for the icache flushing, it doesn't answer all
the questions. Off the top of my mind:

- I see the calls to flush_icache_page() in mm/memory.c but I don't see
them next to all set_pte_at() that insert a valid PTE. For example, we
don't flush the icache for anonymous pages. While that might seem like a
good idea, we have been under pressure to "fix" that on powerpc to make
sure there is no stale icache content from another process leaking into
userspace.

- It needs to be done -before- set_pte_at() but I think the code does it
right, only your explanation above makes it unclear :-)

- It doesn't take the PTE pointer as an argument, so here goes our trick
on powerpc of filtering out exec permission rather than flushing when a
page is accessed by a read fault

- We -still- have the problem of tracking whether the icache has been
flushed or not yet for a given physical page on archs with PIPT (or non
aliasing VIPT) like powerpc. Without that tracking, we flush a lot more
than necessary since we'll end up flushing things like glibc text pages
for every process they are mapped into which is totally wasteful. Thus
the idea of using a new PG bit to separate D$ from I$ tracking still
makes sense.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/