From: Benjamin Herrenschmidt on
On Sat, 2010-03-06 at 19:36 +0000, Russell King - ARM Linux wrote:
> On Sat, Mar 06, 2010 at 04:17:23PM +0530, James Bottomley wrote:
> > On a fault in of exec data, we first try to get the page out of the page
> > cache. If it's not present, we put the faulting process to sleep and
> > fetch it in from storage. When we do the read, on the PIO path, the
> > kernel alias for the page becomes dirty. Some time later, we place the
> > page into the user space (updating the pte entry that caused a fault).
> > At this point, we'll call both flush_icache_page() and
> > update_mmu_cache() ... this is where the I/D resolution should be done.
>
> No - this is where things get extremely icky.
>
> The problem at this point occurs on SMP architectures. As soon as you
> update the PTE entry, it is visible to other threads of the application.
> If you do I-cache handling after updating the PTE, then there is a window
> where another CPU can execute the page:

Right, we actually hit that bug on powerpc, however, James explanation
is misleading, ie, I think the -code- actually is right and
flush_icache_page() is called before set_pte_at(). However, see my other
email, I have other issues with it as it is, but nothing unfixable.

So for now, I keep my flush in set_pte_at() and ptep_set_access_flags(),
we'll see if I can move that to an improved flush_icache_page(). In
fact, even set_pte_at() isn't a panacea for me, as I want the fault type
as well.

Cheers,
Ben.

> CPU0 CPU1
> speculatively prefetches from page N via kernel
> mapping, loads garbage into I-cache
> attempts to execute P
> page fault
> page N allocated
> set_pte_at
> executes P
> *splat*
> flush I-cache


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: James Bottomley on
On Sun, 2010-03-07 at 08:03 +1100, Benjamin Herrenschmidt wrote:
> On Sat, 2010-03-06 at 16:17 +0530, James Bottomley wrote:
> > On a fault in of exec data, we first try to get the page out of the page
> > cache. If it's not present, we put the faulting process to sleep and
> > fetch it in from storage. When we do the read, on the PIO path, the
> > kernel alias for the page becomes dirty. Some time later, we place the
> > page into the user space (updating the pte entry that caused a fault).
> > At this point, we'll call both flush_icache_page() and
> > update_mmu_cache() ... this is where the I/D resolution should be done.
> > Since it's after any I/O has occurred, it doesn't matter whether the CPU
> > speculatively moved anything in or not. As long as you flush the kernel
> > alias and invalidate the user I and D aliases, we're good to go. Using
> > the page arch flags is really only to optimise this process (defer
> > kernel D alias flushing).
>
> Ok, so while flush_icache_page() looks like something we could use
> instead of set_pte_at() for the icache flushing, it doesn't answer all
> the questions. Off the top of my mind:

OK, so what I was actually trying to get across is the point that we
don't handle I cache problems in the I/O or page cache code ... we
handle them in the mm code, so the mm piece of the above was
deliberately a bit vague.

> - I see the calls to flush_icache_page() in mm/memory.c but I don't see
> them next to all set_pte_at() that insert a valid PTE. For example, we
> don't flush the icache for anonymous pages. While that might seem like a
> good idea, we have been under pressure to "fix" that on powerpc to make
> sure there is no stale icache content from another process leaking into
> userspace.

I'm not entirely sure what flush_icache_page() is supposed to do. On
parisc it flushes the *kernel* icache ... which has got to be wrong.
According to cachetlb.txt it's an obsolete interface.

> - It needs to be done -before- set_pte_at() but I think the code does it
> right, only your explanation above makes it unclear :-)

Sorry, like I said, I only sketched the mm piece. However, at least on
parisc, there's a technical problem with flushing before we have the
pte: On VIPT systems, we need a mapping before the flush will work. I
was experimenting with a mechanism whereby we set aside in the kernel an
aligned region of our congruence size and simply flushed in that region
with the correct mappings, but we haven't got around to implementing it
in the kernel yet.

> - It doesn't take the PTE pointer as an argument, so here goes our trick
> on powerpc of filtering out exec permission rather than flushing when a
> page is accessed by a read fault
>
> - We -still- have the problem of tracking whether the icache has been
> flushed or not yet for a given physical page on archs with PIPT (or non
> aliasing VIPT) like powerpc. Without that tracking, we flush a lot more
> than necessary since we'll end up flushing things like glibc text pages
> for every process they are mapped into which is totally wasteful. Thus
> the idea of using a new PG bit to separate D$ from I$ tracking still
> makes sense.

So, assuming full congruence of user space, can't you use the VMA as an
indicator? i.e. if we have no user space mappings, we have to flush the
icache ... if we have one or more, the icache has been flushed and
placing the same page congruently in a different address space benefits
from that prior flush, so consequently there's no need to flush again?

I also think we've established the relevant facts for the I/O thread
(that we only need to either flush the kernel D cache or mark it as to
be flushed later on PIO reads). We're now into deep technicalities of
how the mm system operates at the architecture level, so perhaps we
should move this to linux-arch?

James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: James Bottomley on
On Sat, 2010-03-06 at 19:36 +0000, Russell King - ARM Linux wrote:
> On Sat, Mar 06, 2010 at 04:17:23PM +0530, James Bottomley wrote:
> > On a fault in of exec data, we first try to get the page out of the page
> > cache. If it's not present, we put the faulting process to sleep and
> > fetch it in from storage. When we do the read, on the PIO path, the
> > kernel alias for the page becomes dirty. Some time later, we place the
> > page into the user space (updating the pte entry that caused a fault).
> > At this point, we'll call both flush_icache_page() and
> > update_mmu_cache() ... this is where the I/D resolution should be done.
>
> No - this is where things get extremely icky.

OK, but the point I'm trying to make is that the page cache code,
including the I/O layer, only manages kernel D alias state (either by
flushing or marking it dirty). The user space I/D handling is done in
the mm code (I'm not claiming it's done correctly there, just claiming
it's done there).

> The problem at this point occurs on SMP architectures. As soon as you
> update the PTE entry, it is visible to other threads of the application.
> If you do I-cache handling after updating the PTE, then there is a window
> where another CPU can execute the page:
>
> CPU0 CPU1
> speculatively prefetches from page N via kernel
> mapping, loads garbage into I-cache
> attempts to execute P
> page fault
> page N allocated
> set_pte_at
> executes P
> *splat*
> flush I-cache

OK, so I can believe this. We see extremely rare segfaults on parisc
which look to be the result of some I flush race like this. However, I
think for a discussion of problems with the arch and mm interfaces, we
should probably move off the usb list and onto linux-arch.

Our specific problem on parisc is that being VIPT we can't do an I (or
D) user flush without a mapping. We have two schemes for fixing this:
One is to use a PAGE_FLUSH flag for the mapping ... it allows the
flushes to work but refuses any type of RWX access (can do this because
we have a software TLB). The other is to use a flush area within the
kernel where we flush a page congruent to the userspace address ... I
haven't got this working yet, and it's a bit wasteful of kernel address
space because our congruence modulus is 4MB.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Pavel Machek on
Hi!

> > Seems like ARM has requirement other architectures do not, that is
> > a) not documented anywhere
> > b) causes problems
>
> Well, ARM is pretty similar to other architectures in this respect. And
> I'm sure other architectures have similar problems, only that they only
> become visible in some circumstances they may not have encountered (i.e.
> PIO drivers + filesystem that doesn't call flush_dcache_page like ext*).
> Some other architectures may do heavier flushing
>
> Of course, a Documentation/arm/cachetlb.txt file would make sense.

Actually, short/simple documentation for driver authors would be even
better. Then you can claim it is bug in driver :-).
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: FUJITA Tomonori on
On Sun, 07 Mar 2010 09:07:17 +0530
James Bottomley <James.Bottomley(a)HansenPartnership.com> wrote:

> So, assuming full congruence of user space, can't you use the VMA as an
> indicator? i.e. if we have no user space mappings, we have to flush the
> icache ... if we have one or more, the icache has been flushed and
> placing the same page congruently in a different address space benefits
> from that prior flush, so consequently there's no need to flush again?

I'm not sure about this (sounds like the trick might work for some
though). As I said earlier, I think that IA64 could avoid flushing
I-cache even if the page has no user space mappings (if it did dma to
the page). ia64 needs to track pages for that.

As Ben said, I guess that we need two separate bits for D and I. I
think that it's a good idea to standardize how to use the bits for
optimization (some uses none, some uses only one, some needs both
though). And then we need to revisit I/O path (fs, the block layer,
drivers). Seems that we added flush_dcache_page() everywhere.


> I also think we've established the relevant facts for the I/O thread
> (that we only need to either flush the kernel D cache or mark it as to
> be flushed later on PIO reads).

We have the PIO issue about D-cache aliasing now? That's, don't mm/ or
fs/ already flush D-cache properly? I thought that Catalin has only
D/I cache consistency issue. If not, PIO doesn't also work powerpc
that handles properly D/I cache consistency.


> We're now into deep technicalities of
> how the mm system operates at the architecture level, so perhaps we
> should move this to linux-arch?

Yeah, probably we should.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/