From: James Harris on
Anyone know of architectures where the TLB caches not-present page
table entries (as well as present ones)? The issue arises when a page
fault is taken. AIUI some CPUs require even a not-present TLB entry to
be invalidated whereas others, sensibly, simply avoid caching not-
present entries.

The Berkeley course CS162 (Operating systems and systems programming)
mentions the need to invalidate TLB entries apparently on any page
fault including a missing (not-present) page. I can't find any
documentation that says what the Intel 386 and 486 do.

On the other hand I have found documentation that says that later
Intel CPUs don't load not-present PTEs into the TLB. This info seems
to apply from at least the Pentium 1. Again, older x86 CPUs are not
mentioned. I've tried looking for info on other architectures without
much success.

A post on alt.os.development has got replies indicating surprise that
a not-present entry would be cached hence this post on comp.arch
asking if there are architectures where not-present entries are cached
in the TLB.

James
From: Andy Glew "newsgroup at on
From comp.arch.

TBD: formalize this discussion of [[Caching Invalid Entries in TLB]]

= Original Post =

On 8/1/2010 2:55 PM, James Harris wrote:
> Anyone know of architectures where the TLB caches not-present page
> table entries (as well as present ones)? The issue arises when a page
> fault is taken. AIUI some CPUs require even a not-present TLB entry to
> be invalidated whereas others, sensibly, simply avoid caching not-
> present entries.
>
> The Berkeley course CS162 (Operating systems and systems programming)
> mentions the need to invalidate TLB entries apparently on any page
> fault including a missing (not-present) page. I can't find any
> documentation that says what the Intel 386 and 486 do.
>
> On the other hand I have found documentation that says that later
> Intel CPUs don't load not-present PTEs into the TLB. This info seems
> to apply from at least the Pentium 1. Again, older x86 CPUs are not
> mentioned. I've tried looking for info on other architectures without
> much success.
>
> A post on alt.os.development has got replies indicating surprise that
> a not-present entry would be cached hence this post on comp.arch
> asking if there are architectures where not-present entries are cached
> in the TLB.

= Glew Reply =

The issue has arisen from time to time.

I personally like the idea of sometimes caching entries that are invalid.

== Using Stale TLB Entries for Prediction ==

a) Especially if they once were valid, but are no longer - i.e. if the
PTE is basically valid, but the present bit was turned off, e.g. when
using the resulting trap to emulate things like A & D bits.

You can use the stale but formerly valid TLB entry as a predictor for
what the physical address will be. E.g. fetch the memory location, and
perform the TLB miss handling off on the side. Restart of incorrect.

ISSUE: what if the memory type associated with a page of physical memory
changed, from WB writeback allowing speculative accesses to UC uncached,
disallowing such? On x86, it depends on your handling of the MTRRs. If
the MTRRs are authoritative, defining the worst case memory types for
physical memory, you can do what I describe above. If not - if the
MTRRs might say WB, but the PTE says UC memory mapped I/O, can't.
Unfortunately, I believe the latter case is cureently in force: i.e. I
believe that x86 cannot safely use invalid TLB entries to predict
physical address changes. Not without some architectural changes.

== Preventing Unnecessary Page Table Walks ==

b) Some processors will do page table walks speculatively. (Like, all
Intel processors since the P6, IIRC.) If you are off on a wrong path
and start getting spurious TLB misses, each such miss may start a page
table walk going. Now, many if not most machines may only allow one
such TLB miss to be in flight at any time - although that is, IMHO,
stupid. It is easy enough to make TLB misses occur concurrently. But
even if you don't have concurrent TLB misses, you might get extra,
unnecessary, TLB misses to the same invalid PTE.

Storing invalid PTEs inside the TLB would prevent such unnecessary page
table walks.

Of course, you would not want to thrash many valid entries out of the
TLB. So you might want to have some sort of mini-TLB for such entries,
or a Bloom filter, or some other way.

By the way, beware: many folks have the implicit assumption that TLB
misses are done in-order at retirement. T'ain't so.

Onur Mutlu published a paper that used events such as page faults as
indications of invalid speculation. He was kind enough to refer to some
unpublished work of mine. (Like, nearly all of my work has not been
published, except occasionally in patents.)

== PDE Caching ==

Let's see, are there any other reasons for invalid TLB entries?

Does caching the PDE (Page Directory Entry, or as I prefer to call it,
L2 PTE) cause any related issues?

Valid PDE / invalid PTE? Invalid PDE? / ...

Ah, yes, how about: you have a valid PDE pointing to a page (PT) of
invalid PTEs. You reference one of those addresses. Maybe you don't
cache the PTE, but you do cache the PDE.

Now, you decide to map the PDE to point to a different address, i.e. a
different PT. But your OS thinks "there were no valid PTEs here, so I
don't need to invalidate the TLB." But the stale PDE entry is cached.

Ooops. Next time you access one of the physical addresses covered by the
PDE, you hit the PDE cache (or some interior page table node cache), and
then go to memory, and pick up ... what? Picking up an invalid PTE
might not be so bad, but you might already have reallocated the old PT
page pointed to by the PDE to something else.

Moral: whenever you change page table structures, invalidate the TLB.
All, or in part.

== Rewalking TLB misses at fault commit ==

Thinking about it:

Caching invalid TLB entries should not be a problem if the CPU always
walks the page tables at retirement before reporting a fault.
Potentially, re-walking the page tables - the speculative TLB miss may
have reported a fault, but since the page tables may have changed, you
might want to be ultra-safe, invalidate any such cached TLB entry, and
redo the walk at retirement.

That's the sort of thing we might have done in the first generation of
OOO CPUs, just to be safe.

However, such a re-walk should not be necessary. If somebody removed it...

== Memory Ordering of Page Table Walks ==

Oh, hey, another possibility: not caching invalid TLB entries, but
memory ordering related to it.

Some OS code might assume that

P1:
initially PTE(VA) = invalid
flag = 0
...
PTE(VA) := valid, some address VA->PA
flag := 1

is ordered, such that

P2:
if( flag == 1 )
then PTE is valid,
i.e. accessing VA will not fault

This is, unfortunately, not true.

The PTE for VA, PTE(VA), may have been pulled into the processor P2
before it observed the write of flag:=1.

Even if there is no caching of invalid TLB entries, this acts the same way.

Problem is that TLB misses are not kept in the data structures that
enforce the semi-official x86 memory ordering model. Reason: expensive,
and supposedly the OS guys are vsmart enough to invalidate the TLB.

Note that this effect is ameliorated by rewalking the PTE before
reporting a page fault. But if you don't want to be that conservative...


= See Also =

[[NAT pages]]

[[Uses for Stale Cache Data]]

http://semipublic.comp-arch.net/wiki/Caching_Invalid_Entries_in_TLB
From: Andy Glew "newsgroup at on
Related issue: NAT pages.

Pages which cause, not a page fault, but an invalid entry, e.g. a NaaN,
to be loaded into destination on access.

Q: should NAT TLB entries be cached? A: Yes, but...
From: MitchAlsup on
On Aug 1, 4:55 pm, James Harris <james.harri...(a)googlemail.com> wrote:
> Anyone know of architectures where the TLB caches not-present page
> table entries (as well as present ones)? The issue arises when a page
> fault is taken. AIUI some CPUs require even a not-present TLB entry to
> be invalidated whereas others, sensibly, simply avoid caching not-
> present entries.
>
> The Berkeley course CS162 (Operating systems and systems programming)
> mentions the need to invalidate TLB entries apparently on any page
> fault including a missing (not-present) page. I can't find any
> documentation that says what the Intel 386 and 486 do.

Andy covered most of the cases: I will cover another (not pertinate to
x86s that I know of):

CAM bits used to do the associative matching are significantly larger
than the RAM bits used to hold the PTEs.
Thus, it makes sense to store several PTEs for each CAM.
How many might be reasonable? The number that arrive from a single
cache access (2s and 4s come to mind).

Some of these PTEs might be invalid paired next to a valid one that
took the miss.

Given the aforementioned organization, I know of no way to avoid
inserting the invalid one in the PTE store while inserting the valid
one (as a single word in the TLB). It is perfectly reasonable to avoid
inserting the PTEs if ALL of them are invalid, but not if the one that
took the miss is valid.

Mitch
From: Paul A. Clayton on
On Aug 2, 11:27 am, MitchAlsup <MitchAl...(a)aol.com> wrote:
[snip]
> Given the aforementioned organization, I know of no way to avoid
> inserting the invalid one in the PTE store while inserting the valid
> one (as a single word in the TLB). It is perfectly reasonable to avoid
> inserting the PTEs if ALL of them are invalid, but not if the one that
> took the miss is valid.

Huh? Two methods are obvious to me. 1) Use the PTE valid bit and
have a policy that invalid PTEs are never cached in the TLB.
2) Add a full/empty bit for each PTE slot in the TLB.


Paul A. Clayton
just a technophile who is somewhat appreciative of others sharing
their
knowledge