From: Piotr Wyderski on
Terje Mathisen wrote:

> Writing to CR3 to invalidate the entire TLB subsystem is _very_
> expensive: Not because the operations itself takes so long, but because
> you have to reload the 90+% of data which is still needed.

Of course, but I wonder why does the operation itself take so long.
Conceptually it is very similar to mfence.

Best regards
Piotr Wyderski

From: James Harris on
On 3 Aug, 14:00, "Piotr Wyderski"
<piotr.wyder...(a)mothers.against.spam.gmail.com> wrote:
> Terje Mathisen wrote:
> > Writing to CR3 to invalidate the entire TLB subsystem is _very_
> > expensive: Not because the operations itself takes so long, but because
> > you have to reload the 90+% of data which is still needed.
>
> Of course, but I wonder why does the operation itself take so long.
> Conceptually it is very similar to mfence.

I'm not sure I see the similarity to mfence but this branch of the
thread has become x86-based so I'll carry on in that vein. I haven't
measured either a reload of CR3 or an invlpg. Both need low-level
access that is not readily available. However, we can say that there
is an immediate cost and a longer-term one, both of which are
undefined. If you are *sure* they are cheap then it's fine to go do
them in all cases but if there's any doubt it's best to avoid them.
There have been reports - though I haven't measured them myself - that
some Intel operations take a surprisingly long time.

That said, to put them in context, if a page has had to be swapped in
from disk then the cost of invalidating the whole TLB would be
negligible compared to the disk access time. In fact since the
faulting task is just about to be restarted the TLB likely contains
entries for another task and needs to be flushed anyway.

On the other hand, swapping in a page is just one possible response.
The page fault may, instead, require a pre-zeroed page to be mapped
in. That would be very quick, could be done immediately, and keeps the
faulting process running. In this case existing TLB entries would be
in use and should be kept. Invalidations here could be expensive to
carry out and/or to bounce back from.

James
From: Andy Glew "newsgroup at on
On 8/3/2010 4:55 AM, Piotr Wyderski wrote:
> Andy Glew wrote:
>
>> Heck: if you yourself can rewalk the page tables, on all machines you
>> can avoid the "expensive TLB invalidation".
>
> On the other hand, why is the TLB invalidation expensive?
> There are two ways to do it, the first is via invlpg and the
> other is to write to cr3. But both if them should be relatively cheap,
> i.e. wait until the LSU pipe is empty and then pulse
> a global edge/level reset line of the TLB subsystem. Why
> isn't the reality as simple as that?


As Terje notes, invalidating the entire TLB, or only the local entries,
via a write to CR3 imposes a major TLB reload cost.

INVALPG should not need to be that expensive. Although it should be
noted that "waiting until the LSU pipe" can itself take quite a few
cycles, 30-100. However, more likely the implementation waits until the
entire pipeline is drained, to take into account the possibility of ITLB
invalidation. I suppose you could see if an entry is in the ITLB, and
drain only the data side if not.

I suspect, however, that the original poster is thinking about doing a
multiprocessor TLB shootdown.
From: EricP on
James Harris wrote:
>
> I'm not sure I see the similarity to mfence but this branch of the
> thread has become x86-based so I'll carry on in that vein.

http://developer.intel.com/products/processor/manuals/index.htm

Intel manual 3A System Programming Guide, Part 1 (#253668)
Section 4.10 "CACHING TRANSLATION INFORMATION" covers TLB caching
(over 16 pages of info)
http://developer.intel.com/Assets/PDF/manual/253668.pdf

Eric



From: EricP on
Andy Glew wrote:
>
> If ever you see flakey results, on x86 or elsewhere I would strongly
> suggest that you have your invalid page exception handler rewalk the
> page tables to see if the page is, indeed, invalid.

In a multi-threaded SMP OS, I think you may, depending on the OS design,
have to always do that. The OS _should_ allow concurrent page faults
from different threads in the same process - no reason not to -
with access to the process page table coordinated by a mutex.
It is therefore possible that between when a fault occurs and when
the table mutex is granted, another thread could patch up the PTE.

There are also various PTE atomic update issues to consider,
and other race conditions because the cpu hardware is not using
spinlocks/mutexes to coordinate its tables accesses and updates
(it does use atomic operations to update PTE A and D bits),
but any OS changes by various threads are coordinated.

Eric