Hardware TLB reloaders [Computer Architecture]

Prev: 100% without investment online part time jobs..(adsense,datawork,neobux..more jobs)
Next: Branchless RLE decoding (Was Re: Delphi needs opengl gui :))

From: EricP on 10 Aug 2010 15:07

MitchAlsup wrote:
>
> So the microarchitect has basicaly 2 choices: a big fully associative
> TLB (where big is less than 70 entries) or a small number of SRAM
> arrays (where small is 1,2,3,4) used in a set associative manner. This
> is what I mean by basically ignoring the problem. you choose the FA
> TLB if you are selling to the general market place, and you choose the
> SRAM organization if you are selling to servers. minor details of the
> architecture will determin how the bits lay out and how many sets you
> get with the SRAM approach.

Since an x86 can touch 4 pages, 2 instruction, 2 data, in a single
instruction, that is the lower limit for a unified set associative TLB.
Or you can split it into ITLB and DTLB and use 2 way assoc. in each.

Also the Intel manual talks about optional higher level TLB
caches for levels PML2 (aka PDE), PML3, and PML4 so the
whole tree doesn't need to be walked for every miss.
They don't specify any processors having these yet though.
Anyway, a miss on a PTE but with a hit on a PDE TLB and then also
a hit on L1 might fix a TLB miss in, say, 2 to 3 clocks for L1
plus 2 to 3 clocks for house keeping.

And since each PDE can reference 512 PTE's, a 64 entry PML2 TLB
could access 32K PTE's within 4 to 6 clocks.

The worst case remains pretty high though as it could necessitate a
store buffer flush of up to 32 entries, all of which could go to main
memory, then the page table walk with 4 main memory reads with each
needing an atomic operation to set the Accessed and/or Dirty bits.

Eric

From: Eric Northup on 10 Aug 2010 16:39

On Aug 10, 3:07 pm, EricP <ThatWouldBeTell...(a)thevillage.com> wrote:
> Since an x86 can touch 4 pages, 2 instruction, 2 data, in a single
> instruction, that is the lower limit for a unified set associative TLB.
> Or you can split it into ITLB and DTLB and use 2 way assoc. in each.

The worst case x86 instruction I know of actually takes 6 or 9 pages
mapped simultaneously to make forward progress. A movsd instruction
(using some prefixes, to bulk the instruction to be multi-byte) can be
arranged with each of the instruction pointer, ESI (source) and EDI
(destination) misaligned such that all three span page boundaries.
That takes you to 6 pages; I think you have to abuse segmentation from
16 bit protected mode to go to 9 - use segments which have limit=64KB,
and base addresses which are one byte below the beginning of a 4KB
page, and arrange for CS:EIP, DS:ESI, and ES:EDI to all point two
bytes below their segment limits. This way, the $(segment):0xFFFE and
$(segment):0xFFFF bytes live on distinct pages, and you get a 3rd
page / pointer when you wrap around to $(segment):0x0000.

From: EricP on 10 Aug 2010 17:51

Eric Northup wrote:
> On Aug 10, 3:07 pm, EricP <ThatWouldBeTell...(a)thevillage.com> wrote:
>> Since an x86 can touch 4 pages, 2 instruction, 2 data, in a single
>> instruction, that is the lower limit for a unified set associative TLB.
>> Or you can split it into ITLB and DTLB and use 2 way assoc. in each.

I realized afterwards that these are pairs of consecutive pages
(as opposed to 4 independent virtual addresses) so that changes it.
Consecutive pages would map to consecutive assoc TLB slots so the
lower limits are 2 way assoc, or 1 way split across ITLB and DTLB.

> The worst case x86 instruction I know of actually takes 6 or 9 pages
> mapped simultaneously to make forward progress. A movsd instruction
> (using some prefixes, to bulk the instruction to be multi-byte) can be
> arranged with each of the instruction pointer, ESI (source) and EDI
> (destination) misaligned such that all three span page boundaries.
> That takes you to 6 pages; I think you have to abuse segmentation from
> 16 bit protected mode to go to 9 - use segments which have limit=64KB,
> and base addresses which are one byte below the beginning of a 4KB
> page, and arrange for CS:EIP, DS:ESI, and ES:EDI to all point two
> bytes below their segment limits. This way, the $(segment):0xFFFE and
> $(segment):0xFFFF bytes live on distinct pages, and you get a 3rd
> page / pointer when you wrap around to $(segment):0x0000.

Ok, I didn't know there were any "op [mem] [mem]" x86 instructions.

So that makes the minimum 3 way unified, or 1 way ITLB, 2 way DTLB.

Eric

From: EricP on 10 Aug 2010 18:11

Eric Northup wrote:
>
> The worst case x86 instruction I know of actually takes 6 or 9 pages
> mapped simultaneously to make forward progress. A movsd instruction
> (using some prefixes, to bulk the instruction to be multi-byte) can be
> arranged with each of the instruction pointer, ESI (source) and EDI
> (destination) misaligned such that all three span page boundaries.
> That takes you to 6 pages; I think you have to abuse segmentation from
> 16 bit protected mode to go to 9 - use segments which have limit=64KB,
> and base addresses which are one byte below the beginning of a 4KB
> page, and arrange for CS:EIP, DS:ESI, and ES:EDI to all point two
> bytes below their segment limits. This way, the $(segment):0xFFFE and
> $(segment):0xFFFF bytes live on distinct pages, and you get a 3rd
> page / pointer when you wrap around to $(segment):0x0000.

And that would make the worst case TLB miss timing on the order of
32 store buffer entries that all miss cache and go to main memory,
plus 9 translates that all do full tables walks and miss all caches,
plus atomic ops to set the Access bits (4* (read mem + atomic OR))

So on the order of 68 main memory reads plus 36 atomic ORs.

Eric

From: MitchAlsup on 10 Aug 2010 23:23

On Aug 10, 2:07 pm, EricP <ThatWouldBeTell...(a)thevillage.com> wrote:
> MitchAlsup wrote:
>
> > So the microarchitect has basicaly 2 choices: a big fully associative
> > TLB (where big is less than 70 entries) or a small number of SRAM
> > arrays (where small is 1,2,3,4) used in a set associative manner. This
> > is what I mean by basically ignoring the problem. you choose the FA
> > TLB if you are selling to the general market place, and you choose the
> > SRAM organization if you are selling to servers. minor details of the
> > architecture will determin how the bits lay out and how many sets you
> > get with the SRAM approach.
>
> Since an x86 can touch 4 pages, 2 instruction, 2 data, in a single
> instruction, that is the lower limit for a unified set associative TLB.
> Or you can split it into ITLB and DTLB and use 2 way assoc. in each.
>
> Also the Intel manual talks about optional higher level TLB
> caches for levels PML2 (aka PDE), PML3, and PML4 so the
> whole tree doesn't need to be walked for every miss.
> They don't specify any processors having these yet though.

AMD parts have what they call a PDC (page directory cache) that can
walk the page tables at 1 level every 2 cycles. Things that miss in
the PDC go to the L2. There is a clever (not yet implemented in HW)
way to walk the x86 page tables backwards so if you know certain
stuff, you might get by with a single 2 cycle access to the PDC or a
single L2 access to walk the 4 level tables.

The AMD I/O MMU also includes specific bits that alows you to skip
levels in the page hierarchy so, for example, the root pointer can
point directly at a single page of PTEs which still allows a device to
access 512 pages and still support a 64-bit virtual address space with
as big a physical address space as you want. That device at that
instant can only see 512 pages--which is more than fine for a multi-
request disc controller.

V8 SPARC processors used what is called a table walk accelerator, and
skips as many levels in the page hierarchy as were preloaded into the
TWA.

Nothing new here. Mere excersizes in cacheing and hiding the caches
from the users...

Mitch

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: 100% without investment online part time jobs..(adsense,datawork,neobux..more jobs)
Next: Branchless RLE decoding (Was Re: Delphi needs opengl gui :))