Hardware TLB reloaders [Computer Architecture]

Prev: 100% without investment online part time jobs..(adsense,datawork,neobux..more jobs)
Next: Branchless RLE decoding (Was Re: Delphi needs opengl gui :))

From: Rob Warnock on 8 Aug 2010 21:40

MitchAlsup <MitchAlsup(a)aol.com> wrote:
+---------------
| n...(a)gosset.csi.cam.ac.uk (Nick Maclaren) wrote:
| > What do modern hardware reloaders do when they get multiple,
| > CLASHING, TLB uses before the first has completed? �That can clearly
| > happen if more memory accesses are allowed than TLB associativity,
| > and possibly in other ways. �Or do they ensure that can't happen?
|
| While I cannot speak for other designers.....
|
| This seems to be the kind of problem you simply ignore--to the extent
| possible. That is, you walk the page tables, install the PTE, and go
| on with life. If you happen to be walking a set of addresses that will
| cause thrashing in the PTE, you simpy ignore the problem, install the
| PTE of the moment and go on with life.
+---------------

ISTR that the reason the 370 had (at least) 4-way associative TLBs
was that there were certain instructions that could not make forward
progress unless there were *eight* pages mapped simultaneously, of
which up to four could collide in the TLB. The famous example of such
was the Translate-And-Test instruction in the situation in which the
instruction itself, the source buffer, the destination buffer, and
the translation table *all* spanned a page boundary, which obviously
needs valid mappings at least eight pages. [But only four could collide
per TLB line.]

-Rob

p.s. The MIPS CPU line [which used software-interrupt TLB reload] addressed
that by making the TLBs fully-associative.

p.p.s. I worked on a BSD Unix port to the AMD Am29000, which also used
software TLB reload. ISTR that we found that when running typical "Unix"
apps (the assembler, "nroff", etc.) that TLB reload used only a few
percent of the CPU.

-----
Rob Warnock <rpw3(a)rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607

From: Anne & Lynn Wheeler on 9 Aug 2010 12:27

rpw3(a)rpw3.org (Rob Warnock) writes:
> ISTR that the reason the 370 had (at least) 4-way associative TLBs
> was that there were certain instructions that could not make forward
> progress unless there were *eight* pages mapped simultaneously, of
> which up to four could collide in the TLB. The famous example of such
> was the Translate-And-Test instruction in the situation in which the
> instruction itself, the source buffer, the destination buffer, and
> the translation table *all* spanned a page boundary, which obviously
> needs valid mappings at least eight pages. [But only four could collide
> per TLB line.]

minor trivia ... translate, transate-and-test were against the source
buffer ... using the translation table/buffer (which could cross page
boundary). the two additional possible page references was that instead
of executing the instruction directly ... the instruction could be the
target of an "EXECUTE" instruction; where the 4-byte EXECUTE instruction
might also cross page boundary

the feature of the execute instruction was that it would take a byte
from a register and use it to modify the 2nd byte of the target
instruction for execution ... which in SS/6-byte instructions was the
length field; eliminating some of the reasons for altering instructions
as they appeared in storage).

note that 360/67 had an 8-entry associative array as the translate
look-aside hardware ... in order to handle the worst case instruction
(eight) page requirement.

more trivia ... in more recent processors ... translate & translate and
test have gotten a "bug" fix and became much more complex.

360 instructions always pretested both the origin of an operand and the
end of an operand for valid (in the case of variable length operand
specification ... used the instruction operand length field ... or in
case of the execute instruction, the length supplied from register)
.... before beginning execution ... in the above example might have
multiple page faults before the instruction execution would actually
start.

370 introduced a couple instructions that would execute incrementally
(MVCL & CLCL) ... although there were some early machines that had
microcode implementation bugs ... that would pretest the end of
MVCL/CLCL operand before starting execution.

relatively recently a hardware fix was accepted for translate &
translate & test instructions. the other variable length 6-byte SS
instructions have both the source operand length and the destination
operand length identical. translate and translate&test instructions have
a table as operand that uses each byte from the source to index. The
assumption was that the table was automatically 256 bytes and therefor
instruction pre-test would do check for valid for start-of-table plus
255.

it turns out that page fault when cross page boundary ... there is
possibility of storage protect scenario on page boundary crossing. that
coupled with some applications that would do translate on subset of
possible values ... and only built a table that was much smaller than
256 bytes. If the table was at the end of a page ... abutting a storage
protect page ... the end-of-table precheck could fail ... even tho the
translate data would never actually result in reference to protected
storage.

so now, translate and translate&test instructions have a pretest if the
table is within 256 bytes of page boundary ... if not ... it executes as
it has since 360 days. if the target table is within 256 bytes of the
end of page ... it may be necessary to execute the instruction
incrementally, byte-by-byte (more like mvcl/clcl)

--
virtualization experience starting Jan1968, online at home since Mar1970

From: Nick Maclaren on 9 Aug 2010 13:10

In article <m3sk2n3g5m.fsf(a)garlic.com>,
Anne & Lynn Wheeler <lynn(a)garlic.com> wrote:
>
>370 introduced a couple instructions that would execute incrementally
>(MVCL & CLCL) ... although there were some early machines that had
>microcode implementation bugs ... that would pretest the end of
>MVCL/CLCL operand before starting execution.

Yes - I was a person who first found one of them :-)

Regards,
Nick Maclaren.

From: Rob Warnock on 9 Aug 2010 23:01

Anne & Lynn Wheeler <lynn(a)garlic.com> wrote:
+---------------
| rpw3(a)rpw3.org (Rob Warnock) writes:
| > ISTR that the reason the 370 had (at least) 4-way associative TLBs
| > was that there were certain instructions that could not make forward
| > progress unless there were *eight* pages mapped simultaneously, of
| > which up to four could collide in the TLB. The famous example of such
| > was the Translate-And-Test instruction in the situation in which the
| > instruction itself, the source buffer, the destination buffer, and
| > the translation table *all* spanned a page boundary, which obviously
| > needs valid mappings at least eight pages. [But only four could collide
| > per TLB line.]
|
| minor trivia ... translate, transate-and-test were against the source
| buffer ... using the translation table/buffer (which could cross page
| boundary). the two additional possible page references was that instead
| of executing the instruction directly ... the instruction could be the
| target of an "EXECUTE" instruction; where the 4-byte EXECUTE instruction
| might also cross page boundary
+---------------

Aha! Thanks for the correction/clarification. At least I did
remember correctly that you needed eight PTEs in a 4-way TLB... ;-}

-Rob

-----
Rob Warnock <rpw3(a)rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607

From: MitchAlsup on 10 Aug 2010 13:12

On Aug 6, 5:02 pm, n...(a)gosset.csi.cam.ac.uk (Nick Maclaren) wrote:
> In article <68e616b4-c45d-4b58-a1ed-9bb08f9ae...(a)t20g2000yqa.googlegroups..com>,
>
> MitchAlsup <MitchAl...(a)aol.com> wrote:
>
> >> What do modern hardware reloaders do when they get multiple,
> >> CLASHING, TLB uses before the first has completed? That can clearly
> >> happen if more memory accesses are allowed than TLB associativity,
> >> and possibly in other ways. Or do they ensure that can't happen?
>
> >While I cannot speak for other designers.....
>
> >This seems to be the kind of problem you simply ignore--to the extent
> >possible. That is, you walk the page tables, install the PTE, and go
> >on with life. If you happen to be walking a set of addresses that will
> >cause thrashing in the PTE, you simpy ignore the problem, install the
> >PTE of the moment and go on with life.
>
> I was afraid of that :-(

It is the nature of the complexity of processor design. See below

> >In practice, if you have any of these (or have any notion of being hit
> >with one of these), you wil simulate a lot of configurations of the
> >TLB and <hopefully> not observe a severe "fall on the face" problem.
> >If you do observe a problem in TLB performance, you will reconfigure
> >the TLB so as to not have one of these. Hopefully with some analysis
> >of why it happened, and this, then, will drive the reconfiguration.
>
> I have heard of systems that deadlocked, but the only case I have
> actually seen is when I managed to deadlock an SGI Origin, and that
> had the extra amusement of cross-board TLB misses ....

I thought this thread was on hardware TLB reloaders.......SGI did not
have.....

In any event, with the frequency of todays processors it is
unrealistic to make a fully associative store much bigger than 64
entries and cycle these in an appropriate amount of time. Thus, the
microarchitect is basically reduced to two choices: a) fully
associative and 64-entries, or b) set associative and built with SRAM,
or c) a bit of both.

An SRAM block contains either 2KBytes or 4KBytes depending on leakage,
wire delay and timing requirements. The array will have between 128
and 300 bits of storage per word at the bit lines.

Using 48-bit virtual and 48-bit physical addresses uses 96-100 bits
per entry, and thus, one can make a 3-way set associative TLB from a
single SRAM that looks and smells like the same SRAM array that cache
data is stored in, savnig design time. More virtual or more physical
bits or more kinds of control over the pages will drop the
associativity to 2-way. So, with a single SRAM array, on can get 384
pages mapped with 3-ways of associativity, 768 with 2 SRAMs in a 6-way
set associative manner.

Another organization would be to associate two successive virtual
pages with a single tag. In the fully associative case, this leads to
128 entries 64-way associative 2-PTEs/tag, and in the SRAM
organization 512 PTEs in 2-way set associative or 1024 in 4-way set
associative. THe utility of the second entry is "not so bad" for code
and "not so good" for data, and these organizations don't perform
usefully better than those organizations with fewer but more
associatively organized entries, and thusly have fallen from favor.

Athlon and (at least early) Opteron processors backed up their 48
entry (early growing to 64 entry later) first level TLBs with a 512
entry SRAM based 4-way set second lever TLB. The combined footprint
was "not so bad" and the combined hit rate was "acceptable" and the
area penalty was "not hideous".

Due to the size of the SRAM organizations, they are appropriate for
the (circa 2007) higher end microprocessors, but not so much for
desktops. And it is easy to see that processor designers targeting
servers (exclusively) will use such an organization and deliver large
TLBs. The problem for x86 users is that the total volume of servers
represents an afternoon in the FAB per year of production. It is
thereby uneconomic (remember circa 2007) to put the larger TLBs on an
x86 die. {And that is a shame...but I digress}

Servers and especially database servers would like to map the data
base working store with as few PTEs as possible so the rest of the
applications making demands on the data base can proceede smoothly
with little user TLB thrashing with DB activities. Databases can be
funny animals, at startup they might allocate 1/2 of main memory (say
a dozen gigabytes in 2010 terms) and then the DB itself manages the
paging to and from this area using algorithms tuned to the data base
needs (rarely page size) and very non-desirous of OS management of the
database pages (locked down pages as seen from the OS). So, here, it
is actually appropriate to utilize really really large pages. We
proposed a 1 GB PTE for K9 (which is just the next natural x86
architectural boundary bigger than 2 MB). This kind of PTE would
basically make the Database table walks vanish--especially if one
could make the TLB holding big pages independent of the TLB holding
the littler pages. One culd use a separate CAM array or a separate
SRAM array as appropiate.

But few other than databases couldmake use of such large pages (even
numerics don't like these so well), and it is doubtful that an OS
wouldlike to manage three distinct sizes of pages.

There are more than a few people still wanting smaller pages.....

So the microarchitect has basicaly 2 choices: a big fully associative
TLB (where big is less than 70 entries) or a small number of SRAM
arrays (where small is 1,2,3,4) used in a set associative manner. This
is what I mean by basically ignoring the problem. you choose the FA
TLB if you are selling to the general market place, and you choose the
SRAM organization if you are selling to servers. minor details of the
architecture will determin how the bits lay out and how many sets you
get with the SRAM approach.

Mitch

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: 100% without investment online part time jobs..(adsense,datawork,neobux..more jobs)
Next: Branchless RLE decoding (Was Re: Delphi needs opengl gui :))