Prev: 100% without investment online part time jobs..(adsense,datawork,neobux..more jobs)
Next: Branchless RLE decoding (Was Re: Delphi needs opengl gui :))
From: Rob Warnock on 8 Aug 2010 21:40 MitchAlsup <MitchAlsup(a)aol.com> wrote: +--------------- | n...(a)gosset.csi.cam.ac.uk (Nick Maclaren) wrote: | > What do modern hardware reloaders do when they get multiple, | > CLASHING, TLB uses before the first has completed? �That can clearly | > happen if more memory accesses are allowed than TLB associativity, | > and possibly in other ways. �Or do they ensure that can't happen? | | While I cannot speak for other designers..... | | This seems to be the kind of problem you simply ignore--to the extent | possible. That is, you walk the page tables, install the PTE, and go | on with life. If you happen to be walking a set of addresses that will | cause thrashing in the PTE, you simpy ignore the problem, install the | PTE of the moment and go on with life. +--------------- ISTR that the reason the 370 had (at least) 4-way associative TLBs was that there were certain instructions that could not make forward progress unless there were *eight* pages mapped simultaneously, of which up to four could collide in the TLB. The famous example of such was the Translate-And-Test instruction in the situation in which the instruction itself, the source buffer, the destination buffer, and the translation table *all* spanned a page boundary, which obviously needs valid mappings at least eight pages. [But only four could collide per TLB line.] -Rob p.s. The MIPS CPU line [which used software-interrupt TLB reload] addressed that by making the TLBs fully-associative. p.p.s. I worked on a BSD Unix port to the AMD Am29000, which also used software TLB reload. ISTR that we found that when running typical "Unix" apps (the assembler, "nroff", etc.) that TLB reload used only a few percent of the CPU. ----- Rob Warnock <rpw3(a)rpw3.org> 627 26th Avenue <URL:http://rpw3.org/> San Mateo, CA 94403 (650)572-2607
From: Anne & Lynn Wheeler on 9 Aug 2010 12:27 rpw3(a)rpw3.org (Rob Warnock) writes: > ISTR that the reason the 370 had (at least) 4-way associative TLBs > was that there were certain instructions that could not make forward > progress unless there were *eight* pages mapped simultaneously, of > which up to four could collide in the TLB. The famous example of such > was the Translate-And-Test instruction in the situation in which the > instruction itself, the source buffer, the destination buffer, and > the translation table *all* spanned a page boundary, which obviously > needs valid mappings at least eight pages. [But only four could collide > per TLB line.] minor trivia ... translate, transate-and-test were against the source buffer ... using the translation table/buffer (which could cross page boundary). the two additional possible page references was that instead of executing the instruction directly ... the instruction could be the target of an "EXECUTE" instruction; where the 4-byte EXECUTE instruction might also cross page boundary the feature of the execute instruction was that it would take a byte from a register and use it to modify the 2nd byte of the target instruction for execution ... which in SS/6-byte instructions was the length field; eliminating some of the reasons for altering instructions as they appeared in storage). note that 360/67 had an 8-entry associative array as the translate look-aside hardware ... in order to handle the worst case instruction (eight) page requirement. more trivia ... in more recent processors ... translate & translate and test have gotten a "bug" fix and became much more complex. 360 instructions always pretested both the origin of an operand and the end of an operand for valid (in the case of variable length operand specification ... used the instruction operand length field ... or in case of the execute instruction, the length supplied from register) .... before beginning execution ... in the above example might have multiple page faults before the instruction execution would actually start. 370 introduced a couple instructions that would execute incrementally (MVCL & CLCL) ... although there were some early machines that had microcode implementation bugs ... that would pretest the end of MVCL/CLCL operand before starting execution. relatively recently a hardware fix was accepted for translate & translate & test instructions. the other variable length 6-byte SS instructions have both the source operand length and the destination operand length identical. translate and translate&test instructions have a table as operand that uses each byte from the source to index. The assumption was that the table was automatically 256 bytes and therefor instruction pre-test would do check for valid for start-of-table plus 255. it turns out that page fault when cross page boundary ... there is possibility of storage protect scenario on page boundary crossing. that coupled with some applications that would do translate on subset of possible values ... and only built a table that was much smaller than 256 bytes. If the table was at the end of a page ... abutting a storage protect page ... the end-of-table precheck could fail ... even tho the translate data would never actually result in reference to protected storage. so now, translate and translate&test instructions have a pretest if the table is within 256 bytes of page boundary ... if not ... it executes as it has since 360 days. if the target table is within 256 bytes of the end of page ... it may be necessary to execute the instruction incrementally, byte-by-byte (more like mvcl/clcl) -- virtualization experience starting Jan1968, online at home since Mar1970
From: Nick Maclaren on 9 Aug 2010 13:10 In article <m3sk2n3g5m.fsf(a)garlic.com>, Anne & Lynn Wheeler <lynn(a)garlic.com> wrote: > >370 introduced a couple instructions that would execute incrementally >(MVCL & CLCL) ... although there were some early machines that had >microcode implementation bugs ... that would pretest the end of >MVCL/CLCL operand before starting execution. Yes - I was a person who first found one of them :-) Regards, Nick Maclaren.
From: Rob Warnock on 9 Aug 2010 23:01 Anne & Lynn Wheeler <lynn(a)garlic.com> wrote: +--------------- | rpw3(a)rpw3.org (Rob Warnock) writes: | > ISTR that the reason the 370 had (at least) 4-way associative TLBs | > was that there were certain instructions that could not make forward | > progress unless there were *eight* pages mapped simultaneously, of | > which up to four could collide in the TLB. The famous example of such | > was the Translate-And-Test instruction in the situation in which the | > instruction itself, the source buffer, the destination buffer, and | > the translation table *all* spanned a page boundary, which obviously | > needs valid mappings at least eight pages. [But only four could collide | > per TLB line.] | | minor trivia ... translate, transate-and-test were against the source | buffer ... using the translation table/buffer (which could cross page | boundary). the two additional possible page references was that instead | of executing the instruction directly ... the instruction could be the | target of an "EXECUTE" instruction; where the 4-byte EXECUTE instruction | might also cross page boundary +--------------- Aha! Thanks for the correction/clarification. At least I did remember correctly that you needed eight PTEs in a 4-way TLB... ;-} -Rob ----- Rob Warnock <rpw3(a)rpw3.org> 627 26th Avenue <URL:http://rpw3.org/> San Mateo, CA 94403 (650)572-2607
From: MitchAlsup on 10 Aug 2010 13:12
On Aug 6, 5:02 pm, n...(a)gosset.csi.cam.ac.uk (Nick Maclaren) wrote: > In article <68e616b4-c45d-4b58-a1ed-9bb08f9ae...(a)t20g2000yqa.googlegroups..com>, > > MitchAlsup <MitchAl...(a)aol.com> wrote: > > >> What do modern hardware reloaders do when they get multiple, > >> CLASHING, TLB uses before the first has completed? That can clearly > >> happen if more memory accesses are allowed than TLB associativity, > >> and possibly in other ways. Or do they ensure that can't happen? > > >While I cannot speak for other designers..... > > >This seems to be the kind of problem you simply ignore--to the extent > >possible. That is, you walk the page tables, install the PTE, and go > >on with life. If you happen to be walking a set of addresses that will > >cause thrashing in the PTE, you simpy ignore the problem, install the > >PTE of the moment and go on with life. > > I was afraid of that :-( It is the nature of the complexity of processor design. See below > >In practice, if you have any of these (or have any notion of being hit > >with one of these), you wil simulate a lot of configurations of the > >TLB and <hopefully> not observe a severe "fall on the face" problem. > >If you do observe a problem in TLB performance, you will reconfigure > >the TLB so as to not have one of these. Hopefully with some analysis > >of why it happened, and this, then, will drive the reconfiguration. > > I have heard of systems that deadlocked, but the only case I have > actually seen is when I managed to deadlock an SGI Origin, and that > had the extra amusement of cross-board TLB misses .... I thought this thread was on hardware TLB reloaders.......SGI did not have..... In any event, with the frequency of todays processors it is unrealistic to make a fully associative store much bigger than 64 entries and cycle these in an appropriate amount of time. Thus, the microarchitect is basically reduced to two choices: a) fully associative and 64-entries, or b) set associative and built with SRAM, or c) a bit of both. An SRAM block contains either 2KBytes or 4KBytes depending on leakage, wire delay and timing requirements. The array will have between 128 and 300 bits of storage per word at the bit lines. Using 48-bit virtual and 48-bit physical addresses uses 96-100 bits per entry, and thus, one can make a 3-way set associative TLB from a single SRAM that looks and smells like the same SRAM array that cache data is stored in, savnig design time. More virtual or more physical bits or more kinds of control over the pages will drop the associativity to 2-way. So, with a single SRAM array, on can get 384 pages mapped with 3-ways of associativity, 768 with 2 SRAMs in a 6-way set associative manner. Another organization would be to associate two successive virtual pages with a single tag. In the fully associative case, this leads to 128 entries 64-way associative 2-PTEs/tag, and in the SRAM organization 512 PTEs in 2-way set associative or 1024 in 4-way set associative. THe utility of the second entry is "not so bad" for code and "not so good" for data, and these organizations don't perform usefully better than those organizations with fewer but more associatively organized entries, and thusly have fallen from favor. Athlon and (at least early) Opteron processors backed up their 48 entry (early growing to 64 entry later) first level TLBs with a 512 entry SRAM based 4-way set second lever TLB. The combined footprint was "not so bad" and the combined hit rate was "acceptable" and the area penalty was "not hideous". Due to the size of the SRAM organizations, they are appropriate for the (circa 2007) higher end microprocessors, but not so much for desktops. And it is easy to see that processor designers targeting servers (exclusively) will use such an organization and deliver large TLBs. The problem for x86 users is that the total volume of servers represents an afternoon in the FAB per year of production. It is thereby uneconomic (remember circa 2007) to put the larger TLBs on an x86 die. {And that is a shame...but I digress} Servers and especially database servers would like to map the data base working store with as few PTEs as possible so the rest of the applications making demands on the data base can proceede smoothly with little user TLB thrashing with DB activities. Databases can be funny animals, at startup they might allocate 1/2 of main memory (say a dozen gigabytes in 2010 terms) and then the DB itself manages the paging to and from this area using algorithms tuned to the data base needs (rarely page size) and very non-desirous of OS management of the database pages (locked down pages as seen from the OS). So, here, it is actually appropriate to utilize really really large pages. We proposed a 1 GB PTE for K9 (which is just the next natural x86 architectural boundary bigger than 2 MB). This kind of PTE would basically make the Database table walks vanish--especially if one could make the TLB holding big pages independent of the TLB holding the littler pages. One culd use a separate CAM array or a separate SRAM array as appropiate. But few other than databases couldmake use of such large pages (even numerics don't like these so well), and it is doubtful that an OS wouldlike to manage three distinct sizes of pages. There are more than a few people still wanting smaller pages..... So the microarchitect has basicaly 2 choices: a big fully associative TLB (where big is less than 70 entries) or a small number of SRAM arrays (where small is 1,2,3,4) used in a set associative manner. This is what I mean by basically ignoring the problem. you choose the FA TLB if you are selling to the general market place, and you choose the SRAM organization if you are selling to servers. minor details of the architecture will determin how the bits lay out and how many sets you get with the SRAM approach. Mitch |