|
Prev: Nvidia's 9800 GTX and 9800 GX2 seem to be a waste of time & money
Next: Marketing names (Re: Hinted conditional branch ?)
From: Paul A. Clayton on 16 Apr 2008 20:47 Would it be helpful for an ISA targeting code density for the sake of price/performance to use bits similar to IPF's template bits in a fixed-sized instruction block to indicate the encoding format used for that block? Such might allow complex/variable formats of instructions (for code density) while allowing parallel decode. I very much doubt that template bits would be a win if only code density was considered (I suspect that providing an instruction that changed the encoding mode in a context sensitive way would better fit the relatively variability or large-scale of mode changes); but it seems that it might be possible to have fast parallel decode (with power- sensitive logic) and significant variability in encoding formats (for code density) by using a small number of bits to express a template. It seems that there would be obvious tension between the sizes of the template and the block while balancing code density and decoding complexity. Larger blocks would allow for larger templates (at the same fraction of the block size) which seems likely to provide closer fits to the ideally dense encoding (larger blocks would also allow individual instructions to placed that would be split between blocks with smaller blocks); but larger blocks imply a coarser fetch granularity (which might not be attractive in terms of power efficiency when the issue width of the processor is modest). While it would be trivial predecoding to replicate a fixed bitfield into subblocks in the instruction cache, it is also somewhat desirable (from a performance/ power perspective) to maintain some degree of code density in the instruction cache. (It would also be desirable for the sake of simpler [more power efficient] decoding to have fewer templates when using simple replication predecoding as each subblock decoder would have to interpret the template [rather than a smaller template representing the local variability as such would require more sophisticated predecode] .) Obviously, the size of the block whose encoding is expressed by the template should be less than or equal to at least the size of an L2 cache block (if L2 contains a partially predecoded representation) and probably the size of the L1 instruction cache block; so a maximum size of the encoding block would be about 32 or 64 bytes. It would also seem that there would be greater predecode overhead with a single template per 64B block (information from the middle would need to be communicated across approximately 256 bit-widths). (Might it be reasonable for information from multiple templates within a cache block to be communicated to the tail of the instruction predecodes, so cache-block global information could be used to refine/extend the predecoding while avoiding delay in starting the predecode?) (Note this thought problem assumes at least some predecoding is desirable between the main memory representation and the instruction cache representation for a price/performance processor Architecture. It also assumes that such a code density mechanism would be desirable for ideal price/performance.) I am somewhat inclined against more dynamic context-sensitive decode like MIPS16 (which provides a dense mode and a jump [and link, register and PC-related; for returns the least significant bit of the return address, which is otherwise unused given 2 byte granularity of dense mode encoding, is used to indicate the mode of the caller] into dense mode encoded function). Such complicates block-level predecode since a block might contain code from a dense function and a non-dense function. (One could make a prediction about the mode of the other code and predecode the whole block with a misprediction being handled like a parity exception [fetch from lower in the memory hierarchy and use the mode from the earlier called function, which caused the ICache fill, and the now known mode of the later called function].) (It might be desirable to leave open the possibility of relatively simple highly parallel instruction translation by software; dynamic modes and block-crossing information makes parallelism more difficult.) (The questions presented here might not be answerable by a quick application of an expert's knowledge. It might even be necessary to have experience in _several_ implementations of at least a few encoding formats to know whether such a template-assisted decode would be appropriate for a price/performance- oriented ISA.) Paul A. Clayton just a technophile (with no HDL knowledge) reachable as 'paaronclayton' at "embarqmail.com"
From: MitchAlsup on 17 Apr 2008 13:10 It is well known that the Athlon/Opteron processor families invent these predecode bits and save with the instruction bytes in the instruction cache. The first time through the decode process, instructions are decoded 4 bytes at a time, and the ending byte is marked. On subsequent decode cycles a find-first scanner is used to deliver independent pointers to additional decoders with all operate in parallel. As long as one can arrange the decode pipeline to absorb the 5 gate overhead of the (2 additional instructions) scanner, one can decode many byte length instructions at a time. Thus, don't waste you time marking this stuff at the instruction bit- pattern level, but DO mark this stuff at the cache level where its almost free. Mitch
From: Paul A. Clayton on 18 Apr 2008 17:12 On Apr 17, 1:10 pm, MitchAlsup <MitchAl...(a)aol.com> wrote: > It is well known that the Athlon/Opteron processor families invent > these predecode bits and save with the instruction bytes in the > instruction cache. The first time through the decode process, > instructions are decoded 4 bytes at a time, and the ending byte is > marked. On subsequent decode cycles a find-first scanner is used to > deliver independent pointers to additional decoders with all operate > in parallel. As long as one can arrange the decode pipeline to absorb > the 5 gate overhead of the (2 additional instructions) scanner, one > can decode many byte length instructions at a time. > > Thus, don't waste you time marking this stuff at the instruction bit- > pattern level, but DO mark this stuff at the cache level where its > almost free. Thank you very much for the reply (from a professional who no doubt has other things available for the use his time). So you do not think it worthwhile to allow the initial (pre)decode to be done in parallel? (I admit, after I thought about this a little more, I realized that one is most likely to have a somewhat fast and narrow interface to L2; so predecode should probably not depend on things outside of, say, 16B sections. This might also justify relatively limited predecode bandwidth.) It seemed to me that one could develop a relatively complex encoding yet have the advantage of cheap (power/area) fast (parallel) decode by bundling instructions into a fixed sized format (like the CDC 6600) and providing indicators of the specific format (e.g., where the register IDs are; other than determining register type, the opcode and immediate values could be handled less quickly). Code density seems desirable, but pipeline-aware/friendly and parallelizable decode seems important. Again thanks, Mitch, for the response. Paul A. Clayton just a technophile reachable as 'paaronclayton' at "embarqmail.com"
From: MitchAlsup on 21 Apr 2008 12:24 On Apr 18, 3:12 pm, "Paul A. Clayton" <paaronclay...(a)earthlink.net> wrote: > Thank you very much for the reply (from a professional who no > doubt has other things available for the use his time). > > So you do not think it worthwhile to allow the initial (pre)decode > to be done in parallel? (I admit, after I thought about this a > little more, I realized that one is most likely to have a > somewhat fast and narrow interface to L2; so predecode > should probably not depend on things outside of, say, 16B > sections. This might also justify relatively limited predecode > bandwidth.) You should be aware that Opteron stores these predecode markers in the L2 (they take the place of the ECC bits and the entire (sub)line is protected by parity) In addition, some branch prediction info is also kept in the L2. The multiprocessor versions, then, share these predecode bits accross processors, saving even more time for shared code stuff. > It seemed to me that one could develop a relatively complex > encoding yet have the advantage of cheap (power/area) fast > (parallel) decode by bundling instructions into a fixed sized > format (like the CDC 6600) and providing indicators of the > specific format (e.g., where the register IDs are; other than > determining register type, the opcode and immediate values > could be handled less quickly). If you are building an out-of-order machine with all of the backing-up accoutremine and a fairly deep pipe, then the cost of adding a single stage to decode pipeline (to deal with this stuff) is about 1%-ish (16-18 pipe stages from fetch through retire). On the other hand, if you are making an in-order short pipeline (5-ish) then instruction decode is NOT the major pipe problem you face, figuring out where the register index(es) are and accessing the register file IS because you have another full cycle to figure out what to do with the operands you just read-out/forwarded. Thus all the fixed format stuff is about fixing where the register indexes come from and not about simplifying the decoding of the operation that gets associated with these register indexes. Thus, I think you are chasing performance that is unlikely to show up on the bottom line, but may look good on paper until you realize that you are simply moving bubbles around in the pipeline. > Code density seems desirable, but pipeline-aware/friendly > and parallelizable decode seems important. Once again, I think that Athlon and Opteron has shown that byte level decoding is actualy more valuable, since you can arbitrarily extend the instruction set as you desire without any particularly nasty ramifications. When RISC was born in the early 1980's we all though decoding byte level (ahem) x86 stuff was going to be really bad for a really long time. However, the amount of cubic dollars Intel and AMD put into making this problem "go away" actually made this problem "go away" in every practical sense of the words. The overhead is virtually nil in both performance and in storage domains. I should also point out that Intel came up with a way of solving this set of issues that does not require marker bits for storage, but consumes a bit of extra power in the decode process. Thus, the problem of byte level decode realy has been made to "go away". In closing, I would, however, argue, that pursuing the instruction boundary smaller than a byte is fraught with peril, and overheads would be greatly magnified. > Again thanks, Mitch, for the response. Mitch
From: MitchAlsup on 21 Apr 2008 12:25
On Apr 18, 3:12 pm, "Paul A. Clayton" <paaronclay...(a)earthlink.net> wrote: > Thank you very much for the reply (from a professional who no > doubt has other things available for the use his time). > > So you do not think it worthwhile to allow the initial (pre)decode > to be done in parallel? (I admit, after I thought about this a > little more, I realized that one is most likely to have a > somewhat fast and narrow interface to L2; so predecode > should probably not depend on things outside of, say, 16B > sections. This might also justify relatively limited predecode > bandwidth.) You should be aware that Opteron stores these predecode markers in the L2 (they take the place of the ECC bits and the entire (sub)line is protected by parity) In addition, some branch prediction info is also kept in the L2. The multiprocessor versions, then, share these predecode bits accross processors, saving even more time for shared code stuff. > It seemed to me that one could develop a relatively complex > encoding yet have the advantage of cheap (power/area) fast > (parallel) decode by bundling instructions into a fixed sized > format (like the CDC 6600) and providing indicators of the > specific format (e.g., where the register IDs are; other than > determining register type, the opcode and immediate values > could be handled less quickly). If you are building an out-of-order machine with all of the backing-up accoutremine and a fairly deep pipe, then the cost of adding a single stage to decode pipeline (to deal with this stuff) is about 1%-ish (16-18 pipe stages from fetch through retire). On the other hand, if you are making an in-order short pipeline (5-ish) then instruction decode is NOT the major pipe problem you face, figuring out where the register index(es) are and accessing the register file IS because you have another full cycle to figure out what to do with the operands you just read-out/forwarded. Thus all the fixed format stuff is about fixing where the register indexes come from and not about simplifying the decoding of the operation that gets associated with these register indexes. Thus, I think you are chasing performance that is unlikely to show up on the bottom line, but may look good on paper until you realize that you are simply moving bubbles around in the pipeline. > Code density seems desirable, but pipeline-aware/friendly > and parallelizable decode seems important. Once again, I think that Athlon and Opteron has shown that byte level decoding is actualy more valuable, since you can arbitrarily extend the instruction set as you desire without any particularly nasty ramifications. When RISC was born in the early 1980's we all though decoding byte level (ahem) x86 stuff was going to be really bad for a really long time. However, the amount of cubic dollars Intel and AMD put into making this problem "go away" actually made this problem "go away" in every practical sense of the words. The overhead is virtually nil in both performance and in storage domains. I should also point out that Intel came up with a way of solving this set of issues that does not require marker bits for storage, but consumes a bit of extra power in the decode process. Thus, the problem of byte level decode realy has been made to "go away". In closing, I would, however, argue, that pursuing the instruction boundary smaller than a byte is fraught with peril, and overheads would be greatly magnified. > Again thanks, Mitch, for the response. Mitch |