From: Paul A. Clayton on
Would it be helpful for an ISA targeting code density for
the sake of price/performance to use bits similar to IPF's
template bits in a fixed-sized instruction block to
indicate the encoding format used for that block? Such
might allow complex/variable formats of instructions (for
code density) while allowing parallel decode.

I very much doubt that template bits would be a win if only
code density was considered (I suspect that providing an
instruction that changed the encoding mode in a context
sensitive way would better fit the relatively variability or
large-scale of mode changes); but it seems that it might
be possible to have fast parallel decode (with power-
sensitive logic) and significant variability in encoding
formats (for code density) by using a small number of
bits to express a template.

It seems that there would be obvious tension between
the sizes of the template and the block while balancing
code density and decoding complexity. Larger blocks
would allow for larger templates (at the same fraction
of the block size) which seems likely to provide closer
fits to the ideally dense encoding (larger blocks would
also allow individual instructions to placed that would
be split between blocks with smaller blocks); but
larger blocks imply a coarser fetch granularity (which
might not be attractive in terms of power efficiency
when the issue width of the processor is modest).

While it would be trivial predecoding to replicate a
fixed bitfield into subblocks in the instruction cache,
it is also somewhat desirable (from a performance/
power perspective) to maintain some degree of
code density in the instruction cache. (It would
also be desirable for the sake of simpler [more
power efficient] decoding to have fewer templates
when using simple replication predecoding as each
subblock decoder would have to interpret the
template [rather than a smaller template representing
the local variability as such would require more
sophisticated predecode] .)

Obviously, the size of the block whose encoding is
expressed by the template should be less than or
equal to at least the size of an L2 cache block (if L2
contains a partially predecoded representation) and
probably the size of the L1 instruction cache block;
so a maximum size of the encoding block would be
about 32 or 64 bytes.

It would also seem that there would be greater
predecode overhead with a single template per
64B block (information from the middle would
need to be communicated across approximately
256 bit-widths). (Might it be reasonable for
information from multiple templates within a cache
block to be communicated to the tail of the
instruction predecodes, so cache-block global
information could be used to refine/extend the
predecoding while avoiding delay in starting the
predecode?)

(Note this thought problem assumes at least some
predecoding is desirable between the main
memory representation and the instruction cache
representation for a price/performance processor
Architecture. It also assumes that such a code
density mechanism would be desirable for ideal
price/performance.)

I am somewhat inclined against more dynamic
context-sensitive decode like MIPS16 (which
provides a dense mode and a jump [and link,
register and PC-related; for returns the least
significant bit of the return address, which is
otherwise unused given 2 byte granularity of
dense mode encoding, is used to indicate the
mode of the caller] into dense mode encoded
function). Such complicates block-level
predecode since a block might contain code
from a dense function and a non-dense function.
(One could make a prediction about the mode
of the other code and predecode the whole
block with a misprediction being handled like a
parity exception [fetch from lower in the
memory hierarchy and use the mode from the
earlier called function, which caused the ICache
fill, and the now known mode of the later called
function].) (It might be desirable to leave open
the possibility of relatively simple highly parallel
instruction translation by software; dynamic
modes and block-crossing information makes
parallelism more difficult.)

(The questions presented here might not be
answerable by a quick application of an
expert's knowledge. It might even be
necessary to have experience in _several_
implementations of at least a few
encoding formats to know whether such
a template-assisted decode would be
appropriate for a price/performance-
oriented ISA.)



Paul A. Clayton
just a technophile (with no HDL knowledge)
reachable as 'paaronclayton'
at "embarqmail.com"

From: MitchAlsup on
It is well known that the Athlon/Opteron processor families invent
these predecode bits and save with the instruction bytes in the
instruction cache. The first time through the decode process,
instructions are decoded 4 bytes at a time, and the ending byte is
marked. On subsequent decode cycles a find-first scanner is used to
deliver independent pointers to additional decoders with all operate
in parallel. As long as one can arrange the decode pipeline to absorb
the 5 gate overhead of the (2 additional instructions) scanner, one
can decode many byte length instructions at a time.

Thus, don't waste you time marking this stuff at the instruction bit-
pattern level, but DO mark this stuff at the cache level where its
almost free.

Mitch
From: Paul A. Clayton on
On Apr 17, 1:10 pm, MitchAlsup <MitchAl...(a)aol.com> wrote:
> It is well known that the Athlon/Opteron processor families invent
> these predecode bits and save with the instruction bytes in the
> instruction cache. The first time through the decode process,
> instructions are decoded 4 bytes at a time, and the ending byte is
> marked. On subsequent decode cycles a find-first scanner is used to
> deliver independent pointers to additional decoders with all operate
> in parallel. As long as one can arrange the decode pipeline to absorb
> the 5 gate overhead of the (2 additional instructions) scanner, one
> can decode many byte length instructions at a time.
>
> Thus, don't waste you time marking this stuff at the instruction bit-
> pattern level, but DO mark this stuff at the cache level where its
> almost free.

Thank you very much for the reply (from a professional who no
doubt has other things available for the use his time).

So you do not think it worthwhile to allow the initial (pre)decode
to be done in parallel? (I admit, after I thought about this a
little more, I realized that one is most likely to have a
somewhat fast and narrow interface to L2; so predecode
should probably not depend on things outside of, say, 16B
sections. This might also justify relatively limited predecode
bandwidth.)

It seemed to me that one could develop a relatively complex
encoding yet have the advantage of cheap (power/area) fast
(parallel) decode by bundling instructions into a fixed sized
format (like the CDC 6600) and providing indicators of the
specific format (e.g., where the register IDs are; other than
determining register type, the opcode and immediate values
could be handled less quickly).

Code density seems desirable, but pipeline-aware/friendly
and parallelizable decode seems important.

Again thanks, Mitch, for the response.



Paul A. Clayton
just a technophile
reachable as 'paaronclayton'
at "embarqmail.com"
From: MitchAlsup on
On Apr 18, 3:12 pm, "Paul A. Clayton" <paaronclay...(a)earthlink.net>
wrote:
> Thank you very much for the reply (from a professional who no
> doubt has other things available for the use his time).
>
> So you do not think it worthwhile to allow the initial (pre)decode
> to be done in parallel?  (I admit, after I thought about this a
> little more, I realized that one is most likely to have a
> somewhat fast and narrow interface to L2; so predecode
> should probably not depend on things outside of, say, 16B
> sections.  This might also justify relatively limited predecode
> bandwidth.)

You should be aware that Opteron stores these predecode markers in the
L2 (they take the place of the ECC bits and the entire (sub)line is
protected by parity) In addition, some branch prediction info is also
kept in the L2. The multiprocessor versions, then, share these
predecode bits accross processors, saving even more time for shared
code stuff.

> It seemed to me that one could develop a relatively complex
> encoding yet have the advantage of cheap (power/area) fast
> (parallel) decode by bundling instructions into a fixed sized
> format (like the CDC 6600) and providing indicators of the
> specific format (e.g., where the register IDs are; other than
> determining register type, the opcode and immediate values
> could be handled less quickly).

If you are building an out-of-order machine with all of the backing-up
accoutremine and a fairly deep pipe, then the cost of adding a single
stage to decode pipeline (to deal with this stuff) is about 1%-ish
(16-18 pipe stages from fetch through retire). On the other hand, if
you are making an in-order short pipeline (5-ish) then instruction
decode is NOT the major pipe problem you face, figuring out where the
register index(es) are and accessing the register file IS because you
have another full cycle to figure out what to do with the operands you
just read-out/forwarded. Thus all the fixed format stuff is about
fixing where the register indexes come from and not about simplifying
the decoding of the operation that gets associated with these register
indexes.

Thus, I think you are chasing performance that is unlikely to show up
on the bottom line, but may look good on paper until you realize that
you are simply moving bubbles around in the pipeline.

> Code density seems desirable, but pipeline-aware/friendly
> and parallelizable decode seems important.

Once again, I think that Athlon and Opteron has shown that byte level
decoding is actualy more valuable, since you can arbitrarily extend
the instruction set as you desire without any particularly nasty
ramifications. When RISC was born in the early 1980's we all though
decoding byte level (ahem) x86 stuff was going to be really bad for a
really long time. However, the amount of cubic dollars Intel and AMD
put into making this problem "go away" actually made this problem "go
away" in every practical sense of the words. The overhead is virtually
nil in both performance and in storage domains. I should also point
out that Intel came up with a way of solving this set of issues that
does not require marker bits for storage, but consumes a bit of extra
power in the decode process. Thus, the problem of byte level decode
realy has been made to "go away".

In closing, I would, however, argue, that pursuing the instruction
boundary smaller than a byte is fraught with peril, and overheads
would be greatly magnified.

> Again thanks, Mitch, for the response.

Mitch
From: MitchAlsup on
On Apr 18, 3:12 pm, "Paul A. Clayton" <paaronclay...(a)earthlink.net>
wrote:
> Thank you very much for the reply (from a professional who no
> doubt has other things available for the use his time).
>
> So you do not think it worthwhile to allow the initial (pre)decode
> to be done in parallel? (I admit, after I thought about this a
> little more, I realized that one is most likely to have a
> somewhat fast and narrow interface to L2; so predecode
> should probably not depend on things outside of, say, 16B
> sections. This might also justify relatively limited predecode
> bandwidth.)

You should be aware that Opteron stores these predecode markers in the
L2 (they take the place of the ECC bits and the entire (sub)line is
protected by parity) In addition, some branch prediction info is also
kept in the L2. The multiprocessor versions, then, share these
predecode bits accross processors, saving even more time for shared
code stuff.

> It seemed to me that one could develop a relatively complex
> encoding yet have the advantage of cheap (power/area) fast
> (parallel) decode by bundling instructions into a fixed sized
> format (like the CDC 6600) and providing indicators of the
> specific format (e.g., where the register IDs are; other than
> determining register type, the opcode and immediate values
> could be handled less quickly).

If you are building an out-of-order machine with all of the backing-up
accoutremine and a fairly deep pipe, then the cost of adding a single
stage to decode pipeline (to deal with this stuff) is about 1%-ish
(16-18 pipe stages from fetch through retire). On the other hand, if
you are making an in-order short pipeline (5-ish) then instruction
decode is NOT the major pipe problem you face, figuring out where the
register index(es) are and accessing the register file IS because you
have another full cycle to figure out what to do with the operands you
just read-out/forwarded. Thus all the fixed format stuff is about
fixing where the register indexes come from and not about simplifying
the decoding of the operation that gets associated with these register
indexes.

Thus, I think you are chasing performance that is unlikely to show up
on the bottom line, but may look good on paper until you realize that
you are simply moving bubbles around in the pipeline.

> Code density seems desirable, but pipeline-aware/friendly
> and parallelizable decode seems important.

Once again, I think that Athlon and Opteron has shown that byte level
decoding is actualy more valuable, since you can arbitrarily extend
the instruction set as you desire without any particularly nasty
ramifications. When RISC was born in the early 1980's we all though
decoding byte level (ahem) x86 stuff was going to be really bad for a
really long time. However, the amount of cubic dollars Intel and AMD
put into making this problem "go away" actually made this problem "go
away" in every practical sense of the words. The overhead is virtually
nil in both performance and in storage domains. I should also point
out that Intel came up with a way of solving this set of issues that
does not require marker bits for storage, but consumes a bit of extra
power in the decode process. Thus, the problem of byte level decode
realy has been made to "go away".

In closing, I would, however, argue, that pursuing the instruction
boundary smaller than a byte is fraught with peril, and overheads
would be greatly magnified.

> Again thanks, Mitch, for the response.

Mitch