From: nmm1 on
In article <ggtgp-1B5ECC.00221217112009(a)netnews.asp.att.net>,
Brett Davis <ggtgp(a)yahoo.com> wrote:
>In article <4B00EB3A.3060700(a)patten-glew.net>,
> "Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> wrote:
>
>> I must admit that I am puzzled by using loop-iteration SpMT if you can
>> do the bypassing between the clusters. I guess that you are using that
>> "batching" to hopefully reduce inter-cluster bypassing. But then I am
>> not a big fan of inner loop SpMT. Loops are easy, and are already done
>> pretty well.
>
>"Loops are easy." ;) Pray tell where these plentiful easy to speedup
>areas are in CPU design. ;) Run strait into a brick wall you have. ;)

What I call vectorisable codes. A solved problem since the 1970s.
Unfortunately, outside HPC, they are rare - and, even in that, they
don't dominate.


Regards,
Nick Maclaren.
From: Mayan Moudgill on
Niels J�rgen Kruse wrote:
> Andy "Krazy" Glew <ag-news(a)patten-glew.net> wrote:
>
>
>>The physical layout matters a lot, and hence has its own terminology.
>>
>>For example, most datapaths are bit interleaved - the actual wires might
>>look like
.....
>>Bit interleaving makes bypassing of same bit to same bit easier.
>>
>>Unfortunately, bit interleaving introduces O(N^2) factor to the area.
>>Doesn't matter for small degree of superscalarness, but matters as you
>>get bigger, as you become wire limited.
>>
.....
>>
>>For a while I was puzzled by IBM's Power family, e.g. Power 4, which
>>seemed to be on the low side. Until I was told that there were extra
>>wires not related to superscalarness, for DMA, etc.,
>
>
> Die Photos of PPC970, POWER4 and POWER5 clearly have 2 identical integer
> units replicated, so I doubt they are interleaved.
>

I'll preface this response by saying that I'm not very sure that a good
full custom design might not be able to work around some of the
problems, and that I am not practiced in this area.

Using bit interleaving for bypassing works something like this for two
ALUs, each with 2 inputs and 1 output.
+----------
v XXXX | YYYY
---> MX3==============> YYYY
^ XXXX | YYYY
<=====+=======================
v XXXX | YYYY ||
---> MX3 --> XXXX | YYYY ||
^ XXXX | YYYY ||
<-----+------------ YYYY ||
v XXXX | YYYY ||
---> MX3 =============> YYYY ||
^ XXXX | YYYY ||
+=======================
v XXXX | YYYY
---> MX3 --> XXXX | YYYY
^ XXXX | YYYY
+------------

Here X's and Y's are the logic for the two ALUs, the --- are the wires
for the X ALU input and output, the === are the wires for the Y ALU
input and output. Control wires are not shown. This cell will be (kind
of) replicated 32 times for a 32-bit ALU. Nothing is to scale; in
particular, in real life the muxes might stretch all the way across the
height of the cell, and the wires would be much closer together (or even
ontop of each other).

At the left we have the path to the register file. The register file and
the result values feed four 3-1 bypass muxes; two select the X inputs
and two select Y inputs. The X/Y inputs are then sent to the appropriate
ALU cell.

[The reason why the cells can't be replicated exactly are the
propagate/generate trees for adds and the compare result trees; hence
the empty space between the cells and muxes]

Now, consider where this is useful:
- X must be tall so that there is space for the wires to go over it
- X must be skinny so that the additional wire delay does not impact
cycle time
- X must not have many layers of metal so that it is possible to route
Y's wires over it
- Y must not have too large a cycle time, so that the additional wire
delay (of routing over X) does not kill performance
- X and Y must have roughly the same height.

It pretty much restricts this to simple ALUs; add/subtracts are probably
as complex operations as you can sustain. There is very little chance
that X can contain shifters or multipliers - that'll violate the skinny
and the layers of metal restrictions.

Y can, in general, be a little more complex than X. But, again, if Y is
too much more complex than X, then Y's implementation may have to be
contorted to fit the height of X. Also, Y cannot use a full cycle [it
still has the additional routing delay over X], so it really can't be
a very complex unit.

Bit lane interleaving seems, to me, to be a somewhat overly restrictive
and impractical optimization technique. It will let you be blindingly
fast on a small subset of the total op space.

I do not know anything about the exact thinking that went into the POWER
aarchitects choices for integer unit layout. However, given that I
probably got my attitudes from them, they may have shared my opinions :)
From: Stephen Fuld on
nmm1(a)cam.ac.uk wrote:
> In article <he1hiv$jmf$1(a)news.eternal-september.org>,
> Stephen Fuld <SFuld(a)Alumni.cmu.edu.invalid> wrote:
>> So the structure is a two dimensional table, one dimension indexed by n,
>> and the other implementation dependent, but treated as an LRU? What
>> does each entry in the table contain? It must have the hash, and
>> presumably a few (4?) bit field to indicate prediction direction and
>> confidence.. Anything else?
>
> No, and it needn't even be that. The 'n' need not be stored in the
> structure. The hash is looked up and the confidence value (which
> includes the direction) copied to the separate register file that
> is indexed by n (and, again, contains just a confidence value).

OK, then a one dimensional table of m entries each containing a hash
value and a confidence value. m is architecture independent; larger
values would keep more "history". And a separate structure, n deep
containing the confidence interval.

snip

> What I am describing is an approach that might lead to
> better prediction for O-O codes. But the details would be modified
> according to whatever techniques are already known to be effective
> for branch prediction, and as a result of simulation.
>
> My key idea is to get away from using just the instruction address
> and thread index for branch prediction, and to use the address of
> the object being worked on (or whatever the compiler regards as
> appropriate). Really no more than that. But doing even that needs
> some handshaking between the compiler and the hardware.

I understand the goal. While I am still not sure of the details of your
proposal, let me make some comments that, I think are applicable.

I have a limited understanding of hardware design, so someone else,
please correct me. Depending upon the number of registers used in the
hash, the instructions could get expensive. For example, if you want to
hash five registers, I think you would need at least three cycles (at
two register read ports per instruction cycle) just to read the values
from the registers. Then you have an "associative memory" lookup of the
m values in the LRU structure. With sufficient hardware, this could
occur in parallel, but for reasonable sized m this is a lot of hardware.
So it might add more cycles to the instruction execution. I believe
the LRU updating of the structure, which is expensive, could be done
after the instruction completes, thus not impacting instruction
execution time. It would limit the rate at which such instructions
could be issued, but that may not be a problem.

But if the instructions take, say 3 cycles, is the improvement in branch
prediction worth the cost?

On the other hand, what about a simplified version - something like the
following. The branch predictor would use a few predefined bits of a
predefined register appended to the branch address to use in indexing
the predictor table. Since branch instructions often don't use the
second register read port, it might not extend the prediction time at
all. The compiler would know which register was used, of course, and
would try to keep that register as the "object id", or object address,
or some unique value. If the value was not wanted for the prediction,
the compiler could add an instruction to zero it out prior to the
branch. This extra instruction might be almost free as it has no
dependencies and could be easily scheduled. This might get some of the
advantages of your proposal, but with a lot less hardware and complexity.


> Oh, the other main idea is 'advisory instructions' only, so that
> a CPU could ignore them without behavioural changes. But that isn't
> a new idea.

Agreed.


--
- Stephen Fuld
(e-mail address disguised to prevent spam)
From: nmm1 on
In article <he46lk$8ph$1(a)news.eternal-september.org>,
Stephen Fuld <SFuld(a)Alumni.cmu.edu.invalid> wrote:
>
>I have a limited understanding of hardware design, so someone else,
>please correct me. Depending upon the number of registers used in the
>hash, the instructions could get expensive. For example, if you want to
>hash five registers, I think you would need at least three cycles (at
>two register read ports per instruction cycle) just to read the values
>from the registers. ...

I have none :-) If I were running that project, I would first run
simulations to discover if it had enough potential to be worth doing
some serious analysis. And one of the aspects of that would be just
how simple a hash gave most of the benefit. It might well be just
the bottom 8 bits of a single register!

My take on it is by looking at the code of O-O method functions, and
noting that a lot of the conditional branches are controlled by a
single factor. That isn't always explicit in the data, but that's
no obstacle.


Regards,
Nick Maclaren.
From: Terje Mathisen on
nmm1(a)cam.ac.uk wrote:
> SET_PRED n<registers>
> Sets predictor n to an undefined hash of the register values.
> USE_PRED n<confidence>
> Uses that predictor, with confidence between (say) -7 and 7; negative
> values imply the confidence of it not being taken.
> PUSH_PREDS<predictors>
> POP_PREDS<predictors>
> Used at function calls and elsewhere to indicate the change in use.
> Simply clearing a predictor is always a permitted action, and most
> hardware would keep only a very small stack of them.
>
> And, yes, my approach is specifically aimed at O-O languages, where
> the same code is used for different objects, and where the path is
> dependent more on the object than the instruction. Would it be a

Isn't it easier to "simply" agree with the compiler writers that the
self pointer will always be loaded into (say) ECX/RCX, and then let the
content of that register, along with the EIP/RIP code address, control
the BTB lookups?

If you're already going to have a multi-level branch predictor, with a
top-level chooser to determine which one to use, then you could have
this object type predictor be one of the lower-level ones.

This at least is both forward and backwards compatible with existing
code and does not require any instruction set changes.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"