From: Paul A. Clayton on
Another obvious (possibly half-way decent) idea: Use the duplicated
register file of a clustered processor design like the Alpha 21264 to
hold distinct contexts.

Such a static partitioning might not be advisable under two
simultaneous threads usually, but at four (reasonably active) threads,
static partitioning might be a net gain in many cases. To allow a
slight increase in support for burst ILP, the inter-cluster forwarding
could write to register caches rather than to the other register file
and these register values could be used for issuing instructions from
the other cluster. The extra write ports in each cluster could then
be used to support two-result operations if desired.

(A two issue per cluster processor might share a multiplier/divider
[possibly replicating enough of a multiplier to support independent 16-
bit by 64-bit multiplications??]. At three issues per cluster
distinct multipliers might make sense.)

(Static partitioning of two threads might make sense when ILP is
relatively low with little benefit from using full issue width issue
for a single thread, when extra registers could be used to support
deeper speculation, or under other circumstances.)

(Obviously, one could also use such register-duplicating clustering to
support SIMD-like operations.)


Paul A. Clayton
just a technophile
From: Andy 'Krazy' Glew on
On 5/27/2010 3:10 PM, Paul A. Clayton wrote:
> Another obvious (possibly half-way decent) idea: Use the duplicated
> register file of a clustered processor design like the Alpha 21264 to
> hold distinct contexts.

Looks like you have found another way of arriving at, another evolutionary path, to

a) AMD's MCMT (Multicluster Multithreading) as in Bulldozer

b) my MultiStar.

I arrived at it from a different path: (a) thinking that most multicluster uarch for single threads were not very
successful, (b) using multicluster for separate threads, and (c) then trying to go back and use the MCMT to speed up
single thread.

I.e. you

MCST (multicluster singlethread) -> MCMT

me

MCMT -> MCST ?


I wonder what things work out differently when you think this way?


I never liked the inter-cluster bypass of the 21264. Complete bypass networks are expensive; incomplete are a glass
jaw. But, heck, even un-clustered machines now have incomplete bypass networks.

>
> Such a static partitioning might not be advisable under two
> simultaneous threads usually, but at four (reasonably active) threads,
> static partitioning might be a net gain in many cases. To allow a
> slight increase in support for burst ILP, the inter-cluster forwarding
> could write to register caches rather than to the other register file
> and these register values could be used for issuing instructions from
> the other cluster. The extra write ports in each cluster could then
> be used to support two-result operations if desired.
>
> (A two issue per cluster processor might share a multiplier/divider
> [possibly replicating enough of a multiplier to support independent 16-
> bit by 64-bit multiplications??]. At three issues per cluster
> distinct multipliers might make sense.)
>
> (Static partitioning of two threads might make sense when ILP is
> relatively low with little benefit from using full issue width issue
> for a single thread, when extra registers could be used to support
> deeper speculation, or under other circumstances.)
>
> (Obviously, one could also use such register-duplicating clustering to
> support SIMD-like operations.)
>
>
> Paul A. Clayton
> just a technophile

From: Paul A. Clayton on
On May 28, 2:13 am, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote:
[snip]
> I.e. you
>
> MCST (multicluster singlethread) -> MCMT
>
> me
>
> MCMT -> MCST ?
>
> I wonder what things work out differently when you think this way?

Well, one of my habits of thinking seems to be to exploit existing
features for alternate uses (e.g., huge page TLB entries holding
PDEs). (This is probably part of the reason I find SMT appealing--
existing [or extreme] ILP core -> choice of single thread
performance or moderately great multithread throughput.)

> I never liked the inter-cluster bypass of the 21264.  Complete bypass networks are expensive; incomplete are a glass
> jaw.  But, heck, even un-clustered machines now have incomplete bypass networks.

I kind of dislike complete bypass because it seems wasteful. (I
would irrationally dislike it even if it were cheap.) Other than
squaring, when is a result used by both inputs of a functional
unit? (Intelligent forwarding would seem desirable, but such
could add excessive delay [aside from area/power costs].)

BTW, could a staggered ALU be used to ease the delay
problem of scheduling/forwarding? If one 'cluster' of
ALUs was staggered a half-cycle relative to the other with
the less significant bits forwarded as soon as available,
could one see some benefit? (I like the Pentium 4
staggered ALU concept. I do wonder if it might be useful
for a low-power design--i.e., addition takes two cycles
to fully complete [less logic activity] but has single cycle
forwarding. [I suspect the ideas in the Pentium 4 are
now tainted with the relative failure of the Pentium 4.])


Paul A. Clayton
just a technophile
From: Andy 'Krazy' Glew on
On 5/29/2010 7:19 PM, Paul A. Clayton wrote:

> BTW, could a staggered ALU be used to ease the delay
> problem of scheduling/forwarding? If one 'cluster' of
> ALUs was staggered a half-cycle relative to the other with
> the less significant bits forwarded as soon as available,
> could one see some benefit? (I like the Pentium 4
> staggered ALU concept. I do wonder if it might be useful
> for a low-power design--i.e., addition takes two cycles
> to fully complete [less logic activity] but has single cycle
> forwarding. [I suspect the ideas in the Pentium 4 are
> now tainted with the relative failure of the Pentium 4.])

I don't think that Pentium 4 had what you think of as a staggered ALU.

When I think of staggered ALU, I think of two ALUs, with the second ALU receiving inputs from the first, and possibly
from the generic register file. I.e. something that allows you to execute A+B->C; C+D->E in one clock cycle.

Pentium 4 actually just ran the ALUs - and the associating support logic, like the scheduler - at 2X the published
frequency of the core. I.e. if the core was publicly 2.5GHz, the "fireball" was actually running at 5GHz.

The original Pentium 4 ALUs were staggered in that they computed the low 16 bits in one of these fast clock cycles, and
the high in the next - allowing back to back adds. But that is not the widespread definition of "staggered" ALU.
From: Paul A. Clayton on
On May 30, 11:18 am, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net>
wrote:
[snip]
> The original Pentium 4 ALUs were staggered in that they computed the low 16 bits in one of these fast clock cycles, and
> the high in the next - allowing back to back adds.  But that is not the widespread definition of "staggered" ALU.

I took the term from "Using Internal Redundant Representations and
Limited Bypass to Support Pipelined Adders and Register Files"
(Mary D. Brown, Yale N. Patt; 2001 [HPCA-3]):
"An example of this concept, called staggered adds, was
implemented in the Intel Pentium 4 [10]. When staggering a 32-bit
add over two cycles, the carry-out of the 16th bit and the lower half
of the result are produced in the first cycle, and the upper half of
the
result is produced in the second cycle."

So what is the proper term for this kind of pipelined addition?

(ISTR reading somewhere that the AMD K5 used the early
availability of the less significant bits of a sum to shorten
load latency, so early use of partial results is not an extremely
new idea.)


Paul A. Clayton
just a technophile