From: "Andy "Krazy" Glew" on
Andy "Krazy" Glew wrote:
> Brett Davis wrote:
>> For SpMT I looked at Alpha style paired integer pipelines ...
>>
>> So where does my paired odd/even pipelines proposal fit in your taxonomy.
>
> You haven't said enough about the physical layout to talk about those
> clustering effects.

The physical layout matters a lot, and hence has its own terminology.

For example, most datapaths are bit interleaved - the actual wires might
look like

ALU 0 s1 bit 0
ALU 0 s2 bit 0
ALU 0 d bit 0

ALU 1 s1 bit 0
ALU 1 s2 bit 0
ALU 1 d bit 0

ALU 0 s1 bit 1
ALU 0 s2 bit 1
ALU 0 d bit 1

ALU 1 s1 bit 1
ALU 1 s2 bit 1
ALU 1 d bit 1

Bit interleaving makes bypassing of same bit to same bit easier.

Unfortunately, bit interleaving introduces O(N^2) factor to the area.
Doesn't matter for small degree of superscalarness, but matters as you
get bigger, as you become wire limited.

For a long time I observed that most processors tended to have only
around 4-6 superscalar. Which corresponds roughly to 12-24 wires
for each bit:
12 = (4 ops * (2 inputs/op + 1 output/op))
to
24 (6 ops * (3 inpurs/op + 1 output/op))

For a while I was puzzled by IBM's Power family, e.g. Power 4, which
seemed to be on the low side. Until I was told that there were extra
wires not related to superscalarness, for DMA, etc.,


Anyway - my hypothesis seems to have been broken by recent Intel
machines that are up to 7 wide. But nevertheless...



8-wide on a bit interleaved datapath is pushing the envelopew.


8-wide as 2 4 wide bit interleaved datapaths is not so bad, although you
will pay a lot of area for the wire turns to get to side by side
datapaths. S I might call this an 8-wide cluster composed of two
adjacent 4-wide bit-interleaved clusters. With whatever bypassing you
say.

As I mentioned earlier, there is a sweet trick you can play with
datapaths that are paired and opposite, reflections. You bit interleave
the input and output wires within the, say, 4-wide cluster, and then you
bit interleave the outputs (but not the inputs). Trouble with this
trick is that it takes one of the ends of the datapath away - where do
you put the scheduler? The data cache?
So, I call this "opposed" clustering.


From: =?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?= on
Andy "Krazy" Glew <ag-news(a)patten-glew.net> wrote:

> The physical layout matters a lot, and hence has its own terminology.
>
> For example, most datapaths are bit interleaved - the actual wires might
> look like
>
> ALU 0 s1 bit 0
> ALU 0 s2 bit 0
> ALU 0 d bit 0
>
> ALU 1 s1 bit 0
> ALU 1 s2 bit 0
> ALU 1 d bit 0
>
> ALU 0 s1 bit 1
> ALU 0 s2 bit 1
> ALU 0 d bit 1
>
> ALU 1 s1 bit 1
> ALU 1 s2 bit 1
> ALU 1 d bit 1
>
> Bit interleaving makes bypassing of same bit to same bit easier.
>
> Unfortunately, bit interleaving introduces O(N^2) factor to the area.
> Doesn't matter for small degree of superscalarness, but matters as you
> get bigger, as you become wire limited.
>
> For a long time I observed that most processors tended to have only
> around 4-6 superscalar. Which corresponds roughly to 12-24 wires
> for each bit:
> 12 = (4 ops * (2 inputs/op + 1 output/op))
> to
> 24 (6 ops * (3 inpurs/op + 1 output/op))
>
> For a while I was puzzled by IBM's Power family, e.g. Power 4, which
> seemed to be on the low side. Until I was told that there were extra
> wires not related to superscalarness, for DMA, etc.,

Die Photos of PPC970, POWER4 and POWER5 clearly have 2 identical integer
units replicated, so I doubt they are interleaved.

--
Mvh./Regards, Niels J�rgen Kruse, Vanl�se, Denmark
From: Mayan Moudgill on
Brett Davis wrote:
>
> I am concerned about one quote from the second paper:
> "The typical IPC of SPEC INT type workload is below 2."
>
> With loop unrolling you should be able to get arbitrary IPC
> multiplication, the problem is the OoO engine being too stupid
> due to various real world constraints to make use of the IPC.

Consider the following loop (and assume all loads hit in cache).

while( p != NULL ) {
n++;
p = p->next;
}

Please unroll to arbitrarily multiply the IPC.

> Or is there something fundamental that I am Missing.
>
> Brett
From: Brett Davis on
In article <4B00EB3A.3060700(a)patten-glew.net>,
"Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> wrote:

> Brett Davis wrote:
> > For SpMT I looked at Alpha style paired integer pipelines with a 2 cycle
> > latency for any rare copies needed between the duplicate register sets.
> > In loops each pipeline handles its odd or even half of the loop count.
> > Outside of loops you have both CPUs running the same code, has power
> > and heat issues. But you win the all important benchmark queen position.
> > (Gamers will love it, server folk will not buy it.)
> > Each half would have its own instruction pointer, memory latencies in
> > the non-loop code would re-sync the IPs to near match.
> > Someone will do this one day.
> >
> > So where does my paired odd/even pipelines proposal fit in your taxonomy.
> >
> > Brett
>
> Are you bypassing between the clusters? If so, you have a bypass
> cluster. Or else how are you transferring registers across iterations?
> It sounds as if you have an incomplete 2 cycle inter-cluster bypass.

After looking at the bypass problem I have decided that there will be none.
Minimal signaling, on the loop branch you signal a loop detect, and
after a loop or a few loops signal an attempt to to do odd loops only,
the second CPU would ack and do even loops.

So this is separate CPUs running the same code, no sharing of register state.
There are some other things you can do also, if one CPU is farther ahead
and fails a speculated branch, the second CPU has less state to rewind, if
any, and so the second CPU takes the lead. Faster than one CPU.

For the rare branches the speculator knows are a coin flip, you can have
each CPU take each choice. The winner makes progress, the one that loses
will catch up on the next cache miss. Faster than one CPU.

Any time the CPUs are in lock step it may pay to throw one down the
"95% wrong" path on the off chance it wins.

For the moment lets forget that you will be lucky to be 5% faster, and
you could use that second CPU to get 100% on another task.
If all that Matters is benchmark peak, and you can sell your Core11 CPU
for $999 while your competitor has to sell his Phantom5 for $200, well
the profit will pay for the inefficiency.

You would share the FPU/Vector unit, so the area cost would be the same
as the Bulldozer, well within modern die size budgets. If the Bulldozer
integer pipes are next to each other maybe something like this or some
of your designs is planned for now or the future...

Few common folk can make any use of multi-cores, if I can turn my 8 core
Bulldozer into a 4 core thats 5% faster, I will, as will most.

The gamers are happy, the server buyers are happy, win win.

So where does my KISS simple paired CPUs fit in your taxonomy, and has
it been done? (Bet not.)

> I must admit that I am puzzled by using loop-iteration SpMT if you can
> do the bypassing between the clusters. I guess that you are using that
> "batching" to hopefully reduce inter-cluster bypassing. But then I am
> not a big fan of inner loop SpMT. Loops are easy, and are already done
> pretty well.

"Loops are easy." ;) Pray tell where these plentiful easy to speedup
areas are in CPU design. ;) Run strait into a brick wall you have. ;)

Brett
From: Brett Davis on
In article <X--dnTCmHuSLk5_WnZ2dnUVZ_tFi4p2d(a)bestweb.net>,
Mayan Moudgill <mayan(a)bestweb.net> wrote:

> Brett Davis wrote:
> >
> > I am concerned about one quote from the second paper:
> > "The typical IPC of SPEC INT type workload is below 2."
> >
> > With loop unrolling you should be able to get arbitrary IPC
> > multiplication, the problem is the OoO engine being too stupid
> > due to various real world constraints to make use of the IPC.
>
> Consider the following loop (and assume all loads hit in cache).
>
> while( p != NULL ) {
> n++;
> p = p->next;
> }
>
> Please unroll to arbitrarily multiply the IPC.
>
> > Or is there something fundamental that I am Missing.
> >
> > Brett

Please point out the Spec benchmark that does this. ;)

The problem has to be game-able to make it past the politics
and dollars that decide what makes it into a major benchmark.
That actual problems might be solved is incidental. ;)

Brett