From: Brett Davis on
In article <2009Nov15.124955(a)mips.complang.tuwien.ac.at>,
anton(a)mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> That reminds me of a paper [kim&smith02] that I have always found very
> impressive, and I wondered why I have never seen followup papers or an
> implementation (but then I don't follow ISCA and Micro proceedings
> closely these days); hmm, looking at the research summary at Smith's
> home page, he is still working on something like this. Of course that
> work was also done in Wisconsin (same University, right?), so it may
> have been inspired by your ideas. Do you have any comments on that?
>
> @InProceedings{kim&smith02,
> author = {Ho-Seop Kim and James E. Smith},
> title = {An Instruction Set and Microarchitecture for
> Instruction Level Distributed Processing},
> - anton

This paper is timely for me as I have an 8 way design spec'd out that
I am getting ready to write up, and I have found no one else looking
at 8 way. (Feel free to fix this shortcoming of mine.)

The second paper corrects the mistake of using accumulator instructions.
http://www.ece.wisc.edu/~hskim/

Most people will read these papers are declare this to be a failure,
as indeed this design is. But the flaws can be fixed.

The basic problem is using Alpha RISC ops as his source, the timeline
of 2002 means he was stuck with GCC. Today due to Apple we have LLVM
and Clang which will allow you access to more parallelism.
http://en.wikipedia.org/wiki/LLVM
http://en.wikipedia.org/wiki/Clang

I am concerned about one quote from the second paper:
"The typical IPC of SPEC INT type workload is below 2."

With loop unrolling you should be able to get arbitrary IPC
multiplication, the problem is the OoO engine being too stupid
due to various real world constraints to make use of the IPC.

Or is there something fundamental that I am Missing.

Brett
From: Brett Davis on
In article <4AFFA499.9070503(a)patten-glew.net>,
"Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> wrote:

> I can't express how good it feels to see MCMT become a product. It's
> been public for years, but it gets no respect until it is in a product.
> It would have been better if I had stayed at Intel to see it through.
> I know that I won't get any credit for it. (Except from some of the guys
> who were at AMD at the time.) But it feels good nevertheless.

Claiming credit can get you pushed out of the company. The old 1%
inspiration and 99% perspiration, with the 99% being pissed at you
for claiming the credit.
I stopped tooting my horn, and now people like me. ;)
(My boss of course knows what I am up to, otherwise I would leave.)

> The only bad thing is that some guys I know at AMD say that Bulldozer is
> not really all that great a product, but is shipping just because AMD
> needs a model refresh. "Sometimes you just gotta ship what you got." If
> this is so, and if I deserve any credit for CMT, then I also deserve
> some of the blame. Although it might have been different, better, if I
> had stayed.

Bulldozer just has to be close to Core2 or better in performance, in
order to stop Intel from playing the pricing squeeze game they are doing
against Barcelona.
Better to save the company now, then wait two more years for a "better"
design that may only be 2% faster, and which will be rolled into the
next needed refresh anyway.

Barcelona was of course a no-brainer re-engineering to remove all the
known bottlenecks throttling the K7 pipeline, so that a new faster
pipeline could be plugged in and not be crippled by: instruction fetch
decode, memory bandwidth, cache bandwidth, etc. Just about everything
in the K10 is faster than the old K7/K8 pipeline in the K10 needs.
And the end result is quite a lot faster than the K8.

> SpMT has potential to speed up a single thread of execution, by
> splitting it up into separate threads and running the separate threads
> on different clusters. [...]
>
> If I received arows in my back for MCMT, I received 10 times as many
> arrows for SpMT. And yet still I have hope for it. Unfortunately, I am
> not currently working on SpMT. Haitham Akkary, the father of DMT,
> continues the work.

I have started backing away from SpMT, too many difficult problems to
solve at once, and there are better solutions. If you can go 8 way a
simple recompile with a loop unroll will get you the same performance
as splitting the loop into odd and even paired threads.
(Not going to add 16 read ports to the register file to do this... ;)

For SpMT I looked at Alpha style paired integer pipelines with a 2 cycle
latency for any rare copies needed between the duplicate register sets.
In loops each pipeline handles its odd or even half of the loop count.
Outside of loops you have both CPUs running the same code, has power
and heat issues. But you win the all important benchmark queen position.
(Gamers will love it, server folk will not buy it.)
Each half would have its own instruction pointer, memory latencies in
the non-loop code would re-sync the IPs to near match.
Someone will do this one day.

Apples solution to this problem is software only, Grand Dispatch, and
quite frankly this is the best solution of all. Use lots of cheap
small CPUs and let the OS and compiler and programmer decide how to
chop up the loops.

We are reaching the end of CPU design, little more can be done with
the opcodes and pipelines we have today. This fight between GM and Ford
may end with both suffering under the weight of dozens of little
companies selling "free" multi-clusters. (20 year prediction...)

> I have a whole taxonomy of different sorts of clustering:
> * fast vs slow bypass clusters
> * fully bypassed vs. partially bypassed
> * mechanisms to reduce bypassing
> * physical layout of clusters
> * bit interleaved datapaths
> * datapaths flowing in opposite directions,
> with bypassing where they touch
> * what's in the cluster
> * execute only
> * execute + data cache
> * schedule + execute + data cache
> * renamer + schedule + execute + datacache
> ...
> * what gets shared between clusters
> * front-end
> * renamer?
> * data-cache - L0? L1? L2?
> * TLBs...
> * MSHRs...
> * FP...
>
> Anyway... It's cool to see MCMT becoming real. It gives me hope that my
> follow-on to MCMT, M* may still, eventually, also become real.

So where does my paired odd/even pipelines proposal fit in your taxonomy.

Brett
From: "Andy "Krazy" Glew" on
Brett Davis wrote:
>> @InProceedings{kim&smith02
>> author = {Ho-Seop Kim and James E. Smith},
>> title = {An Instruction Set and Microarchitecture for
>> Instruction Level Distributed Processing},
>> - anton
>
> This paper is timely for me as I have an 8 way design spec'd out that
> I am getting ready to write up, and I have found no one else looking
> at 8 way. (Feel free to fix this shortcoming of mine.)

Or take an MCMT machine like Barcelona, which essentially has two
roughly 4-wide clusters - and figure out a way to spread computation
of a single thread across both clusters.

The problem with width is the bypasses.
From: "Andy "Krazy" Glew" on
Brett Davis wrote:
> For SpMT I looked at Alpha style paired integer pipelines with a 2 cycle
> latency for any rare copies needed between the duplicate register sets.
> In loops each pipeline handles its odd or even half of the loop count.
> Outside of loops you have both CPUs running the same code, has power
> and heat issues. But you win the all important benchmark queen position.
> (Gamers will love it, server folk will not buy it.)
> Each half would have its own instruction pointer, memory latencies in
> the non-loop code would re-sync the IPs to near match.
> Someone will do this one day.
>
> So where does my paired odd/even pipelines proposal fit in your taxonomy.
>
> Brett

Are you bypassing between the clusters? If so, you have a bypass
cluster. Or else how are you transferring registers across iterations?
It sounds as if you have an incomplete 2 cycle inter-cluster bypass.

I must admit that I am puzzled by using loop-iteration SpMT if you can
do the bypassing between the clusters. I guess that you are using that
"batching" to hopefully reduce inter-cluster bypassing. But then I am
not a big fan of inner loop SpMT. Loops are easy, and are already done
pretty well.

Having both run the same code sounds like Jim Goodman's Datascalar.

So: you seem to be proposing clusters with partial intercluster
bypassing of 2 clocks, batched on alternate loop iterations, with
Datascalar style non-loop execution.

You haven't said enough about the physical layout to talk about those
clustering effects.
From: Noob on
Brett Davis wrote:

> The basic problem is using Alpha RISC ops as his source, the timeline
> of 2002 means he was stuck with GCC. Today due to Apple we have LLVM
> and Clang which will allow you access to more parallelism.

"We have LLVM due to Apple" ?!

Did Apple support the project from the start?

"The LLVM project started in 2000 at the University of Illinois at
Urbana-Champaign, under the direction of Vikram Adve and Chris Lattner.
In 2005, Apple Inc. hired Lattner and formed a team to work on the LLVM
system for various uses within Apple's development systems."