Synthesize latest CPU design in an old library. [Computer Architecture]

Prev: Intel: People Do Not Need High-Performance Graphics
Next: CFP: Informatics 2010: new date - until 15 March 2010

From: Jean on 9 Mar 2010 17:12

I was trying to see the impact of computer architecture improvements
on CPU performance compared to technology. Has anyone studied/analyzed
the performance impact when a latest CPU design is synthesized in an
old design library (Say a 45nm in 1um ?) ?

Can this be simulated by running a latest CPU at lower clock frequency
(Few MHz)?

Jean

From: MitchAlsup on 11 Mar 2010 12:34

I have a sneeking suspicion that if one were to take one of the first
generation RISC machines and implement the circuits in the modern
processes that one would be quite surprised at A: how much you got,
and B: how uncompetitive they would be.

Frequency wise, these older designs would scale quite well in
comparison to their modern counterparts. The 33 MHz MIPS design might
run well in the 3 GHz range (as a 32-bit implementation); but it would
run in the multiGHz range.

Performance wise, the old 0.7 IPC design point of the first generation
RISC machines would be degraded into about 0.35-to-0.5 IPC simply
because of the delay to main memory at the multi-GHz operating point,
even after some good cacheing was applied outside the tiny spec of
silicon that passed for a processor, way back when. In comparison, the
modern OoO superScalar get all the way up to 1.0 IPC on benchmark
stuff. So all the architecture of the past two decades was to get the
performace back as memory got more latent, cycle wise. The cost is
area and power. The area part seems to have "gone away" as the
relentless mark of smaller feature sizes migrates forward.

Now, going in reverse, taking a modern OoO SS architecture and
implementing it in (say) 1µ that the thing would be so big you could
not fit it in a reticle. Thus, it would be completely impractible to
even consider doing so.

Mitch

From: Anton Ertl on 12 Mar 2010 06:12

MitchAlsup <MitchAlsup(a)aol.com> writes:
>I have a sneeking suspicion that if one were to take one of the first
>generation RISC machines and implement the circuits in the modern
>processes that one would be quite surprised at A: how much you got,
>and B: how uncompetitive they would be.
>
>Frequency wise, these older designs would scale quite well in
>comparison to their modern counterparts. The 33 MHz MIPS design might
>run well in the 3 GHz range (as a 32-bit implementation); but it would
>run in the multiGHz range.

Maybe I should trust you on this, because you are working on these
things, but all the clock rates I have seen from real chips make me
doubt this.

In particular, even single-issue in-order designs with these kinds of
clock rates (e.g., Cell's PPE), and even designs that did not reach
such clock rates (e.g., the VIA C3 etc.), have >10 pipeline stages.
The Niagara, which has a 6-stage pipeline, only reached 1.4GHz. ARM
implementations with around five stages never reached 1GHz AFAIK,
although the desire to save power may have to do with it.

Also, shrinking a CPU in a way that increases performance a lot seemed
to be a non-trivial task, even in those times when there was still
performance to be had from shrinking. I remember the disappointing
speedups from 21264 shrinks, although according to Wikipedia the
eventual top clockrates were not so bad; still, they came out late,
and the clock rate increases by Intel and AMD were much better, so
Alpha was passed in clock rate by IA-32 CPUs.

>Performance wise, the old 0.7 IPC design point of the first generation
>RISC machines would be degraded into about 0.35-to-0.5 IPC simply
>because of the delay to main memory at the multi-GHz operating point,
>even after some good cacheing was applied outside the tiny spec of
>silicon that passed for a processor, way back when.

Hmm, if you couple such a CPU with a fast 6MB L2 cache (like the one
in the Core 2), would the performance impact from memory accesses be
so bad? For main memory accesses OoO does not help much, either, and
the L2 accesses would be about as fast (relative to the clock) as main
memory was back in the 33MHz days; actually, thinking about how slow
main memory was on our DecStations (IIRC ~1000ns latency on our
DecStation 5000/150), it would be faster. Of course, it does not make
much sense to use a big L2 cache with a tiny CPU core.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

From: MitchAlsup on 14 Mar 2010 14:52

On Mar 12, 6:12 am, an...(a)mips.complang.tuwien.ac.at (Anton Ertl)
wrote:
> MitchAlsup <MitchAl...(a)aol.com> writes:
> >I have a sneeking suspicion that if one were to take one of the first
> >generation RISC machines and implement the circuits in the modern
> >processes that one would be quite surprised at A: how much you got,
> >and B: how uncompetitive they would be.
>
> >Frequency wise, these older designs would scale quite well in
> >comparison to their modern counterparts. The 33 MHz MIPS design might
> >run well in the 3 GHz range (as a 32-bit implementation); but it would
> >run in the multiGHz range.
>
> Maybe I should trust you on this, because you are working on these
> things, but all the clock rates I have seen from real chips make me
> doubt this.
>
> In particular, even single-issue in-order designs with these kinds of
> clock rates (e.g., Cell's PPE), and even designs that did not reach
> such clock rates (e.g., the VIA C3 etc.), have >10 pipeline stages.
> The Niagara, which has a 6-stage pipeline, only reached 1.4GHz. ARM
> implementations with around five stages never reached 1GHz AFAIK,
> although the desire to save power may have to do with it.

I agree with your data, however, each of these designs added stuff to
the pipelines not present in the originals; in addition, I was also
assuming access to an Intel-class 'processor' semicondictor technology
(worth at least a factor of 2X in frequency compared to more typical
FABs).

> Also, shrinking a CPU in a way that increases performance a lot seemed
> to be a non-trivial task, even in those times when there was still
> performance to be had from shrinking. I remember the disappointing
> speedups from 21264 shrinks, although according to Wikipedia the
> eventual top clockrates were not so bad; still, they came out late,
> and the clock rate increases by Intel and AMD were much better, so
> Alpha was passed in clock rate by IA-32 CPUs.

Intel and AMD had at least 3X as many people "beating on gates" as did
DEC. This is the old "cubic dollars" argument. The amount of money
available, when one is shipping more than 1 million CPUs per week, is
sufficient to buy enough engineers to beat the very vast majority of
implementation problem into submission.

> >Performance wise, the old 0.7 IPC design point of the first generation
> >RISC machines would be degraded into about 0.35-to-0.5 IPC simply
> >because of the delay to main memory at the multi-GHz operating point,
> >even after some good cacheing was applied outside the tiny spec of
> >silicon that passed for a processor, way back when.
>
> Hmm, if you couple such a CPU with a fast 6MB L2 cache (like the one
> in the Core 2), would the performance impact from memory accesses be
> so bad? For main memory accesses OoO does not help much, either, and
> the L2 accesses would be about as fast (relative to the clock) as main
> memory was back in the 33MHz days; actually, thinking about how slow
> main memory was on our DecStations (IIRC ~1000ns latency on our
> DecStation 5000/150), it would be faster. Of course, it does not make
> much sense to use a big L2 cache with a tiny CPU core.

Once you are actually waiting on memory, the small simple CPU waits a
lot more power efficiently than the great big OoO CPU.

When comparing benchmark applications, one typically sees around 1.0
IPC. When looking at 'comercial' applications one might see 0.1 to 0.2
IPC. The small simple machine wuld suffer a 2X degredation compared to
the BG OoO on benchmark applications, but suffer only 10%-15% on a
comercial application all due to memory latency.

But back to your design of a couple of CPUs coupled to a shared 6MB
L2, the size of such a small CPU is about the size of 64KBytes of
cache, so your system would end up being 1-2% CPUs and 98-99% cache.
My guess is that there is a better balance point between CPUs and
Cache that achieves a better system throughput. The one I postulated 4
years ago consisted of 4 microCPUs coupled to a 4 banked 256KB L2 then
coupled to the big on-die cache of "whatever fit". The L2 was capable
of 5-cycle access with 4 accesses to different banks simultaneously.
My design in particular, subdivided the banks into 4 columns and
allows one access per column per cycle. The center of this arrangement
contained the L2 queues, and the miss buffer and victum buffers and
logic that 'talked' to the outer hierarchy. Durring an L2 access the
entire microCPU was 'declocked' to save power.

During this effort, we found "no particular reason" that the microCPU
would not run at least as fast as Opteron (where all the parts had
been stolen {ALUs, FPUs, Reg Files, Cache Blocks, TLBs, L3s,...}). Due
to its power consumption, and the fact it fits within the linear
distance of a single clock cycle, there was potential upside on the
frequency. That is, whereas Opteron was 4 clocks top to bottom and 3
clocks side to side, PSP (A.K.A. microCPU) was smaller than a clock in
both dimensions. This helps big time in skew management.

The important point from all of this was that if you want to make
small simple CPUs, A: you can, but B: you have to build at least as
good a cache hierarchy to support them as you do with the big OoO
CPUs. Latency maters, and if you are building latency sensitive
machines, you have to pound latency out of the cahe hierarchy until
you have stumbled upon enough MLP (across processors) to keep the
DRAMs busy.

Mitch

Mitch

|
Pages: 1
Prev: Intel: People Do Not Need High-Performance Graphics
Next: CFP: Informatics 2010: new date - until 15 March 2010