Prev: Intel: People Do Not Need High-Performance Graphics
Next: CFP: Informatics 2010: new date - until 15 March 2010
From: Jean on 9 Mar 2010 17:12 I was trying to see the impact of computer architecture improvements on CPU performance compared to technology. Has anyone studied/analyzed the performance impact when a latest CPU design is synthesized in an old design library (Say a 45nm in 1um ?) ? Can this be simulated by running a latest CPU at lower clock frequency (Few MHz)? Jean
From: MitchAlsup on 11 Mar 2010 12:34 I have a sneeking suspicion that if one were to take one of the first generation RISC machines and implement the circuits in the modern processes that one would be quite surprised at A: how much you got, and B: how uncompetitive they would be. Frequency wise, these older designs would scale quite well in comparison to their modern counterparts. The 33 MHz MIPS design might run well in the 3 GHz range (as a 32-bit implementation); but it would run in the multiGHz range. Performance wise, the old 0.7 IPC design point of the first generation RISC machines would be degraded into about 0.35-to-0.5 IPC simply because of the delay to main memory at the multi-GHz operating point, even after some good cacheing was applied outside the tiny spec of silicon that passed for a processor, way back when. In comparison, the modern OoO superScalar get all the way up to 1.0 IPC on benchmark stuff. So all the architecture of the past two decades was to get the performace back as memory got more latent, cycle wise. The cost is area and power. The area part seems to have "gone away" as the relentless mark of smaller feature sizes migrates forward. Now, going in reverse, taking a modern OoO SS architecture and implementing it in (say) 1µ that the thing would be so big you could not fit it in a reticle. Thus, it would be completely impractible to even consider doing so. Mitch
From: Anton Ertl on 12 Mar 2010 06:12 MitchAlsup <MitchAlsup(a)aol.com> writes: >I have a sneeking suspicion that if one were to take one of the first >generation RISC machines and implement the circuits in the modern >processes that one would be quite surprised at A: how much you got, >and B: how uncompetitive they would be. > >Frequency wise, these older designs would scale quite well in >comparison to their modern counterparts. The 33 MHz MIPS design might >run well in the 3 GHz range (as a 32-bit implementation); but it would >run in the multiGHz range. Maybe I should trust you on this, because you are working on these things, but all the clock rates I have seen from real chips make me doubt this. In particular, even single-issue in-order designs with these kinds of clock rates (e.g., Cell's PPE), and even designs that did not reach such clock rates (e.g., the VIA C3 etc.), have >10 pipeline stages. The Niagara, which has a 6-stage pipeline, only reached 1.4GHz. ARM implementations with around five stages never reached 1GHz AFAIK, although the desire to save power may have to do with it. Also, shrinking a CPU in a way that increases performance a lot seemed to be a non-trivial task, even in those times when there was still performance to be had from shrinking. I remember the disappointing speedups from 21264 shrinks, although according to Wikipedia the eventual top clockrates were not so bad; still, they came out late, and the clock rate increases by Intel and AMD were much better, so Alpha was passed in clock rate by IA-32 CPUs. >Performance wise, the old 0.7 IPC design point of the first generation >RISC machines would be degraded into about 0.35-to-0.5 IPC simply >because of the delay to main memory at the multi-GHz operating point, >even after some good cacheing was applied outside the tiny spec of >silicon that passed for a processor, way back when. Hmm, if you couple such a CPU with a fast 6MB L2 cache (like the one in the Core 2), would the performance impact from memory accesses be so bad? For main memory accesses OoO does not help much, either, and the L2 accesses would be about as fast (relative to the clock) as main memory was back in the 33MHz days; actually, thinking about how slow main memory was on our DecStations (IIRC ~1000ns latency on our DecStation 5000/150), it would be faster. Of course, it does not make much sense to use a big L2 cache with a tiny CPU core. - anton -- M. Anton Ertl Some things have to be seen to be believed anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html
From: MitchAlsup on 14 Mar 2010 14:52
On Mar 12, 6:12 am, an...(a)mips.complang.tuwien.ac.at (Anton Ertl) wrote: > MitchAlsup <MitchAl...(a)aol.com> writes: > >I have a sneeking suspicion that if one were to take one of the first > >generation RISC machines and implement the circuits in the modern > >processes that one would be quite surprised at A: how much you got, > >and B: how uncompetitive they would be. > > >Frequency wise, these older designs would scale quite well in > >comparison to their modern counterparts. The 33 MHz MIPS design might > >run well in the 3 GHz range (as a 32-bit implementation); but it would > >run in the multiGHz range. > > Maybe I should trust you on this, because you are working on these > things, but all the clock rates I have seen from real chips make me > doubt this. > > In particular, even single-issue in-order designs with these kinds of > clock rates (e.g., Cell's PPE), and even designs that did not reach > such clock rates (e.g., the VIA C3 etc.), have >10 pipeline stages. > The Niagara, which has a 6-stage pipeline, only reached 1.4GHz. ARM > implementations with around five stages never reached 1GHz AFAIK, > although the desire to save power may have to do with it. I agree with your data, however, each of these designs added stuff to the pipelines not present in the originals; in addition, I was also assuming access to an Intel-class 'processor' semicondictor technology (worth at least a factor of 2X in frequency compared to more typical FABs). > Also, shrinking a CPU in a way that increases performance a lot seemed > to be a non-trivial task, even in those times when there was still > performance to be had from shrinking. I remember the disappointing > speedups from 21264 shrinks, although according to Wikipedia the > eventual top clockrates were not so bad; still, they came out late, > and the clock rate increases by Intel and AMD were much better, so > Alpha was passed in clock rate by IA-32 CPUs. Intel and AMD had at least 3X as many people "beating on gates" as did DEC. This is the old "cubic dollars" argument. The amount of money available, when one is shipping more than 1 million CPUs per week, is sufficient to buy enough engineers to beat the very vast majority of implementation problem into submission. > >Performance wise, the old 0.7 IPC design point of the first generation > >RISC machines would be degraded into about 0.35-to-0.5 IPC simply > >because of the delay to main memory at the multi-GHz operating point, > >even after some good cacheing was applied outside the tiny spec of > >silicon that passed for a processor, way back when. > > Hmm, if you couple such a CPU with a fast 6MB L2 cache (like the one > in the Core 2), would the performance impact from memory accesses be > so bad? For main memory accesses OoO does not help much, either, and > the L2 accesses would be about as fast (relative to the clock) as main > memory was back in the 33MHz days; actually, thinking about how slow > main memory was on our DecStations (IIRC ~1000ns latency on our > DecStation 5000/150), it would be faster. Of course, it does not make > much sense to use a big L2 cache with a tiny CPU core. Once you are actually waiting on memory, the small simple CPU waits a lot more power efficiently than the great big OoO CPU. When comparing benchmark applications, one typically sees around 1.0 IPC. When looking at 'comercial' applications one might see 0.1 to 0.2 IPC. The small simple machine wuld suffer a 2X degredation compared to the BG OoO on benchmark applications, but suffer only 10%-15% on a comercial application all due to memory latency. But back to your design of a couple of CPUs coupled to a shared 6MB L2, the size of such a small CPU is about the size of 64KBytes of cache, so your system would end up being 1-2% CPUs and 98-99% cache. My guess is that there is a better balance point between CPUs and Cache that achieves a better system throughput. The one I postulated 4 years ago consisted of 4 microCPUs coupled to a 4 banked 256KB L2 then coupled to the big on-die cache of "whatever fit". The L2 was capable of 5-cycle access with 4 accesses to different banks simultaneously. My design in particular, subdivided the banks into 4 columns and allows one access per column per cycle. The center of this arrangement contained the L2 queues, and the miss buffer and victum buffers and logic that 'talked' to the outer hierarchy. Durring an L2 access the entire microCPU was 'declocked' to save power. During this effort, we found "no particular reason" that the microCPU would not run at least as fast as Opteron (where all the parts had been stolen {ALUs, FPUs, Reg Files, Cache Blocks, TLBs, L3s,...}). Due to its power consumption, and the fact it fits within the linear distance of a single clock cycle, there was potential upside on the frequency. That is, whereas Opteron was 4 clocks top to bottom and 3 clocks side to side, PSP (A.K.A. microCPU) was smaller than a clock in both dimensions. This helps big time in skew management. The important point from all of this was that if you want to make small simple CPUs, A: you can, but B: you have to build at least as good a cache hierarchy to support them as you do with the big OoO CPUs. Latency maters, and if you are building latency sensitive machines, you have to pound latency out of the cahe hierarchy until you have stumbled upon enough MLP (across processors) to keep the DRAMs busy. Mitch Mitch |