Why does Intel favor thin rectangular CPUs? [Computer Architecture]

Prev: 128316 Computer Knowledge, Free and alwqays Up to Date 59
Next: Fwd: Different stacks for return addresses and data?

From: Robert Myers on 2 Mar 2010 12:04

On Mar 2, 11:40 am, Stephen Fuld <SF...(a)alumni.cmu.edu.invalid> wrote:
> On 3/1/2010 7:15 PM, Robert Myers wrote:
>

[Stephen Fuld wrote]:
>
> I'm with Del here, though the way I head it is that bandwidth is only
> money, but latency is forever.
>
> > Those miracles (aggressive prefetch, out of order, huge cache) have
> > been being served up for years.
>
> Yes, but they are running out of steam. Caches are a diminishing
> returns game, and there seem to be limits on the others.
>
> > We crashed through the so-called
> > memory wall long ago.
>
> No, we just moved it out some.
>
And we could move it out even more, if we were willing to spend the
watts and transistors. I'm sure Andy Glew could say much if he cared
to. That the emphasis is moving away from latency-hiding tricks (as
Andy has lamented here) is very clear.

> > It was such a relatively minor problem that
> > Intel could keep the memory controller off the die for years after
> > alpha had proven the enormous latency advantage of putting it on die.
> > More than a decade later, Intel had to use up that "miracle."
>
> > There are no bandwidth-hiding tricks. Once the pipe is full, that's
> > it. That's as fast as things will go. And, as one of the architects
> > here commented, once you have all the pins possible and you wiggle
> > them as fast as you can, there is no more bandwidth to be had.
>
> But there are lots of things we could do given enough money. For
> example, we could integrate the memory on chip or on an MCM to eliminate
> the pin restrictions. We are also not near the limit of pin wiggling speed.
>
> I cite as a counter example, that if we had wanted more bandwidth and
> were willing to pay some more, and sacrifice some latency, we would all
> be using more banks of FB-DIMMs.
>
My original point was about *this* chip, which is aimed at a mass
market. For the foreseeable future, the hard limit for mass market
chips is going to be bandwidth, not latency. The bandwidth-latency
tradeoff is a matter of design choices, not miracles.

Latency that is exposed on the critical path is forever. It isn't
very often that you *have* to leave latency exposed on the critical
path. Once again, it is a matter of design choices.

Robert.

From: MitchAlsup on 2 Mar 2010 12:09

On Mar 2, 1:14 am, Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
> MitchAlsup wrote:
> > On Mar 1, 12:12 pm, Terje Mathisen<"terje.mathisen at tmsw.no">
> > wrote:
> >> Even with a very non-bleeding edge gpu, said gpu is far larger than any
> >> of those x86 cores which many people here claim to be too complicated.
>
> > A large number of pipelines all doing the same kinds of work are
> > actually simpler than a medium number of pipelines all doing different
> > kinds of work.
>
> Mitch, I do know that. :-)
>
> Taken to the logical endpoint, and you get SRAMs before logic in the
> same process, new cpu models which are basically cache size increases on
> more or less the same core, as well as Larrabee-style heaps of identical
> (smallish) cores.

I remember how small my fisrt CPU design was. That generation was so
small <gate count relative to todays>, that it would be about the size
of a single bonding pad today. Thus, one could have put about 1000
first generation RISC CPUs on the same die area as the processors
shown in the image that started this thread AND still have room for
that Graphics GPU plus the 8MB cache and both interfaces!

The problem with this plan is that the L1 and L2 cache sizes are not
part of the die areas above.

Mitch

From: Anton Ertl on 2 Mar 2010 11:53

Stephen Fuld <SFuld(a)alumni.cmu.edu.invalid> writes:
>On 3/2/2010 6:45 AM, nik Simpson wrote:
>> On 3/2/2010 1:17 AM, Terje Mathisen wrote:
>>> nik Simpson wrote:
>>>> On 3/1/2010 12:12 PM, Terje Mathisen wrote:
[<http://www.semiaccurate.com/2010/02/27/look-intels-upcoming-sandy-bridge/>]
>>>>> Even with a very non-bleeding edge gpu, said gpu is far larger than any
>>>>> of those x86 cores which many people here claim to be too complicated.
>>>>>
>>>>> Terje
>>>>>
>>>> Isn't the GPU core still on a 45nm process, vs 32nm for the CPU and
>>>> cache?
>>>>
>>> That would _really_ amaze me, if they employed two different processes
>>> on the same die!
....
>> That's certainly the case for the Clarksdale/Westmere parts with
>> integrated graphics...
>>
>> http://www.hardocp.com/article/2010/01/03/intel_westmere_32nm_clarkdale_core_i5661_review/
>
>
>I think the confusion here is that Clarkdale is a multi-chip module (CPU
>chip plus graphics chip in one package) whereas Sandy Bridge is a single
>chip.

Another difference is that one can buy Clarkdale now, whereas Sandy
Bridge will supposedly be available next year. The die plan that
Terje commented upon is for Sandy Bridge, though.

As for the complexity: replication does not make things complex.
That's why Terje did not consider the CPU cores as a whole.
Similarly, the GPUs have a lot of internal replication; but adding
more replication within a GPU makes more sense than having multiple
GPUs, that's why the GPU is bigger than a single CPU core.
Conversely, adding more ALUs, dispatchers, etc. to a CPU does not help
that much, so we get multi-cores instead.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

From: MitchAlsup on 2 Mar 2010 12:19

On Mar 2, 5:28 am, n...(a)cam.ac.uk wrote:
> 2) Put the memory back-to-back with the CPU, factory integrated,
> thus releasing all existing memory pins for I/O use. Note that this
> allows for VASTLY more memory pins/pads.

I have been thinking along these lines...

Consider a chip containing CPUs sitting in a package with a small-
medium number of DRAM chips. The CPU and DRAM chips orchestrated with
an interface that exploits the on die wire density that cannot escape
the package boundary.

A: make this DRAM the only parts of the coherent memory
B: use more conventional FBDIMM channels to an extended core storage
C: perform all <disk, network, high speed> I/O to the ECS
D: page ECS to the on die DRAM as a single page sized burst at FBDIMM
speeds
E: an efficient on-CPU-chip TLB shootdown mechanism <or coherent TLB>

A page copy to an FBDIMM resident page would take about 150-200 ns;
and this is about the access time of a single line if the whole ECS
was made coherent!

F: a larger ECS can be built <if desired> by implementing a FBDIMM
multiplexer

Mitch

From: Anne & Lynn Wheeler on 2 Mar 2010 12:28

MitchAlsup <MitchAlsup(a)aol.com> writes:
> I have been thinking along these lines...
>
> Consider a chip containing CPUs sitting in a package with a small-
> medium number of DRAM chips. The CPU and DRAM chips orchestrated with
> an interface that exploits the on die wire density that cannot escape
> the package boundary.
>
> A: make this DRAM the only parts of the coherent memory
> B: use more conventional FBDIMM channels to an extended core storage
> C: perform all <disk, network, high speed> I/O to the ECS
> D: page ECS to the on die DRAM as a single page sized burst at FBDIMM
> speeds
> E: an efficient on-CPU-chip TLB shootdown mechanism <or coherent TLB>
>
> A page copy to an FBDIMM resident page would take about 150-200 ns;
> and this is about the access time of a single line if the whole ECS
> was made coherent!
>
> F: a larger ECS can be built <if desired> by implementing a FBDIMM
> multiplexer

this was somewhat the 3090 in the 80s (but room full of boxes)
.... modulo not quite doing i/o into/out-of extended store. the issue was
that physical packaging couldn't get all the necessary real storage
within the processor latency requirements.

there was a wide-bus, (relatively) very fast synchronous instruction
that moved 4k bytes between processor storage and extended store.

at the time, i complained about not being able to do i/o directly
into/out-of extended store.

there was something half-way in-between ... when attempting to support
HIPPI, the standard 3090 i/o interface couldn't support the bandwidth
.... so they hacked into the side of the extended store bus for HIPPI
I/O.

--
42yrs virtualization experience (since Jan68), online at home since Mar1970

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Prev: 128316 Computer Knowledge, Free and alwqays Up to Date 59
Next: Fwd: Different stacks for return addresses and data?