From: Mayan Moudgill on
Brett Davis wrote:
> The future of CPU based computing, mini clusters.

I've been figuring out how to answer this, given that a real answer
would require a lot of background, and would have to be heavily
qualified. So, some things I say may not seem well justified, or
soundcategorical where it should be conditional. And of course,
everything is approximate.

The problem that (IMO) they are trying to tackle is the diconnect
between compute capacity on one hand and memory bandwidth and latency on
the other.

To put things in perspective, lets see what it would take to keep the
floating point pipelines fully occupied. Assume a 2GHz clock.

* A floating point pipeline can sustain ~2Gops/cycle. Assuming 1
output/2 inputs per cycle and dual precision operations, it needs 48
GB/sec.
* A floating point pipeline has a depth of ~5-8 cycles. This means that
there must be 5-8 independent floating point operations available for
full utilization.
* (External, DRAM) Memory is about ~200 cycles for the initial access,
plus some delta for subsequent accesses (I know I'm being optimistic).
To cover this latency fully, we need 200 independent FP ops.
* Assuming Qimonda's GDDR5 part, each chip can supply 4GB/s with a chip
of size 64MB and ~60 pins. So, to get 48GB/sec, assuming 50% efficiency
(I'm being generous) one needs a minimum configuration of 1.5GB and 1440
pins.

These are all for *1* FP pipe. For 32 pipes, the bandwidth and the
resources required go up by 32.

One can use on-chip caches to work around some of the contraints. At
>1MB/1mm^2 (for 1T-SRAM/eDRAM), you can budget about 64MB on-chip. To
keep off-chip accesses reasonable, with 32 pipes, we probably want a
miss rate of less than 2%.

Now, the miss rate is application dependent, of course. But it is
imperative that whatever app you run on this processor fit be tailored
to have a resident set size of less than 64MB.

Lets assume that we can fit it so that we expect a miss rate of 1%. That
means that ~1 out of 30 ops will take (at least) 200 cycles to complete
[remember each FP op has 3 accesses in it]. Adding this to the length of
the FP pipe, and delay to access the various levels of the caches, and
we end up with an average latency of about 20 cycles per op. This means
that for one pipe, we probably need 20+ threads, and for 32 fp pipes we
need 640+ threads!

Each of those threads will need:
- registers (or some equivalent to hold state)
- I$ (not directly, but additional threads => more PCs => pressure on
the I$)
- higher level D$ (assumes that there is more than 1 level of D$;
assuming that we can meet the bandwidth using the 64MB D$, it is
arguable that we should not have any other D$. Among other things,
coherence is trivial).

>>How big and how capable is that CPU/GPU you have 400 compies of?
>
>
> Wimpy, only one quarter the speed of a "real" CPU at the same clock, or
> less, way less.

After balancing out these concerns, you are probably going to be left
with a processor that, for single threaded applications, should be
viewed as being closer to 100-250MHz.

> The tradeoff is you get ~25 times as many CPUs per die
> area.

The problem with these comparisons is that, for massively parallel
execution to work, you must still control the off-chip miss rate. After
that, multi-threading can kick in and hide memory hiearchy and pipeline
latencies.

However, if you can control the miss-rates, an alternative design such
as 4 convential cores augmented ny 8-way SIMD might be within 2x of the
performance of the multi-threaded structure, but have single core
performance with 2x of the best single core designs.
From: Brett Davis on
In article <4AE982F3.7060704(a)bestweb.net>,
Mayan Moudgill <mayan(a)bestweb.net> wrote:

> Brett Davis wrote:
> > The future of CPU based computing, mini clusters.
>
> I've been figuring out how to answer this, given that a real answer
> would require a lot of background, and would have to be heavily
> qualified. So, some things I say may not seem well justified, or
> soundcategorical where it should be conditional. And of course,
> everything is approximate.
>
> * Assuming Qimonda's GDDR5 part, each chip can supply 4GB/s with a chip
> of size 64MB and ~60 pins. So, to get 48GB/sec, assuming 50% efficiency
> (I'm being generous) one needs a minimum configuration of 1.5GB and 1440
> pins.

GDDR5, how quaint. ;)
I am assuming that in two to four years we start switching to embedded
RRAM. 8 gigs on die with multiple 1024 bit busses, etc.

> These are all for *1* FP pipe. For 32 pipes, the bandwidth and the
> resources required go up by 32.

We already have 1600 vector pipes on ATI chips, 400 CPUs is quite a bit
less potential flops.

Game software expands to use up all resources: RAM, flops, bandwidth.

Rendering is going to change from polys to raytracing or more likely
Reyes. (Sub-pixel sampling.) Reyes is cache friendly, it just needs an
order of magnitude more flops than today, to move from Pixar movies to
realtime on your PC.

Brett
From: Mayan Moudgill on
Brett Davis wrote:

> In article <4AE982F3.7060704(a)bestweb.net>,
> Mayan Moudgill <mayan(a)bestweb.net> wrote:
>
>
>>Brett Davis wrote:
>>
>>>The future of CPU based computing, mini clusters.
>>
>>I've been figuring out how to answer this, given that a real answer
>>would require a lot of background, and would have to be heavily
>>qualified. So, some things I say may not seem well justified, or
>>soundcategorical where it should be conditional. And of course,
>>everything is approximate.
>>
>>* Assuming Qimonda's GDDR5 part, each chip can supply 4GB/s with a chip
>>of size 64MB and ~60 pins. So, to get 48GB/sec, assuming 50% efficiency
>>(I'm being generous) one needs a minimum configuration of 1.5GB and 1440
>>pins.
>
>
> GDDR5, how quaint. ;)
> I am assuming that in two to four years we start switching to embedded
> RRAM. 8 gigs on die with multiple 1024 bit busses, etc.
>

Embedded in 2 to 4 years *may* allow you to get to 128MB (256MB, *maybe*
on a really huge chip). On chip-only not happenning.

However, I am sure that in 2-4 years we will see increases in
bandwidth/pin and data/chip.

>>These are all for *1* FP pipe. For 32 pipes, the bandwidth and the
>>resources required go up by 32.
>
>
> We already have 1600 vector pipes on ATI chips, 400 CPUs is quite a bit
> less potential flops.

Agreed, with the caveat that they appeart to be SP pipes, not DP (its
320 DP pipes). And the clocking is less than 1GHz.

Redoing the memory demand for SP, we come up with ~1.5 TB/s to keep the
vector pipes filled, while the *peak* off-chip bandwidth is 1/10th that.
So, extracting performance is going to be a function of locality and
hit-rates.

> Game software expands to use up all resources: RAM, flops, bandwidth.
>
> Rendering is going to change from polys to raytracing or more likely
> Reyes. (Sub-pixel sampling.) Reyes is cache friendly, it just needs an
> order of magnitude more flops than today, to move from Pixar movies to
> realtime on your PC.

Raytracing, I understand. I'm not sure it's cache friendly, but at least
I know the basic algorithm. About Reyes, I'm woefully ignorant. I'll
have to do some reading before I can even ask some dumb questions.
From: nmm1 on
In article <4AEABE6C.9080408(a)bestweb.net>,
Mayan Moudgill <mayan(a)bestweb.net> wrote:
>Brett Davis wrote:
>
>> I am assuming that in two to four years we start switching to embedded
>> RRAM. 8 gigs on die with multiple 1024 bit busses, etc.
>
>Embedded in 2 to 4 years *may* allow you to get to 128MB (256MB, *maybe*
>on a really huge chip). On chip-only not happenning.

It depends slightly on what you mean by "embedded". Some people
classify the CPUs that go in printers and control aircraft as that.

Technically, I can't see a problem with 8 GB/package in 2-4 years,
if any suitable company wanted to make them, and whether a package
includes only one chip or several is a purely manufacturing detail.
But it WOULD mean facing up to some unpalatable decisions. It isn't
likely, but it's possible.


Regards,
Nick Maclaren.
From: Mayan Moudgill on
nmm1(a)cam.ac.uk wrote:
> In article <4AEABE6C.9080408(a)bestweb.net>,
> Mayan Moudgill <mayan(a)bestweb.net> wrote:
>
>>Brett Davis wrote:
>>
>>
>>>I am assuming that in two to four years we start switching to embedded
>>>RRAM. 8 gigs on die with multiple 1024 bit busses, etc.
>>
>>Embedded in 2 to 4 years *may* allow you to get to 128MB (256MB, *maybe*
>>on a really huge chip). On chip-only not happenning.
>
>
> It depends slightly on what you mean by "embedded". Some people
> classify the CPUs that go in printers and control aircraft as that.

To clarify, in this context, embedded = eDRAM/1TSRAM (embedded memory)