The future of CPU based computing, mini clusters. [Computer Architecture]

Prev: Maximum ammount of local variables in cg shaders ?
Next: Online Exams for Certification, Free Practice Exams, Study Material, Dumps

From: Brett Davis on 25 Oct 2009 17:10

The future of CPU based computing, mini clusters.

This was the NVidia Fermi thread, subject changed.

> >Do you need a main CPU if your GPU has 400 processors?
> >
> >The answer for Windows and Unix is yes you need a CPU, but for OS/X I am
> >not so sure. For the next generation of game consoles if you could throw
> >out the CPU you could cut your costs in almost half, with no loss in
> >performance if done correctly...
>
> How big and how capable is that CPU/GPU you have 400 compies of?

Wimpy, only one quarter the speed of a "real" CPU at the same clock, or
less, way less. The tradeoff is you get ~25 times as many CPUs per die
area.

> QNX runs pretty well on a 286, and can run on a few tens of slow
> processors pretty well. How much memory, what kind of mmu, how
> are common buses interfaced, and how you do interrupts?

Your MMU design, or ways to not use your MMU are critical.
Also related is your L1 and L2 connectivity to your CPUs.

Ideally you want a code thread to say it wants a cluster of 16 cpus with
shared MMU/L1/L2, and then let that thread spawn 100 sub-threads in that
shared memory space of 16 CPUs.

Otherwise you have 100 threads/CPUs fighting over MMU pages, and none of
those CPUs making any significant progress.

So you design your hardware around 16 CPU clusters, and your OS, and
your apps around the same paradigm. If you do it right, over time if the
sweet spot moves to 8 CPUs or 32 CPUs, the same code will still run. You
gave the primary process a cluster, it does not need to know how many
CPUs, or how much cache, or what the clock speed was.

This is the future of CPU based computing, mini clusters.

The huge benefit is that you only need one MMU/L1/L2 per cluster. The
MMU is a huge piece of die real estate, (and heat) as is the L1 and L2.

As for any idea of using no MMU and a completely shared memory space
like a graphics chip. That is insane. Having a thousand other processes
running broken code and scribbling all over my data and code, leads to a
design that will never work in the real world. Its a house of cards, in
a room full of angry two year olds.

> >Apple could pull this off, (iConsole?) Sony might try and fail, anyone
> >else would get laughed at, its to hard.
> >
> >FYI: Apple has its own CPU design team, does not need NVidia.
>
> I don't think cpu design is the issue here. The issues are
> systems design and os design.

Bingo, hardware companies do not understand system design or OS design.
Apple as a software company that designs hardware to sell, does.

Few software or hardware companies can force their customer base and
developer base onto a new paradigm, one that is a difficult and costly
transition. Even if that change has huge benefits. Apple can, maybe Sony.

Apple is heavily promoting Grand Central Dispatch, which has 90% of what
you need to run on these shared memory clusters I just described.

http://developer.apple.com/mac/articles/cocoa/introblocksgcd.html

Sony is betting on Larrabee, which may end up with a similar cluster
organization. MMU looks like it would be part of the ring controller
that controls memory access off the cluster. L2 is global.

My first pass design is ATI like, with shared L1, not sure you can share
a L1 16 ways... But with separate L1s you get a hideous number of MMU
checks you have to deal with between the L1 and L2. Being a software guy
this tradeoff is outside of my knowledge base.

> >You would still have two types of processors, GPU work needs extra units
> >that a real CPU does not need. So you could end up with CPUs smaller
> >than on ATIs 1600 vector pipe chip. Lots smaller if you dont bother with
> >adding vector units to the CPUs. Scratch that, the real CPUs would not
> >be clustered as ten pipes running the same code. That change alone would
> >make the CPU units ~4 times bigger than the ATI units.
> >
> >I kinda like this idea, would be interesting to program for.
>
> Why would you bother having different cpus if 95% of the load
> is graphics work anyway?

For most games less than 20% of the CPU is doing anything directly
related to graphics. And most of that 20% would be character skinning,
which is moving onto the GPU.
The landscape is chopped into pre-compiled blocks that are handed off to
the graphics chip. You spend maybe 3% on bounding box checks for those
blocks, and these checks also will move largely into the GPU over time.

Another 20% is spent on character bone animation and character physics,
this is also moving onto the GPU, or a CPU cluster...

10% on other physics and collisions, which is trying and failing to move
onto the GPU. These problems are actually too hard for GPUs today, but
its perfect for a CPU cluster.

10% on particles, this is moving onto the GPU.

5% in AI, this stays on the CPU.

And a big list of other things that will stay on the CPU.

In answer to your question, a typical PC sold today has 2 CPUs and 400
GPU pipes, so yes 95% of the computation is actually on the GPU. But
without that CPU doing all the hard work, that GPU will sit idle.

In the game industry we are running out of things we can hand off to the
GPU, even if that GPU is relatively bright.

Brett

From: Mayan Moudgill on 25 Oct 2009 21:08

Brett Davis wrote:

> As for any idea of using no MMU and a completely shared memory space
> like a graphics chip. That is insane. Having a thousand other processes
> running broken code and scribbling all over my data and code, leads to a
> design that will never work in the real world. Its a house of cards, in
> a room full of angry two year olds.
>

Umm... MMU != memory protection. Various forms of base+bound protection
could be implemented that would give you protection without needing an MMU.

From: Joe Pfeiffer on 26 Oct 2009 00:21

Mayan Moudgill <mayan(a)bestweb.net> writes:

> Brett Davis wrote:
>
>> As for any idea of using no MMU and a completely shared memory space
>> like a graphics chip. That is insane. Having a thousand other
>> processes running broken code and scribbling all over my data and
>> code, leads to a design that will never work in the real world. Its
>> a house of cards, in a room full of angry two year olds.
>>
>
> Umm... MMU != memory protection. Various forms of base+bound
> protection could be implemented that would give you protection without
> needing an MMU.

This depends critically on your definition of MMU. I'd regard
base+bound as a really primitive MMU; you seem to require something
stronger in order to count as an MMU.
--
As we enjoy great advantages from the inventions of others, we should
be glad of an opportunity to serve others by any invention of ours;
and this we should do freely and generously. (Benjamin Franklin)

From: Brett Davis on 26 Oct 2009 00:43

In article <_-OdnSjff4QDa3nXnZ2dnUVZ_omdnZ2d(a)bestweb.net>,
Mayan Moudgill <mayan(a)bestweb.net> wrote:

> Brett Davis wrote:
>
> > As for any idea of using no MMU and a completely shared memory space
> > like a graphics chip. That is insane. Having a thousand other processes
> > running broken code and scribbling all over my data and code, leads to a
> > design that will never work in the real world. Its a house of cards, in
> > a room full of angry two year olds.
> >
>
> Umm... MMU != memory protection. Various forms of base+bound protection
> could be implemented that would give you protection without needing an MMU.

x86 started with base+bounds, even giving a plentiful set of offset
registers. Almost no one used it, and those registers were recycled for
other uses.

RAM is so plentiful now that if you went with fine grain memory
protection, you would just round up allocations and give out full pages.
Makes paging out to disk easy, which is needed when the user closes the
lid on his laptop, and everything is saved.

Does anyone use base+bounds in any market?

Brett

From: ArarghMail910NOSPAM on 26 Oct 2009 02:29

On Mon, 26 Oct 2009 04:43:21 GMT, Brett Davis <ggtgp(a)yahoo.com> wrote:

<snip>
>
>x86 started with base+bounds, even giving a plentiful set of offset
>registers. Almost no one used it, and those registers were recycled for
>other uses.
It did? x86? When? Where? Any Docs?
<snip>
--
ArarghMail910 at [drop the 'http://www.' from ->] http://www.arargh.com
BCET Basic Compiler Page: http://www.arargh.com/basic/index.html

To reply by email, remove the extra stuff from the reply address.

| Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: Maximum ammount of local variables in cg shaders ?
Next: Online Exams for Certification, Free Practice Exams, Study Material, Dumps