The future of CPU based computing, mini clusters. [Computer Architecture]

Prev: Maximum ammount of local variables in cg shaders ?
Next: Online Exams for Certification, Free Practice Exams, Study Material, Dumps

From: Morten Reistad on 26 Oct 2009 13:08

In article <ggtgp-4FCC6C.16102725102009(a)netnews.asp.att.net>,
Brett Davis <ggtgp(a)yahoo.com> wrote:
>The future of CPU based computing, mini clusters.
>
>This was the NVidia Fermi thread, subject changed.
>
>> >Do you need a main CPU if your GPU has 400 processors?
>> >
>> >The answer for Windows and Unix is yes you need a CPU, but for OS/X I am
>> >not so sure. For the next generation of game consoles if you could throw
>> >out the CPU you could cut your costs in almost half, with no loss in
>> >performance if done correctly...
>>
>> How big and how capable is that CPU/GPU you have 400 compies of?
>
>Wimpy, only one quarter the speed of a "real" CPU at the same clock, or
>less, way less. The tradeoff is you get ~25 times as many CPUs per die
>area.

That would be on the order of a Via C3 or somesuch, 600-800 Mhz,
still usably fast for code that farms well out into threads.

>> QNX runs pretty well on a 286, and can run on a few tens of slow
>> processors pretty well. How much memory, what kind of mmu, how
>> are common buses interfaced, and how you do interrupts?
>
>Your MMU design, or ways to not use your MMU are critical.
>Also related is your L1 and L2 connectivity to your CPUs.
>
>Ideally you want a code thread to say it wants a cluster of 16 cpus with
>shared MMU/L1/L2, and then let that thread spawn 100 sub-threads in that
>shared memory space of 16 CPUs.

Both basic unix design and e.g. qnx can live with segment descriptors as
the mmu objects. The demand paging stuff is extra, so the
mmu can be a pretty simple one. The 286 model works pretty well,
actually.

>Otherwise you have 100 threads/CPUs fighting over MMU pages, and none of
>those CPUs making any significant progress.
>
>So you design your hardware around 16 CPU clusters, and your OS, and
>your apps around the same paradigm. If you do it right, over time if the
>sweet spot moves to 8 CPUs or 32 CPUs, the same code will still run. You
>gave the primary process a cluster, it does not need to know how many
>CPUs, or how much cache, or what the clock speed was.
>
>This is the future of CPU based computing, mini clusters.
>
>The huge benefit is that you only need one MMU/L1/L2 per cluster. The
>MMU is a huge piece of die real estate, (and heat) as is the L1 and L2.

But you still get process isolation, right?

>As for any idea of using no MMU and a completely shared memory space
>like a graphics chip. That is insane. Having a thousand other processes
>running broken code and scribbling all over my data and code, leads to a
>design that will never work in the real world. Its a house of cards, in
>a room full of angry two year olds.

yep. You need basic process isolation, but the paging stuff is a
historic legacy now.

>> >Apple could pull this off, (iConsole?) Sony might try and fail, anyone
>> >else would get laughed at, its to hard.
>> >
>> >FYI: Apple has its own CPU design team, does not need NVidia.
>>
>> I don't think cpu design is the issue here. The issues are
>> systems design and os design.
>
>Bingo, hardware companies do not understand system design or OS design.
>Apple as a software company that designs hardware to sell, does.
>
>Few software or hardware companies can force their customer base and
>developer base onto a new paradigm, one that is a difficult and costly
>transition. Even if that change has huge benefits. Apple can, maybe Sony.

Why not take an OS that runs _only_ in the gpu clusters, and
let whatever stuff is sold with the machine handle the "main" cpu?
This is a stellar chance to migrate "below the radar".

>Apple is heavily promoting Grand Central Dispatch, which has 90% of what
>you need to run on these shared memory clusters I just described.
>
>http://developer.apple.com/mac/articles/cocoa/introblocksgcd.html
>
>Sony is betting on Larrabee, which may end up with a similar cluster
>organization. MMU looks like it would be part of the ring controller
>that controls memory access off the cluster. L2 is global.
>
>My first pass design is ATI like, with shared L1, not sure you can share
>a L1 16 ways... But with separate L1s you get a hideous number of MMU
>checks you have to deal with between the L1 and L2. Being a software guy
>this tradeoff is outside of my knowledge base.

The "big" cpu's behave terribly here. They are somewhat saved by
hypertransport, which is just a pipe with raw speed to peek in other caches.

A "logarithmical" layout, 16 cpus, 4 L1, 1 L2 may be a way to
go.

>
>> >You would still have two types of processors, GPU work needs extra units
>> >that a real CPU does not need. So you could end up with CPUs smaller
>> >than on ATIs 1600 vector pipe chip. Lots smaller if you dont bother with
>> >adding vector units to the CPUs. Scratch that, the real CPUs would not
>> >be clustered as ten pipes running the same code. That change alone would
>> >make the CPU units ~4 times bigger than the ATI units.
>> >
>> >I kinda like this idea, would be interesting to program for.
>>
>> Why would you bother having different cpus if 95% of the load
>> is graphics work anyway?
>
>For most games less than 20% of the CPU is doing anything directly
>related to graphics. And most of that 20% would be character skinning,
>which is moving onto the GPU.
>The landscape is chopped into pre-compiled blocks that are handed off to
>the graphics chip. You spend maybe 3% on bounding box checks for those
>blocks, and these checks also will move largely into the GPU over time.
>
>Another 20% is spent on character bone animation and character physics,
>this is also moving onto the GPU, or a CPU cluster...
>
>10% on other physics and collisions, which is trying and failing to move
>onto the GPU. These problems are actually too hard for GPUs today, but
>its perfect for a CPU cluster.
>
>10% on particles, this is moving onto the GPU.
>
>5% in AI, this stays on the CPU.
>
>And a big list of other things that will stay on the CPU.
>
>In answer to your question, a typical PC sold today has 2 CPUs and 400
>GPU pipes, so yes 95% of the computation is actually on the GPU. But
>without that CPU doing all the hard work, that GPU will sit idle.
>
>In the game industry we are running out of things we can hand off to the
>GPU, even if that GPU is relatively bright.

Is this because of need for serial speed ("big" cpus), memory footprint,
or organisational issues where you simply do not have an os and scheduler
for running the gpu units as if they were a large set of normal cpus?

-- mrr

From: ArarghMail910NOSPAM on 26 Oct 2009 18:38

On Mon, 26 Oct 2009 05:09:02 -0500,
ArarghMail910NOSPAM(a)NOT.AT.Arargh.com wrote:

>On 26 Oct 2009 08:30:25 GMT, Andrew Reilly
><andrew-newspost(a)areilly.bpc-users.org> wrote:
>
>>On Mon, 26 Oct 2009 01:29:17 -0500, ArarghMail910NOSPAM wrote:
>>
>>> On Mon, 26 Oct 2009 04:43:21 GMT, Brett Davis <ggtgp(a)yahoo.com> wrote:
>>>
>>> <snip>
>>>>
>>>>x86 started with base+bounds, even giving a plentiful set of offset
>>>>registers. Almost no one used it, and those registers were recycled for
>>>>other uses.
>>> It did? x86? When? Where? Any Docs? <snip>
>>
>>That's more or less how the 286 MMU worked, and (I think) versions of QNX
>That's not how I remember it, but I will dig out the manual and see.
Drat. Can't find the 286 manual.
<snip>
--
ArarghMail910 at [drop the 'http://www.' from ->] http://www.arargh.com
BCET Basic Compiler Page: http://www.arargh.com/basic/index.html

To reply by email, remove the extra stuff from the reply address.

From: Jeremy Linton on 26 Oct 2009 19:32

ArarghMail910NOSPAM(a)NOT.AT.Arargh.com wrote:
> On Mon, 26 Oct 2009 05:09:02 -0500,
> ArarghMail910NOSPAM(a)NOT.AT.Arargh.com wrote:
>
>> On 26 Oct 2009 08:30:25 GMT, Andrew Reilly
>> <andrew-newspost(a)areilly.bpc-users.org> wrote:
>>
>>> On Mon, 26 Oct 2009 01:29:17 -0500, ArarghMail910NOSPAM wrote:
>>>
>>>> On Mon, 26 Oct 2009 04:43:21 GMT, Brett Davis <ggtgp(a)yahoo.com> wrote:
>>>>
>>>> <snip>
>>>>> x86 started with base+bounds, even giving a plentiful set of offset
>>>>> registers. Almost no one used it, and those registers were recycled for
>>>>> other uses.
>>>> It did? x86? When? Where? Any Docs? <snip>
>>> That's more or less how the 286 MMU worked, and (I think) versions of QNX
>> That's not how I remember it, but I will dig out the manual and see.
> Drat. Can't find the 286 manual.

There are 286 manuals here:
http://www.ragestorm.net/downloads/286intel.txt
http://datasheets.chipdb.org/Intel/x86/286/datashts/intel_M80C286.pdf
(page 12 has the descriptor formats)

From the manual section 6.4:
"Finally, the segment descriptor contains the physical base address of
the target segment, as well as size (limit) and access information. The
processor sums the 24-bit segment base and the specified 16-bit offset
to generate the resulting 24-bit physical address."

One could argue, that the 286 was the first x86 with an MMU, and
therefor the statement the x86 started with base+bounds would be correct.

I have found memories of my 286 and its protected mode. I remember it
worked pretty well in windows 3.0 (286 protected mode got removed
shortly after with the only choice being real or 386 protected) and I
was unhappy when they removed it, as the only other choice was 386
protected mode which ran significantly slower on my machine.

From: Brett Davis on 26 Oct 2009 22:41

In article <il1hr6-hck.ln1(a)laptop.reistad.name>,
Morten Reistad <first(a)last.name> wrote:

> In article <ggtgp-4FCC6C.16102725102009(a)netnews.asp.att.net>,
> Brett Davis <ggtgp(a)yahoo.com> wrote:
> >The future of CPU based computing, mini clusters.

> >> >Do you need a main CPU if your GPU has 400 processors?
> >
> >So you design your hardware around 16 CPU clusters, and your OS, and
> >your apps around the same paradigm. If you do it right, over time if the
> >sweet spot moves to 8 CPUs or 32 CPUs, the same code will still run. You
> >gave the primary process a cluster, it does not need to know how many
> >CPUs, or how much cache, or what the clock speed was.
> >
> >The huge benefit is that you only need one MMU/L1/L2 per cluster. The
> >MMU is a huge piece of die real estate, (and heat) as is the L1 and L2.
>
> But you still get process isolation, right?

I am fairly indifferent about process isolation inside a cluster.
I figure that generally you are running the same code on 1000 items.
So a programmer gets a cluster sand box that is all his property.
The OS would wait for all threads to finish before reseting the sandbox
and giving the cluster to another process group.

One could argue that this wastes CPUs, but thats antiquated thinking,
you have THOUSANDS of these wimpy CPUs, in HUNDREDS of clusters, the
last thing you want is some irresponsible memory thrashing process
trying to "share" your CPU cluster with you. That would FAIL.

To pull this off you need KISS at all levels, the OS would not care
about individual CPUs, the OS only cares about clusters, and with
hundreds of clusters it has its hands full as it is.

> Why not take an OS that runs _only_ in the gpu clusters, and
> let whatever stuff is sold with the machine handle the "main" cpu?
> This is a stellar chance to migrate "below the radar".

GPUs do not run real code, they run code fragments on pixels/data.
Your CPU runs the ATI OS code to manage the ATI GPU.

> A "logarithmical" layout, 16 cpus, 4 L1, 1 L2 may be a way to
> go.

I like this.

> >In the game industry we are running out of things we can hand off to the
> >GPU, even if that GPU is relatively bright.
>
> Is this because of need for serial speed ("big" cpus), memory footprint,
> or organisational issues where you simply do not have an os and scheduler
> for running the gpu units as if they were a large set of normal cpus?

Legacy issues of old spaghetti code designed a decade ago, and grown
into a modern Godzilla nightmare. And I have it easy compared to the
poor losers at EA who are using code two decades old, that was never
actually "designed" to begin with...

From: ArarghMail910NOSPAM on 26 Oct 2009 22:43

On Mon, 26 Oct 2009 18:32:22 -0500, Jeremy Linton
<reply-to-list(a)nospam.org> wrote:

>ArarghMail910NOSPAM(a)NOT.AT.Arargh.com wrote:
>> On Mon, 26 Oct 2009 05:09:02 -0500,
>> ArarghMail910NOSPAM(a)NOT.AT.Arargh.com wrote:
>>
>>> On 26 Oct 2009 08:30:25 GMT, Andrew Reilly
>>> <andrew-newspost(a)areilly.bpc-users.org> wrote:
>>>
>>>> On Mon, 26 Oct 2009 01:29:17 -0500, ArarghMail910NOSPAM wrote:
>>>>
>>>>> On Mon, 26 Oct 2009 04:43:21 GMT, Brett Davis <ggtgp(a)yahoo.com> wrote:
>>>>>
>>>>> <snip>
>>>>>> x86 started with base+bounds, even giving a plentiful set of offset
>>>>>> registers. Almost no one used it, and those registers were recycled for
>>>>>> other uses.
>>>>> It did? x86? When? Where? Any Docs? <snip>
>>>> That's more or less how the 286 MMU worked, and (I think) versions of QNX
>>> That's not how I remember it, but I will dig out the manual and see.
>> Drat. Can't find the 286 manual.
I thought I had one. I found the 386 manual, and the 86/88/186/188
manual.

>
>There are 286 manuals here:
>http://www.ragestorm.net/downloads/286intel.txt
>http://datasheets.chipdb.org/Intel/x86/286/datashts/intel_M80C286.pdf
>(page 12 has the descriptor formats)
Thanks. I spent some time looking for these with no luck.

> From the manual section 6.4:
>"Finally, the segment descriptor contains the physical base address of
>the target segment, as well as size (limit) and access information. The
>processor sums the 24-bit segment base and the specified 16-bit offset
>to generate the resulting 24-bit physical address."
>
>One could argue, that the 286 was the first x86 with an MMU, and
>therefor the statement the x86 started with base+bounds would be correct.
Yes, I guess so.

What confused me was:

"even giving a plentiful set of offset registers. Almost no one used
it, and those registers were recycled for other uses."

I don't remember "plentiful set of offset registers" or "any registers
being recycled".

> I have found memories of my 286 and its protected mode. I remember it
>worked pretty well in windows 3.0 (286 protected mode got removed
>shortly after with the only choice being real or 386 protected) and I
>was unhappy when they removed it, as the only other choice was 386
>protected mode which ran significantly slower on my machine.

A 386SX-16 perhaps? :-)

I still have some 286 systems -- I wonder if any still work?

--
ArarghMail910 at [drop the 'http://www.' from ->] http://www.arargh.com
BCET Basic Compiler Page: http://www.arargh.com/basic/index.html

To reply by email, remove the extra stuff from the reply address.

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
Prev: Maximum ammount of local variables in cg shaders ?
Next: Online Exams for Certification, Free Practice Exams, Study Material, Dumps