The future of CPU based computing, mini clusters. [Computer Architecture]

Prev: Maximum ammount of local variables in cg shaders ?
Next: Online Exams for Certification, Free Practice Exams, Study Material, Dumps

From: Morten Reistad on 27 Oct 2009 18:05

In article <ggtgp-3E9E33.21404826102009(a)netnews.asp.att.net>,
Brett Davis <ggtgp(a)yahoo.com> wrote:
>In article <il1hr6-hck.ln1(a)laptop.reistad.name>,
> Morten Reistad <first(a)last.name> wrote:

>> >The huge benefit is that you only need one MMU/L1/L2 per cluster. The
>> >MMU is a huge piece of die real estate, (and heat) as is the L1 and L2.
>>
>> But you still get process isolation, right?
>
>I am fairly indifferent about process isolation inside a cluster.
>I figure that generally you are running the same code on 1000 items.
>So a programmer gets a cluster sand box that is all his property.
>The OS would wait for all threads to finish before reseting the sandbox
>and giving the cluster to another process group.

I am thinking about the possibility to take this one step further;
and look into if it is possible to run _all_ of the system inside
the GPU. If each cluster is on the order of a Via C3 in processing
power this should be perfectly feasible.

Using a few handfuls of clusters as the "main cpu". To have any
hope of running "modern" code like, say QNX, or Version 7 unix,
we need a minimal MMU. Not fancy demand paging; simple process
isolation with base+offset registers does nicely; the 80286 ran
both of these well.

So, basic process isolation 286-style is a basic requirement.

>One could argue that this wastes CPUs, but thats antiquated thinking,
>you have THOUSANDS of these wimpy CPUs, in HUNDREDS of clusters, the
>last thing you want is some irresponsible memory thrashing process
>trying to "share" your CPU cluster with you. That would FAIL.
>
>To pull this off you need KISS at all levels, the OS would not care
>about individual CPUs, the OS only cares about clusters, and with
>hundreds of clusters it has its hands full as it is.

In my world, this OS would be a hypervisor for something resembling
netbsd, where the BSD sees a few scores of processors.

>> Why not take an OS that runs _only_ in the gpu clusters, and
>> let whatever stuff is sold with the machine handle the "main" cpu?
>> This is a stellar chance to migrate "below the radar".
>
>GPUs do not run real code, they run code fragments on pixels/data.
>Your CPU runs the ATI OS code to manage the ATI GPU.

Seems memory is an issue.

Exactly how tight is memory in the GPU?

>
>> A "logarithmical" layout, 16 cpus, 4 L1, 1 L2 may be a way to
>> go.
>
>I like this.
>
>> >In the game industry we are running out of things we can hand off to the
>> >GPU, even if that GPU is relatively bright.
>>
>> Is this because of need for serial speed ("big" cpus), memory footprint,
>> or organisational issues where you simply do not have an os and scheduler
>> for running the gpu units as if they were a large set of normal cpus?
>
>Legacy issues of old spaghetti code designed a decade ago, and grown
>into a modern Godzilla nightmare. And I have it easy compared to the
>poor losers at EA who are using code two decades old, that was never
>actually "designed" to begin with...

Well, perhaps we can get KLH up and running.

-- mrr

From: Morten Reistad on 27 Oct 2009 17:53

In article <hc5bhb$kr8$1(a)aioe.org>,
Jeremy Linton <reply-to-list(a)nospam.org> wrote:
>ArarghMail910NOSPAM(a)NOT.AT.Arargh.com wrote:
>> On Mon, 26 Oct 2009 05:09:02 -0500,
>> ArarghMail910NOSPAM(a)NOT.AT.Arargh.com wrote:

>>>> That's more or less how the 286 MMU worked, and (I think) versions of QNX
>>> That's not how I remember it, but I will dig out the manual and see.
>> Drat. Can't find the 286 manual.
>
>There are 286 manuals here:
>http://www.ragestorm.net/downloads/286intel.txt
>http://datasheets.chipdb.org/Intel/x86/286/datashts/intel_M80C286.pdf
>(page 12 has the descriptor formats)
>
> From the manual section 6.4:
>"Finally, the segment descriptor contains the physical base address of
>the target segment, as well as size (limit) and access information. The
>processor sums the 24-bit segment base and the specified 16-bit offset
>to generate the resulting 24-bit physical address."
>
>One could argue, that the 286 was the first x86 with an MMU, and
>therefor the statement the x86 started with base+bounds would be correct.
>
> I have found memories of my 286 and its protected mode. I remember it
>worked pretty well in windows 3.0 (286 protected mode got removed
>shortly after with the only choice being real or 386 protected) and I
>was unhappy when they removed it, as the only other choice was 386
>protected mode which ran significantly slower on my machine.

I ran QNX on the 286'es, and got very good results too. One BBS
even ran 30+ logged in users on a noname china-clone in late 1986,
with lots of resources left except for memory. 1.5 megabyte was
a little tight for 30 users, even then.

The process size has 64k limits on I,D,stack and mapped data,
but the message passing made it possible to make servers to handle
common tasks.

It has a stellar scaling performance. So, this is where I
would start to look if we are to press farms of GPUs into service
as general purpose computers.

-- mrr

From: Del Cecchi on 27 Oct 2009 21:03

From: EricP on 27 Oct 2009 23:18

Chris Gray wrote:
> Mayan Moudgill <mayan(a)bestweb.net> writes:
>
>> There are several OSes (including, IIRC, HP-UX) which do not permit
>> multiple virtual addresses to point to the same real address. I'm
>> guessing that they've managed to work around the CoW trick somehow.
>
> The restriction may be in the MMU. My memories of this are pretty vague,
> but wasn't it (and the one in Power?) a "reverse lookup", which actually
> mapped physical pages to virtual pages, instead of the other way around?
> There could be only one such entry for a physical page. So, as Mayan
> carefully says, you can't have multiple virtual addresses associated
> with one physical address. However, that doesn't stop multiple address
> spaces from having that physical page in them - it just must be at the
> same virtual address in all of them (and perhaps the same modes).
>
> It too long since I worked on the Myrias PAMS stuff, but I think there
> was something extra that we had to do under HP-UX that we didn't have
> to do under AIX. It may have related to the ability to nuke entire
> virtual segments under AIX, however.
>

The IBM PPC 603 and 604 used inverted (hashed) paged tables,
however Linux expects to be able to have shared virtual sections.

Linux expects its page tables to be a 3 level tree a-la x86 style.
The linux port (below) treats the hash table like a second level TLB,
and retains the x86 page tables as the "official" look up.
A hardware TLB miss loads from the hash table.
A hash table miss triggers a page fault, which looks in
the x86 page tables, loads the hash table and restarts.

On process switch the hash table is cleared, just like a TLB.
However rather than scanning the hash table to clear out
old entries, the 603/604 supported Address Space IDs on PTE's
so they just use a counter as the ASID each time they switch
and the old entries won't match.
(Alternately, has there not been ASIDs, they could use a small
circular FIFO list to track the valid entries in the hash table).

see
Optimizing the Idle Task and Other MMU Tricks
http://www.usenix.org/events/osdi99/full_papers/dougan/dougan.pdf

Eric

From: Anne & Lynn Wheeler on 28 Oct 2009 12:30

re:
http://www.garlic.com/~lynn/2009p.html#19 The future of CPU based computing, mini clusters.

I've periodically claimed that John's 801/risc in the mid to late 70s
some past posts
http://www.garlic.com/~lynn/subtopic.html#801

was to to to the opposite hardware extreme from the (failed/canceled)
future system effort ... some past posts
http://www.garlic.com/~lynn/submain.html#futuresys

801/iliad/romp/rios started out 32bit virtual addresses ... with 16
segment registers (top four bits of virtual address would access one of
16 segment registers). The segment register would contain a "segment id"
(12bits in romp, 24bits in rios) ... which would be used to provide
"associativity" (TLB).

in 370, TLB (and potentially virtual cache) would be "STO" associative
.... basically the real address of the start of the address space
"segment table". 370 hardware could implement a "STO stack" ... say
seven entries saving the most recently used STOs. TLB (STO-associative)
entries would have 3-bit tag ... indicating invalid entry ... or
association with one of the seven entries from the STO stack.

801 with inverted tables ... didn't have a corresponding hardware tables
for uniquely identifying virtual address space ... so explicitly defined
an virtual address spaced identifier ... or actually a virtual address
space segment identifier (a combination of 16 values used to create a
virtual address space definition). The ROMP 12-bit "identifier" roughtly
corresponded to the 3bit STO-stack identifier in (some) 370 hardware
implementations. However, being a segment identifier ... it corresponds
closer to the "PTO" identifier mentioned in the previous post (allowed
for in the original 370 architecture definition ... but I don't believe
there was actually any such 370 implementation).

There were some issues with only 16 segment registers ... that it
limited number of concurrent different shared objects for sharing. In
original 801, there was no protection domain ... and the claim was that
inline code could as easily change the value in one of the virtual
segment registers ... as address pointers in general registers could be
changed. This ran into little more difficulty in the transition
to using 801 for unix ... and requirement to implement hardware
protection domain.

--
40+yrs virtualization experience (since Jan68), online at home since Mar1970

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Prev: Maximum ammount of local variables in cg shaders ?
Next: Online Exams for Certification, Free Practice Exams, Study Material, Dumps