From: Mayan Moudgill on
Brett Davis wrote:
> Mayan Moudgill <mayan(a)bestweb.net> wrote:
>
>>Brett Davis wrote:
>>
>>> Mayan Moudgill <mayan(a)bestweb.net> wrote:
>>>
>>>>>The future of CPU based computing, mini clusters.
>>>>
>>>GDDR5, how quaint. ;)
>>>I am assuming that in two to four years we start switching to embedded
>>>RRAM. 8 gigs on die with multiple 1024 bit busses, etc.
>>>
>>
>>Embedded in 2 to 4 years *may* allow you to get to 128MB (256MB, *maybe*
>>on a really huge chip). On chip-only not happenning.
>
>
> I have made plentiful posts on RRAM/Memristor, here are the Wiki pages.
>
> http://en.wikipedia.org/wiki/RRAM
> http://en.wikipedia.org/wiki/Memristor
>
> Look at that die picture, I am not talking 6 transistor SRAM, this has
> no transistors. Makes a 6 transistor SRAM look like a vacuum tube, at
> ~25 times smaller. All the transistors are in the bottom layer, this can
> be put on any layer. Quoted 100 gigabits per chip, I am assuming 1
> square centimeter, and assuming 22nm, and then rounded down to get 8
> gigaBYTES.

If it was 2-4 years away from deployment, someone would be talking about
trial runs etc. Could be, but my bet is at least decade - and that's if
it turns out to be viable. Anyone seen widespread MRAM deployment yet?

Also, the density figures are a little implausible. They have to be
assuming every cell is a minimum dimension. [Lets work the nummbers:
assuming a 2 lambda x 2 lambda cell in 22nm - 400Mb/mm^2, 16x16 chip -
yeah, about 100Gb].

Now for a sanity check: DRAM which can be made, IIRC, with a 6 or 8
lambda^2 cell. Question - why is no one talking about 50Gb DRAM chips in
22nm?

Oh and one more thing - don't expect your favorite competitor to silicon
to necessarily scale as well as silicon. Even when it theoretically
should, it'll need a lot of engineering, which it isn't necessarily
going to get.


>>>We already have 1600 vector pipes on ATI chips, 400 CPUs is quite a bit
>>>less potential flops.
>>
>>Agreed, with the caveat that they appeart to be SP pipes, not DP (its
>>320 DP pipes). And the clocking is less than 1GHz.
>
>
> 640 DP pipes in the 1600 SP ATI chip. (Not 800, every fifth pipe is
> special.) And top bin 975 MHz parts are not much less than 1GHz. ;)
>
> http://www.anandtech.com/video/showdoc.aspx?i=3643&p=5
>
> You can do pairs of DP muls or adds per cycle, but only one DP MADD.
> So the DP throughput is about a third of the SP.

BTW: I goofed here. I said 1.5 TB/s, its actually 15TB/s of arguments +
results for the 2 input/1 output SP ops (20TB/s if you're using MADD).
So the peak-bandwidth is 1/100th of the required data-rate. Your cache
hit rates had better be *REAL* good.

>
>>>Rendering is going to change from polys to raytracing or more likely
>>>Reyes. (Sub-pixel sampling.) Reyes is cache friendly, it just needs an
>>>order of magnitude more flops than today, to move from Pixar movies to
>>>realtime on your PC.
>>
>>Raytracing, I understand. I'm not sure it's cache friendly, but at least
>> I know the basic algorithm. About Reyes, I'm woefully ignorant. I'll
>>have to do some reading before I can even ask some dumb questions.
>
>
> http://en.wikipedia.org/wiki/Reyes_rendering
> http://www.steckles.com/reyes1.html
> http://www.kunzhou.net/2009/renderants-tr.pdf
>
> As for ray tracing with about 8 copies of 8 sony Cell vector processors
> expanded out to 16 SP wide, you could think about it. (Or Larrabee) That
> gives you 64 wide vector processors, with I assume a quarter+ meg of RAM
> each.

Assuming you are correct, you are talking about 8 copies x 8 processors
x 16-wide x 3.2GHz = 3.2 Tops. You're assuming that you have about 16MB
on chip.

> Chop your scene up and farm out the geometry to that RAM, batch up
> your ray trace vector bounces that bounce out, and feed the bounce
> batches to the vector units that have the geo.

How many shapes? polygons? triangles? points? can be stored in 256KB?
Will you need to convert shapes down to sub-pixel sized polygons, and
store some associated surface properties (color,etc.)? If so, what's a
typical metric (such as so many bytes per object) used?

Let me take a stab at it: a point probably has at least 10 values that I
can think of (coordiantes, normal,color), and probably a few more that I
am ignorant of. Assume its about 16 values, so 64B per point. That
means we can store a maximum of 4K points in 256KB, and a maximum of
256K points across the entire chip.

I'm assuming that one recursively splits the volume using something like
an oct-tree till you end up with a box containing a small enough number
of objects to fit in cache. Given a box and a set of rays entering the
box, you bounce the rays around the objects in the box, compute the set
of rays leaving the box, then pass that information to the adjacent boxes.

Hmmm....I wonder how many boxes we'll need. I have no clue here, so lets
start with a real WAG. You probably want to represent any object using a
sub-pixel mesh - lets say 4 points per pixel. Assume that the set of
objects fits 1/2 the screen. Lets say that we want to represent both
sides (otherwise, what's the point of ray tracing). Lets assume a
1920x1280 screen. So, thats about 10M points (is that high or low?). If
so, you end up with a few thousand boxes.

[BTW: how much memory does a ray require? I'm guessing its also at least
10 elements of information, and maybe more]


> http://en.wikipedia.org/wiki/Cell_(microprocessor)
>
> Assuming quad wide Element Interconnect Bus, (EIB) you get 100 GB/s per
> Cell of bandwidth, times 64.
>
> End result, you never touch main RAM, all computation is on chip.
> Which is good because you only have ~25 GB/s to RAM.
>
> Again, even without RRAM, your concerns about main RAM bandwidth are
> quaint. ;)
>
> If you do things right you only touch each page used of RAM once per
> frame, for geometry loads and texture loads, and code loads. All the
> writes go to the video card, and you run at 60+ fps.
>

Hmmm...clearly I don't understand how. The way I had visualized it, I
would end up processing a box multiple times - I would have rays leaving
a box, bouncing off an object in another box and re-entering the box.
Also, given bi-directional ray tracing, the light source(s) would
introduce rays that would go in the opposite direction than the
eye-based rays, and so would cause boxes to be evaluated differently
(and therefore more than once). With a few thousand boxes, under my
assumptions, one would end up swapping boxes in and out of the chip
repeatedly.

What am I missing?
From: nmm1 on
In article <4AEBACFC.8000207(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>ChrisQ wrote:
>>
>> That's correct and embedded devices are already far more powerfull than
>> 486 class x86 in terms of throughput. They have on chip ram, flash and
>> moderately high speed comms as well as other peripherals.
>>
>> Assume an array of such devices, assigned a single thread each, there
>> would be no need for cache and interrupt requirements would be minimal.
>> How usefull would this be, assuming future developments could put enough
>> code space on chip and the comms problem between an allocation cpu and
>> the array could be solved ?...
>
>The way to get enough code space for meaningful programs is... to have
>an I$. Nick will probably say otherwise, from his massive experience,
>but guys at the world's largest supercomputer customers contain this for me.

You are getting increasingly ridiculous. That is completely irrelevant
to the point ChrisQ was making.


Regards,
Nick Maclaren.
From: nmm1 on
In article <ggtgp-30096B.00174431102009(a)netnews.asp.att.net>,
Brett Davis <ggtgp(a)yahoo.com> wrote:
>
>As for ray tracing with about 8 copies of 8 sony Cell vector processors
>expanded out to 16 SP wide, you could think about it. (Or Larrabee) That
>gives you 64 wide vector processors, with I assume a quarter+ meg of RAM
>each. Chop your scene up and farm out the geometry to that RAM, batch up
>your ray trace vector bounces that bounce out, and feed the bounce
>batches to the vector units that have the geo.
>
>http://en.wikipedia.org/wiki/Cell_(microprocessor)
>
>Assuming quad wide Element Interconnect Bus, (EIB) you get 100 GB/s per
>Cell of bandwidth, times 64.
>
>End result, you never touch main RAM, all computation is on chip.
>Which is good because you only have ~25 GB/s to RAM.
>
>Again, even without RRAM, your concerns about main RAM bandwidth are
>quaint. ;)
>
>If you do things right you only touch each page used of RAM once per
>frame, for geometry loads and texture loads, and code loads. All the
>writes go to the video card, and you run at 60+ fps.
>
>Today of course that is not true, cache sizes and local RAM sizes are
>too small, so you hit each used RAM page multiple times, and write a
>fair bit out to RAM as well.
>
>It could be argued that future CPUs need LESS bandwidth to RAM than
>today!

Er, games are not the ONLY important application! I frequently see
people where their primary limit is memory size, but where their code
still takes super-linear CPU time in the amount of memory.

There always have been applications that need a limited working
set. I once wrote a serious application that ran almost entirely
in a CPU's registers - all right, that was a Ferranti Atlas/Titan :-)


Regards,
Nick Maclaren.
From: ChrisQ on
nmm1(a)cam.ac.uk wrote:

>
> There always have been applications that need a limited working
> set. I once wrote a serious application that ran almost entirely
> in a CPU's registers - all right, that was a Ferranti Atlas/Titan :-)
>

You could do that on some of the early pdp11's, which had all the
registers mapped into address space. The example I remember was a
simple core memory test program and may have been in one of the 11/05
manuals.

Back on topic, it just seems to me that the ever increasing complexity,
single core, is getting nowhere and a return to much simpler
architectures may be be a way forward. Embedded devices may not be
ideal, but might be a very low cost way to do some serious research on
parallel methods, which could be scaled up as the technology advances.
Some of the 50 mip devices cost a couple of ukp each or less in quantity
and they need almost no external hardware to function. Compared to
current desktop hardware, they are low power, simple devices. Code
normally executes from on chip pre programmed flash, sometimes with a
wait state or 2, but can also run from on chip ram, where there are
often no wait states required. Putting numbers into this, flash mem
sizes range up to a megabyte or so, with ram sizes up to 512 K bytes.
However, many have reasonably large (Gbyte) address spaces and have
external bus capability to allow more memory or peripherals to be connected.

The elephant in the room is of course the fundamental gulf between
present software methodology at the programmer interface (basically
serial) and parallel archhitectures of any kind. i don't think that
there will be much progress until that is bridged. The problem is one of
software and initially at least, there may need to be an abstraction
layer between the machine and programmer so that current code can
continue to work, while new methods are developed to take advantage of
the underlying parallel architecture...

Regards,

Chris
From: nmm1 on
In article <vxXGm.23512$nI.4033(a)newsfe14.ams2>,
ChrisQ <meru(a)devnull.com> wrote:
>
>> There always have been applications that need a limited working
>> set. I once wrote a serious application that ran almost entirely
>> in a CPU's registers - all right, that was a Ferranti Atlas/Titan :-)
>
>You could do that on some of the early pdp11's, which had all the
>registers mapped into address space. The example I remember was a
>simple core memory test program and may have been in one of the 11/05
>manuals.

Yes, but I wasn't cheating in that way - the machine I used did NOT
memory map registers - it had 128 registers, and used a special
register to allow register indexing.


Regards,
Nick Maclaren.