From: Tim Bradshaw on
mark.hoemmen(a)gmail.com wrote:

> Intel's proposed 80-core architecture will have DRAM attached to each
> core -- sort of how Cell has "local stores" attached to each SPE.
> That's how they plan to solve the BW problem -- amortize it over all
> the cores.

Don't we call that `cache' normally? (yes, I know, they'll be *big*
caches, but only big by today's standards, in the same sense that
today's machines have as much cache as yesterday's had main memory.)

From: Pascal Bourguignon on
"Tim Bradshaw" <tfb+google(a)tfeb.org> writes:

> mark.hoemmen(a)gmail.com wrote:
>
>> Intel's proposed 80-core architecture will have DRAM attached to each
>> core -- sort of how Cell has "local stores" attached to each SPE.
>> That's how they plan to solve the BW problem -- amortize it over all
>> the cores.
>
> Don't we call that `cache' normally? (yes, I know, they'll be *big*
> caches, but only big by today's standards, in the same sense that
> today's machines have as much cache as yesterday's had main memory.)

Well, the fact that L1 and L2 caches are totally transparent to the
programmer and the HD cache somewhat less is no reason to distinguish
them.

You've probably already seen this pyramid with the registers in the
top corner, above layers of memories, L1, L2 and now L3, the RAM, the
HD, the tapes, etc. We could also add layers for the Internet and the
physical world.

RAM is used as cache for the HD. HD is used as cache for the big
storage repositories on tapes or CD, or for the Internet. The
Internet is used as a cache for the real world. Our computers don't
need robotic extensions to access information in the real world,
because the real world is cached into the Internet. (Well, it may be
useful to have these robotic extensions to allow the computer access
the real world itself, instead of having armies of human filling
wikipedia and other pages indexed by google).

It's only a matter of OS to hide all these details. Use mmap instead
of open/read/write/close. Add an imap(2) and call
imap(address,"http://en.wikipedia.org/wiki/Raven");
instead of sending your robotic extensions go watch birds.
Of course, it helps to have a big addressing space.

Earth is 510,065,600 km�(*), that's 510,065,600e12 mm� or 69 bits to
identify each mm� of Earth surface. So we'll have to wait for 128bit
processors to be able to mmap every bit of Earth surface into the
virtual memory space of our computers. In the meantime, we can just
implement our own 128-bit virtual address space, and a mere emap(2)
syscall is all what is needed to address the (physical) desktop of
your coworkers on another continent, thru remote presence robots.



(*) I'm lazy to compute it tonight, so I just copied the number cached
in Wikipedia; beware! ;-)

--
__Pascal Bourguignon__ http://www.informatimago.com/

HEALTH WARNING: Care should be taken when lifting this product,
since its mass, and thus its weight, is dependent on its velocity
relative to the user.
From: Madhu on
* Maciek Pasternacki <87r6u1plm6.fsf(a)lizard.king> :
| On Sweetmorn, Chaos 11, 3173 YOLD, Juan R. wrote:
|
|>> | If you want to analyse chess positions you can never have too
|>> | much speed and it has nothing to do with rendering. I'm sure
|>> | it's the same situation with go and many other games.
|>>
|>> But having more than one core will not be a benefit if your
|>> algorithms are graph based and have to search a tree. IIRC most
|>> graph algorithms (dfs bfs) are inherently unparallelizable.
|>
|> And did not a parallel search tree could distribute subtree search
|> between cores at each branching point?
[...]
| single thread would work like:
| (loop
| (if *node-queue*
| (let ((node (dequeue *node-queue*)))
| (do-something-with node)
| (dolist (subnode (children node))
| (enqueue subnode *node-queue*)))
| (return))
|
| Search would start with enqueuing root node, and would end by any
| thread setting *node-queue* to NIL. This would be parallelizable
| over any number of cores (supposing one doesn't care about exact DFS
| search order -- but if one cared about order, one wouldn't have
| considered parallelizing).

Your stopping criterion will have to be different. Also, if your
input is not a tree, this algorithm will expand the same node multiple
times. This [inefficiency] can be done in parallel, of course :)

Which is why order tends to be important in DFS, and why it is
unsuitable for decomposition. Of course, as others have noted, once
the leaves are reached there are usually gains to be made. The point
I wanted to make was akin to that in chemistry, where the overall rate
of a reaction is limited by the rate of the slowest step. (The slowest
step here being walking the graph)

--
Madhu
From: Rob Warnock on
Tim Bradshaw <tfb+google(a)tfeb.org> wrote:
+---------------
| Chris Barts wrote:
| > How many people have forgotten that 'code' is a mass noun and, as such,
| > does not take plurals? Do you also say 'these muds' and 'these dusts'?
|
| How many people have forgotten that *language changes over time* and is
| not something handed down from the elder days, never to be changed?
| The sense of `codes' I gave is very common in the HPC community where
| "a code" typically refers to something approximating to a particular
| implementation of an algorithm. The plural use, which is more common,
| means something like "implementations of algorithms".
+---------------

Yup. Far too much of the HPC market consists of simply rerunning 1960's
"dusty deck" codes with different inputs and larger array dimensions.


-Rob

-----
Rob Warnock <rpw3(a)rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607

From: George Neuner on
On 11 Jan 2007 13:03:57 -0800, "Tim Bradshaw" <tfb+google(a)tfeb.org>
wrote:

>mark.hoemmen(a)gmail.com wrote:
>
>> Intel's proposed 80-core architecture will have DRAM attached to each
>> core -- sort of how Cell has "local stores" attached to each SPE.
>> That's how they plan to solve the BW problem -- amortize it over all
>> the cores.
>
>Don't we call that `cache' normally? (yes, I know, they'll be *big*
>caches, but only big by today's standards, in the same sense that
>today's machines have as much cache as yesterday's had main memory.)

Well, on Cells the private memories are not cache but staging memories
.... the main processor has to move data into and out of them on behalf
of the coprocessors. It's very similar to the multi-level memory
system used on the old Cray's where the CPU had to fetch and organize
data to feed the array processors and store the results back to the
shared main memory.

AFAIK, no one has tried to offer a hardware solution to staging
computations in a distributed memory system since the KSR1 (circa
1990, which failed due to the company's creative bookkeeping rather
than the machine's technology). Everyone now relies on software
approaches like MPI and PVM.

George
--
for email reply remove "/" from address