First Next-Gen CELL Processor: 2 PPEs - 32 SPEs - at least 1 Teraflop [Computer Architecture]

Prev: Need SimpleScalar GCC Compiler
Next: Software vs Hardware

From: Robert Myers on 11 Dec 2006 19:40

Edward Wolfgram wrote:
>
> What don't you like about Blue Gene?
>

Blue Gene has a bisection bandwidth in the range of millibytes per
flop, depending on how it's configured (or you could have a nice
bisection bandwidth if you settled for an uninterestingly small
machine). As you continue to add nodes to the mesh, creating
ever-higher linpack scores, the bisection bandwidth in millibytes per
flop just keeps falling, with the limit for this "scalable" machine
being zero.

That's a problem for doing FFT's. It's a problem that was identified
in the potential applications of Red Storm (25% projected efficiency
for pseudospectral simulations, for example), and a problem that has
appeared in IBM's own documents regarding FFT's on Blue Gene: the flops
per processor falls apart at some uninterestingly low number of
processors.

I gather that Blue Gene just won an award for doing FFT's. I haven't
had a chance yet to look at it to see what it means. Nothing could
have come to me as a greater surprise. It's been a while since I went
through the details, all of which were discouraging. Maybe something
has changed I don't know about.

Robert.

From: Chris Thomasson on 15 Dec 2006 00:28

"Joe Seigh" <jseigh_01(a)xemaps.com> wrote in message
news:CbadnSpeDeLqweHYnZ2dnUVZ_segnZ2d(a)comcast.com...
> The cell processor appears to be the first commondity multiprocessor that
> breaks with the cached shared memory multi-processing model if I
> got this right. So it's less applicable to shared memory multi-threading
> models and more applicable to models like MPI.
>
> In some respects, although there's no crossbar switch, it's like the
> old SP systems where an IBM mainframe served as a scheduling and
> control processor. The PPC appears to be the new mainframe. :)
>
> I wonder how the old shared memory strategy will work out. Will
> coherent cache shared memory scale up to 10's and 100's of processors
> and stay competitive?

YES!!!

Here is ultimate PDR + Hardware Solution:

http://groups.google.com/group/comp.arch/msg/2a0f4163f8e13f1e

Watch... A PATENT for this technique will mysteriously appear one day.

;^)

From: Chris Thomasson on 15 Dec 2006 00:35

"Joe Seigh" <jseigh_01(a)xemaps.com> wrote in message
news:CbadnSpeDeLqweHYnZ2dnUVZ_segnZ2d(a)comcast.com...
> The cell processor appears to be the first commondity multiprocessor that
> breaks with the cached shared memory multi-processing model if I
> got this right. So it's less applicable to shared memory multi-threading
> models and more applicable to models like MPI.
>
> In some respects, although there's no crossbar switch, it's like the
> old SP systems where an IBM mainframe served as a scheduling and
> control processor. The PPC appears to be the new mainframe. :)
>
> I wonder how the old shared memory strategy will work out. Will
> coherent cache shared memory scale up to 10's and 100's of processors
> and stay competitive?

You can use the PowerPC for a lot of the shared memory work. The Cell simply
forces you to stick to a strict distributed programming paradigm. Well,
luckily, I have experience with distributed programming. However, I do like
the fact that I can use the PPC on the Cell to do high-end shared memory
multi-processing.

From: Chris Thomasson on 19 Dec 2006 01:59

"Joe Seigh" <jseigh_01(a)xemaps.com> wrote in message
news:I9WdneLUKrCi9R_YnZ2dnUVZ_vqpnZ2d(a)comcast.com...
> Chris Thomasson wrote:
>> "Joe Seigh" <jseigh_01(a)xemaps.com> wrote in message
>> news:CbadnSpeDeLqweHYnZ2dnUVZ_segnZ2d(a)comcast.com...
>>
>>>The cell processor appears to be the first commondity multiprocessor that
>>>breaks with the cached shared memory multi-processing model if I
>>>got this right. So it's less applicable to shared memory multi-threading
>>>models and more applicable to models like MPI.

[...]

Well, message passing is okay with me simply because I can personally
implement it with virtually zero overheads. It comforts me to know that a
message passing algorithm can be implemented in software in a way that
renders all questions which deal with any possible overheads that may be
attributed to its usage, virtually meaningless. I posted the algorithm over
on c.p.t. if your interested; look for conversations I had with David
Hopwood.

In my "very humble" opinion, the posted algorithm proves that an efficient
message passing algorithm can be accomplished and 100% implemented in
software using existing ISA's. The Cell seems to trust the programmer a
whole lot... Forcing us to come up with ultra-lean-and-mean message passing
paradigm seems to be the trend... The trend that will make us some real $$$
that is...

;^)

Any thoughts on this approach?

Joe, if were are forced to use distributed programming, then we are forced
to implement fast message passing patterns in software... We have to beat
the hardware... I have a bad feeling that the hardware guys can render us
software guys moot? Na... The Cell proves that software means something
after all?

:O

>>>I wonder how the old shared memory strategy will work out. Will
>>>coherent cache shared memory scale up to 10's and 100's of processors
>>>and stay competitive?

The cache coherency has to be weak. The software should always have the
ability to use the ISA to force a certain memory model for certain
algorithms.

If the cache coherency mechanism a future processor uses is sufficiently
weak, then it can allow software applications a tailor custom memory models
to its specific data-usage patterns' and overall throughout protocols.

>> You can use the PowerPC for a lot of the shared memory work. The Cell
>> simply forces you to stick to a strict distributed programming paradigm.

[...]

> AMD seems to be about to go that route
>
> http://techreport.com/onearticle.x/11438
>
> so distributed models might become more the norm.

Well, if you can beat em'... Join em'?

Humm... At least we can use our overall synchronization algorithm
implementation design goals of "zero-overhead or die" in a message passing
implementation...

Brief Ultra-Fast Message-Passing Outline
----------------------

---- Multiplexing Of Multiple Per-Thread:

-- communication data-structures:
"word-sized" lock-free anchors for linked data-structure

-- allocated with:
"**
http://groups.google.com/group/comp.arch/browse_frm/thread/24c40d42a04ee855
**"
(patent pending)

---- Synchronized With:

-- "per-thread Petersons Algorithm":
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/c49c0658e2607317

-- "per-thread unbounded virtually zero-overhead fifo":
http://appcore.home.comcast.net/

---- Organized with

-- dual-per-thread/message version/time stamps

All may not be lost?

;^)

[...]

> Depending on how unique the processor is, the
> application might have to be written from scratch.

Yep.

:O

Well, more work for us? Consultants anyone? Any one need a consultant?

;^)

First | Prev |
Pages: 1 2 3 4 5 6
Prev: Need SimpleScalar GCC Compiler
Next: Software vs Hardware