From: nmm1 on
In article <694ab6ff-e25a-4b9d-8a0b-6d142986128c(a)q23g2000yqd.googlegroups.com>,
Larry <lstewart2(a)gmail.com> wrote:
>
>BG/L is hard to program, and weird in many ways, buy inability to do
>FFT is a bum rap.

If it wasn't, IBM's Blue Gene design team should be reassigned to
something less challenging. One of its claims is that it has
special interconnect to make that efficient.

>Incidently, only the SX9 gets over 10% of HPL this way, the rest of
>the pack is in the 3-6% area. Global FFT is hard.

Yup. A good exercise for a budding parallel algorithmist is to
design three ways of doing a global FFT, that aren't completely
hopeless. It's harder than it sounds.


Regards,
Nick Maclaren.
From: Larry on
On Mar 8, 3:16 pm, Larry <lstewa...(a)gmail.com> wrote:

>  The data sets for
> these calculations get so large, that there is no way to do
> substantial numbers of them in parallel, because the machines
> do not have enough memory to hold, say a terabyte of data per run,.
>
> -Larry

Evidently ability to do arithmetic in my head is not my
strong point. The data for a 1kx1kx1k single precision
complex fft is 8 GB, not 1TB... There would be plenty
of space in a cluster to do these shot-parallel

-L
From: nedbrek on
Hello all,

"Del Cecchi" <delcecchinospamofthenorth(a)gmail.com> wrote in message
news:7viusjFjaqU1(a)mid.individual.net...
> Andy "Krazy" Glew wrote:
>> If you are a computer architect, it's Intel in Oregon, Silicon Valley,
>> Austin. Where else?
>
> Perhaps IBM in Rochester MN or maybe even Mayo Clinic, Rochester. The
> clinic does a lot of special stuff for medical equipment and Dr Barry
> Gilbert had a group that did high speed stuff for Darpa.

I sometimes fantasize about going to Intel Israel.

Ned


From: Robert Myers on
On Mar 8, 3:16 pm, Larry <lstewa...(a)gmail.com> wrote:
> On Mar 7, 11:15 pm, Andrew Reilly <areilly...(a)bigpond.net.au> wrote:
>
>
>
>
>
> > On Sun, 07 Mar 2010 17:51:55 -0800, Robert Myers wrote:
> > > I've explained my objections succinctly.  Of 64000 or more processors,
> > > it can use only 512 effectively in doing an FFT.  That the bisection
> > > bandwidth is naturally meaured in *milli*bytes per flop is something
> > > that I have yet to see in an IBM publication.
>
> > Is that really a serious limitation?  (512 cores doing a parallel FFT)  I
> > know I'm not familiar with the problem space, but that already seems way
> > out at the limits of precision-limited usefulness.  How large (points) is
> > the FFT that saturates a 512-processor group on a BG?  Are there *any*
> > other super computers that allow you to compute larger FFTs at an
> > appreciable fraction of their peak floating point throughput?[*]
>
> > Clearly it is a limitation, otherwise you wouldn't be complaining about
> > it.  Still seems that it might be useful to be able to be doing 128 of
> > those mega-FFTs at the same time, if doing lots of them is what you cared
> > about.
>
> > I reckon I'd be more worried about the precision of the results than the
> > speed of computing them, though.
>
> > [*] Are any of the popular supercomputer benchmarks capacity-shaped in
> > this way, rather than rate-for-fixed-N-shaped?  How many problems are
> > capacity limited rather than rate limited?
>
> > Cheers,
>
> > --
> > Andrew
>
> Actually I think R. Myers' facts are wrong here.  I downloaded the
> HPCC Challenge Results data from the UTK website, and added
> a column comparing computing the ration of global FFT performance
> versus global HPL performance. Then I filtered away all systems not
> achieving at least 100 GFlops FFT performance.
>
> For systems achieving over a teraflop of FFT performance, BG/L with
> 32K cores is beaten <only> by the SX-9 in this figure of
> merit.  For systems achieving over 100 GF, it is beaten by the SX-9,
> by the best Nahelem/Infiniband systems, by the Cray XT3, and
> by the largest SiCortex machine.
>
> BG/L is hard to program, and weird in many ways, buy inability to do
> FFT is a bum rap.
>
> Incidently, only the SX9 gets over 10% of HPL this way, the rest of
> the pack is in the 3-6% area.  Global FFT is hard.
>
My facts are not wrong. IBM''s own documents show that scaling falls
apart (doing volumetric FFT's) above 512 processors. Performances
below 10% efficiency compared to theoretical peak flops on real world
problems have been complained about for DoE computers for years. Blue
Gene was just another in a long series of computers that can't do what
a Cray-1 could do easily and with very high efficiency.

The bandwidth problem is inescapable. With today's "scalable"
architectures, you can add more flops and more memory, but, at a
certain point, if you need to do FFT's, you are just wasting money,
and that certain point comes fast. Building machines and declaring
them to be scalable when they are not is just bureacratic
marketdoidism.

This is a design (and budgetary) decision, not an immutable fact of
nature. It has become standard practice to starve big machines for
global bandwidth because it's the linpack flops that get you the
press. It would be *very* expensive to design a machine that scales
well to larger volumetric FFT's, but it is (as Eugene Miya has
correctly argued) a matter of budgetary choices.

> There is a very large market for FFT cycles, particularly in doing 2D
> and 3D ffts for seismic processing.  The data sets for
> these calculations get so large, that there is no way to do
> substantial numbers of them in parallel, because the machines
> do not have enough memory to hold, say a terabyte of data per run,.
> If you try to do it any other
> way, you've succeeded in turning a cluster communications problem into
> a (worse) I/O problem.
>
If you're going to criticize me, at least read what I write. I am not
talking about problems where I/O would be the limiting factor. I'd be
surprised if seismic processing were done on anything but rack
clusters. Completely different world.

Robert.
From: Robert Myers on
On Mar 8, 3:29 pm, n...(a)cam.ac.uk wrote:
> In article <694ab6ff-e25a-4b9d-8a0b-6d1429861...(a)q23g2000yqd.googlegroups..com>,
>
> Larry  <lstewa...(a)gmail.com> wrote:
>
> >BG/L is hard to program, and weird in many ways, buy inability to do
> >FFT is a bum rap.
>
> If it wasn't, IBM's Blue Gene design team should be reassigned to
> something less challenging.  One of its claims is that it has
> special interconnect to make that efficient.
>
> >Incidently, only the SX9 gets over 10% of HPL this way, the rest of
> >the pack is in the 3-6% area.  Global FFT is hard.
>
> Yup.  A good exercise for a budding parallel algorithmist is to
> design three ways of doing a global FFT, that aren't completely
> hopeless.  It's harder than it sounds.
>