From: Robert Myers on
On Jan 3, 6:44 am, n...(a)cam.ac.uk wrote:
> In article <7qacstFk...(a)mid.individual.net>,
>
> Del Cecchi <delcec...(a)gmail.com> wrote:

Well, actually, Robert Myers wrote:

> >Since I've argued that super-sized computers seem to me to be of
> >questionable value, maybe that's all the vast bulk of science really
> >needs.  If I really need a computer with respectable bi-section
> >bandwidth, I can skip waiting for a gigantic machine that runs at 5%
> >efficiency (or worse) and learn to live with whatever I can build
> >myself.
>
> Bisection bandwidth is the Linpack measurement of networking, but
> that doesn't affect your point.
>
> I favoured (and favour) the latency and bandwidth of all-to-all,
> both because it is more realistic and because it is invariant over
> the network topology and implementation.

Sure, and while we're at it, can I have a time-travel machine? Rather
than trying to fight every conceivable battle, I've chosen just one
where even the national labs and now even some of the endlessly-
reproduced bureaucrat charts have already conceded my point: limited
bi-section bandwidth is a problem for FFT's.

If Vladimir Putin can says he wants supercomputers and people think
that supercomputers are lots of linpack flops, then that's what
scientists in Russia will be getting. It's what scientists in the USA
are already getting. I've chosen a single number to focus on. Just
like linpack flops, it simplifies a complicated story, but I'll be
happy to get the story to the level of complication I've been so
persistent about.

> The only real point of the specialist HPC systems nowadays is when
> the problems are limited by communication and not processing, but
> it's more a situation of how you spend your money than differences
> in architecture.  There isn't any major technical problem with using
> multiple InfiniBand ports on each node, and a suitable topology; the
> software can get interesting, though.

Yes, and if your goal is to claw your way to the top of the all-
important Top 500 list, you're not going to waste money on
communication that doesn't help with a linpack benchmark.

> However, people who do that usually want SMALLER nodes, because that
> increases the network capacity relative to computation power.  It's
> expensive to scale the former pro-rata to the latter, and is often
> not feasible with off-the-shelf components.  I.e. they use single-
> socket nodes.

I'm not following your logic here. If I could have one big fat wire
running amongst huge nodes, rather than angel hair spaghetti
connecting many more smaller nodes, why, exactly, would I prefer the
latter? One four or or eight way nehalem or successor with or without
gpgpu assist with a huge wire running to a massive switch. What
better are you proposing?

Robert.


From: Mayan Moudgill on
nmm1(a)cam.ac.uk wrote:

>
> The CETEP/CASTEP/ONETEP series of ab initio quantum mechanics programs
> alternate between the solution of a large set of simultaneous equations
> and a very large 3-D FFT. The latter, like sorting, is limited by
> communication. It is (amongst other things) used for surface chemistry
> caclulations, which are of immense industrial importance.

(I can't speak to 3D FFTs, so I'll restrict myself to Cooley-Tukey
radix-2 1D FFT which I do have experience with, and hopefully the
analysis will carry over)

A 1-D FFT of size 2**lgN=N points over 2**lgM=M machines will do lgN FFT
steps. Of these, all but the last lgM are independent. The last lgM
steps can be done in different ways, but lets analyze the case where
every node exchanges information with its butterfly pair, so that the
entire array is copied for each of the last lgM steps.

Assume that it takes P sec per point for compute and U sec per point for
communication. Then, the time to compute a 1-D FFT with N points will be
serially:
Ts = lgN*N*P
while the time to do it in parallel will be:
Tp = lgN*N*P/M + lgM*U*N

What's U? Assume that the points are double precision complex values
[i.e. 16bytes], and thatthat the bandwidth per link is B bytes per
second, all nodes are connected to a central switch using 2 links (one
up, one down). Given the nature of 1-D FFT communication, then all links
can be used simulataneously during the communication phase, for an
aggregate time of:
U=16/B*M
Set B'=16/B and substituting
Tp = lgN*N*P/M + lgM*N*B'/M

Communication can be overlapped with computation - as soon as some
number of results have been computed (128 complex doubles for normal
ehternet, 512 for jumbo frames, YMMV based on OS syscall overhead and
actual network used), start sending them out. Ideally, this will result in
Tp = (lgN-lgM)*N*P/M + lgM*N*B'/M
= 1/M*(lgN*N*P + lgM*N*(B'-P))

The reciprocal of the speed-up is:
Tp/Ts = 1/M + lgM*(B'-P)/lgN*P
= 1/M + lgM*B'/lgN*P - lgM/lgN

Assume a 16-way machine lgM=4, P=1e-9 ns, B=1e9 B/s (assumes dual 10GbE,
fairly tuned stacks).

For lg2N = 30 (N ~ 1G-points), we would end up with Ts = 32.2sec and Tp
= 6.0sec, of which 4.3 ns was communication: speedup=5.33.
For a 64-way, assuming the same numbers, we end up with Tp = 2.0sec, of
which 1.6s is communication: speedup=16.
For 256 way, we would end up with Tp=0.6sec, of which 0.5 sec is
communication: speedup=51

radix-2 1D-FFTs have the nice property that any machine needs to only
talk to lgM other machines, so the switch structure will have somewhat
better scaling properties than a general switch.

Anyway, Nick, you're right that things like FFTs are network bandwidth
limited; however, it is still possible to get fairly good speedups.

Of course, I haven't analyzed the other approaches; e.g. do the last
lg2M stages on one processor (so that there is only one communication
phase) or other points in the parallelism space.
From: nmm1 on
In article <u9mdnb0HtsTSaN3WnZ2dnUVZ_gKdnZ2d(a)bestweb.net>,
Mayan Moudgill <mayan(a)bestweb.net> wrote:
>
>(I can't speak to 3D FFTs, so I'll restrict myself to Cooley-Tukey
>radix-2 1D FFT which I do have experience with, and hopefully the
>analysis will carry over)

In general, it doesn't, but it more-or-less does for the one you are
doing - which is NOT the way to do multi-dimensional FFTs! It is
almost always much faster to transpose, so you are doing vector
operations in the FFT at each stage.

>Communication can be overlapped with computation - ...

I am afraid not, when you are using 'commodity clusters'. Firstly,
even with the current offloading of the TCP/IP stack, there is still
a lot of CPU processing needed to manage the transfer, and obviously
a CPU can't be doing an FFT while doing that. Secondly, you have
omitted the cost of the scatter/gather, which again has to be done
by the CPU.

>Assume a 16-way machine lgM=4, P=1e-9 ns, B=1e9 B/s (assumes dual 10GbE,
>fairly tuned stacks).
>For lg2N = 30 (N ~ 1G-points), we would end up with Ts = 32.2sec and Tp
>= 6.0sec, of which 4.3 ns was communication: speedup=5.33.
>For a 64-way, assuming the same numbers, we end up with Tp = 2.0sec, of
>which 1.6s is communication: speedup=16.
>For 256 way, we would end up with Tp=0.6sec, of which 0.5 sec is
>communication: speedup=51

Hmm. That's not the experience of the people I know who have tried
it. I haven't checked your calculations, so I am not saying whether
or not I agree with them.

>Anyway, Nick, you're right that things like FFTs are network bandwidth
>limited; however, it is still possible to get fairly good speedups.

For multi-dimensional FFTs, certainly. I remain doubtful that you
would get results anywhere near that good for single-dimensional
ones. I certainly know that they are notorious SMP-system killers,
and the speedups obtained by vendors' libraries are not very good.


Regards,
Nick Maclaren.
From: Mayan Moudgill on
Robert Myers wrote:

> I assume that most who buy installations like Blue Gene would have RAS
> requirements that would be hard or impossible to meet with a Beowulf
> cluster. In the end, it's probably RAS that rules.
>

What kind of recovery are you looking for? Against node failures?

I'd suggest that distributed checkpointing with restart on failures
could be an adequate model for small-to-mid-size clusters.

BTW: does anyone have pointers to how IBM/Los Alamos/LLNL are planning
to handle failures on Roadrunner/BlueGene? A MTBF analysis, and a
discussion of failure-detection and recovery techniques would be nice.
From: Mayan Moudgill on
nmm1(a)cam.ac.uk wrote:

> In article <u9mdnb0HtsTSaN3WnZ2dnUVZ_gKdnZ2d(a)bestweb.net>,
> Mayan Moudgill <mayan(a)bestweb.net> wrote:
>
>>(I can't speak to 3D FFTs, so I'll restrict myself to Cooley-Tukey
>>radix-2 1D FFT which I do have experience with, and hopefully the
>>analysis will carry over)
>
>
> In general, it doesn't, but it more-or-less does for the one you are
> doing - which is NOT the way to do multi-dimensional FFTs! It is
> almost always much faster to transpose, so you are doing vector
> operations in the FFT at each stage.
>


The initial bit-reversal is trivial to parallelize (at least for 1D
FFTs). It will (at most) involve a single copy of the array over the
entire network.