From: Robert Myers on 4 Jan 2010 13:33
On Jan 4, 11:08 am, Thomas Womack <twom...(a)chiark.greenend.org.uk>
> Can you give some idea of how large an FFT you want to do?
You want to do the biggest problem that you conceivably can, and, for
the problems I know most about, one eighth of the total memory space
would be a plausible goal. For the 64K processor Blue Gene
installation, you'd like to be able to make use of about 8K
processors, not 512. For the bigger machines, you want to use more.
I answered to direct questioning here why I think doing such large
transforms is an important goal for any machine that wants to advance
fundamental science. To make a long story short, many interesting
questions come down to: for a strongly-nonlinear system, how do the
longest and shortest scales interact? Nature gives us enormous ranges
of scales that we will never come close to computing. The best we can
hope to do is to try to understand, from problems within reach, how
that interaction goes. Fourier transform methods to represent
differential operators are particularly attractive for such a question
because they don't do artificial things to the smallest scales, as all
non-global differencing schemes do. Important computations can be
reduced to nothing but FFT's and some saxpy type operations.
> >The advice of the Einstein of Cambridge (and the thousands of unnamed
> >others whom he will call to his witness) notwithstanding, the reason
> >is not far to find and could have been and was identified even from
> >the sketchiest of design documents, and it could have been fixed.
> >Even someone as obtuse as I am can follow that logic.
> OK, you are being gnomic rather than obtuse; please tell us the
> reason, and how you fix it without making the machine vastly more
> expensive or vastly less modular.
The bomb labs have known about this problem for a long time. Machine
bisection bandwidth is a reasonable predictor of performance on an
FFT, and the massively powerful machines that are so breathlessly
reported to the press have an embarrassing bisection bandwidth. The
solution is straightforward, but not cheap: you need more network
bandwidth. For machines that use a fat tree architecture, the
challenge is to build a fast enough switch with sufficient bandwidth.
I assume that this issue has been discussed at length behind closed
doors and that the national labs have decided they don't want to pay
the price: they'd rather have more flops than bytes/second, even
though bandwidth is the limiting factor in a huge range of problems.
If you're scaling up the machine while letting bytes/flop drop to zero
(which is what the "scalable" machines we now have do), the
scalability is simply a fraud.
> >Maybe the world has changed so much that an ability to handle "naive"
> >but multidimensional data structures with great efficiency is no
> >longer so important. Nick *would* be in a much better position to
> >comment on that than I would. I know, just like you do, about a
> >narrow but important class of problems. Well, if physics is narrow.
> Naive, efficient and fast is a classic pick-two; insisting that
> physically-adjacent points live sixteen megabytes apart doesn't seem
> entirely naive to me.
> http://www.cse.scitech.ac.uk/disco/mew20/presentations/MFG.pdfis a
> huge bolus of undigestable information, but describes the performance
> profile on small-to-medium clusters of what seem to be a number of
> jobs that chemists (micro-scale physicists?) want to do; they have
> fairly big FFTs in and don't seem to be doing too badly.
If you build your own cluster, you can do the trades for yourself. If
the nation is going to invest in a handful of ball-buster machines,
you have to take what the committee decides to give you. We're going
ahead based on claims about global warming without understanding the
basic science of an issue that pervades nearly every aspect of that
problem and we've been building machines that won't help.
From: Stephen Fuld on 4 Jan 2010 17:28
Andy "Krazy" Glew wrote:
>> You could use the provided hardware scatter-gather if you were astute
>> enough to use InfiniBand interconnect. :-)
>> you can lead a horse to water but you can't make him give up ethernet.
> What's the stoy on Infiniband?
Do you want to know the history of Infiniband or some details of what it
was designed to do (and mostly does)?
- Stephen Fuld
(e-mail address disguised to prevent spam)
From: Anne & Lynn Wheeler on 4 Jan 2010 17:40
Stephen Fuld <SFuld(a)alumni.cmu.edu.invalid> writes:
> Do you want to know the history of Infiniband or some details of what
> it was designed to do (and mostly does)?
minor reference to SCI (being implementable subset of FutureBus)
eventually morphing into current InfiniBand
40+yrs virtualization experience (since Jan68), online at home since Mar1970
From: Thomas Womack on 4 Jan 2010 18:54
In article <8a091340-7961-4a0a-baae-1265d2cc00f8(a)r24g2000yqd.googlegroups.com>,
Robert Myers <rbmyersusa(a)gmail.com> wrote:
>On Jan 4, 11:08=A0am, Thomas Womack <twom...(a)chiark.greenend.org.uk>
>> Can you give some idea of how large an FFT you want to do?
>You want to do the biggest problem that you conceivably can, and, for
>the problems I know most about, one eighth of the total memory space
>would be a plausible goal. For the 64K processor Blue Gene
>installation, you'd like to be able to make use of about 8K
>processors, not 512. For the bigger machines, you want to use more.
The largest 1D FFT that I can find evidence of is implied by
http://www.hpcs.is.tsukuba.ac.jp/~daisuke/pi.html - the size of the
FFT isn't stated, but it'll be either 25*2^35 storing three decimal
digits per double-precision entry or 15*2^35 storing five. The
algorithm doubles the number of correct digits with each iteration,
and each iteration involves about three full-length FFTs, so there are
about a hundred FFTs each on about half a trillion elements - each is
taking about twenty minutes.
Quoting Daisuke Takahashi (neither the footballer nor the figure-skater):
"Main program run:
Job start : 9th April 2009 07:37:32 (JST)
Job end : 10th April 2009 12:43:21 (JST)
Elapsed time : 29:05:49
Main memory : 13.5 TB
Algorithm : Gauss-Legendre algorithm
Programs were written by myself. The computer used was T2K Open
Supercomputer (Appro Xtreme-X3 Server) at the Center for Computational
Sciences, University of Tsukuba. 640 nodes of the total system (648
nodes, theoretical peak processing speed for the single node is 147.2
billion floating point operations per second. 95.4 trillion floating
point operations per second for all nodes), were definitely used as
single job and parallel processing for both of programs run."
The machine is (according to top500, and the 147.2 = 2.3 GHz * 4
flops/cycle * 16 cores/node) a Myrinet 10G cluster of four-socket
quad-core 2.3GHz Opterons; I'd guess twenty to thirty million dollars,
and a very fat switch in the middle, though the Myrinet documentation
I can find suggests that their biggest routine switch is 512-way. 648
is 2^3*3^4, the coincidence of node counts makes me wonder if the
topology is something like the Kautz graphs that Si-Cortex used.
http://www.hpcs.is.tsukuba.ac.jp/~daisuke/pub.html indicates that
Daisuke has also worked on 3D FFT implementations on this kind of
From: Del Cecchi` on 4 Jan 2010 23:23
Robert Myers wrote:
> If you build your own cluster, you can do the trades for yourself. If
> the nation is going to invest in a handful of ball-buster machines,
> you have to take what the committee decides to give you. We're going
> ahead based on claims about global warming without understanding the
> basic science of an issue that pervades nearly every aspect of that
> problem and we've been building machines that won't help.
How is your bisection bandwidth calculation affected by the reasonable
amount of per node memory on Blue Gene? As I understand it, the current
BG/P node has 4 cores and 4GB of memory.
Just noticed bad reply to address. sorry, will fix in a minute for