Effects of Memory Latency and Bandwidth on Supercomputer,Application Performance [Computer Architecture]

Prev: High-bandwidth computing (hbc) wiki and mailing list
Next: Effects of Memory Latency and Bandwidth onSupercomputer,Application Performance

From: Brett Davis on 26 Jul 2010 02:49

In article
<15ea973c-fef5-4cc0-82fb-d71e68a6425d(a)d37g2000yqm.googlegroups.com>,
Robert Myers <rbmyersusa(a)gmail.com> wrote:
> On Jul 25, 1:27�am, Brett Davis <gg...(a)yahoo.com> wrote:
> > <57004bc1-201f-4cd1-8584-e59c01eb0...(a)e5g2000yqn.googlegroups.com>,
> > �Robert Myers <rbmyers...(a)gmail.com> wrote:
> > > I agree. �The plots are splendid. �They always are.
> >
> > Stanford online has a truly awesome related iTunes talk by Bill Dally:http://deimos3.apple.com/WebObjects/Core.woa/Browse/itunes.stanford.e...
> > From:http://deimos3.apple.com/WebObjects/Core.woa/Browse/itunes.stanford.e...
> > iTunes store search term "The future of throughput".
> > Track 13 for the course "Programming Massively Parallel Processors with CUDA (CS193G)".
> >
>
> I got stopped on Bill Dally's first slide, where he says
> "efficiency=locality."
>
> He claims that his slide tells you everything you need to know about
> the future of computing, AND THAT BELIEF AND PROSELYTIZING FOR THAT
> BELIEF IS WHY I HAVE USED UP SO MUCH SPACE HERE.
>
> THE MOST INTERESTING PROBLEMS ARE NOT LOCAL, BECAUSE THE MOST
> INTERESTING PROBLEMS ARE NOT LINEAR.
>
> A problem is embarrassingly parallel IFF it is local IFF it is linear.
>
> If a problem is linear, then there is a representation in which it is
> both local and embarrassingly parallel. If a problem is not linear,
> then there is, in general, no such representation. Does one need a
> PhD from Cal Tech to understand this?

A simple programmer will tell you fine, but your performance is going
to be horrible.
Several things have been tried:

The company that bought Cray had a machine that was 1000 way
interleaved to deal with 1000 cycle latencies from RAM.
That design is dead, and buried?

Sun had its 8 way threaded multiprocessor for a more traditional
approach to the same problem. Dead.

IBM sells 4 way multiprocessors, it is too early to say if this
is a fad and failure, for IBM.

AMD would allow you to split the 128 bit memory bus into two
independent 64 bit busses, for better transaction throughput.
To the best of my knowledge no one turns this mode on, as one
bus is faster for everyone?

For BullDozer with its 256 bit memory bus I could see one
128 bit bus for the OS and app image, and a pair of 64 bit
buses for the database to use. But that is a stretch when
AMD may have decided the whole idea is stupid.

Intel has two way HypeThreading, which while no longer a net
negative that is turned off at boot by default like the NetBust
design was, is still a wash at best. Bold hopes for 2x speed
increases for high latency code have returned no speed boost
at all. You are hard pressed to find a 5% improvement in a
non-rigged benchmark.
Now that we have more CPUs than threads of work to do for most
desktop CPUs, the very idea of fine grain threading is dead.
(Outside of Intel marketing, where it is just Blue Crystals.)
Fine grain threading costs transistors for all those things
you have to double, and otherwise increase the size of.
Costing power, which costs clock, which costs performance.

AMD is quite vocal that they will never do HypeThreading,
saying that it is a net negative, and I believe them.

> I suppose that it is inevitable that EE's should believe that
> interesting things can happen in a linear universe, but, from a
> mathematical point of view, linear systems are as interesting as the
> aging of paintings hanging in the Louvre.
>
> Robert.

The one problem set I know of that really wants small random
access is dictionary based AI, the last remaining approach
to emulating the human mind, as all the other approaches have
failed. (Think ELisa with a terra-byte database.)

But ultimately this is a kludge to get the same results that
the human mind does, but the human mind is massively parallel
and soft plugboard wired up between neurons.

On the scale between ASIC-DSP-CPU the human mind way off the
left of ASIC.

So what problem is it that you want this functionality for?

Fringe designs and apps are interesting mental exercises, fun.

> > A related PDF:http://cva.stanford.edu/people/dally/ISSCC2005.pdf
> >
> > Brett

From: Jason Riedy on 26 Jul 2010 12:13

And Brett Davis writes:
> The company that bought Cray had a machine that was 1000 way
> interleaved to deal with 1000 cycle latencies from RAM.
> That design is dead, and buried?

Nope. The Cray XMT exists; I use the one at PNNL almost daily. The
XMT2 is on its way. And your numbers are off by a factor of 10. The
Tera machine had ~140 cycle memory latency (iirc) and carried 128
threads per cpu. The XMT's latency is far worse, and you have fewer
useful threads per processor (user code limited to 100). Some of that
is slated to change with the XMT2, but I don't know (and wouldn't be
able to say yet) hard numbers.

I'm not making any claims about commercial viability, however. But it's
far easier to obtain decent performance on massive-scale combinatorial
graph analysis on the XMT than elsewhere... at the moment. Amdahl
giggles a bit at practical potential...

Jason

From: Robert Myers on 26 Jul 2010 13:16

On Jul 26, 1:33 am, Andy Glew <"newsgroup at comp-arch.net"> wrote:
> On 7/25/2010 8:12 PM, Robert Myers wrote:
>
> > I got stopped on Bill Dally's first slide, where he says
> > "efficiency=locality."
>
> > He claims that his slide tells you everything you need to know about
> > the future of computing, AND THAT BELIEF AND PROSELYTIZING FOR THAT
> > BELIEF IS WHY I HAVE USED UP SO MUCH SPACE HERE.
>
> > THE MOST INTERESTING PROBLEMS ARE NOT LOCAL, BECAUSE THE MOST
> > INTERESTING PROBLEMS ARE NOT LINEAR.
>
> > A problem is embarrassingly parallel IFF it is local IFF it is linear.
>
> > If a problem is linear, then there is a representation in which it is
> > both local and embarrassingly parallel. If a problem is not linear,
> > then there is, in general, no such representation. Does one need a
> > PhD from Cal Tech to understand this?
>
> (1) Is there a proof for this? Not that there are non-linear systems
> that are not embarassingly parallel, but that there are no interesting
> non-linear systems that are not amenable to parallel solutions.
>
> E.g. the N-body problem, gravitational dynamics?
>

First, you have to understand that what I think of as a huge problem
the DoE (for example) does not even recognize as a problem at all,
never mind as a huge problem.

Bill Dally later whisks through the logic, where he talks about
calculating fluxes that never make it off the graphics card. If you
think way back to the divergence theorem, you can see what they're
doing. You reduce the nonlinear terms in the Navier-Stokes (or other
equation involving transport) to fluxes through the surface of a
control volume. Voila! The non-linear term is local.

Since you've got all these flops and not much bandwidth, you only ever
transmit and receiive fluxes across the boundary. What's more, if
you're the DoE, you can add all kinds of chemical reactions and
constitutive equations to keep the flops occupied, EVEN THOUGH YOU AR
USING TERRIBLE APPROXIMATION OF THE TRANSPORT EQUATIONS, even without
the complication of all those chemical reactions (or whatever).

Control volumes, of course, do not appear in the Navier-Stokes
equations, which contains differential operators. Any accurate
approximation to the differential operators is non-local, and, in
fact, the only one with quantifiable limitations in the face of non-
linearity is global (a spectral representation of the operator).

This all sounds very arcane, I'm sure, and the DoE wants you to think
it is arcane, but the physics are driven by the details of how the
shortest scales interact with the longest scales, and when you
misrepresent the differential operators with a crude appropximation,
you are changing the exact physics you want to look at.

If you can't build computers that can actually do global FFT's, then
you will never be able to examine what the numerics are actually
doing, and I do truly believe that that's the way the supercomputer
bandwagon wants things. That way, they can publish gorgeous pictures
of the Rayleigh-Taylor instability, even while using methods of
unknowable accuracy.

It's important enough that, in the middle of his sales pitch, Bill
Dally feels compelled to whisk through an admission of what's really
going on, even though I strongly suspect that he has no idea of the
mathematical consequences of the methods he is using. This whole
"supercomputer" shenanigan (and with it miraculous GPU's that will
unravel the secrets of the universe with practically no bandwidth) is
built on a crude fudge. The fudge is SO convenient, though, that the
empire builders and software writers just want to get on with it,
without worrying that they might have lost contact with reality before
they even started.

> (2) If what you and Dally say is true, Robert, then you may be tilting
> at windmills. There may be no computationally efficient way of doing
> what you want.
>
> I don't believe this, because I do not equate computationally efficient
> to embarassingly parallel.
>
> Also: even though locality really matters, nevertheless we have used
> that as an excuse to pessimize non-local activities. At the very least
> we can influence the constant factor by removing these pessimizations.
> As I have attempted to do with my proposal for a scatter/gather based
> memory subsystem.

I may well be tilting at windmills. I've understood that all along.

Even if you can build computers with sufficient global bandwidth (can
afford the hardware), the energy costs might kill you. I don't really
know.

The FFT isn't local, but it diagonalizes the derivative (DOES TRULY
MAKE THE CALCULATION LOCAL), and it can be done with very great
efficiency, IFF you have sufficient global bandwidth.

Personally, I think that if I'm tilting at windmills, it's not because
of any actual physical limitation or any real financial limitation
that can't be overcome, but because dealing with the foundational
reality I keep harping on will interfere with the march of petaflops
and the endless flow of pretty pictures.

Robert.

From: Robert Myers on 26 Jul 2010 20:24

On Jul 26, 7:29 pm, Chris Gray <c...(a)GraySage.com> wrote:
> Robert Myers <rbmyers...(a)gmail.com> writes:
> > First, you have to understand that what I think of as a huge problem
> > the DoE (for example) does not even recognize as a problem at all,
> > never mind as a huge problem.
>
> ... etc ...
>
> I know virtually nothing about the math and coding of these things, and so
> should just stay out of this. But:
>
> 1) surely someone reading this newsgroup has a simulation/whatever that can
> be fairly easily converted from the "bad" locality assumptions into one
> that runs globally. Can't you just hack some of the constants so that it
> only has one "cell" instead of many? Don't the environments support things
> like large arrays spanning multiple system nodes?
>
> 2) surely someone reading this newsgroup has some kind of access to a system
> that can try to run the resulting simulation. I figure that back in the
> old days at Myrias we could have found a way to do it. We could likely
> have done a weekend run on 1024 nodes, longer on fewer nodes.
>
> It could even be in the interests of hardware vendors to do this. If the
> run proves Robert right, then there could be lots of new money coming to
> research systems able to run globally.
>
> Likely one run wouldn't be enough, but it would at least be a start.
>

It's not as if any of this just bubbled up out of the ground like
methane around the Deep Horizon well.

People who need to get the physics right know how to do it:

http://www.o3d.org/abracco/annual_rev_3dnumerical.pdf

It would be fascinating to see what would happen if you tried to
reproduce these kinds of fundamental calculations using the kind of
differencing that Dally apparently takes for granted. It would be a
significant undertaking, and I don't think anyone would be surprised
that such calculations would yield results that would be wrong for
reasons that one could easily predict ahead of time: the real physics
of isotropic turbulence would be swamped by numerical diffusion.

In my admittedly cynical view of the world, when you need to get it
right, and you know it will be obvious that you have gotten it wrong,
no one uses the methods that the DoE apparently designs its computers
around. If all you need is a calculation that isn't obviously wrong
without doing some subtle checking, crude differencing schemes are the
rule and highly accurate differencing schemes are the exception. At
one time, fluid mechanical calculations were so expensive that you
were happy to get something that just looked right.

We shouldn't have to live that way any more. I can't guarantee that
the kind of comparison you'd think would be obvious to do hasn't been
done, but the incentive to do it is low and the risks very high. As
someone who really does know has said to me in a friendly way: I'm
likely to make a lot of enemies.

A careful comparison of local, low-order differencing results with
global, high-order differencing results is an obvious thing to do, but
if there is any such study with any computer even remotely state-of-
the-art (and, remember, state of the art "supercomputers" are
*terrible* for FFT's), I'm not familiar with it. It's an obvious
thing to do.

Robert.

From: Andy Glew "newsgroup at on 26 Jul 2010 22:34

On 7/26/2010 5:24 PM, Robert Myers wrote:
> A careful comparison of local, low-order differencing results with
> global, high-order differencing results is an obvious thing to do, but
> if there is any such study with any computer even remotely state-of-
> the-art (and, remember, state of the art "supercomputers" are
> *terrible* for FFT's), I'm not familiar with it. It's an obvious
> thing to do.

Can you remind us (okay, me, the lazy one) what the scaling of bandwidth
requirements with problem size is for FFTs?

I.e. for an NxNxN 3D FFT, DP coordinates, what is the bandwidth? Phrase
it solely in terms of sendig around 64 bit DP numbers.

And phrase it in terms of a dancehall configuration - all memory on one
side, all processors on the other. I.e. do NOT neglect bandwidth to
processor local memory.

One you get to a certain level of the FFT, we have lots of nice cache
line sized accesses. Unless they are being distributed to different
processors. Which is where my scatter/gather interconnect would help.

We might want to create 3D tiles.

I'm staring at this thinking compression. Adjacent points are probably
close in value. Now, compression of cache lines doesn't help that much,
unless you have sequential access patterns even bigger than a cache
line. But in my scatter/gather world, any compression might help, since
it can be packed with other odd sized packets.

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: High-bandwidth computing (hbc) wiki and mailing list
Next: Effects of Memory Latency and Bandwidth onSupercomputer,Application Performance