From: Robert Myers on
I'm slowly doing my catchup homework on what the national wisdom is on
bisection bandwidth. Not too surprisingly, there a plenty of people
out there who know that it's already a big problem, and that it is
only going to get bigger, as there is no Moore's Law for bandwidth.

In the meantime, I'd like to offer a succinct explanation as to why
this issue is fundamental and won't go away if only enough flops and/
or press releases and/or power point presentations about flops appear.

States do not interact in linear time invariant systems. There is
almost always a transformation available that will put the problem
into a representation where the computation is actually embarrassingly
parallel. This is fundamental characteristic of such systems and is
not driven by bureaucratic requirements. Actually applying such a
transformation to expose the embarrassingly parallel nature of the
problem may not be convenient, as it is generally a global matrix
operation on a state vector, but the possibility always exists.

No such transformation exists for general nonlinear systems. Various
approximations to such transformations may exist that amount to
linearization around some particular system state, but, once the
system has changed in interesting ways, the transformation will cease
even to be approximately valid. In general, nonlinear systems mix
states in upredictable ways such that localization of the computation
is not possible.

If you are using finite differences, which are generally a convolution
of finite support over a state, you can deceive yourself into thinking
that you have successfully localized the problem, but the process of
approximating a differential operator by such a localized convolution
will *itself* mix states in ways that are unphysical and unrelated to
the actual mathematics of the problem. I suspect that this key
deception is important to allowing continuing use of "supercomputers"
that, on the face of it, are unsuitable for nonlinear systems, because
general nonlinear systems globally mix states. The local nature of
the computation is an artifact of the discretization scheme that bears
no necessary relationship to the physics or mathematics of the actual
problem.

In some systems, the resulting errors may be acceptable. For general
strongly non-linear systems, there is no a priori way that I know of
to establish that the errors are acceptable. One common property of
"stable" differencing schemes is that they artificially smooth
solutions, so that even testing by changing grid resolution may not
reveal a problem.

If there is a mathematician in the house and I have made an error, I'm
sure I will be informed of it. Adequate global bandwidth is not
merely desirable, it is a requirement for simulating nonlinear
systems.

Robert.

From: Terje Mathisen "terje.mathisen at on
Robert Myers wrote:
> I'm slowly doing my catchup homework on what the national wisdom is on
> bisection bandwidth. Not too surprisingly, there a plenty of people
> out there who know that it's already a big problem, and that it is
> only going to get bigger, as there is no Moore's Law for bandwidth.

Huh?

Sure there is, it is driven by the same size shrinks as regular ram and
cpu chips have enjoyed.

I guess the real problem is that you'd like the total bandwidth to scale
not just with the link frequencies but even faster so that it also keeps
up by the increasing total number of ports/nodes in the system, without
overloading the central mesh?

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: Robert Myers on
On Mar 14, 5:43 am, Terje Mathisen <"terje.mathisen at tmsw.no">
wrote:
> Robert Myers wrote:
> > I'm slowly doing my catchup homework on what the national wisdom is on
> > bisection bandwidth.   Not too surprisingly, there a plenty of people
> > out there who know that it's already a big problem, and that it is
> > only going to get bigger, as there is no Moore's Law for bandwidth.
>
> Huh?
>
> Sure there is, it is driven by the same size shrinks as regular ram and
> cpu chips have enjoyed.
>
> I guess the real problem is that you'd like the total bandwidth to scale
> not just with the link frequencies but even faster so that it also keeps
> up by the increasing total number of ports/nodes in the system, without
> overloading the central mesh?
>
At the chip (or maybe chip carrier) level, there are interesting
things you can do because of decreased feature sizes, as we have
recently discussed.

It's conceivable that such trickery will allow a computer with better
global scaling of bandwidth, but it is not, so far as I know, an
automatic result, unlike Moore's law which has allowed cramming more
and more flops into a smaller and smaller space, leaving global
bandwidth as the unspoken of and unsolved and perhaps even unsolvable
problem.

Robert.

From: MitchAlsup on
On Mar 14, 12:15 pm, Robert Myers <rbmyers...(a)gmail.com> wrote:
> On Mar 14, 5:43 am, Terje Mathisen <"terje.mathisen at tmsw.no">
> > I guess the real problem is that you'd like the total bandwidth to scale
> > not just with the link frequencies but even faster so that it also keeps
> > up by the increasing total number of ports/nodes in the system, without
> > overloading the central mesh?
>
> At the chip (or maybe chip carrier) level, there are interesting
> things you can do because of decreased feature sizes, as we have
> recently discussed.

One achieves maximal "routable" bandwidth at the "frame" scale . With
todays current board technologies, this "frame" scale occurs around 1
cubic meter.

Consider a 1/2 meter sq motherboard with "several" CPU nodes with 16
bidirectionial (about) byte wide ports running at 6-10 GTs. Now
consider a back plane that simply couples this 1/2 sq meter
motherboard to another 1/2 sq meter DRAM carring board also with 16
bidirectional (almost) bite wide ports running at the same
frequencies. Except, this time, the DRAM boards are perpendicular to
the CPU boards. With this arrangement, we have 16 CPU containing
motherboards fully connected to 16 DRAM containing motherboards and
256 (almost) byte wide connections running at 6-10 GTs. 1 cubic meter,
about the actual size of a refrigerator. {Incidentally, this kind of
system would have about 4TB/s of bandwidth to about 4TB of actual
memory}

Once you get larger than this, all of the wires actualy have to exist
as wires (between "frames"), not just traces of coper on a board or
through a connector, and one becomes wire bound connecting frames.

Mitch
From: Anton Ertl on
MitchAlsup <MitchAlsup(a)aol.com> writes:
>Consider a 1/2 meter sq motherboard with "several" CPU nodes with 16
>bidirectionial (about) byte wide ports running at 6-10 GTs. Now
>consider a back plane that simply couples this 1/2 sq meter
>motherboard to another 1/2 sq meter DRAM carring board also with 16
>bidirectional (almost) bite wide ports running at the same
>frequencies. Except, this time, the DRAM boards are perpendicular to
>the CPU boards. With this arrangement, we have 16 CPU containing
>motherboards fully connected to 16 DRAM containing motherboards and
>256 (almost) byte wide connections running at 6-10 GTs. 1 cubic meter,
>about the actual size of a refrigerator.

I compute 1/2m x 1/2m x 1/2m = 1/8 m^3.

Where have I misunderstood you?

But that size is the size of a small freezer around here (typical
width 55-60cm, depth about the same, and the height of the small ones
is around the same, with the normal-sized ones at about 1m height).

Hmm, couldn't you have DRAM boards on both sides of the mainboard (if
you find a way to mount the mainboard in the middle and make it strong
enough). Then you can have a computer like a normal-size fridge:-).

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html