From: Robert Myers on
On Jul 26, 10:34 pm, Andy Glew <"newsgroup at comp-arch.net"> wrote:
> On 7/26/2010 5:24 PM, Robert Myers wrote:
>
> > A careful comparison of local, low-order differencing results with
> > global, high-order differencing results is an obvious thing to do, but
> > if there is any such study with any computer even remotely state-of-
> > the-art (and, remember, state of the art "supercomputers" are
> > *terrible* for FFT's), I'm not familiar with it.  It's an obvious
> > thing to do.
>
> Can you remind us (okay, me, the lazy one) what the scaling of bandwidth
> requirements with problem size is for FFTs?
>
FFT pseudospectral code on the Cray-1 (~1 byte/flop--Mitch can correct
me, I'm sure) was memory-bound in vector mode. The Cray moved 64-bit
words.

> I.e. for an NxNxN 3D FFT, DP coordinates, what is the bandwidth?  Phrase
> it solely in terms of sendig around 64 bit DP numbers.
>
> And phrase it in terms of a dancehall configuration - all memory on one
> side, all processors on the other.  I.e. do NOT neglect bandwidth to
> processor local memory.
>
> One you get to a certain level of the FFT, we have lots of nice cache
> line sized accesses. Unless they are being distributed to different
> processors.  Which is where my scatter/gather interconnect would help.
>
Lots of possibilities for cleverness, in particular by simply renaming
rather than moving--e.g. who cares if things are physically in bit
reversed order under certain conditions. Also, you don't need a
general interconnect if what you really want to do are FFT's.

> We might want to create 3D tiles.
>
Very likely. So much depends on the details of the interconnect.
I've done some speculation on possibilities, which I will pass along.

> I'm staring at this thinking compression.  Adjacent points are probably
> close in value.  Now, compression of cache lines doesn't help that much,
> unless you have sequential access patterns even bigger than a cache
> line.  But in my scatter/gather world, any compression might help, since
> it can be packed with other odd sized packets.

The really tormented sections of data are likely to be compact in
physical space, if that helps. In spectral space, things will be wild
because of phase. One might learn a lot about turbulence just by
looking at the possibilities for compression. Might be some serious
prizes or at least some noteworthy papers along that line of inquiry
if you found anything interesting.

I'm sorry if this seems at tad superficial. This discussion takes a
lot out of me.

Robert.

From: Brett Davis on
In article <87fwz6xluk.fsf(a)NaN.sparse.dyndns.org>,
Jason Riedy <jason(a)acm.org> wrote:

> And Brett Davis writes:
> > The company that bought Cray had a machine that was 1000 way
> > interleaved to deal with 1000 cycle latencies from RAM.
> > That design is dead, and buried?
>
> Nope. The Cray XMT exists; I use the one at PNNL almost daily. The
> XMT2 is on its way. And your numbers are off by a factor of 10. The
> Tera machine had ~140 cycle memory latency (iirc) and carried 128
> threads per cpu. The XMT's latency is far worse, and you have fewer
> useful threads per processor (user code limited to 100). Some of that
> is slated to change with the XMT2, but I don't know (and wouldn't be
> able to say yet) hard numbers.

I stand corrected.
I based the death on this model, which sold one unit:
http://en.wikipedia.org/wiki/Cray_MTA-2

That info is old however, a new model is out:
http://en.wikipedia.org/wiki/Cray_XMT

Is is of course limited by the socket it uses, another reason I
thought it would stay dead. The 1000 threads was a rumor for a proposed
follow up, its what you need at the predicted 10 GHz speeds CPUs
were supposed to be running at today, to completely hide 1000 cycle
memory latency.

This should be as close to exactly what Robert Myers wants, that
he can get.

> I'm not making any claims about commercial viability, however. But it's
> far easier to obtain decent performance on massive-scale combinatorial
> graph analysis on the XMT than elsewhere... at the moment. Amdahl
> giggles a bit at practical potential...
>
> Jason
From: Brett Davis on
In article <op.vgf7hoq9ss38k4(a)khagan.ttx>,
"Ken Hagan" <K.Hagan(a)thermoteknix.com> wrote:

> On Mon, 26 Jul 2010 07:49:24 +0100, Brett Davis <ggtgp(a)yahoo.com> wrote:
>
> > ... dictionary based AI, the last remaining approach
> > to emulating the human mind, as all the other approaches have
> > failed. (Think ELisa with a terra-byte database.)
>
> That would be "last remaining that I've thought of", with a strong
> implication that it has survived this long simply because the other
> failures were tried first.

The PBS series NOVA did a show on AI. Back in the 60s/70s there were
a dozen major promising approaches and another dozen on the fringe.
As faster computers and more PHDs where thrown at the problems, those
approaches were proofed as wrong one by one.

The dictionary approach was not one of the promising approaches, any
PHD would tell you that the computation required would not be available
for 50 to 100 years, someone would come up with a breakthrough long
before the dictionary approach became viable to even test properly.

Here we are 50 years later... ;)

> > But ultimately this is a kludge to get the same results that
> > the human mind does, but the human mind is massively parallel
> > and soft plugboard wired up between neurons.
>
> I think we can be pretty certain that the human mind is not a *soft*
> plugboard on the sort of timescales that it solves intellectual problems.
> On the question of its parallelism, I'll wait until someone comes up with
> a plausible model for how it works. (Come to that, it doesn't make much
> sense to take lessons in computer architecture from the brain either, for
> the same reason.)

I did not mean to imply realtime rewiring, I assume a few 1000's of
neurons are rewired each night as you sleep. Unused connections die,
and underutilized neurons seek out new connections, semi-randomly.
Ergo parts of the brain that lose use due to a limb or other sense
getting chopped off, are re-assigned over time, which is what scientists
see. All a SWAG of course.

We know how picture memories are stored in the brain, and retrieved.
Its an extreme form of pattern matching against similar images, scaled.
And the images have idea tags for searching.

This technique is quite similar to a compressed dictionary lookup
with cross correlation to related ideas.

Intelligence would seem to be nothing more than a critical mass of
data plus connections, allowed to self explore and learn. Quite a shock.

No God or soul would seem to be needed, those are created afterwords,
to explain that which cannot be explained. (This is a old idea, it
comes up often in Bible studies.)

Brett
From: Benny Amorsen on
Andy Glew <"newsgroup at comp-arch.net"> writes:

> Network routing (big routers, lots of packets).

With IPv4 you can usually get away with routing on the top 24 bits + a
bit of special handling of the few local routes on longer than 24-bit.
That's a mere 16MB table if you can make do with possible 256
"gateways".

You can simply replicate the whole routing table to local memory on all
processors; updates are a challenge but it's usually ok if packets go to
the wrong route for a few ms.


/Benny

From: Terje Mathisen "terje.mathisen at on
Andy Glew wrote:
> And phrase it in terms of a dancehall configuration - all memory on one
> side, all processors on the other. I.e. do NOT neglect bandwidth to
> processor local memory.

Nice way of putting it: All accesses are just as expensive, i.e. very
expensive.
>
> One you get to a certain level of the FFT, we have lots of nice cache
> line sized accesses. Unless they are being distributed to different
> processors. Which is where my scatter/gather interconnect would help.
>
> We might want to create 3D tiles.

This sounds a lot more reasonable: 1D FFT is very nice from a cache
viewpoint, while 2D and 3D are much worse.

Could we, instead of 3D tiles, using 3-way address interleaving, similar
to what some graphics formats use?

I.e. Larrabee which has been (at least for the time) repurposed as a HPC
platform has a couple of opcodes for two-way and three-way bit interleaving.

Using this my naive interpretation would be that you would get
enormously better cache behavior than from a regular (power-of-two
stride) 2D or 3D FFT?
>
> I'm staring at this thinking compression. Adjacent points are probably
> close in value. Now, compression of cache lines doesn't help that much,
> unless you have sequential access patterns even bigger than a cache
> line. But in my scatter/gather world, any compression might help, since
> it can be packed with other odd sized packets.

I'm afraid that during the FFT run you would convert all the numbers
in-place, into values where your nice local compressibility might disappear.

I have more faith in uniform compression, i.e. making do with less
bits/node, simply because that is easy to emulate and check where
numerical instability would rear its ugly head.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"