Effects of Memory Latency and Bandwidth on Supercomputer,Application Performance [Computer Architecture]

Prev: What will Microsoft use its ARM license for?
Next: Redbook on the new z196 mainframes.

From: Chris Gray on 26 Jul 2010 19:29

Robert Myers <rbmyersusa(a)gmail.com> writes:

> First, you have to understand that what I think of as a huge problem
> the DoE (for example) does not even recognize as a problem at all,
> never mind as a huge problem.

.... etc ...

I know virtually nothing about the math and coding of these things, and so
should just stay out of this. But:

1) surely someone reading this newsgroup has a simulation/whatever that can
be fairly easily converted from the "bad" locality assumptions into one
that runs globally. Can't you just hack some of the constants so that it
only has one "cell" instead of many? Don't the environments support things
like large arrays spanning multiple system nodes?

2) surely someone reading this newsgroup has some kind of access to a system
that can try to run the resulting simulation. I figure that back in the
old days at Myrias we could have found a way to do it. We could likely
have done a weekend run on 1024 nodes, longer on fewer nodes.

It could even be in the interests of hardware vendors to do this. If the
run proves Robert right, then there could be lots of new money coming to
research systems able to run globally.

Likely one run wouldn't be enough, but it would at least be a start.

--
Chris Gray

From: Rick Jones on 26 Jul 2010 19:16

Chris Gray <cg(a)graysage.com> wrote:
> Robert Myers <rbmyersusa(a)gmail.com> writes:

> > First, you have to understand that what I think of as a huge problem
> > the DoE (for example) does not even recognize as a problem at all,
> > never mind as a huge problem.

> ... etc ...

> I know virtually nothing about the math and coding of these things, and so
> should just stay out of this. But:

> 1) surely someone reading this newsgroup has a simulation/whatever that can
> be fairly easily converted from the "bad" locality assumptions into one
> that runs globally. Can't you just hack some of the constants so that it
> only has one "cell" instead of many? Don't the environments support things
> like large arrays spanning multiple system nodes?

> 2) surely someone reading this newsgroup has some kind of access to a system
> that can try to run the resulting simulation. I figure that back in the
> old days at Myrias we could have found a way to do it. We could likely
> have done a weekend run on 1024 nodes, longer on fewer nodes.

> It could even be in the interests of hardware vendors to do this. If the
> run proves Robert right, then there could be lots of new money coming to
> research systems able to run globally.

> Likely one run wouldn't be enough, but it would at least be a start.

Perhaps the folks at ScaleMP?

rick jones
--
firebug n, the idiot who tosses a lit cigarette out his car window
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

From: James Van Buskirk on 27 Jul 2010 09:22

<nmm1(a)cam.ac.uk> wrote in message
news:i2m3eu$m39$1(a)smaug.linux.pwf.cam.ac.uk...

> 1-D FFTs are pretty dire from a caching viewpoint, because there is
> virtually no data reuse, except in some special cases and with code
> designed to use them. Or have you found a new approach?

> 2- and 3-D FFTs are quite good for caching, if the code is changed
> to make use of that. For a weak meaning of 'quite good', true. And,
> yes, you need a blocked transpose operation.

The difference between 1-D and the others is that for a 1-D FFT you
have to perform a preliminary transpose because the data that will be
combined on the first half of the passes is scattered all over the
initial array, so you need 3 transposes assuming cache is big enough
to hold a sqrt(N) sized array. For the others, one dimension's worth
of data is already contiguous in memory so you only need one transpose
per dimension. For an FFT all access to main memory (not to cache)
takes place in the transpose operations.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end

|
Pages: 1
Prev: What will Microsoft use its ARM license for?
Next: Redbook on the new z196 mainframes.