From: nmm1 on
In article <04cb46947eo6mur14842fqj45pvrqp61l1(a)4ax.com>,
George Neuner <gneuner2(a)comcast.net> wrote:
>
>ISTM bandwidth was the whole point behind pipelined vector processors
>in the older supercomputers. Yes there was a lot of latency (and I
>know you [Mitch] and Robert Myers are dead set against latency too)
>but the staging data movement provided a lot of opportunity to overlap
>with real computation.

Yes.

>YMMV, but I think pipeline vector units need to make a comeback. I am
>not particularly happy at the thought of using them again, but I don't
>see a good way around it.

NO chance! It's completely infeasible - they were dropped because
the vendors couldn't make them for affordable amounts of money any
longer.


Regards,
Nick Maclaren.
From: jacko on
On 20 July, 15:41, n...(a)cam.ac.uk wrote:
> In article <04cb46947eo6mur14842fqj45pvrqp6...(a)4ax.com>,
> George Neuner  <gneun...(a)comcast.net> wrote:
>
>
>
> >ISTM bandwidth was the whole point behind pipelined vector processors
> >in the older supercomputers.  Yes there was a lot of latency (and I
> >know you [Mitch] and Robert Myers are dead set against latency too)
> >but the staging data movement provided a lot of opportunity to overlap
> >with real computation.
>
> Yes.
>
> >YMMV, but I think pipeline vector units need to make a comeback.  I am
> >not particularly happy at the thought of using them again, but I don't
> >see a good way around it.
>
> NO chance!  It's completely infeasible - they were dropped because
> the vendors couldn't make them for affordable amounts of money any
> longer.
>
> Regards,
> Nick Maclaren.

Maybe he needs a FPGA card with many single cycle Boothe multipliers
on chip, A bit slow though due to routing delays, but much parallel.
There really should be a way to que mulmac pairs with a reset to zero
(or the nilpotent).
From: Andy Glew "newsgroup at on
On 7/19/2010 11:59 AM, Robert Myers wrote:
> David L. Craig wrote:
>
>> I am new to comp.arch and so am unclear of the pertinent history of this
>> discussion

This is a bit of a tired discussion. Not because the solution is known,
but because the solutions that we think we know aren't commercially
feasible. We need to break out of the box.

We welcome new blood, and new ideas.

>> Also, why single out floating point bandwidth? For instance, what about the
>> maximum number of parallel RAM acceses architectures can support, which has
>> major impacts on balancing cores' use with I/Os use?
>
> Computation is more or less a solved problem. Most of the challenges
> left have to do with moving data around, with latency and not bandwidth
> having gotten the lion's share of attention (for good reason). I believe
> that moving data around will ultimately be the limiting factor with
> regard to reducing power consumption.

I'm with you, David. Maximizing what I call the MLP, the memory level
parallelism, the number of DRAM accesses that can be concurrently in
flight, is one of the things that we can do.

But Robert's comment is symptomatic of the discussion. Robert says most
work has been on latency, by which I think that he means caches, and
maybe integrating the memory controller. I say MLP to Robert, but he
glides on by.

Robert is interested in brute force bandwidth. Mitch points out that
modern CPUs have 1-4 DRAM channels, which defines the bandwidth that you
get, assuming fairly standard JEDEC DRAM interfaces. GPUs may have more
channels, 6 being a possibility, wider, etc., so higher bandwidth is a
possibility.

Me, I'm just the MLP guy: give me a certain number of channels and
bandwidth, I try to make the best use of them. MLP is one of the ways
of making more efficient use of whatever limited bandwidth you have. I
guess that's my mindset - making the most of what you have. Not because
I don't want to increase the overall memory bandwidth. But because I
don't have any great ideas on how to do so, apart from
a) More memory channels
b) Wider memory channnels
c) Memory channels/DRAMs that handle short bursts/high address
bandwidth efficiently
d) DRAMs with a high degree of internal banking
e) aggressive DRAM scheduling
Actually, c,d,e are really ways of making more efficient use of
bandwidth, i.e. preventing pins from going idle because the burst length
is giving you a lot of data you don't want.
f) stacking DRAMs
g) stacking DRAMs with an interface chip such as Tom Pawlowski of
micron proposes, and a new abstract DRAM interface, enabling all of the
good stuff above but keeping DRAM a comodity
h) stacking DRAMs with an interface chip and a processor chip (with
however many processors you care to build).

Actually, I think that it is inaccurate to say that Robert Myers just
wants brute force memory bandwidth. I think that he would be unhappy
with a machine that achieved brute force memory bandwidth by having 4KiB
burst transfers - because while that machine might be good for DAXPY, it
would not be good for most of the codes Robert wants.
I think that Robert does not want brute force sequential babwidth.
I think that he needs randoom access pattern bandwidth.

Q: is that so, Robert?


> Even leaving aside justifying why expensive bandwidth is not optional,
> there is little precedent here for in-depth explorations of blue-sky
> proposals. A fair fraction of the blue-sky propositions brought here
> can't be taken seriously, and my sense of this group is that it wants to
> keep the thinking mostly inside the box, not for want of imagination,
> but to avoid science fiction and rambling, uninformed discussion.

I'm game for blue-sky SCIENCE FICTION. I.e. imaginings based on
science. That have some possibility of being true.

I'm not so hot on science FANTASY, imaginin

gs based on wishful thinking.
From: MitchAlsup on
On Jul 20, 10:31 am, Andy Glew <"newsgroup at comp-arch.net"> wrote:
> Actually, I think that it is inaccurate to say that Robert Myers just
> wants brute force memory bandwidth.  

Undouubtably correct.

As to why Vector machine fell out of fashion. Vectors were architected
to absorb memory latency. Early Crays had 64-entry vectors and 20-ish
cycle main memory. Later, as the CPUs got faster and the memories
larger and more interleaved, the latency, in cycles, to main memory
increased. And once the Vector machines got to where main memory
latency, in cycles, was greater than the vector length, their course
had been run.

Nor can OoO machines create vector performance rates unless the
latency to <whatever layer in the memory hierarchy supplies the data>
can be absorbed by the size of the execution window. Thus, the
execution window needs to be 2.5-3 times the number of flops being
crunched per loop iteration. We are at the point where, even when the
L2 cache supplies data, there are too many latency cycles for the
machine to be able to efficiently strip mine data. {And in most cases
the cache hierarchy is not designed to efficiently strip mine data,
either.}

neither
a) High latency adequate bandwidth
nor
b) Low Latency inadequate bandwidth
enable vector execution rates--that is getting the most of the FP
computation capabilities.

Mitch
From: Robert Myers on
Andy Glew wrote:

> I think that Robert does not want brute force sequential babwidth.
> I think that he needs randoom access pattern bandwidth.
>
> Q: is that so, Robert?
>


The problems that I personally am most interested in are fairly
"crystalline": very regular access patterns across very regular data
structures.

So the data access patterns are neither random nor sequential. The fact
that processors and memory controllers want to deal with cache lines and
not with 8-byte words is a serious constraint. No matter how you
arrange a multi-dimensional array, some kinds of access are going to
waste huge amounts of bandwidth, even though the access pattern is far
from random.

In the ideal world, you don't want scientists and engineers worrying
about where things are, and more and more problems involve access
patterns that are hard to anticipate. If you can't make random access
free (as fast as sequential access), then at least you can aim at making
hardware naivet� less costly (a factor of, say, two penalty for having
majored in physics rather than computer science, rather than a factor
of, say, ten or more).

Problems that require truly random (or hard to anticipate) access are
(as I perceive things) far more frequent than they were in the early
Cray days, and the costs of dealing with them increasingly painful.

To attempt to be concise: I have no doubt that the needs of media
stream processors will be met without my worrying about it. Any kind of
more complicated access (I speculate) is now so far down the priority
list that, from the POV of COTS processor manufacturers, it is in the
noise. So I'm interested in talking about any kind of calculation that
can't feed a GPU without some degree of hard work or even magic.

If I seem a tad blas� about the parts of the problem you understand the
most about (or are most interested in), it's because my concerns extend
far beyond a standard rack mounted board and even beyond the rack to the
football-field sized installations that get the most press in HPC.
There are so many pieces to this problem, that even making a
comprehensive list is a challenge. At one time, you could enter a room
and see a Cray 2 (not including the supporting plumbing). Now you'd
have to take the roof off a building and rent a helicopter to get a
similar view of a state of the art "supercomputer." There's a lot to
think about.

I'm also interested in what you can build that doesn't occupy
significant real estate and require a rewiring of the nation's electric
grid, so I'm interested in what you can jam onto a single board or into
a single rack. No shortage of things to talk about.

A final word about latency and bandwidth. I really want to keep my mind
as open as possible. The more latency you can tolerate, perhaps with
some of the kinds of exotic approaches (e.g. huge instruction windows)
that interest you, the more options you have for approaching the problem
of bandwidth. I know that most everyone here understands that. I just
want to make it clear that I understand it, too.

Robert.