From: jacko on
Why do memory channels have to be wired by inverter chains and
relativly long track interconnect on the circuit board? Microwave
pipework from chiptop to chiptop is perhaps possible, but maintaining
enough bandwidth over the microwave channel is many GHz, but it is
close, so of a low radiant power!

Flops or not? lets generalize and call them nops, said he with a touch
of carcasm. Non-specific Operations, needing GB/s.

Cheers Jacko
From: Robert Myers on
On Jul 19, 3:44 pm, nik Simpson <ni...(a)knology.net> wrote:
> On 7/19/2010 10:36 AM, MitchAlsup wrote:
>
>
>
> > d) high end PC processors can afford 2 memory channels
>
> Not quite as screwed as that, the top-end Xeon & Opteron parts have 4
> DDR3 memory channels, but still screwed. For the 2-socket space, it's 3
> DDR3 memory channels for typical server processors. Of course, the move
> to on-chip memory controllers means that scope for additional memory
> channels is pretty much "zero" but that's the price you pay for
> commodity parts, they are designed to meet the majority of customers,
> and it's hard to justify the costs of additional memory channels at the
> processor and board layout levels just to satisfy the needs of bandwidth
> crazy HPC apps ;-)
>

Maybe the capabilities of high-end x86 are and will continue to be so
compelling that, unless IBM is building the machine, that's what we're
looking at for the foreseeable future.

I don't understand the economics of less mass-market designs, but
maybe the perfect chip would be some iteration of an "open" core,
maybe less heat-intensive, less expensive, and soldered-down with more
attention to memory and I/O resources.

Or maybe you could dual port or route memory, accepting whatever cost
in latency there is, and at least allow some pure DMA device to
perform I/O and gather/scatter chores so as to maximize what processor
bandwidth there is.

I'd like some blue sky thinking.

Robert.


From: Andrew Reilly on
On Mon, 19 Jul 2010 08:36:18 -0700, MitchAlsup wrote:

> The memory system can supply only 1/3rd of what a single processor wants

If that's the case (and down-thread Nik Simpson suggests that the best
case might even be twice as "good", or 2/3 of a single processor's worst-
case demand), then that's amazingly better than has been available, at
least in the commodity processor space, for quite a long time. I
remember when I started moving DSP code onto PCs, and finding anything
with better than 10MB/s memory bandwidth was not easy. These days my
problem set typically doesn't get out of the cache, so that's not
something I personally worry about much any more. If your problem set is
driven by stream-style vector ops, then you might as well switch to low-
power critters like Atoms, and match the flops to the available
bandwidth, and save some power.

On the other hand, I have a lot of difficulty believing that even for
large-scale vector-style code, a bit of loop fusion, blocking or code
factoring can't bring value-reuse up to a level where even (0.3/nProcs)
available bandwidth is plenty.

That's single-threaded application-think. Where you *really* need that
bandwidth, I suspect, is for the inter-processor communication between
your hoards of cooperating (ha!) cores.

Cheers,

--
Andrew
From: jacko on
On 20 July, 05:43, Andrew Reilly <areilly...(a)bigpond.net.au> wrote:
> On Mon, 19 Jul 2010 08:36:18 -0700, MitchAlsup wrote:
> > The memory system can supply only 1/3rd of what a single processor wants
>
> If that's the case (and down-thread Nik Simpson suggests that the best
> case might even be twice as "good", or 2/3 of a single processor's worst-
> case demand), then that's amazingly better than has been available, at
> least in the commodity processor space, for quite a long time.  I
> remember when I started moving DSP code onto PCs, and finding anything
> with better than 10MB/s memory bandwidth was not easy.  These days my
> problem set typically doesn't get out of the cache, so that's not
> something I personally worry about much any more.  If your problem set is
> driven by stream-style vector ops, then you might as well switch to low-
> power critters like Atoms, and match the flops to the available
> bandwidth, and save some power.

Or run a bigger network off the same power.

> On the other hand, I have a lot of difficulty believing that even for
> large-scale vector-style code, a bit of loop fusion, blocking or code
> factoring can't bring value-reuse up to a level where even (0.3/nProcs)
> available bandwidth is plenty.

Prob(able)ly - sick perverse hanging on to a longer word in the post
quantum age.

> That's single-threaded application-think.  Where you *really* need that
> bandwidth, I suspect, is for the inter-processor communication between
> your hoards of cooperating (ha!) cores.

Maybe. I think much of the problem is not vectors, as these are
usually have a single index, it's matrix and tensor problems which
have 2 or n indexes. T[a,b,c,d]

The fact that many product sums over different indexes, even with
transpose elimination coding (automatic switching between row and
column order based on linear sequencing of a write target, or for best
read/write/read/write etc. performance) in the prefetch context, with
limited gather/scatter.

Maybe even some multi store (slightly wasteful of memory cells) with
differing address bit swapings? the high bits as an address map
translation selector with bank write and read combo 'union' operation
(* or +)?.

Ummm.
From: George Neuner on
On Mon, 19 Jul 2010 08:36:18 -0700 (PDT), MitchAlsup
<MitchAlsup(a)aol.com> wrote:

>It seems to me that having less than 8 bytes of memory bandwidth per
>flop leads to an endless series of cache excersizes.**
>
>It also seems to me that nobody is going to be able to put the
>required 100 GB/s/processor pin interface on the part.*
>
>Nor does it seam, it would have the latency needed to strip mine main
>memory continuously were the required BW made available.
>
>Thus, we are in essence screwed.
>
>* current bandwidths
>a) 3 GHz processors with 2 FP pipes running 128-bit double DP flops
>(ala SSE) This gives 12 GFlop/processor
>b) 12 GFlop/processor demands 100 GByte/processor
>c) DDR3 can achieve 17 GBytes/channel
>d) high end PC processors can afford 2 memory channels
>e) therefore we are screwed:
>e.1)The memory system can supply only 1/3rd of what a single processor
>wants
>e.2)There are 4 and growing numbers of processors
>e.3) therefore the memory systen can support less than 1/12 as much BW
>as required.
>
>Mitch
>
>** The Ideal memBW/Flop is 3 memory operations per flop, and back in
>the Cray-1 to XMP transition much of the vectorization gain occurred
>from the added memBW and the better chaining.

ISTM bandwidth was the whole point behind pipelined vector processors
in the older supercomputers. Yes there was a lot of latency (and I
know you [Mitch] and Robert Myers are dead set against latency too)
but the staging data movement provided a lot of opportunity to overlap
with real computation.

YMMV, but I think pipeline vector units need to make a comeback. I am
not particularly happy at the thought of using them again, but I don't
see a good way around it.

George