From: Tim McCaffrey on
In article <906e8749-bc47-4d2d-9316-3a0d20a7cdfc(a)b7g2000yqd.googlegroups.com>,
rbmyersusa(a)gmail.com says...
>
>On Mar 2, 1:01=A0pm, n...(a)cam.ac.uk wrote:
>> In article <462899e1-6298-4e2a-918f-733cfa759...(a)g19g2000yqe.googlegroups=
>.com>,
>> Robert Myers =A0<rbmyers...(a)gmail.com> wrote:
>>
>>
>>
>>
>>
>> >On Mar 2, 12:19=3DA0pm, MitchAlsup <MitchAl...(a)aol.com> wrote:
>>
>> >> > =3DA0 =3DA0 2) Put the memory back-to-back with the CPU, factory int=
>egrated=3D
>> >,
>> >> > thus releasing all existing memory pins for I/O use. =3DA0Note that =
>this
>> >> > allows for VASTLY more memory pins/pads.
>>
>> >> I have been thinking along these lines...
>>
>> >> Consider a chip containing CPUs sitting in a package with a small-
>> >> medium number of DRAM chips. The CPU and DRAM chips orchestrated with
>> >> an interface that exploits the on die wire density that cannot escape
>> >> the package boundary.
>>
>> >> A: make this DRAM the only parts of the coherent memory
>> >> B: use more conventional FBDIMM channels to an extended core storage
>> >> C: perform all <disk, network, high speed> I/O to the ECS
>> >> D: page ECS to the on die DRAM as a single page sized burst at FBDIMM
>> >> speeds
>> >> E: an efficient on-CPU-chip TLB shootdown mechanism <or coherent TLB>
>>
>> >> A page copy to an FBDIMM resident page would take about 150-200 ns;
>> >> and this is about the access time of a single line if the whole ECS
>> >> was made coherent!
>>
>> >> F: a larger ECS can be built <if desired> by implementing a FBDIMM
>> >> multiplexer
>>
>> >How is any of this different from putting huge "Level 4" cache on the
>> >die, an idea I proposed long ago. =A0Maybe it's only now that the scale
>> >sizes make it a realistic option.
>>
>> It's different, because it is changing the interface from the chip
>> (actually, package) to the outside world. =A0Basically, MUCH larger
>> units and no attempt at low-level coherence. =A0Those are the keys
>> to making it fly, and not the scale sizes.
>>
>Just to be clear.
>
>I referred to it here as "Level 4" cache. What I proposed was putting
>main memory on die. If you (as you probably will) want to use
>conventional memory the way we now use disk, it wouldn't function in
>the same way as memory now does, as you (very belatedly) point out.
>
>I know, you thought of it all in 1935.
>

Actually, 1972(ish), and it was CDC/Seymore Cray (well, I assume it was him).
It was even called the same thing: ECS.

ECS was actually available before that (CDC 6000 series), but only the main
CPU could talk to it, and I/O was done to main memory. With the 7600 the PPs
could also write to the ECS (slowly), and the CPU could read from it. There
were Fortran extensions to place *big* arrays in ECS and the compiler took
care of paging in and out the part you were working on. Michigan State used
it to swap out processes (it was much faster than disk).

The ECS used a 600 bit interface to the CPU, and had an extremely long access
time (10us IIRC), so bandwidth was about the same as main memory, but latency
sucked.

ECS was also how multiple CPUs talked to each other, it had 4 ports for 4
different systems, so they could coordinate disk/file/perpherial sharing.

MSU used the ECS to connect a 6400 and a 6500 together, and later the 6500
with a Cyber 170/750. The slower machine was used primarily for systems
development work.

- Tim

From: nedbrek on
Hello all,

<nmm1(a)cam.ac.uk> wrote in message
news:hmisp2$u90$1(a)smaug.linux.pwf.cam.ac.uk...
> In article <hmiqg4$ads$1(a)news.eternal-september.org>,
> nedbrek <nedbrek(a)yahoo.com> wrote:
>>"Robert Myers" <rbmyersusa(a)gmail.com> wrote in message
>>news:66eeb001-7f72-4ad6-afbb-7bdcb5a0275b(a)y11g2000yqh.googlegroups.com...
>>> On Mar 1, 8:05 pm, Del Cecchi <delcecchinospamoftheno...(a)gmail.com>
>>> wrote:
>>
>>>> Isn't that backwards? bandwidth costs money, latency needs miracles.
>>>
>>> There are no bandwidth-hiding tricks. Once the pipe is full, that's
>>> it. That's as fast as things will go. And, as one of the architects
>>> here commented, once you have all the pins possible and you wiggle
>>> them as fast as you can, there is no more bandwidth to be had.
>>
>>Robert is right. Bandwidth costs money, and the budget is fixed (no more
>>$1000 CPUs, well, not counting the Extreme/Expensive Editions).
>
> Actually, Del is. Robert is right that, given a fixed design, bandwidth
> is pro rata to budget.
>
>>Pin count is die size limited, and die stopped growing at Pentium 4. If
>>we
>>don't keep using all the transistors, it will start shrinking (cf. Atom).
>>
>>That's all assuming CPUs available to mortals (Intel/AMD). If you're IBM,
>>then you can have all the bandwidth you want.
>
> Sigh. Were I given absolute powers over Intel, I could arrange to
> have a design produced with vastly more bandwidth (almost certainly
> 10x, perhaps much more), for the same production cost.
>
> All that is needed is four things:
>
> 1) Produce low-power (watts) designs for the mainstream, to
> enable the second point.
>
> 2) Put the memory back-to-back with the CPU, factory integrated,
> thus releasing all existing memory pins for I/O use. Note that this
> allows for VASTLY more memory pins/pads.

I'm afraid I don't see what a low power core has to do with faster DRAM. A
big core likes fast DRAM too...

Also, what do you mean by factory integrated? On one die? Also, ala
Mitch's suggestion, is this the only memory in the system?

> 3) Lean on the memory manufacturers to deliver that, or buy up
> one and return to that business.

The DRAM manufacturers are very conservative. You must be very persuasive
with them. Also, their margins are razor thin (many are propped up by
government subsidies). Your suggestions must be cost neutral, and
applicable to 100% of the market.

> 4) Support a much simpler, fixed (large) block-size protocol to
> the first-level I/O interface chip. Think HiPPI, taken to extremes.

I'm not familiar with HiPPI. The Wikipedia page has very little content,
and the FAQ off there is not responding...

> The obstacles are mainly political and marketdroids. Intel is big
> enough to swing that, if it wanted to. Note that I am not saying
> it should, as it is unclear that point (4) is the right direction
> for a general-purpose chip. However, points (1) to (3) would work
> without point (4).

Ned


From: nedbrek on

"MitchAlsup" <MitchAlsup(a)aol.com> wrote in message
news:49270cd8-a1b5-4186-9b82-0564efee8c56(a)33g2000yqj.googlegroups.com...
On Mar 2, 5:28 am, n...(a)cam.ac.uk wrote:
>> 2) Put the memory back-to-back with the CPU, factory integrated,
>> thus releasing all existing memory pins for I/O use. Note that this
>> allows for VASTLY more memory pins/pads.
>
> I have been thinking along these lines...
>
> Consider a chip containing CPUs sitting in a package with a small-
> medium number of DRAM chips. The CPU and DRAM chips orchestrated with
> an interface that exploits the on die wire density that cannot escape
> the package boundary.

I'm not aware of any MCM that will support on die wire density. MCM's in
general are frowned upon due to higher yield loss and cost of the package.

> A: make this DRAM the only parts of the coherent memory

This complicates system inventories (now you have the cross product of bins
for speed and memory capacity).

> B: use more conventional FBDIMM channels to an extended core storage

If all of coherent mem is on the next die, you can probably eat the disk
latency for most apps. Although, database guys would love it...

> E: an efficient on-CPU-chip TLB shootdown mechanism <or coherent TLB>

Coherent TLB is an interesting idea. Has anyone investigated the
complications for this?

Ned


From: Robert Myers on
On Mar 2, 7:18 pm, timcaff...(a)aol.com (Tim McCaffrey) wrote:
> In article <906e8749-bc47-4d2d-9316-3a0d20a7c...(a)b7g2000yqd.googlegroups.com>,
> rbmyers...(a)gmail.com says...
>

> >I know, you thought of it all in 1935.
>
> Actually, 1972(ish), and it was CDC/Seymore Cray (well, I assume it was him).
> It was even called the same thing: ECS.
>
> ECS was actually available before that (CDC 6000 series), but only the main
> CPU could talk to it, and I/O was done to main memory.  With the 7600 the PPs
> could also write to the ECS (slowly), and the CPU could read from it.  There
> were Fortran extensions to place *big* arrays in ECS and the compiler took
> care of paging in and out the part you were working on.  Michigan State used
> it to swap out processes (it was much faster than disk).
>
> The ECS used a 600 bit interface to the CPU, and had an extremely long access
> time (10us IIRC), so bandwidth was about the same as main memory, but latency
> sucked.
>
> ECS was also how multiple CPUs talked to each other, it had 4 ports for 4
> different systems, so they could coordinate disk/file/perpherial sharing.
>
> MSU used the ECS to connect a 6400 and a 6500 together, and later the 6500
> with a Cyber 170/750.  The slower machine was used primarily for systems
> development work.
>
I don't know if it's related, but CDC's "large core memory" (as
opposed "small core memory") was my very unwelcome introduction to
computer hardware.

From that experience, I acquired several permanent prejudices:

1. For scientific/engineering applications, "programmers" should
either be limited to sorting and labeling output, or (preferably) they
should be shipped to the antarctic, where they could be sent, one at a
time, to check the temperature gauge a quarter mile from the main
camp.

2. No sane computational physicist should imagine that even a thorough
knowledge of FORTRAN was adequate preparation for getting things done.

3. Computer architects are generally completely out of touch with
reality.

Do anything you like, but please never show respect for Seymour Cray,
including misspelling his name.

Robert.

From: "Andy "Krazy" Glew" on
nedbrek wrote:
> "MitchAlsup" <MitchAlsup(a)aol.com> wrote in message
> news:49270cd8-a1b5-4186-9b82-0564efee8c56(a)33g2000yqj.googlegroups.com...
> On Mar 2, 5:28 am, n...(a)cam.ac.uk wrote:
>>> 2) Put the memory back-to-back with the CPU, factory integrated,
>>> thus releasing all existing memory pins for I/O use. Note that this
>>> allows for VASTLY more memory pins/pads.
>> I have been thinking along these lines...
>>
>> Consider a chip containing CPUs sitting in a package with a small-
>> medium number of DRAM chips. The CPU and DRAM chips orchestrated with
>> an interface that exploits the on die wire density that cannot escape
>> the package boundary.

Aw, heck.

Yesterday I met with Ivan Sutherland at Portland State University.

Amongst other things, Ivan pitched his capacitatively coupled interconnect between chips.

I don't remember the pad pitch.

---

Interestingly, Ivan was against chip stacking. Says that there is not much difference between 2D and 3D. Advocates
instead the sort of overlapped checkerboard he talked about in the capacitative coupling paper. (At least I think that
is where I saw it.)

I tend to disagree. I think there is a big difference between N**1/2 and N**2/3, the Pollack-Glew Law prediction for
performance, at least of latency sensitive code. I also think that surface area is a limiter for some of the
configurations I am most interested in, such as cell phones. (Unless the chips are really small...) My head is still
into stacks with a logic chip at the bottom, and several DRAM chips (and maybe other memory chips).

Which is interesting: such stacks probably cannot be capacitatively coupled except on alternate layers. Unless
somebody is willing to build a load of bread configuration, chips butted edge on to another chip (or carrier). Failing
that, one might have the interesting situation of higher bandwidth "outside the chip-stack module" than within the module.