From: Del Cecchi on
Tim McCaffrey wrote:
snip
>
>
> Some years ago I read an article in (Electronic Design?) about a stacked DRAM
> device. They contended that DRAM is basically an analog device, and if the
> digital/analog interface was put on a seperate chip the performance could be
> improved considerably. Their product/proof-of-concept was a device with one
> "interface" chip and up to 32 stacked DRAM chips. They claimed 4ns access
> time (IIRC).
>
> Now, put that stack on a processor, actually put several so that you can have
> multiple memory channels. The CDC 6000 series used 32 banks (now called
> channels) of memory, the 7000 series used 16 banks, and the 750/760 Cybers had
> 8. The systems took a 15% performance hit going from 16 to 8.
>
> If you have 8 channels, with 4ns access, 64 bit wide, you don't need L2 or L3
> cache.
>
> - Tim
>
Proof of concept? System 370 and System/3 mod 15 used "Reisling" memory
with 4 2k sram array chips in a stacked 1/2inch module. The sense amp
bit driver module was separate.

One of the limiting factors for dram is the size of the signal
difference required between the read cell and the reference cell that is
available to drive the sense amp.

And at the cell level there is no write port and read port, just a bit
line.

From: davewang202 on
On Feb 9, 12:20 pm, timcaff...(a)aol.com (Tim McCaffrey) wrote:
>
> Some years ago I read an article in (Electronic Design?) about a stackedDRAM
> device.  They contended thatDRAMis basically an analog device, and if the
> digital/analog interface was put on a seperate chip the performance could be
> improved considerably.  Their product/proof-of-concept was a device with one
> "interface" chip and up to 32 stackedDRAMchips.  They claimed 4ns access
> time (IIRC).

You may be thinking of Tezzeron.

http://www.tezzaron.com/

They stack wafers of DRAM array cells, and the I/O stuff on a separate
wafer on the bottom.

> Now, put that stack on a processor, actually put several so that you can have
> multiple memory channels. The CDC 6000 series used 32 banks (now called
> channels) of memory, the 7000 series used 16 banks, and the 750/760 Cybers had
> 8.  The systems took a 15% performance hit going from 16 to 8.
>
> If you have 8 channels, with 4ns access, 64 bit wide, you don't need L2 or L3
> cache.
>
>                 - Tim

From: "Andy "Krazy" Glew" on
MitchAlsup wrote:
> On Feb 8, 11:14 pm, "Andy \"Krazy\" Glew" <ag-n...(a)patten-glew.net>
> wrote:
>> Q: are we ever going to get to a similar point, where an abstract DRAM
>> interface makes sense? Certainly, already have for flash/PCM.
>> But for DRAM?
>
> With the understanding that any RAM has three necessary* ports, a port
> to insert addresses, a port to insert write data, and a port to remove
> read data; I think that an abstract interface is undesirable, what is
> desired is more direct access to these three necessarily existing
> ports. See USPTO 5367494.

I think the abstraction lies in making these ports.

As opposed to requiring the user to know that the address really has a row ands column part, with separate timing for
RAS and CAS. And the read and write data ports happen to have the same wires.

Now, I'm of two minds about things like abstracting RAS and CAS timing.

Like Mitch, I have in the past been able to improve performance significantly by being bank structure aware.

But, when the JEDEC interface implies RAS and CAS, implies a particularly array structure inside the DRAM - but when the
DRAM itself has a completely different structure - then the low level interface is obfuscating.

I can understand why a DRAM memory subsystem designer might want to be just given the address - so that he can do his
own scheduling of the various arrays and subarrays.

Which necessarily implies that the read data return port be decoupled from the read address port.

--

My take, as usual, is:

a) have an abstract interface

b) but don't hide the low level details. Tell the user about the bank structure. But give the memory subsystem
designer the freedom to do more than just blindly do what the programmer has told him to do.

ab) If necessary, have interfaces that ab.1) tell the user what sorts of scheduling the DRAM memory subsystem can do
(e.g. "I'm a stupid DRAM, and I will do exactly what you tell me", through "I'm a really smart DRAM, and I will return
requests out of order"), and ab.2) allow the user to control the DRAM "I don't care how smart you think you are, just do
things in the order I tell you."

From: "Andy "Krazy" Glew" on
Tim McCaffrey wrote:
> Some years ago I read an article in (Electronic Design?) about a stacked DRAM
> device. They contended that DRAM is basically an analog device, and if the
> digital/analog interface was put on a seperate chip the performance could be
> improved considerably. Their product/proof-of-concept was a device with one
> "interface" chip and up to 32 stacked DRAM chips. They claimed 4ns access
> time (IIRC).
>
> Now, put that stack on a processor, actually put several so that you can have
> multiple memory channels. The CDC 6000 series used 32 banks (now called
> channels) of memory, the 7000 series used 16 banks, and the 750/760 Cybers had
> 8. The systems took a 15% performance hit going from 16 to 8.
>
> If you have 8 channels, with 4ns access, 64 bit wide, you don't need L2 or L3
> cache.
>
> - Tim


I really wanted to talk briefly about point (0) below, and then the neat technical point (1). But since (0) grew, I'll
do them backwards.

(1) Now, if you don't have the L2 or L3 cache, what do you do with the extra silicon area on the CPU chip?
1a) More CPUs? But we may be past the point of diminishing returns for all except GPUs.
1b) Smaller chips? But for small systems, there is a minimum economic chip size.
1c) How about, take these suddenly smaller CPUs, and put them on the DRAM digital/analog controller chip? But then we
have to respin this chip every time the analog stack is tweaked.


(0) This saying "if memory is fast enough, you don't need L2 or L3 cache" has been the kiss of death for many advanced
memory projects. People were saying this for RAMbus or RAMbus-like memory wrt the '486.

Yes, that early, and with more detail: if we have a new faster memory, not only do we not need the L2$ (and maybe not
the L1$), but we may not even need fancy OOO stuff like P6.

Or perhaps not the kiss of death, since such projects seem to get funded. And last years, even decades. They just
eventually die. Perhaps "kiss of eventual moribundity"? Zombie-dom?

Trouble is, the saying is true. Like motherhood and apple pie. The problem is the conclusions that you take from the
saying: an advanced memory project that only makes sense if it allows you to eliminate many levels of the cache
hierarchy. Which doesn't make sense if you still have the cache. My takeaway is that you should try to create a new
advanced memory technology that makes sense even with the cache, but which might make even more sense if it enables
getting rid of the cache. Hedge your bets.

Trouble is, management at companies like Intel cannot understand hedging, which of necessity requires tracking the
potential payoffs for multiple scenarios. They don't want to see two scenarios, with-cache and without -- because they
will want three landing zone scenarios for each, giving 6, etc. Combinatoric explosion.
From: "Andy "Krazy" Glew" on
Tim McCaffrey wrote:
> Some years ago I read an article in (Electronic Design?) about a stacked DRAM
> device. They contended that DRAM is basically an analog device, and if the
> digital/analog interface was put on a seperate chip the performance could be
> improved considerably. Their product/proof-of-concept was a device with one
> "interface" chip and up to 32 stacked DRAM chips. They claimed 4ns access
> time (IIRC).
>
> Now, put that stack on a processor, actually put several so that you can have
> multiple memory channels. The CDC 6000 series used 32 banks (now called
> channels) of memory, the 7000 series used 16 banks, and the 750/760 Cybers had
> 8. The systems took a 15% performance hit going from 16 to 8.
>
> If you have 8 channels, with 4ns access, 64 bit wide, you don't need L2 or L3
> cache.
>
> - Tim


I really wanted to talk briefly about point (0) below, and then the neat technical point (1). But since (0) grew, I'll
do them backwards.

(1) Now, if you don't have the L2 or L3 cache, what do you do with the extra silicon area on the CPU chip?
1a) More CPUs? But we may be past the point of diminishing returns for all except GPUs.
1b) Smaller chips? But for small systems, there is a minimum economic chip size.
1c) How about, take these suddenly smaller CPUs, and put them on the DRAM digital/analog controller chip? But then we
have to respin this chip every time the analog stack is tweaked.


(0) This saying "if memory is fast enough, you don't need L2 or L3 cache" has been the kiss of death for many advanced
memory projects. People were saying this for RAMbus or RAMbus-like memory wrt the '486.

Yes, that early, and with more detail: if we have a new faster memory, not only do we not need the L2$ (and maybe not
the L1$), but we may not even need fancy OOO stuff like P6.

Or perhaps not the kiss of death, since such projects seem to get funded. And last years, even decades. They just
eventually die. Perhaps "kiss of eventual moribundity"? Zombie-dom?

Trouble is, the saying is true. Like motherhood and apple pie. The problem is the conclusions that you take from the
saying: an advanced memory project that only makes sense if it allows you to eliminate many levels of the cache
hierarchy. Which doesn't make sense if you still have the cache. My takeaway is that you should try to create a new
advanced memory technology that makes sense even with the cache, but which might make even more sense if it enables
getting rid of the cache. Hedge your bets.

Trouble is, management at companies like Intel cannot understand hedging, which of necessity requires tracking the
potential payoffs for multiple scenarios. They don't want to see two scenarios, with-cache and without -- because they
will want three landing zone scenarios for each, giving 6, etc. Combinatoric explosion.