From: Robert Myers on
On Mar 2, 12:19 pm, MitchAlsup <MitchAl...(a)aol.com> wrote:
> On Mar 2, 5:28 am, n...(a)cam.ac.uk wrote:
>
> >     2) Put the memory back-to-back with the CPU, factory integrated,
> > thus releasing all existing memory pins for I/O use.  Note that this
> > allows for VASTLY more memory pins/pads.
>
> I have been thinking along these lines...
>
> Consider a chip containing CPUs sitting in a package with a small-
> medium number of DRAM chips. The CPU and DRAM chips orchestrated with
> an interface that exploits the on die wire density that cannot escape
> the package boundary.
>
> A: make this DRAM the only parts of the coherent memory
> B: use more conventional FBDIMM channels to an extended core storage
> C: perform all <disk, network, high speed> I/O to the ECS
> D: page ECS to the on die DRAM as a single page sized burst at FBDIMM
> speeds
> E: an efficient on-CPU-chip TLB shootdown mechanism <or coherent TLB>
>
> A page copy to an FBDIMM resident page would take about 150-200 ns;
> and this is about the access time of a single line if the whole ECS
> was made coherent!
>
> F: a larger ECS can be built <if desired> by implementing a FBDIMM
> multiplexer
>
How is any of this different from putting huge "Level 4" cache on the
die, an idea I proposed long ago. Maybe it's only now that the scale
sizes make it a realistic option.

Robert.


From: Robert Myers on
On Mar 2, 6:28 am, n...(a)cam.ac.uk wrote:
> In article <hmiqg4$ad...(a)news.eternal-september.org>,
>
> nedbrek <nedb...(a)yahoo.com> wrote:
> >"Robert Myers" <rbmyers...(a)gmail.com> wrote in message
> >news:66eeb001-7f72-4ad6-afbb-7bdcb5a0275b(a)y11g2000yqh.googlegroups.com....
> >> On Mar 1, 8:05 pm, Del Cecchi <delcecchinospamoftheno...(a)gmail.com>
> >> wrote:
>
> >>> Isn't that backwards? bandwidth costs money, latency needs miracles.
>
> >> There are no bandwidth-hiding tricks.  Once the pipe is full, that's
> >> it.  That's as fast as things will go.  And, as one of the architects
> >> here commented, once you have all the pins possible and you wiggle
> >> them as fast as you can, there is no more bandwidth to be had.
>
> >Robert is right.  Bandwidth costs money, and the budget is fixed (no more
> >$1000 CPUs, well, not counting the Extreme/Expensive Editions).
>
> Actually, Del is.  Robert is right that, given a fixed design, bandwidth
> is pro rata to budget.
>
> >Pin count is die size limited, and die stopped growing at Pentium 4.  If we
> >don't keep using all the transistors, it will start shrinking (cf. Atom)..
>
> >That's all assuming CPUs available to mortals (Intel/AMD).  If you're IBM,
> >then you can have all the bandwidth you want.
>
> Sigh.  Were I given absolute powers over Intel, I could arrange to
> have a design produced with vastly more bandwidth (almost certainly
> 10x, perhaps much more), for the same production cost.
>
> All that is needed is four things:
>
>     1) Produce low-power (watts) designs for the mainstream, to
> enable the second point.
>
>     2) Put the memory back-to-back with the CPU, factory integrated,
> thus releasing all existing memory pins for I/O use.  Note that this
> allows for VASTLY more memory pins/pads.
>
>     3) Lean on the memory manufacturers to deliver that, or buy up
> one and return to that business.
>
>     4) Support a much simpler, fixed (large) block-size protocol to
> the first-level I/O interface chip.  Think HiPPI, taken to extremes.
>
> The obstacles are mainly political and marketdroids.  Intel is big
> enough to swing that, if it wanted to.  Note that I am not saying
> it should, as it is unclear that point (4) is the right direction
> for a general-purpose chip.  However, points (1) to (3) would work
> without point (4).
>
> Also, note that I said "arrange to have a design produced".  The
> hardware experts I have spoken to all agree that is technically
> feasible.  Intel is big enough to do this, IF it wanted to.

You are such an amazing chameleon. Intel isn't the only one who
sniffs the wind before speaking.

Robert.
From: nmm1 on
In article <462899e1-6298-4e2a-918f-733cfa759c0e(a)g19g2000yqe.googlegroups.com>,
Robert Myers <rbmyersusa(a)gmail.com> wrote:
>On Mar 2, 12:19=A0pm, MitchAlsup <MitchAl...(a)aol.com> wrote:
>>
>> > =A0 =A0 2) Put the memory back-to-back with the CPU, factory integrated=
>,
>> > thus releasing all existing memory pins for I/O use. =A0Note that this
>> > allows for VASTLY more memory pins/pads.
>>
>> I have been thinking along these lines...
>>
>> Consider a chip containing CPUs sitting in a package with a small-
>> medium number of DRAM chips. The CPU and DRAM chips orchestrated with
>> an interface that exploits the on die wire density that cannot escape
>> the package boundary.
>>
>> A: make this DRAM the only parts of the coherent memory
>> B: use more conventional FBDIMM channels to an extended core storage
>> C: perform all <disk, network, high speed> I/O to the ECS
>> D: page ECS to the on die DRAM as a single page sized burst at FBDIMM
>> speeds
>> E: an efficient on-CPU-chip TLB shootdown mechanism <or coherent TLB>
>>
>> A page copy to an FBDIMM resident page would take about 150-200 ns;
>> and this is about the access time of a single line if the whole ECS
>> was made coherent!
>>
>> F: a larger ECS can be built <if desired> by implementing a FBDIMM
>> multiplexer
>>
>How is any of this different from putting huge "Level 4" cache on the
>die, an idea I proposed long ago. Maybe it's only now that the scale
>sizes make it a realistic option.

It's different, because it is changing the interface from the chip
(actually, package) to the outside world. Basically, MUCH larger
units and no attempt at low-level coherence. Those are the keys
to making it fly, and not the scale sizes.


Regards,
Nick Maclaren.
From: Robert Myers on
On Mar 2, 1:01 pm, n...(a)cam.ac.uk wrote:
> In article <462899e1-6298-4e2a-918f-733cfa759...(a)g19g2000yqe.googlegroups..com>,
> Robert Myers  <rbmyers...(a)gmail.com> wrote:
>
>
>
>
>
> >On Mar 2, 12:19=A0pm, MitchAlsup <MitchAl...(a)aol.com> wrote:
>
> >> > =A0 =A0 2) Put the memory back-to-back with the CPU, factory integrated=
> >,
> >> > thus releasing all existing memory pins for I/O use. =A0Note that this
> >> > allows for VASTLY more memory pins/pads.
>
> >> I have been thinking along these lines...
>
> >> Consider a chip containing CPUs sitting in a package with a small-
> >> medium number of DRAM chips. The CPU and DRAM chips orchestrated with
> >> an interface that exploits the on die wire density that cannot escape
> >> the package boundary.
>
> >> A: make this DRAM the only parts of the coherent memory
> >> B: use more conventional FBDIMM channels to an extended core storage
> >> C: perform all <disk, network, high speed> I/O to the ECS
> >> D: page ECS to the on die DRAM as a single page sized burst at FBDIMM
> >> speeds
> >> E: an efficient on-CPU-chip TLB shootdown mechanism <or coherent TLB>
>
> >> A page copy to an FBDIMM resident page would take about 150-200 ns;
> >> and this is about the access time of a single line if the whole ECS
> >> was made coherent!
>
> >> F: a larger ECS can be built <if desired> by implementing a FBDIMM
> >> multiplexer
>
> >How is any of this different from putting huge "Level 4" cache on the
> >die, an idea I proposed long ago.  Maybe it's only now that the scale
> >sizes make it a realistic option.
>
> It's different, because it is changing the interface from the chip
> (actually, package) to the outside world.  Basically, MUCH larger
> units and no attempt at low-level coherence.  Those are the keys
> to making it fly, and not the scale sizes.
>
Just to be clear.

I referred to it here as "Level 4" cache. What I proposed was putting
main memory on die. If you (as you probably will) want to use
conventional memory the way we now use disk, it wouldn't function in
the same way as memory now does, as you (very belatedly) point out.

I know, you thought of it all in 1935.

Robert.
From: Stephen Fuld on
On 3/2/2010 9:56 AM, Robert Myers wrote:
> On Mar 2, 12:19 pm, MitchAlsup<MitchAl...(a)aol.com> wrote:
>> On Mar 2, 5:28 am, n...(a)cam.ac.uk wrote:
>>
>>> 2) Put the memory back-to-back with the CPU, factory integrated,
>>> thus releasing all existing memory pins for I/O use. Note that this
>>> allows for VASTLY more memory pins/pads.
>>
>> I have been thinking along these lines...
>>
>> Consider a chip containing CPUs sitting in a package with a small-
>> medium number of DRAM chips. The CPU and DRAM chips orchestrated with
>> an interface that exploits the on die wire density that cannot escape
>> the package boundary.
>>
>> A: make this DRAM the only parts of the coherent memory
>> B: use more conventional FBDIMM channels to an extended core storage
>> C: perform all<disk, network, high speed> I/O to the ECS
>> D: page ECS to the on die DRAM as a single page sized burst at FBDIMM
>> speeds
>> E: an efficient on-CPU-chip TLB shootdown mechanism<or coherent TLB>
>>
>> A page copy to an FBDIMM resident page would take about 150-200 ns;
>> and this is about the access time of a single line if the whole ECS
>> was made coherent!
>>
>> F: a larger ECS can be built<if desired> by implementing a FBDIMM
>> multiplexer
>>
> How is any of this different from putting huge "Level 4" cache on the
> die, an idea I proposed long ago. Maybe it's only now that the scale
> sizes make it a realistic option.

Well, some of the details of Mitch's proposal aren't clearly specified.
I can't tell if he intends the off chip DRAM to be part of the
processor's address space or not. If it is, then the on-chip DRAM is
essentially a level 4 cache. But it could be that it isn't, in which
case, it is more like a fast paging device with some extra features.


--
- Stephen Fuld
(e-mail address disguised to prevent spam)