From: David L. Craig on
On Jul 20, 11:31 am, Andy Glew <"newsgroup at comp-arch.net">
wrote:

> We welcome new blood, and new ideas.

These are new ideas? I hope not.

> I'm with you, David. Maximizing what I call the MLP, the
> memory level parallelism, the number of DRAM accesses that
> can be concurrently in flight, is one of the things that
> we can do.

> Me, I'm just the MLP guy: give me a certain number of
> channels and bandwidth, I try to make the best use of
> them. MLP is one of the ways of making more efficient
> use of whatever limited bandwidth you have. I guess that's
> my mindset - making the most of what you have. Not because
> I don't want to increase the overall memory bandwidth.
> But because I don't have any great ideas on how to do so,
> apart from
> a) More memory channels
> b) Wider memory channnels
> c) Memory channels/DRAMs that handle short bursts/high
> address bandwidth efficiently
> d) DRAMs with a high degree of internal banking
> e) aggressive DRAM scheduling
> Actually, c,d,e are really ways of making more efficient
> use of bandwidth, i.e. preventing pins from going idle
> because the burst length is giving you a lot of data you
> don't want.
> f) stacking DRAMs
> g) stacking DRAMs with an interface chip such as Tom
> Pawlowski of micron proposes, and a new abstract
> DRAM interface, enabling all of the good stuff
> above but keeping DRAM a comodity
> h) stacking DRAMs with an interface chip and a
> processor chip (with however many processors you
> care to build).

If we're talking about COTS design, FP bandwidth is
probably not the area in which to increase production
costs for better performance. As Mitch Alsup observed
a little after the post I've been quoting became
available:

> We are at the point where, even when the L2 cache
> supplies data, there are too many latency cycles for
> the machine to be able to efficiently strip mine
> data. {And in most cases the cache hierarchy is not
> designed to efficiently strip mine data, either.}

Have performance runs using various cache disablements
indicated any gains could be realized therein? If so,
I think that makes the case for adding circuits to
increase RAM parallelism as the cores fight it out for
timely data in and data out operations.

If we're talking about custom, never-mind-the-cost
designs, then that's the stuff that should make this
a really fun group.
From: jacko on
reality rnter, thr eniene pooj descn to lan dern turdil/


Soery I must STIOP giving thge motostest.

I'd love for it to mux long dataa. I can't see hoe it frows to the
tend to stuff. Chad'ict? I do know that not writing is good.
From: Robert Myers on
On Jul 20, 1:49 pm, "David L. Craig" <dlc....(a)gmail.com> wrote:

> If we're talking about custom, never-mind-the-cost
> designs, then that's the stuff that should make this
> a really fun group.

If no one ever goes blue sky and asks: what is even physically
possible without worrying what may or may not be already in the works
at Intel, then we are forever limited, even in the imagination, to
what a marketdroid at Intel believes can be sold at Intel's customary
margins. There is always IBM, of course, and AMD seems willing to try
anything that isn't guaranteed to put it out of business, but, for the
most part, the dollars just aren't there, unless the government
supplies them.

As far as I'm concerned, the roots of the current climate for HPC can
be found in some DoD memos from the early nineties. I'm pretty sure I
have already offered links to some of those documents here.

In all fairness to those memos and to the semiconductor industry in
the US, the markets have delivered well beyond the limits I feared
when those memos first came out. I doubt if mass-market x86
hypervisors ever crossed the imagination at IBM, even as the
barbarians were at the gates.

Also, to be fair to markets, the cost-no-object exercises the
government undertook even after those early 90's memos delivered
almost nothing of any real use. Lots of money has been squandered on
some really dumb ideas. The national labs and others have tried the
same idea (glorified Beowulf) with practically every plausible
processor and interconnect on offer and pretty much the same result
(90%+ efficiency for Linpack, 10% for anything even slightly more
interesting).

Moving the discussion to some place slightly less visible than
comp.arch might not produce more productive flights of fancy, but I,
for one, am interested in what is physically possible and not just
what can be built with the consent of Sen. Mikulski--a lady I have
always admired, to be sure, from her earliest days in politics, just
not the person I'd cite as intellectual backup for technical
decisions.

Robert.

From: jacko on
On 20 July, 18:49, "David L. Craig" <dlc....(a)gmail.com> wrote:
> On Jul 20, 11:31 am, Andy Glew <"newsgroup at comp-arch.net">
> wrote:
>
> > We welcome new blood, and new ideas.
>
> These are new ideas?  I hope not.
>
>
>
>
>
> > I'm with you, David. Maximizing what I call the MLP, the
> > memory level parallelism, the number of DRAM accesses that
> > can be concurrently in flight, is one of the things that
> > we can do.
> > Me, I'm just the MLP guy:  give me a certain number of
> > channels and bandwidth, I try to make the best use of
> > them.  MLP is one of the ways of making more efficient
> > use of whatever limited bandwidth you have. I guess that's
> > my mindset - making the most of what you have.  Not because
> > I don't want to increase the overall memory bandwidth.
> > But because I don't have any great ideas on how to do so,
> > apart from
> >   a) More memory channels
> >   b) Wider memory channnels
> >   c) Memory channels/DRAMs that handle short bursts/high
> >      address bandwidth efficiently
> >   d) DRAMs with a high degree of internal banking
> >   e) aggressive DRAM scheduling
> > Actually, c,d,e are really ways of making more efficient
> > use of bandwidth, i.e. preventing pins from going idle
> > because the burst length is giving you a lot of data you
> > don't want.
> >   f) stacking DRAMs
> >   g) stacking DRAMs with an interface chip such as Tom
> >      Pawlowski of micron proposes, and a new abstract
> >      DRAM interface, enabling all of the good stuff
> >      above but keeping DRAM a comodity
> >   h) stacking DRAMs with an interface chip and a
> >      processor chip (with however many processors you
> >      care to build).
>
> If we're talking about COTS design, FP bandwidth is
> probably not the area in which to increase production
> costs for better performance.  As Mitch Alsup observed
> a little after the post I've been quoting became
> available:
>
> > We are at the point where, even when the L2 cache
> > supplies data, there are too many latency cycles for
> > the machine to be able to efficiently strip mine
> > data. {And in most cases the cache hierarchy is not
> > designed to efficiently strip mine data, either.}
>
> Have performance runs using various cache disablements
> indicated any gains could be realized therein?  If so,
> I think that makes the case for adding circuits to
> increase RAM parallelism as the cores fight it out for
> timely data in and data out operations.
>
> If we're talking about custom, never-mind-the-cost
> designs, then that's the stuff that should make this
> a really fun group.- Hide quoted text -
>
> - Show quoted text -

Why want in a explcit eans be< (short) all functors line up to allign.
From: MitchAlsup on
An example of the subtle microarchitectureal optimization that is in
Robert's favor was tried in one of my previous designs.

The L1 cache was organized to cache the width of the bus returning
from the L2 on die cache.

The L2 cache was organized at the width of your typical multibeet
cache line returning from main memory. Thus, one L2 cache line would
occupy 4 L1 cache sub-lines when fully 'in' the L1. Some horseplay at
the cache coherence protocol prevented incoherence.

With the L1-to-L2 interface suitably organized, one could strip mine
data from the L2 through the L1 through the computation units back to
the L1. L1 Victims were transfered back to the L2 as L2 data arrived
and forwarded into execution.

Here, the execution window had to absorb only the L2 transfer delay
plus the floatig point computation delay. And for this that execution
window worked just fine. DAXPY and DGEMM on suitably sized vectors
would strip mine data footprints as big as the L2 cache at vector
rates.

Mitch