From: Terje Mathisen on
Jean wrote:
> What I was describing is a design in which the cache exists only for
> the data. In the fetch stage of the pipeline, instructions are fetched
> from stream buffers and never go to cache. Fetch---Stream buffer--
> Main memory. This obviously should reduce the delay of fetch stage
> because heads of stream buffers wont constitute to much delay compared
> to a cache.

The huge problem here is that most code contains loops!

This means that you'll have to reload the instructions from RAM for
every iteration, taking away major parts of the bandwidth needed for
data load/store.

In fact, pretty much all programs, except for a few vector codes and
things like REP MOVS block moves will require a _lot_ more instruction
bytes than data bytes.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: Joe Pfeiffer on
Jean <alertjean(a)rediffmail.com> writes:

> On Oct 15, 12:29�pm, Joe Pfeiffer <pfeif...(a)cs.nmsu.edu> wrote:
>> Jean <alertj...(a)rediffmail.com> writes:
>> > On Oct 15, 8:24�am, n...(a)cam.ac.uk wrote:
>> >> In article <89e448f1-612e-49f7-90c3-15c5c414c...(a)j19g2000yqk.googlegroups.com>,
>>
>> >> Jean �<alertj...(a)rediffmail.com> wrote:
>> >> >Cant we build a pipelined machine with no I-Cache and only use
>> >> >multiple Stream buffers (Prefetch Queues) �and D- cache instead of
>> >> >it ? Wont the sequential characteristics of instructions make it
>> >> >work ? Comments !
>>
>> >> Yes. �It's been done. �It worked. �Look up 'unified cache'.
>>
>> >> Regards,
>> >> Nick Maclaren.
>>
>> > Isn't a system with unified cache, one which �only has a single cache
>> > for holding both data and instruction ? Instructions will be still
>> > fetched from cache..right ?
>>
>> > What I was describing is a design in which the cache exists only for
>> > the data. In the fetch stage of the pipeline, instructions are fetched
>> > from stream buffers and never go to cache. �Fetch---Stream buffer--
>> > Main memory. This obviously should reduce the delay of fetch stage
>> > because heads of stream buffers wont constitute to much delay compared
>> > to a cache.
>>
>> > Jean
>>
>> Consider the number of cycles required to fetch from main memory to your
>> stream buffer.
>
> Cycles will be always there there even if it was a cache.

How big do you intend your stream buffers to be? How many do you intend
to have? How many instructions in advance do you intend to start
prefetching?
--
As we enjoy great advantages from the inventions of others, we should
be glad of an opportunity to serve others by any invention of ours;
and this we should do freely and generously. (Benjamin Franklin)
From: Del Cecchi on

"Terje Mathisen" <Terje.Mathisen(a)tmsw.no> wrote in message
news:GpCdnYwPbqS980rXnZ2dnUVZ8kudnZ2d(a)lyse.net...
> Jean wrote:
>> What I was describing is a design in which the cache exists only
>> for
>> the data. In the fetch stage of the pipeline, instructions are
>> fetched
>> from stream buffers and never go to cache. Fetch---Stream buffer--
>> Main memory. This obviously should reduce the delay of fetch stage
>> because heads of stream buffers wont constitute to much delay
>> compared
>> to a cache.
>
> The huge problem here is that most code contains loops!
>
> This means that you'll have to reload the instructions from RAM for
> every iteration, taking away major parts of the bandwidth needed for
> data load/store.
>
> In fact, pretty much all programs, except for a few vector codes and
> things like REP MOVS block moves will require a _lot_ more
> instruction bytes than data bytes.
>
> Terje
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Just leave the instructions in the stream buffer until they aren't
needed any more. Maybe put them back into the stream after they are
executed. If they get to the head and aren't needed then flush them.

del


From: Rob Warnock on
Terje Mathisen <Terje.Mathisen(a)tmsw.no> wrote:
+---------------
| Jean wrote:
| > What I was describing is a design in which the cache exists only for
| > the data. In the fetch stage of the pipeline, instructions are fetched
| > from stream buffers and never go to cache. Fetch---Stream buffer--
| > Main memory. This obviously should reduce the delay of fetch stage
| > because heads of stream buffers wont constitute to much delay compared
| > to a cache.
+---------------

We've been there, done that: See the AMD Am29000 [circa 1988].

+---------------
| The huge problem here is that most code contains loops!
|
| This means that you'll have to reload the instructions from RAM for
| every iteration, taking away major parts of the bandwidth needed for
| data load/store.
+---------------

That's why the Am29000 [the original 29000, not the later 29030 et seq]
had a 4-word Branch Target Cache. The BTC could (usually) cover the
latency of getting the I-bus stream restarted after a branch.

I really liked the Am29000, with its unified A-bus but separate I- and
D-busses that could be doing streaming simultaneously. Using VDRAM
("Video DRAM") as its main memory, providing the I-bus stream out
the VDRAM's serial port and the D-bus on the normal parallel port,
was a nice local minimum in architecture space, however temporary
that sweet spot turned out to be in the market [roughly 1988-1992].
And using triple-port V2DRAMs, dedicating the other serial port
to a streaming DMA engine, made the whole kit *really* sweet for
high-performance [for the day] peripheral & network controllers!

Of course, the Am29030, with its single unified bus and an I-cache
but no D-cache, blew the Am29000 out of the water just a few years
later [not to mention that V2DRAMs got hard to buy!], but still...


-Rob

-----
Rob Warnock <rpw3(a)rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607

From: Robert Myers on
On Oct 16, 12:51 am, r...(a)rpw3.org (Rob Warnock) wrote:

>
> That's why the Am29000 [the original 29000, not the later 29030 et seq]
> had a 4-word Branch Target Cache. The BTC could (usually) cover the
> latency of getting the I-bus stream restarted after a branch.
>
> I really liked the Am29000, with its unified A-bus but separate I- and
> D-busses that could be doing streaming simultaneously. Using VDRAM
> ("Video DRAM") as its main memory, providing the I-bus stream out
> the VDRAM's serial port and the D-bus on the normal parallel port,
> was a nice local minimum in architecture space, however temporary
> that sweet spot turned out to be in the market [roughly 1988-1992].
> And using triple-port V2DRAMs, dedicating the other serial port
> to a streaming DMA engine, made the whole kit *really* sweet for
> high-performance [for the day] peripheral & network controllers!
>
> Of course, the Am29030, with its single unified bus and an I-cache
> but no D-cache, blew the Am29000 out of the water just a few years
> later [not to mention that V2DRAMs got hard to buy!], but still...

And people wonder why I am so annoyed at AMD for turning into Intel-me-
too.

Robert.