From: Terje Mathisen on
Del Cecchi wrote:
> "Terje Mathisen"<Terje.Mathisen(a)tmsw.no> wrote in message
>> The huge problem here is that most code contains loops!
>>
>> This means that you'll have to reload the instructions from RAM for
>> every iteration, taking away major parts of the bandwidth needed for
>> data load/store.
>>
>> In fact, pretty much all programs, except for a few vector codes and
>> things like REP MOVS block moves will require a _lot_ more
>> instruction bytes than data bytes.
>>
>
> Just leave the instructions in the stream buffer until they aren't
> needed any more. Maybe put them back into the stream after they are
> executed. If they get to the head and aren't needed then flush them.

Thanks for piping in with my next prompt. :-)

As soon as you allow (short?) loops to run out of the FIFO (instruction
stream buffer), you have reinvented the very first, not very efficient,
instruction cache.

I.e. either you reload every instruction byte from memory for each and
every invocation, or you have just reinstated the I-cache.

The only possible exception to the need for an I-cache is when you have
a single huge/complicated path of execution which happens to be larger
than the total size of the instruction cache(s):

At least for a while, this often happened when running database engines,
resulting in a situation where it could be faster to run on a cache-less
architecture.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: Terje Mathisen on
Rob Warnock wrote:
> Terje Mathisen<Terje.Mathisen(a)tmsw.no> wrote:
> +---------------
> | Jean wrote:
> |> What I was describing is a design in which the cache exists only for
> |> the data. In the fetch stage of the pipeline, instructions are fetched
> |> from stream buffers and never go to cache. Fetch---Stream buffer--
> |> Main memory. This obviously should reduce the delay of fetch stage
> |> because heads of stream buffers wont constitute to much delay compared
> |> to a cache.
> +---------------
>
> We've been there, done that: See the AMD Am29000 [circa 1988].
>
> +---------------
> | The huge problem here is that most code contains loops!
> |
> | This means that you'll have to reload the instructions from RAM for
> | every iteration, taking away major parts of the bandwidth needed for
> | data load/store.
> +---------------
>
> That's why the Am29000 [the original 29000, not the later 29030 et seq]
> had a 4-word Branch Target Cache. The BTC could (usually) cover the
> latency of getting the I-bus stream restarted after a branch.

Which dates the architecture to the time when RAM and CPU had comparable
speeds.

Besides, the BTC is/was just another form of I-cache, right? :-)

> I really liked the Am29000, with its unified A-bus but separate I- and
> D-busses that could be doing streaming simultaneously. Using VDRAM
> ("Video DRAM") as its main memory, providing the I-bus stream out
> the VDRAM's serial port and the D-bus on the normal parallel port,
> was a nice local minimum in architecture space, however temporary
> that sweet spot turned out to be in the market [roughly 1988-1992].

Yeah, I do remember those days.

> And using triple-port V2DRAMs, dedicating the other serial port
> to a streaming DMA engine, made the whole kit *really* sweet for
> high-performance [for the day] peripheral& network controllers!

Didn't some people use the same cpu for a fancy graphics co-processor?
>
> Of course, the Am29030, with its single unified bus and an I-cache
> but no D-cache, blew the Am29000 out of the water just a few years
> later [not to mention that V2DRAMs got hard to buy!], but still...

:-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: nmm1 on
In article <7jq44oF36ju1vU1(a)mid.individual.net>,
Del Cecchi <delcecchiofthenorth(a)gmail.com> wrote:
>
>Just leave the instructions in the stream buffer until they aren't
>needed any more. Maybe put them back into the stream after they are
>executed. If they get to the head and aren't needed then flush them.

Ah. A circular, rather than linear, stream buffer. Clearly good
news for loops, but less clearly good for other codes.


Regards,
Nick Maclaren.
From: Del Cecchi on

"Terje Mathisen" <Terje.Mathisen(a)tmsw.no> wrote in message
news:m6udnfDJ8sb4iUXXnZ2dnUVZ8m6dnZ2d(a)lyse.net...
> Del Cecchi wrote:
>> "Terje Mathisen"<Terje.Mathisen(a)tmsw.no> wrote in message
>>> The huge problem here is that most code contains loops!
>>>
>>> This means that you'll have to reload the instructions from RAM
>>> for
>>> every iteration, taking away major parts of the bandwidth needed
>>> for
>>> data load/store.
>>>
>>> In fact, pretty much all programs, except for a few vector codes
>>> and
>>> things like REP MOVS block moves will require a _lot_ more
>>> instruction bytes than data bytes.
>>>
>>
>> Just leave the instructions in the stream buffer until they aren't
>> needed any more. Maybe put them back into the stream after they
>> are
>> executed. If they get to the head and aren't needed then flush
>> them.
>
> Thanks for piping in with my next prompt. :-)
>
> As soon as you allow (short?) loops to run out of the FIFO
> (instruction stream buffer), you have reinvented the very first, not
> very efficient, instruction cache.
>
> I.e. either you reload every instruction byte from memory for each
> and every invocation, or you have just reinstated the I-cache.

:-) So I have.
>
> The only possible exception to the need for an I-cache is when you
> have a single huge/complicated path of execution which happens to be
> larger than the total size of the instruction cache(s):
>
> At least for a while, this often happened when running database
> engines, resulting in a situation where it could be faster to run on
> a cache-less architecture.
>
> Terje
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"