From: Jürgen Böhm on

Hi all,

I am about to design a CPU, specialized for execution of compiled Lisp code.
The basic design that proved useful in writing the compiler was that
of a machine with two stacks: data stack and code stack indexed by two
stack pointers *dsp* and *csp* respectively. These point into memory.
Additionally there is a so called "template pointer" *tp*, which points
to a template (an aggregate data structure in memory which holds code
and data local to the code).

Data and Address path are 32 bit wide, memory can arbitrarily hold
code and data, there is no virtual memory support and no cache (this is
my first attempt at designing a CPU and it is currently more a hobby
project). The instructions are 32 bit, the higher 16 bit containing the
operation code (maybe later some bits of these 16 are assigned to
additional tasks as 8 bit will suffice to hold all opcodes). The lower
16 bit contain a "data extension" d, that is used in some operations:

JMP d : jumps relative 4 * d bytes
LOD d : pushes data from (*dsp*+4*d) to top of data stack,
(LOD 0 is DUP in FORTH notation)
STO d : stores top of stack in (*dsp*+4*d) and adds 4 to *dsp* (stacks
grow downwards)

all operations (+ - * car cdr ..) are done with the top of data-stack
elements in the way an UPN calculator (or FORTH) works.

Using this design, I have already implemented a basic compiler and an
emulator for the CPU and as a next step want to build the CPU in Verilog
for an FPGA.

That was a long foreword to come to my question, which is this: As you
noticed, my machine has few *internal registers* it does all the stack
handling for function calling in memory (via data-stack). But as I have
learned, a lot of "stack processors" instead try to keep all the stack
manipulations related to function calling internally with register sets
and windows like the SPARC processors.
This seems to imply that in practice there will never be a large
depth of recursive calls attained during program execution. Namely, if
recursion depth led to the necessity of spilling registers to memory on
every function call, there should be no gain in execution speed about
working in memory from the beginning on, as in my design.
Is this correctly thought? Maybe you can give me some hints on
recommended reading for someone starting a processor design too,
especially in the realm of stack machine architecture (the only real
textbook I (partly) read is the internet text of Prof. Koopman on stack
machines).

Greetings

J�rgen


--
J�rgen B�hm www.aviduratas.de
"At a time when so many scholars in the world are calculating, is it not
desirable that some, who can, dream ?" R. Thom
From: robertwessel2 on
On Feb 1, 4:57 pm, Jürgen Böhm <jbo...(a)gmx.net> wrote:
> That was a long foreword to come to my question, which is this: As you
> noticed, my machine has few *internal registers* it does all the stack
> handling for function calling in memory (via data-stack). But as I have
> learned, a lot of "stack processors" instead try to keep all the stack
> manipulations related to function calling internally with register sets
> and windows like the SPARC processors.
> This seems to imply that in practice there will never be a large
> depth of recursive calls attained during program execution. Namely, if
> recursion depth led to the necessity of spilling registers to memory on
> every function call, there should be no gain in execution speed about
> working in memory from the beginning on, as in my design.
> Is this correctly thought? Maybe you can give me some hints on
> recommended reading for someone starting a processor design too,
> especially in the realm of stack machine architecture (the only real
> textbook I (partly) read is the internet text of Prof. Koopman on stack
> machines).


If I understand your question, that's not correct. Many past stack
processors (especially the Forth oriented ones), had finite sized
hardware stack which were very fast compared to main memory), and
spilled them as necessary. Even a few calls worth of hardware stack
will lead to a substantial reduction in the amount of memory traffic.
In general, the falloff in memory traffic for spills is basically
exponential with the size of the hardware stack. Pathological cases
excepted, of course.

From: Robert Mabee on
robertwessel2(a)yahoo.com wrote:
> On Feb 1, 4:57 pm, Jürgen Böhm <jbo...(a)gmx.net> wrote:
>>That was a long foreword to come to my question, which is this: As you
>>noticed, my machine has few *internal registers* it does all the stack
>>handling for function calling in memory (via data-stack). But as I have
>>learned, a lot of "stack processors" instead try to keep all the stack
>>manipulations related to function calling internally with register sets
>>and windows like the SPARC processors.
>> This seems to imply that in practice there will never be a large
>>depth of recursive calls attained during program execution. Namely, if
>>recursion depth led to the necessity of spilling registers to memory on
>>every function call, there should be no gain in execution speed about
>>working in memory from the beginning on, as in my design.
> If I understand your question, that's not correct. Many past stack
> processors (especially the Forth oriented ones), had finite sized
> hardware stack which were very fast compared to main memory), and
> spilled them as necessary. Even a few calls worth of hardware stack
> will lead to a substantial reduction in the amount of memory traffic.

Even a single lazy top-of-stack register should be a big help. You
could even omit hardware spill in this case by making the compiler track
the supposed validity of this register, even to the point of having two
entry points to any function differing in whether the register is empty.

Of course you could put off the stack optimization for a second design
to keep the initial effort reasonable.

Have you looked into garbage collection? A major issue is
distinguishing numbers from pointers. You could tag the data or
keep pointers in distinct places, such as a separate pointer stack.

From: Jürgen Böhm on
Robert Mabee wrote:
>
> Even a single lazy top-of-stack register should be a big help. You
> could even omit hardware spill in this case by making the compiler track
> the supposed validity of this register, even to the point of having two
> entry points to any function differing in whether the register is empty.
>
Please excuse my small knowledge, but what does it mean for a register
to be "lazy"? At the moment my compiler, generating code for an
expression, e.g.

(f 1 2)

it loads something that amounts to the address of the code of f, then 1
and then 2 on the stack. This makes three items. Then an instruction
CALLCLOS 3 calls the function f, which puts its result where the pointer
to f was originally. As far as I see, this would make three spillings of
the top of stack register necessary. An advantage would only show up,
when the stack cache (as I might call that) holds maybe 32, 64 or 128
words, when - hopefully - during a longer execution sequence there will
be no higher "peaks" in stack use due to nesting and recursion.


> Of course you could put off the stack optimization for a second design
> to keep the initial effort reasonable.
>
Probably I will have to follow this advice, although greater perfection
is always more satisfying. At the moment my main problem is to find the
right strategy for the design. The first step seems to make a block
diagram and then find out which data moves have to occur in sequence for
every instruction planned (is this called a RTL level description?).
Then a Verilog program for this must be written, which will probably
include state machines (is that much of a difference to
microprogramming, if one excludes the case of writable microcode store?).
But what I understand the least at the moment, is how to test the
design with a testbench and a Verilog simulator. I have done some
simpler Verilog designs already (a text only output "VGA card", a PS/2
port driver and such) which didn't need a testbench driver more
complicated than a clock-generator.
As far as I understand the simulation proces, one could write in
Verilog a simulator for the computer memory that is essentially an
array, filled with data given by initialization files. So in principle I
could compile some simple lisp program to my own emulator's array, read
that out and prepare the necessary initialization file for the Verilog
simulator software. Then the simulated processor will have to prove its
validity executing actual code on actual data.
Do you know a good source (on the net or book) where such a process
is well documented (example files, step by step proceeding)?


> Have you looked into garbage collection? A major issue is
> distinguishing numbers from pointers. You could tag the data or
> keep pointers in distinct places, such as a separate pointer stack.
>
>
Yes I have read a lot about garbage collecting algorithms. At first I
will implement a (stop-and-)copying garbage collector with semi-spaces
and "flip" using the Cheney algorithm. Maybe later I will improve it
following a paper by Baker (see: Uniprocessor Garbage Collection
Techniques, by Paul R. Wilson for references to the cited algorithms).
Of course I have therefore designed a tagging scheme (and incorporated
in my runtime-system) that is compatible with these algorithms (it is
pretty much standard).
Concerning stacks, it is important for GC that data stack and
(return-)address stack are separated in my design. On the data stack
there will be, at least at the moment where GC is called, only tagged
data of valid type, which is part of the Root-Set of GC. Additionally
the two stack approach makes some intermediate shuffling on
the data stack possible, as data can be temporarily moved to the
address-stack and back when there is no danger of GC occuring or the
moved data does not belong to the Root-Set (e.g. Fixnums, Characters).

Greetings

J�rgen


--
J�rgen B�hm www.aviduratas.de
"At a time when so many scholars in the world are calculating, is it not
desirable that some, who can, dream ?" R. Thom
From: robertwessel2 on
On Feb 2, 5:32 pm, Jürgen Böhm <jbo...(a)gmx.net> wrote:
> Robert Mabee wrote:
>
> > Even a single lazy top-of-stack register should be a big help. You
> > could even omit hardware spill in this case by making the compiler track
> > the supposed validity of this register, even to the point of having two
> > entry points to any function differing in whether the register is empty.
>
> Please excuse my small knowledge, but what does it mean for a register
> to be "lazy"? At the moment my compiler, generating code for an
> expression, e.g.


I'm pretty sure he means that spills are lazy. IOW, you only spill to
memory when you have to (when the hardware stack is full), and reload
from memory when the stack is empty.


> (f 1 2)
>
> it loads something that amounts to the address of the code of f, then 1
> and then 2 on the stack. This makes three items. Then an instruction
> CALLCLOS 3 calls the function f, which puts its result where the pointer
> to f was originally. As far as I see, this would make three spillings of
> the top of stack register necessary. An advantage would only show up,
> when the stack cache (as I might call that) holds maybe 32, 64 or 128
> words, when - hopefully - during a longer execution sequence there will
> be no higher "peaks" in stack use due to nesting and recursion.


Call depth resembles a one dimensional random walk, and tends to
bounce back and forth in a narrow band at high frequency, with a much
slower overall drift up and down. That's why a modest sized hardware
stack (backed by main memory spills) is so effective in eliminating
main memory accesses.

There's an old survey book on stack based processors, "Stack
Computers, The New Wave," by Koopman, that's available on line.

http://www.ece.cmu.edu/~koopman/stack_computers/

Section 6.4 shows some benchmarks (of toy applications), showing how
memory traffic is affected by the hardware stack sizes.

If you're designing a stack machine, you should at least skim it.