|
Prev: India
Next: what architectural features in DSP prevent DSP from executingcontrol code efficiently?
From: Jürgen Böhm on 1 Feb 2007 17:57 Hi all, I am about to design a CPU, specialized for execution of compiled Lisp code. The basic design that proved useful in writing the compiler was that of a machine with two stacks: data stack and code stack indexed by two stack pointers *dsp* and *csp* respectively. These point into memory. Additionally there is a so called "template pointer" *tp*, which points to a template (an aggregate data structure in memory which holds code and data local to the code). Data and Address path are 32 bit wide, memory can arbitrarily hold code and data, there is no virtual memory support and no cache (this is my first attempt at designing a CPU and it is currently more a hobby project). The instructions are 32 bit, the higher 16 bit containing the operation code (maybe later some bits of these 16 are assigned to additional tasks as 8 bit will suffice to hold all opcodes). The lower 16 bit contain a "data extension" d, that is used in some operations: JMP d : jumps relative 4 * d bytes LOD d : pushes data from (*dsp*+4*d) to top of data stack, (LOD 0 is DUP in FORTH notation) STO d : stores top of stack in (*dsp*+4*d) and adds 4 to *dsp* (stacks grow downwards) all operations (+ - * car cdr ..) are done with the top of data-stack elements in the way an UPN calculator (or FORTH) works. Using this design, I have already implemented a basic compiler and an emulator for the CPU and as a next step want to build the CPU in Verilog for an FPGA. That was a long foreword to come to my question, which is this: As you noticed, my machine has few *internal registers* it does all the stack handling for function calling in memory (via data-stack). But as I have learned, a lot of "stack processors" instead try to keep all the stack manipulations related to function calling internally with register sets and windows like the SPARC processors. This seems to imply that in practice there will never be a large depth of recursive calls attained during program execution. Namely, if recursion depth led to the necessity of spilling registers to memory on every function call, there should be no gain in execution speed about working in memory from the beginning on, as in my design. Is this correctly thought? Maybe you can give me some hints on recommended reading for someone starting a processor design too, especially in the realm of stack machine architecture (the only real textbook I (partly) read is the internet text of Prof. Koopman on stack machines). Greetings J�rgen -- J�rgen B�hm www.aviduratas.de "At a time when so many scholars in the world are calculating, is it not desirable that some, who can, dream ?" R. Thom
From: robertwessel2 on 1 Feb 2007 23:45 On Feb 1, 4:57 pm, Jürgen Böhm <jbo...(a)gmx.net> wrote: > That was a long foreword to come to my question, which is this: As you > noticed, my machine has few *internal registers* it does all the stack > handling for function calling in memory (via data-stack). But as I have > learned, a lot of "stack processors" instead try to keep all the stack > manipulations related to function calling internally with register sets > and windows like the SPARC processors. > This seems to imply that in practice there will never be a large > depth of recursive calls attained during program execution. Namely, if > recursion depth led to the necessity of spilling registers to memory on > every function call, there should be no gain in execution speed about > working in memory from the beginning on, as in my design. > Is this correctly thought? Maybe you can give me some hints on > recommended reading for someone starting a processor design too, > especially in the realm of stack machine architecture (the only real > textbook I (partly) read is the internet text of Prof. Koopman on stack > machines). If I understand your question, that's not correct. Many past stack processors (especially the Forth oriented ones), had finite sized hardware stack which were very fast compared to main memory), and spilled them as necessary. Even a few calls worth of hardware stack will lead to a substantial reduction in the amount of memory traffic. In general, the falloff in memory traffic for spills is basically exponential with the size of the hardware stack. Pathological cases excepted, of course.
From: Robert Mabee on 2 Feb 2007 15:30 robertwessel2(a)yahoo.com wrote: > On Feb 1, 4:57 pm, Jürgen Böhm <jbo...(a)gmx.net> wrote: >>That was a long foreword to come to my question, which is this: As you >>noticed, my machine has few *internal registers* it does all the stack >>handling for function calling in memory (via data-stack). But as I have >>learned, a lot of "stack processors" instead try to keep all the stack >>manipulations related to function calling internally with register sets >>and windows like the SPARC processors. >> This seems to imply that in practice there will never be a large >>depth of recursive calls attained during program execution. Namely, if >>recursion depth led to the necessity of spilling registers to memory on >>every function call, there should be no gain in execution speed about >>working in memory from the beginning on, as in my design. > If I understand your question, that's not correct. Many past stack > processors (especially the Forth oriented ones), had finite sized > hardware stack which were very fast compared to main memory), and > spilled them as necessary. Even a few calls worth of hardware stack > will lead to a substantial reduction in the amount of memory traffic. Even a single lazy top-of-stack register should be a big help. You could even omit hardware spill in this case by making the compiler track the supposed validity of this register, even to the point of having two entry points to any function differing in whether the register is empty. Of course you could put off the stack optimization for a second design to keep the initial effort reasonable. Have you looked into garbage collection? A major issue is distinguishing numbers from pointers. You could tag the data or keep pointers in distinct places, such as a separate pointer stack.
From: Jürgen Böhm on 2 Feb 2007 18:32 Robert Mabee wrote: > > Even a single lazy top-of-stack register should be a big help. You > could even omit hardware spill in this case by making the compiler track > the supposed validity of this register, even to the point of having two > entry points to any function differing in whether the register is empty. > Please excuse my small knowledge, but what does it mean for a register to be "lazy"? At the moment my compiler, generating code for an expression, e.g. (f 1 2) it loads something that amounts to the address of the code of f, then 1 and then 2 on the stack. This makes three items. Then an instruction CALLCLOS 3 calls the function f, which puts its result where the pointer to f was originally. As far as I see, this would make three spillings of the top of stack register necessary. An advantage would only show up, when the stack cache (as I might call that) holds maybe 32, 64 or 128 words, when - hopefully - during a longer execution sequence there will be no higher "peaks" in stack use due to nesting and recursion. > Of course you could put off the stack optimization for a second design > to keep the initial effort reasonable. > Probably I will have to follow this advice, although greater perfection is always more satisfying. At the moment my main problem is to find the right strategy for the design. The first step seems to make a block diagram and then find out which data moves have to occur in sequence for every instruction planned (is this called a RTL level description?). Then a Verilog program for this must be written, which will probably include state machines (is that much of a difference to microprogramming, if one excludes the case of writable microcode store?). But what I understand the least at the moment, is how to test the design with a testbench and a Verilog simulator. I have done some simpler Verilog designs already (a text only output "VGA card", a PS/2 port driver and such) which didn't need a testbench driver more complicated than a clock-generator. As far as I understand the simulation proces, one could write in Verilog a simulator for the computer memory that is essentially an array, filled with data given by initialization files. So in principle I could compile some simple lisp program to my own emulator's array, read that out and prepare the necessary initialization file for the Verilog simulator software. Then the simulated processor will have to prove its validity executing actual code on actual data. Do you know a good source (on the net or book) where such a process is well documented (example files, step by step proceeding)? > Have you looked into garbage collection? A major issue is > distinguishing numbers from pointers. You could tag the data or > keep pointers in distinct places, such as a separate pointer stack. > > Yes I have read a lot about garbage collecting algorithms. At first I will implement a (stop-and-)copying garbage collector with semi-spaces and "flip" using the Cheney algorithm. Maybe later I will improve it following a paper by Baker (see: Uniprocessor Garbage Collection Techniques, by Paul R. Wilson for references to the cited algorithms). Of course I have therefore designed a tagging scheme (and incorporated in my runtime-system) that is compatible with these algorithms (it is pretty much standard). Concerning stacks, it is important for GC that data stack and (return-)address stack are separated in my design. On the data stack there will be, at least at the moment where GC is called, only tagged data of valid type, which is part of the Root-Set of GC. Additionally the two stack approach makes some intermediate shuffling on the data stack possible, as data can be temporarily moved to the address-stack and back when there is no danger of GC occuring or the moved data does not belong to the Root-Set (e.g. Fixnums, Characters). Greetings J�rgen -- J�rgen B�hm www.aviduratas.de "At a time when so many scholars in the world are calculating, is it not desirable that some, who can, dream ?" R. Thom
From: robertwessel2 on 2 Feb 2007 20:45 On Feb 2, 5:32 pm, Jürgen Böhm <jbo...(a)gmx.net> wrote: > Robert Mabee wrote: > > > Even a single lazy top-of-stack register should be a big help. You > > could even omit hardware spill in this case by making the compiler track > > the supposed validity of this register, even to the point of having two > > entry points to any function differing in whether the register is empty. > > Please excuse my small knowledge, but what does it mean for a register > to be "lazy"? At the moment my compiler, generating code for an > expression, e.g. I'm pretty sure he means that spills are lazy. IOW, you only spill to memory when you have to (when the hardware stack is full), and reload from memory when the stack is empty. > (f 1 2) > > it loads something that amounts to the address of the code of f, then 1 > and then 2 on the stack. This makes three items. Then an instruction > CALLCLOS 3 calls the function f, which puts its result where the pointer > to f was originally. As far as I see, this would make three spillings of > the top of stack register necessary. An advantage would only show up, > when the stack cache (as I might call that) holds maybe 32, 64 or 128 > words, when - hopefully - during a longer execution sequence there will > be no higher "peaks" in stack use due to nesting and recursion. Call depth resembles a one dimensional random walk, and tends to bounce back and forth in a narrow band at high frequency, with a much slower overall drift up and down. That's why a modest sized hardware stack (backed by main memory spills) is so effective in eliminating main memory accesses. There's an old survey book on stack based processors, "Stack Computers, The New Wave," by Koopman, that's available on line. http://www.ece.cmu.edu/~koopman/stack_computers/ Section 6.4 shows some benchmarks (of toy applications), showing how memory traffic is affected by the hardware stack sizes. If you're designing a stack machine, you should at least skim it.
|
Next
|
Last
Pages: 1 2 Prev: India Next: what architectural features in DSP prevent DSP from executingcontrol code efficiently? |