CPU design [FPGA]

Prev: JOP as SOPC component
Next: uclinux on spartan-3e starter kit

From: Frank Buss on 22 Aug 2006 18:06

jacko wrote:

> search for MSL16 as a compact example of stack machine, i would use
> slightly different ops, and things if i did it.

The paper at

http://www.cse.cuhk.edu.hk/~phwl/mt/public/archives/old/msl16/fccm98_fcpu.pdf

says it needs 175 CLBs on a Xlinx FPGA. And

http://www.xilinx.com/publications/xcellonline/xcell_48/xc_picoblaze48.htm

says that the PicoBlaze needs 76 slices (311 slices, if you add serial
ports and timers). I'm not sure if this is valid for every FPGA, but
somewhere I've read that 4 slices = 1 CLB, so the MSL16 needs more than 9
times more logic gates than PicoBlaze. This is not my idea of a small core.

--
Frank Buss, fb(a)frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de

From: Frank Buss on 22 Aug 2006 21:00

Jim Granville wrote:

> One stack machine, that is still small, but could help greatly with
> software flows (being an already defined std)
> is the Instruction List language of IEC 61131-1
>
> http://www.3s-software.com/index.shtml?CoDeSys_IL
>
> and
>
> http://en.wikipedia.org/wiki/Instruction_list

This looks very interesting, because every command has only one operand,
which make developing the core really easy and leaves much space for
addressing modes etc. in an opcode, even with 8 bit opcodes.

I'll try to mix this with my last addressing modes. I don't need "call",
because this is only a jump, where the return address is stored somewhere
(I don't need recursion).

4 bits: instruction
lda: load accu
sta: store accu
or: accu = accu or argument
xor: "
and: "
add: "
sub: "
cmp: "
bcc: branch if carry clear
bcs: branch if carry set
bne: branch if zero set
beq: branch if zero clear
jmp: jump
inp: read from port or special register (pc, flags, i/o ports, timer etc.)
outp: write to port or special register

I don't need it, but the last possible instruction could be rti, return
from interrupt, which restores pc, accu and the flags, which are saved on
interrupt entry. With inp/outp the interrupt address and frequency could be
setup.

4 bits: address mode (pc relative, 16 bit argument, doesn't make much
sense, so all useful combinations fits in 4 bits)

immediate, 8 bit argument
immediate, 16 bit argument
immediate, no arguments, #0
immediate, no arguments, #1

8 bit transfer width:
address, 8 bit argument
address, 16 bit argument
address, pc relative, 8 bit argument
address indirect, 8 bit argument
address indirect, 16 bit argument
address indirect, pc relative, 8 bit argument

16 bit transfer width:
address, 8 bit argument
address, 16 bit argument
address, pc relative, 8 bit argument
address indirect, 8 bit argument
address indirect, 16 bit argument
address indirect, pc relative, 8 bit argument

The "pc relative" address modes adds the argument to the pc to get the
value. This can be used for the branches and jumps for short jumps, but as
well for using some kind of local variables. Let's try the swap algorithm:

; swap 6 byte source and destination MACs
p1: .dw 0
p2: .dw 0
tmp: .db 0
lda #5
sta p1 (pc relative)
lda #11
sta p2 (pc relative)
loop: lda (p1) (address indirect, pc relative)
sta tmp (address, pc relative)
lda (p2) (address indirect, pc relative)
sta (p1) (address indirect, pc relative)
lda tmp (address, pc relative)
sta (p2) (address indirect, pc relative)
lda p2 (pc relative)
sub #1 (one byte, because #1 needs no operand)
sta p2 (pc relative)
lda p1 (pc relative)
sub #1 (one byte, because #1 needs no operand)
sta p1 (pc relative)
bcc loop (pc relative)

37 bytes

This is not as good as my RISC idea (20 bytes), but the code is much better
to understand: you need not to think about it when reading and writing it.
But maybe this is only because some ages ago I've written some demos and
intros on C64 (6502), which uses a similiar instruction set :-)

Do you think the core for this design would be smaller than PicoBlaze or my
RISC idea?

BTW: There are some nice contructs possible for smaller code, like to use
some kind of zero-page, like implemented in the 6502, because the lda/sta
instructions could be used with 8 bit arguments addresses. But code size
and speed is not so important for me, a small core is more important, and
maybe easy to write assembler code, to avoid implementing a GCC backend for
my CPU.

--
Frank Buss, fb(a)frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de

From: Jim Granville on 22 Aug 2006 22:19

Frank Buss wrote:
> Jim Granville wrote:
>
>
>>One stack machine, that is still small, but could help greatly with
>>software flows (being an already defined std)
>>is the Instruction List language of IEC 61131-1
>>
>>http://www.3s-software.com/index.shtml?CoDeSys_IL
>>
>>and
>>
>>http://en.wikipedia.org/wiki/Instruction_list
>
>
> This looks very interesting, because every command has only one operand,
> which make developing the core really easy and leaves much space for
> addressing modes etc. in an opcode, even with 8 bit opcodes.
>
> I'll try to mix this with my last addressing modes. I don't need "call",
> because this is only a jump, where the return address is stored somewhere
> (I don't need recursion).
>
> 4 bits: instruction
> lda: load accu
> sta: store accu
> or: accu = accu or argument
> xor: "
> and: "
> add: "
> sub: "
> cmp: "
> bcc: branch if carry clear
> bcs: branch if carry set
> bne: branch if zero set
> beq: branch if zero clear
> jmp: jump
> inp: read from port or special register (pc, flags, i/o ports, timer etc.)
> outp: write to port or special register
>
> I don't need it, but the last possible instruction could be rti, return
> from interrupt, which restores pc, accu and the flags, which are saved on
> interrupt entry. With inp/outp the interrupt address and frequency could be
> setup.
>
> 4 bits: address mode (pc relative, 16 bit argument, doesn't make much
> sense, so all useful combinations fits in 4 bits)
>
> immediate, 8 bit argument
> immediate, 16 bit argument
> immediate, no arguments, #0
> immediate, no arguments, #1
>
> 8 bit transfer width:
> address, 8 bit argument
> address, 16 bit argument
> address, pc relative, 8 bit argument
> address indirect, 8 bit argument
> address indirect, 16 bit argument
> address indirect, pc relative, 8 bit argument
>
> 16 bit transfer width:
> address, 8 bit argument
> address, 16 bit argument
> address, pc relative, 8 bit argument
> address indirect, 8 bit argument
> address indirect, 16 bit argument
> address indirect, pc relative, 8 bit argument
>
> The "pc relative" address modes adds the argument to the pc to get the
> value. This can be used for the branches and jumps for short jumps, but as
> well for using some kind of local variables. Let's try the swap algorithm:
>
> ; swap 6 byte source and destination MACs
> p1: .dw 0
> p2: .dw 0
> tmp: .db 0
> lda #5
> sta p1 (pc relative)
> lda #11
> sta p2 (pc relative)
> loop: lda (p1) (address indirect, pc relative)
> sta tmp (address, pc relative)
> lda (p2) (address indirect, pc relative)
> sta (p1) (address indirect, pc relative)
> lda tmp (address, pc relative)
> sta (p2) (address indirect, pc relative)
> lda p2 (pc relative)
> sub #1 (one byte, because #1 needs no operand)
> sta p2 (pc relative)
> lda p1 (pc relative)
> sub #1 (one byte, because #1 needs no operand)
> sta p1 (pc relative)
> bcc loop (pc relative)
>
> 37 bytes
>
> This is not as good as my RISC idea (20 bytes), but the code is much better
> to understand: you need not to think about it when reading and writing it.
> But maybe this is only because some ages ago I've written some demos and
> intros on C64 (6502), which uses a similiar instruction set :-)
>
> Do you think the core for this design would be smaller than PicoBlaze or my
> RISC idea?

The core can certainly be made very small, it depends on the datatypes
you decide to support. - I've been looking at the very similar, but
venerable MC14500 ICU into CPLDs ( effectvely IL with only Boolean type )

Note that the IL syntax allows brackets, and I think has an implicit
stack; a bit like reverse-polish calculators
- see this example I got from the web :

Example IL code, from the net ( derived from a ladder diagram ) :
Read as O:001/00 = I:000/00 AND ( I:000/01 OR ( I:000/02 AND NOT I:000/03) )

Label Opcode Operand Comment
START:
LD %I:000/00 (* Load input bit 00 *)
AND( %I:000/01 (* Start a branch and load input bit 01 OR(
%I:000/02 (* Load input bit 02 *) ANDN %I:000/03 (* Load
input bit 03 and invert *)
)
)
ST %O:001/00 (* SET the output bit 00 *)

With the implicit stack, your swap becomes
LD VarNameA
LD VarNameB
ST VarNameA
ST VarNameB

This also makes the assembler a little more complex, as it needs to
re-order, and be bracket aware, before final-code-generate :)

> BTW: There are some nice contructs possible for smaller code, like to use
> some kind of zero-page, like implemented in the 6502, because the lda/sta
> instructions could be used with 8 bit arguments addresses. But code size
> and speed is not so important for me, a small core is more important, and
> maybe easy to write assembler code, to avoid implementing a GCC backend for
> my CPU.

Another good reference site I've found, is this
http://www.tracemode.com/products/dev/

they offer a free ( 117MB ) version, I have not got the time to look at
yet.

Something like this, should (hopefully) allow simulation and
development of IL code, as the software aspects of this will be the key
elements.

If you can keep to a defined type/operator subset of IL, then this
should also be somewhat portable.

I did see that some of their IL examples, suggest two operands, but the
standards docs I have here, do not mention that ?
It could be that two operands simply does an implicit load of the first
one, and is done to make the code slightly more readable.

-jg

From: Frank Buss on 23 Aug 2006 06:04

Martin Schoeberl wrote:

> What do you mean with 'very close to the hardware'? I try to
> avoid vendor specific library elements as much as possible and
> stay with plain VHDL. If you mean that the VHDL coding style
> is more hardware oriented, than I agree.

Yes, this was what I mean, e.g. figures 5.6 to 5.9 of your thesis, where
you describe the processor pipeline with gates and which is implemented
like this in VHDL. But maybe this is the normal case and I'm just to new to
VHDL to write and interconnect components in this way.

http://www.jopdesign.com/thesis/thesis.pdf

> I started directly
> in an FPGA implementation and did almost no simulation.

Why not? When I was implementing my CRC32 check for my network core, I've
tested the algorithm with a VHDL testbench (ethernet packet send and
receive works at 10 Mbit and 100 Mbit on my Spartan 3E starter kit now).
The turnaround times are faster with simulation and it is very easy to
debug it, instead of debugging a synthesized core in hardware. The same was
true for my DS2432 ROM id reader, where I've written the testbench, first
and then implemented the reader.
http://www.frank-buss.de/vhdl/spartan3e.html

--
Frank Buss, fb(a)frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de

From: Walter Banks on 23 Aug 2006 08:43

Jim Granville wrote:

> The tiniest CPUs do not need a stack, and interupts do not need to be
> re-entrant, so a faster context switch is to re-map the Registers, Flags
> (and even PC ? ) onto a different area in BRAM.
> You can share this resource by INTs re-map top-down, and calls re-map
> bottom up - with a hardware trap when they collide :)

Once you get into seeing clearly the relationship between features and
cost a lot can be removed.

Interrupts can be removed at extremely low cost to applications. Both the
Microchip PIC12 and Freescale RS08 do not have interrupts. In the
RS08 C compiler we developed some software IP to where possible
go into a power down mode and launch execution threads that compiled as
execution to completion.

The threads are typically short and a as a side effect run to completion
makes local re-use easy

C compilers implemented for small processors work well with out either
a data or subroutine return stack. Two of the processors we have written
compilers for in the last couple years both used an assessable return
register. Flow control analysis in the compiler make nested subroutines
user transparent.

The instruction set reduction in the RS08 from the S08 parent had a
4-6% impact on application performance.

Walter..

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: JOP as SOPC component
Next: uclinux on spartan-3e starter kit