CPU design [FPGA]

Prev: JOP as SOPC component
Next: uclinux on spartan-3e starter kit

From: Martin Schoeberl on 22 Aug 2006 06:28

>> and Java also: http://www.jopdesign.com/
>
> You have tested both: a "normal" instruction set and a stack machine. For
> the stack machine you wrote that it is two times faster. What about code
> size and the size of the core?

Mmh, this statement is from a very early version of JOP
(about 2001). It was a comparison on the implementation of
the Java virtual machine (JVM) in two different types of
microcode.

About code size: It's the code (bytecode) that the Java
compiler generates plus some class information. Bytecode
is efficient, but class information adds to the memory
footprint. The size depends on the support of Java libraries.

Core size is configurable, starting from about 1000 LCs.
A well balanced version of JOP is about 2000 LCs.

>
> I've downloaded your code and looks like it is implemented very close to
> the hardware instead of using arbitrary VHDL and let the synthesizer decide
> how to implement it. A good idea for my implementation :-)

What do you mean with 'very close to the hardware'? I try to
avoid vendor specific library elements as much as possible and
stay with plain VHDL. If you mean that the VHDL coding style
is more hardware oriented, than I agree. I started directly
in an FPGA implementation and did almost no simulation.

Martin

From: Göran Bilski on 22 Aug 2006 06:30

Frank Buss wrote:
> Gran Bilski wrote:
>
>
>>You seems to have a c,z bits somewhere but you will need two versions of
>>each instruction, one which uses the carry and one which doesn't
>
>
> Yes, I have carry and zero flag. To make the implementation of the core
> easier, I think I'll use one bit of the instruction set to determine if the
> flags are updated or not.
>
>
>>Running more than just simple programs in real-time applications
>>requires interrupt support which messes things up considerable in the
>>control part of the processor.
>
>
> Why? I think I can implement a "call" instruction like in 68000:
>
> r2=pc
> pc=r1
>
> In the sub routine I can save r2, if I need more call stack.
>
> Interrupts could be implemented by saving the PC register in a special
> register and restoring it by calling a special return instruction.
>
>
So will you have instructions that saves the C,Z values?
Imagine doing a cmp instruction and after that you take an interrupt,
the interrupt handler will also use these flags so when you return the
interrupted program will use the wrong values.

>>Do you consider using only absolute branching or also doing relative
>>branching?
>
>
> 64 instructions are possible, so relative branching is a good idea and I'll
> use the same concept with one bit for deciding, if it is absolute or
> relative.
>
>
>>If you really are wanting to have a processor which is code efficient,
>>you might want to look at a stack machine.
>>If I was to create a tiny tiny processor with little area and code
>>efficient I would do a stack machine.
>>But they are much nastier to program but they can be implemented very
>>efficiently.
>
>
> I've implemented a simple Forth implementation for Java and it's just
> different, not more difficult to program in Forth:
>
> http://www.frank-buss.de/forth/
>
> The MARC4 from Atmel uses qForth:
>
> http://www.atmel.com/journal/documents/issue5/pg46_48_Atmel_5_CodePatch_A.pdf
>
> Maybe you are right and the core and programs are smaller with Forth, I'll
> think about it. Really useful is that it is simple to write an interactive
> read-eval-print loop in Forth (like in Lisp), so that you can program and
> debug a system over RS232.
>

From: Frank Buss on 22 Aug 2006 08:11

Gran Bilski wrote:

> So will you have instructions that saves the C,Z values?
> Imagine doing a cmp instruction and after that you take an interrupt,
> the interrupt handler will also use these flags so when you return the
> interrupted program will use the wrong values.

Yes, I think a r1 to flags register and flags register to r1 instruction
will be sufficient, a little bit like 6502 txs and tsx.

--
Frank Buss, fb(a)frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de

From: radarman on 22 Aug 2006 08:57

Göran Bilski wrote:
> Frank Buss wrote:
> > Göran Bilski wrote:
> >
> >
> >>If the interesting part is to create this solution without any time
> >>limits than you should create most from scratch.
> >
> >
> > Yes, this is what I'm planning.
> >
> > I have another idea for a CPU, very RISC like. The bits of an instructions
> > are something like micro-instructions:
> >
> > There are two internal 16 bit registers, r1 and r2, on which the core can
> > perform operations and 6 "normal" 16 bit registers. The first 2 bits of an
> > instructions defines the meaning of the rest:
> >
> > 2 bits: operation:
> > 00 load internal register 1
> > 01 load internal register 2
> > 10 execute operation
> > 11 store internal register 1
> >
> > I think it is a good idea to use 8 bits for one instruction instead of
> > using non-byte-aligned instructions, so we have 6 bits for the operation.
> > Some useful operations:
> >
> > 6 bits: execute operation:
> > r1 = r1 and r2
> > r1 = r1 or r2
> > r1 = r1 xor r2
> > cmp(r1, r2)
> > r1 = r1 + r2
> > r1 = r1 - r2
> > pc = r1
> > pc = r1, if c=0
> > pc = r1, if c=1
> > pc = r1, if z=0
> > pc = r1, if z=1
> >
> > For the load and store micro instructions, we have 6 bits for encoding the
> > place on which the load and store acts:
> >
> > 6 bits place:
> > 1 bit: transfer width (0=8, 1=16 bits)
> > 2 bits source/destination:
> > 00: register:
> > 3 bits: register index
> > 01: immediate:
> > 1 bit: width of immediate value (0=8, 1=16 bits)
> > next 1 or 2 bytes: immediate number (8/16 bits)
> > 10: memory address in register
> > 3 bits: register index
> > 11: address
> > 1 bit: width of address (0=8, 1=16 bits)
> > next 1 or 2 bytes: address (8/16 bits)
> >
> > The transfer width and the value need not to be the same. E.g. 1010xx
> > means, that the next byte is loaded into the internal register and the
> > upper 8 bits are set to 0.
> >
> > But for this reduced instruction set a compiler would be a good idea. Or
> > different layers of assembler. I'll try to translate my first CPU design,
> > which needed 40 bytes:
> >
> > ; swap 6 byte source and destination MACs
> > .base = 0x1000
> > p1: .dw 0
> > p2: .dw 0
> > tmp: .db 0
> > move #5, p1
> > move #11, p2
> > loop: move.b (p1), tmp
> > move.b (p2), (p1)
> > move.b tmp, (p2)
> > sub.b p2, #1
> > sub.b p1, #1
> > bcc.b loop
> >
> > With my new instruction set it could be written like this (the normal
> > registers 0 and 1 are constant 0 and 1) :
> >
> > load r1 immediate with 5
> > store r1 to register 2
> > load r1 immediate with 11
> > store r1 to register 3
> > loop: load r1 from memory address in register 2
> > load r2 from memory address in register 3
> > store r1 to memory address in register 3
> > store r2 to memory address in register 2
> > load r1 from register 3
> > load r2 from register 1
> > operation r1 = r1 - r2
> > store r1 in register 3
> > load r1 in register 2
> > operation r1 = r1 - r2
> > store r1 in register 2
> > operation pc = loop if c=0
> >
> > This is 20 bytes long. As you can see, there are micro optimizations
> > possible, like for the last two register decrements, where the subtrahend
> > needs to be loaded only once.
> >
> > I think this instruction set could be implemented with very few gates,
> > compared to other instruction sets, and the memory usage is low, too.
> > Another advantage: 64 different instructions are possible and orthogonal
> > higher levels are easy to implement with it, because the load and store
> > operations work on all possible places. Speed would be not the fastest, but
> > this is no problem for my application.
> >
> > The only problem is that you need a C compiler or something like this,
> > because writing assembler with this reduced instruction set looks like it
> > will be no fun.
> >
> > Instead of 16 bits, 32 bits and more is easy to implement with generic
> > parameters for this core.
> >
>
> Things to keep in mind is to handle larger arithmetic than 16 bits.
> That will usually introduce some kind of carry bits (stored where?).
> You seems to have a c,z bits somewhere but you will need two versions of
> each instruction, one which uses the carry and one which doesn't
>
> Running more than just simple programs in real-time applications
> requires interrupt support which messes things up considerable in the
> control part of the processor.
>
> Do you consider using only absolute branching or also doing relative
> branching?
>
> If you really are wanting to have a processor which is code efficient,
> you might want to look at a stack machine.
> If I was to create a tiny tiny processor with little area and code
> efficient I would do a stack machine.
> But they are much nastier to program but they can be implemented very
> efficiently.
>
>
> Göran

Since you have control of the microcode, you can implement 16-bit math
in an 8-bitter by chaining other states. The v8 uRISC/Arclite has a
16-bit increment, which is implemented as {Rn+1,Rn}++. It takes two
clock cycles to execute because it issues two commands to the ALU. Yes,
you do have to keep a carry flag, but you would keep one anyway.

BTW - I just finished the interrupt controller for my processor core,
and it wasn't that difficult. (once I got past the priority part). In
my case, I wait for the next instruction decode, and then enter the
interrupt states. Once it starts an interrupt, it's a simple matter of
storing off the flag register and current PC + 1, and then doing a JSR
to the location indicated in the service vector. I use a req/ack scheme
to let the microcode FSM indicate that it has entered an ISR.

Of course, my CPU doesn't have any cache, and a simple two-stage
pipeline - so that might have something to do with the simplicity of it.

From: radarman on 22 Aug 2006 09:01

Göran Bilski wrote:
> Frank Buss wrote:
> > Göran Bilski wrote:
> >
> >
> >>You seems to have a c,z bits somewhere but you will need two versions of
> >>each instruction, one which uses the carry and one which doesn't
> >
> >
> > Yes, I have carry and zero flag. To make the implementation of the core
> > easier, I think I'll use one bit of the instruction set to determine if the
> > flags are updated or not.
> >
> >
> >>Running more than just simple programs in real-time applications
> >>requires interrupt support which messes things up considerable in the
> >>control part of the processor.
> >
> >
> > Why? I think I can implement a "call" instruction like in 68000:
> >
> > r2=pc
> > pc=r1
> >
> > In the sub routine I can save r2, if I need more call stack.
> >
> > Interrupts could be implemented by saving the PC register in a special
> > register and restoring it by calling a special return instruction.
> >
> >
> So will you have instructions that saves the C,Z values?
> Imagine doing a cmp instruction and after that you take an interrupt,
> the interrupt handler will also use these flags so when you return the
> interrupted program will use the wrong values.
>
>
> >>Do you consider using only absolute branching or also doing relative
> >>branching?
> >
> >
> > 64 instructions are possible, so relative branching is a good idea and I'll
> > use the same concept with one bit for deciding, if it is absolute or
> > relative.
> >
> >
> >>If you really are wanting to have a processor which is code efficient,
> >>you might want to look at a stack machine.
> >>If I was to create a tiny tiny processor with little area and code
> >>efficient I would do a stack machine.
> >>But they are much nastier to program but they can be implemented very
> >>efficiently.
> >
> >
> > I've implemented a simple Forth implementation for Java and it's just
> > different, not more difficult to program in Forth:
> >
> > http://www.frank-buss.de/forth/
> >
> > The MARC4 from Atmel uses qForth:
> >
> > http://www.atmel.com/journal/documents/issue5/pg46_48_Atmel_5_CodePatch_A.pdf
> >
> > Maybe you are right and the core and programs are smaller with Forth, I'll
> > think about it. Really useful is that it is simple to write an interactive
> > read-eval-print loop in Forth (like in Lisp), so that you can program and
> > debug a system over RS232.
> >

Simpler solution - have the microcode FSM push the flags to the stack.
It's a simple alteration, and saves a lot of heartache. I have
contemplated even pushing the entire context to the stack, since I can
burst write from the FSM a lot faster than I can with individual
PSH/POP instructions, but I figure that would be overkill.

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: JOP as SOPC component
Next: uclinux on spartan-3e starter kit