Sign in

Not a member? | Forgot your Password?

Search blogs

Search tips

Free PDF Downloads

Advanced Linux Programming

What Every Programmer Should Know About Memory

Introduction to Embedded Systems

C++ Tutorial

Embedded Systems - Theory and Design Methodology

Microcontroller Programming and Interfacing

Introduction to Microcontrollers

More Free PDF Downloads

Recent Blogs on EmbeddedRelated

Two Capacitors Are Better Than One
posted by Jason Sachs

Coding Step 1 - Hello World and Makefiles
posted by Stephen Friederichs

Introduction to Microcontrollers - Ada - 7 Segments and Catching Errors
posted by Mike Silva

OOKLONE: a cheap RF 433.92MHz OOK frame cloner
posted by Fabien Le Mentec

Practical protection against dust and water (i.e. IP protection)
posted by Dr Cagri Tanriover

Specifying the Maximum Amplifier Noise When Driving an ADC
posted by Rick Lyons

How to make a heap profiler
posted by Yossi Kreinin

Vintage robotics!
posted by Lonnie Honeycutt

Little to no benefit from C based HLS
posted by Christopher Felton

DSPRelated and EmbeddedRelated now on Facebook & I will be at EE Live!
posted by Stephane Boucher

Introduction to Microcontrollers

1 - Beginnings

2 - Further Beginnings

3 - Hello World

4 - More On GPIO

5 - Interrupts

6 - More On Interrupts

7 - Timers

8 - Adding Some Real-World Hardware

9 - More Timers and Displays

10 - Buttons and Bouncing

11 - Button Matrix & Auto Repeating

12 - Driving WS2812 RGB LEDs

13 - 7-segment displays & Multiplexing

14 - Ada - 7 Segments and Catching Errors

See Also


Embedded Systems Blogs > Victor Yurkovsky > PC and SP for a small CPU

Would you like to be notified by email when Victor Yurkovsky publishes a new blog?


Pageviews: 3870

PC and SP for a small CPU

Posted by Victor Yurkovsky on Jul 23 2013   

Ok, let's make a small stack-based CPU.

I will start where the rubber meets the road - the PC/stack subsystem that I like referring to as the 'legs'. As usual, I will present a design with a twist.

Not having a large design team, deadlines and million-dollar fab runs when designing CPUs creates a truly different environment. I can actually sit at the kitchen table and doodle around with CPU designs to my heart's content. I can try really ridiculous approaches, and work without a plan, just to see what happens. When something interesting happens, I can adjust the rest of my design to fit. I am an artist, man!

The Legs

When normal people (that is, not artists :) build CPUs, they will generally designate a register as a Program Counter (PC) and use it to address memory. The PC needs to be incremented normally; in addition it must support jumps and calls, so it is generally constructed as a loadable counter.

For calls and returns, we use the Stack Pointer (SP) that addresses the memory, either the same one as the PC or a different one. SP can be either incremented or decremented.

The stack semantics dictate that the SP must be pre-decremented on push and post incremented on pop (or the reverse). In spite of its apparent simplicity, this pre/post distinction can be tricky to implement. Some minimal implementations (J1 stack processor) give up and leave the post-increment for the next instruction (for the datastack anyway), leaving it up to the assembler to deal with the complexity.

The interaction of the PC incrementor and the return address that winds up pushed onto the stack is yet another source of complexity that is hard to describe until you try to implement it. Suffice it to say that the you have to either push an incremented address or increment the popped address to avoid running the same instruction twice. It is amazing how many real processors implement the PC/stack pointer subsystem in a clumsy way.

The traditional PC/SP implementation impacts the rest of the processor in a very significant way. Both the PC and the SP need to address memory, often simultaneously. Given that requirement, we are faced with a hard choice to make - either dual-port the RAM or require multiple cycles for instructions. Traditionally, the first choice is not an option, but with FPGAs we could do it easily (although I am loath to do so for other reasons). The second choice is not attractive either, as it incurs a significant speed penalty and increases the complexity of the design.

Decoupled Stack

Luckily, there is a third alternative: decouple the stack memory. There is little reason to keep the stack in the same memory space as the code or data, for minimal processors. Especially if you are not planning on running C on it, and I have little interest in that.

A distributed RAM can be implemented very compactly on Xilinx chips: a single slice can house two sixteen-bit RAMs. This leads to a very compact stack memory - a 16-level 16-bit stack takes up only 8 slices!

But wait, it gets better. Each half-slice also has free incrementor logic. With that, we can eliminate the PC register altogether, and use the memory addressed by SP as PC.

This arrangement makes subroutine calling really easy. We don't have to push anything - the PC is on the stack to start with!

There are consequences to this decoupled approach. Since the stack memory is outside the normal memory space, it is inaccessible to regular memory reads. For instance, you cannot take an address of data on the return stack. Running out of the stack without a separate PC also makes it entirely impossible to store data on the return stack - there simply is no pathway to move data there. This is a little traumatic, as even Forth uses the return stack sometimes to store data. However, there are workarounds.

Let's implement the legs. I will break up the functionality into small modules - the map report will show 'utilization by hierarchy' to let us identify how big each module is.

First, the stack memory:
  A 16-bit 16-level stack memory.
  Infer a RAM16_S1.  We write it every cycle with DIN and output DOUT, which
  may be incremented.
module STACKRAM(
  input         C,
  input   [3:0] A,
  input  [15:0] DIN,
  output [15:0] DOUT,
  input inc
  reg [15:0] ram[0:15];
  assign DOUT = ram[A] + inc;
  always @(posedge C)
    ram[A] <= DIN;

The Stack Pointer:
  A 5 bit stack pointer
  There is no penalty for using it as a 4-bit pointer
module SP(
  input C,
  input push,
  input pop,
  output [4:0]dout
  reg [4:0] SP;           //Stack Pointer
  reg [4:0] newsp;
  always @(push or pop)
    case ({push,pop})
      2'b01:  newsp = SP+1;
      2'b10:  newsp = SP-1;
      default: newsp = SP;
  always @(posedge C)
    SP <= newsp;
  assign dout = newsp;

And finally, the entire PC/SP module:
  The complete PC/SP subsystem
module PC(
  input clk,
  input [15:0] in,       //input vector data
  input inc,             //when set, increment PC
  input vec,             //when set, accept vector
  input push,            //push new value onto stack
  input pop,             //return value (increment SP for next cycle)
  output [15:0] out,
  output[3:0] addr
  //stack pointer - 2 slices...
  SP mysp(clk,push,pop,addr);

  wire [15:0] min;
  wire [15:0] mout;
  STACKRAM mem(clk,addr,min,mout,inc);
  //mux between direct input or old PC/inc
  assign min = vec? in : mout;
  assign out = min;   //output new address or inced old.

Pretty simple. To test it I implemented the design on a Digilent Spartan S3 board. I connected the 4-digit display to the address bus, 8 sliding switches to the vector in register, and 3 buttons to signify jump, call and return. Running with a slow clock, I can watch my CPU incrementing the address, jumping or calling to a specified address, and returning to the original PC +1! The button instructions are decoded into control wires as follows
  reg pc_inc, pc_vec, pc_push, pc_pop;
  always @(posedge cpuclk) begin 
    case (btn[3:1])
      3'b100: begin //jump
        pc_inc=0;  pc_vec=1; pc_push=0; pc_pop=0;
      3'b010: begin //call
        pc_inc=0;  pc_vec=1; pc_push=1; pc_pop=0;
      3'b001: begin //return
        pc_inc=1;  pc_vec=0; pc_push=0; pc_pop=1;
      default: begin //increment PC
        pc_inc=1;  pc_vec=0; pc_push=0; pc_pop=0;
  wire [3:0] sp;
  //switches for low 8 bits of vector 
  PC mypc(cpuclk,{8'h00,sw[7:0]},pc_inc,pc_vec,pc_push,pc_pop,ab,sp);
The tools report the size as 24 slices -- pretty close to optimal. SP should really fit into 2 slices...
| +mypc              |           | 10/24         |
| ++mem              |           | 9/9           |
| ++mysp             |           | 5/5           | 
So there you have it. All that's left to do is to add the datastack, the ALUs and the instruction decoder....

Rate this article:
Rating: 0 | Votes: 0
posted by Victor Yurkovsky

Previous post by Victor Yurkovsky: PicoBlaze - Program RAM Access for an Interactive Monitor
Next post by Victor Yurkovsky: Windows XP and Win32 - the Platform of the Future!
all articles by Victor Yurkovsky

Comments / Replies

There are no comments yet!

Sorry, you need javascript enabled to post any comments.