Joshua Hirschkorn | CID: 02378306 | GitHub: vortexisalpha
- Pipeline Registers
- Fetch-Decode Pipeline Register (pip_reg_d.sv)
- Decode-Execute Pipeline Register (pip_reg_e.sv)
- Execute-Memory Pipeline Register (pip_reg_m.sv)
- Memory-Writeback Pipeline Register (pip_reg_w.sv)
- Flush and stall logic implementation
- Hazard Unit
- Data forwarding logic
- Load word stalling
- Control hazard flushing
- Top level Integration
- Refactoring and naming conventions for pipeline stages
- Module interconnection and signal routing
- Cache Design
- 2-way set associative cache architecture design (Co-authored with Yichan)
- State machine for cache miss handling (IDLE
- Cache Stalling Integration
- Modified hazard unit for cache miss stalling
- Updated all pipeline registers to support cache stall signals
- Funct3 Byte Offset Logic
- Implemented byte/word operation handling in cache
- Byte offset extraction for LBU/SB instructions
- Component Test Benches
- Created testbenches for pipeline and cache components
- Integration Testing
- Full system verification with provided test programs
- Version Control
- Merge conflict resolution
- Release tagging for milestone versions
- Team Coordination
- Task allocation and versioning principles
To start we made this diagram and took inspiration from the textbook: : Digital Design and Computer Architecture (RISC-V Edition) by Sarah Harris and David Harris.
In blue below is each component that we added to the single cycle CPU for pipelining purposes.
The transition from single cycle (mostly done in Lab4) to pipelined architecture required the addition of pipeline registers between each stage to separate instructions and allow multiple instructions to be processed simultaneously. I implemented 2 of the 5 pipeline registers with clear naming conventions following the standard pipeline stage suffixes (F, D, E, M, W) and i and o for labelling logic as input/output logic. Discussing and preparing these naming conventions and the diagram was a key part of the success of our development because when it came to linking everything up in the top.sv file the filesystem was much easier to navigate.
Commits:
The first pipeline register separates the execute and memory stages passing through some ALU, Control Unit, PC+4 and Extend logic some of which was taken directly from the Execute stage register:
module pip_reg_m #( //Execute to memory stage
PC_WIDTH = 32,
INSTRUCTION_WIDTH = 32,
REGISTER_ADDRESS_WIDTH = 5
)(
input logic clk_i,
input logic en_i,
input logic RegWriteE_i, //Execute
output logic RegWriteM_o, //Memory
input logic[1:0] ResultSrcE_i, //Execute
output logic[1:0] ResultSrcM_o, //Memory
input logic MemWriteE_i, //Execute
output logic MemWriteM_o, //Memory
input logic [2:0] funct3E_i, // Execute
output logic [2:0] funct3M_o, // Memory
input logic [INSTRUCTION_WIDTH-1:0] ALUResultE_i, //Execute
output logic [INSTRUCTION_WIDTH-1:0] ALUResultM_o, //Memory
input logic [INSTRUCTION_WIDTH-1:0] WriteDataE_i, //Execute
output logic [INSTRUCTION_WIDTH-1:0] WriteDataM_o, //Memory
input logic [REGISTER_ADDRESS_WIDTH-1:0] RdE_i, //Execute
output logic [REGISTER_ADDRESS_WIDTH-1:0] RdM_o, //Memory
input logic [PC_WIDTH-1:0] PCPlus4E_i, //Execute
output logic [PC_WIDTH-1:0] PCPlus4M_o //Memory
);
always_ff @(posedge clk_i) begin
if (en_i) begin
RegWriteM_o <= RegWriteE_i;
ResultSrcM_o <= ResultSrcE_i;
MemWriteM_o <= MemWriteE_i;
funct3M_o <= funct3E_i;
ALUResultM_o <= ALUResultE_i;
WriteDataM_o <= WriteDataE_i;
RdM_o <= RdE_i;
PCPlus4M_o <= PCPlus4E_i;
end
end
endmodule
Key design choices in this register:
-
Clear (
clr_i): Implements flush functionality for control hazards. When a branch is taken, the incorrect instruction in the decode stage must be flushed. -
Enable (
en_i): Implements stall functionality. When disabled, the register "freezes" and maintains its current values. -
Positive edge triggering: All pipeline registers operate on
posedge clkfor synchronous operation.
Commits:
The largest pipeline register, carrying all control signals and data from decode to execute:
//declared all the logic with input and output prefixes as you see below up here...
always_ff @(posedge clk_i) begin
if (clr_i) begin //flush logic
RegWriteE_o <= 'b0;
ResultSrcE_o <= 'b0;
MemWriteE_o <= 'b0;
JumpE_o <= 'b0;
BranchE_o <= 'b0;
ALUControlE_o <= 'b0;
ALUSrcE_o <= 'b0;
funct3E_o <= 'b0;
RD1E_o <= 'b0;
RD2E_o <= 'b0;
PCE_o <= 'b0;
Rs1E_o <= 'b0;
Rs2E_o <= 'b0;
RdE_o <= 'b0;
ImmExtE_o <= 'b0;
PCPlus4E_o <= 'b0;
end
else if (en_i) begin // Normal operation: pass data through
RegWriteE_o <= RegWriteD_i;
ResultSrcE_o <= ResultSrcD_i;
// ...remaining signals passed through
end
end
A critical design decision was passing funct3 through the pipeline register. This is necessary because:
-
Branch resolution: The branch unit in the execute stage needs
funct3to determine the branch type (BEQ, BNE, BLT, BGE). -
Memory operations: The memory stage needs
funct3to distinguish between byte and word load/store operations.
This was a factor that we overlooked and it came to stab us in the back when implementing cache. We ended up having to add this input/output whilst implementing the cache due to the byte operations. More on this later...
commits:
- Inputs and outputs
- Implemented hazard unit logic over video call (should say co authored but doesn't)
- Updated to take in Resultsrc0 and bug fixes
- Fixing outputs from control unit
- Fixing outputs from ALU
The hazard unit was one of my contributions, implementing data forwarding, stalling, and flushing logic. I co-authored this with Yichan over video call, carefully designing the logic to handle all hazard scenarios.
Planning on video call diagram:
We used the Digital Design and Computer Architecture text book as a guide to the top level units and base naming conventions for this stage. We also spoke about how we were going to deal with all the stalls and flushing logic.
The core forwarding logic detects Read After Write (RAW) hazards and forwards data from later pipeline stages.
An example of a data hazard is:
Cycle n:
instr1: ADD x5, x1, x2 // produces x5
Cycle n+1:
instr2: SUB x6, x5, x3 // needs x5
The way we use forward logic is dependant on signals ForwardAE and ForwardBE outputted from the hazard unit. We can control the input to the ALU with MUXs to give us signals from later stages ALUResultM or ResultW:
The following logic will detect if a register in the previous cycle needs a value that is already calculated in a further stage, we then "forward" the signal such that the stage executing the command is aware of this update:
always_comb begin
// Forward A (Rs1)
if (((Rs1E_i == RdM_i) && RegWriteM_i) && (Rs1E_i != '0)) begin
ForwardAE_o = 2'b10; // Forward from Memory Stage
end
else if (((Rs1E_i == RdW_i) && RegWriteW_i) && (Rs1E_i != '0)) begin
ForwardAE_o = 2'b01; // Forward from Writeback stage
end
else begin
ForwardAE_o = 2'b00; // No forwarding
end
// Forward B (Rs2) similar logic
if (((Rs2E_i == RdM_i) && RegWriteM_i) && (Rs2E_i != '0)) begin
ForwardBE_o = 2'b10;
end
else if (((Rs2E_i == RdW_i) && RegWriteW_i) && (Rs2E_i != '0)) begin
ForwardBE_o = 2'b01;
end
else begin
ForwardBE_o = 2'b00;
end
Key considerations:
-
Register x0 check: The condition
Rs1E_i != '0ensures we don't forward when reading from x0 (which is always zero). -
Priority: Memory stage forwarding takes priority over writeback stage (checked first) because it has more recent data.
-
RegWrite check: Only forward if the source stage will actually write to the register.
Load instructions create a special hazard because the data isn't available until after the memory stage.
An example of this behaviour would be:
I1: lw x5, 0(x1)
I2: add x6, x5, x2 // we need x5 here
This requires stalling in the hazard unit to detect if we need a stall based on ResultSrcE0 and weather we are using the same register as being written to memory.
Note that ResultSRC0 is the first bit of ResultSRC which controls weather we are reading or writing to the register file. We need this to detect load and write instructions with the following logic:
lwStall = ResultSrcE0_i && ((Rs1D_i == RdE_i) || (Rs2D_i == RdE_i));
When ResultSrcE0 is high, it indicates a load instruction in the execute stage. If either source register in the decode stage matches the load's destination, we must stall.
A significant modification I made was integrating cache miss handling into the hazard unit with Yichan.
In our design, cache miss handling is integrated directly into the hazard unit rather than treated as a separate control path. This is because a cache miss behaves like a global hazard: it affects not just one instruction, but the correctness of the entire pipeline.
When a cache miss occurs, the cache asserts CacheStall_i, which freezes all pipeline stages:
-
The Fetch and Decode stages are stalled to prevent the PC from advancing and new instructions from entering the pipeline.
-
The Execute, Memory, and Writeback stages are also frozen so that instructions that are trying to use memory do not partially execute or commit results while waiting for memory.
// Stall logic:
// stall on regular stall or cache stall
StallF_o = lwStall || CacheStall_i;
StallD_o = lwStall || CacheStall_i;
// freeze on cache stall only
StallE_o = CacheStall_i;
StallM_o = CacheStall_i;
StallW_o = CacheStall_i;
// Flush logic
if (!CacheStall_i) begin // don't flush on cache stall
FlushD_o = PCSrcE_i;
FlushE_o = (lwStall || PCSrcE_i);
end
else begin
FlushD_o = 'b0;
FlushE_o = 'b0;
end
The CacheStall_i is given by the cache memory directly into the hazard unit. The cache has a logic called CacheMiss_o which will be set to 1 if it has a cache miss. This becomes CacheStall_i in the hazard unit.
Key design ideas:
-
Cache stall freezes everything: When the cache misses, all stages must freeze to keep the pipeline functional.
-
No flush during cache stall: We must not flush valid instructions while waiting for cache.
-
Independent stall signals: Each stage gets its own stall signal, allowing more control over what stages need to be stalled in what scenarios.
Each person was responsible for integrating all their own modules into top.sv and establishing consistent naming conventions. The signal naming follows the pattern SignalName + Stage (e.g. RegWriteD, RegWriteE, RegWriteM, RegWriteW), making it easy to trace signals through the pipeline. For the parts I designed with Yichan, we did it together and the parts I did individually I did alone.
Commits:
- Name changing from Lab 4 for good programming practice
- Editing sign extend for good variable names
- Editing control unit for good variable names
- Adding RD outputs for pipelining
- Finalised pipeline level top.sv with comments on needed components
- +lots more bug fixes and testing logic
// Control signals carried across pipeline stages
logic RegWriteD;
logic RegWriteE;
logic RegWriteM;
logic RegWriteW;
logic [1:0] ResultSrcD;
logic [1:0] ResultSrcE;
logic [1:0] ResultSrcM;
logic [1:0] ResultSrcW;
The forwarding multiplexers use the hazard unit outputs to select the correct data source:
// Execute stage
mux3 ForwardMuxA (
.in0_i(RD1E),
.in1_i(ResultW),
.in2_i(ALUResultM),
.sel_i(ForwardAE),
.out_o(SrcAE)
);
mux3 ForwardMuxB (
.in0_i(RD2E),
.in1_i(ResultW),
.in2_i(ALUResultM),
.sel_i(ForwardBE),
.out_o(WriteDataE)
);
Commits:
Working with Yichan, we designed and implemented a 2-way set associative write back cache with 4096 bytes capacity and a least recently used value. We first created a diagram during our December 2nd meeting, then implemented it in live video calls.
Here are a few sketches from our video calls:
We iterated our initial design quite a few times and came up with this finalised diagram for our cache memory:
The next step was to figure out how cache would be implemented with our current pipelined design and we came up with this addition to our draw.io document as a structure for how to integrate the cache:
We decided to alter the design of the provided example in the lectures as we thought that we wanted more capacity which meant that we needed a larger set bits allocation to allow for the width of the cache to increase.
Set Format:
| LRU Bit (1) | Way0 (56 bits) | Way1 (56 bits) |
Way Format:
| Valid (1) | Dirty (1) | Tag (22 bits) | Data (32 bits) |
Parameters:
- Total capacity: 2048 bytes of actual data
- Number of sets: 256 (2^8)
- Ways per set: 2
- Tag bits: 22
- Set index bits: 8
- Byte offset bits: 2
The cache uses a finite state machine to handle cache misses. This is the structure I decided on implementing for the cache miss:
The state transitions are:
-
IDLE -> WRITEBACK: On miss, if the LRU way is dirty, write it back to memory first.
-
IDLE -> FETCH: On miss, if the LRU way is clean, skip directly to fetching.
-
WRITEBACK -> FETCH: After writeback, fetch the new data.
-
FETCH -> UPDATE: After fetching, update the cache with new data.
-
UPDATE -> IDLE: Return to idle, ready for next access.
This structure makes sense because each state corresponds to a distinct stall condition: the cache does not stall on hits in IDLE, stalls during WRITEBACK to prevent the processor from accessing a line being evicted, stalls during FETCH while waiting for memory to return a new line, and only releases the stall once the cache line has been fully updated and marked valid, ensuring the CPU never observes a partially updated cache state.
typedef enum {IDLE, WRITEBACK, FETCH, UPDATE} my_state;
always_comb begin
next_state = current_state;
case (current_state)
IDLE: begin
if (cache_miss) begin
if (target_dirty)
next_state = WRITEBACK;
else
next_state = FETCH;
end
end
WRITEBACK: next_state = FETCH;
FETCH: next_state = UPDATE;
UPDATE: next_state = IDLE;
endcase
end
Commits:
- Changing all pipeline registers to pass in funct3
- Changed hazard unit to take in cache stall and removed unnecessary pipeline logic
- Edited top.sv to allow these changes
- Fixing state machine logic for funct3 enabling on writeback state
- Add Update state logic
- Bug fix, funct3_o needs to be set on update state too
One of my key contributions was implementing the funct3 logic to handle byte operations correctly. This was particularly challenging because the cache operates on word aligned addresses, but byte operations (LBU, SB) need to access specific bytes within a word.
Initially, our cache would fail on tests like 3_lbu_sb.s because:
-
The cache always reads/writes full 32 bit words
-
Byte operations need to extract or modify specific bytes based on the address
-
The byte offset (bits [1:0] of the address) determines which byte to access
I implemented byte offset extraction in the cache module:
// Byte offset from address for byte operations
logic [1:0] byte_offset;
assign byte_offset = addr_i[1:0];
// Check if it's a word operation with funct3
logic is_word_op;
assign is_word_op = (funct3_i == 3'b010);
For read operations, the byte offset selects the correct byte from the cached word:
if (is_word_op) begin
data_o = raw_cache_data;
end
else begin
if (byte_offset == 2'b00)
data_o = {24'b0, raw_cache_data[7:0]};
else if (byte_offset == 2'b01)
data_o = {24'b0, raw_cache_data[15:8]};
else if (byte_offset == 2'b10)
data_o = {24'b0, raw_cache_data[23:16]};
else
data_o = {24'b0, raw_cache_data[31:24]};
end
For write operations, only the specific byte is modified while preserving the rest:
// On cache hit with byte write
if (byte_offset == 2'b00)
cache_array[set_addr][7:0] <= data_i[7:0];
else if (byte_offset == 2'b01)
cache_array[set_addr][15:8] <= data_i[7:0];
else if (byte_offset == 2'b10)
cache_array[set_addr][23:16] <= data_i[7:0];
else
cache_array[set_addr][31:24] <= data_i[7:0];
An important detail was ensuring funct3 is passed through the cache to memory for operations that miss:
The following logic updates funct3_o to the memory based on which state it is. If we are in IDLE state we need it to pass through. All the other ones we can set it to the previous default value of 0b010:
// Pass funct3 to memory and force word access on fill/writeback/update
assign funct3_o = (current_state == FETCH ||
current_state == WRITEBACK ||
current_state == UPDATE) ? 3'b010 : funct3_i;
During cache fill operations, we always access memory as words (32 bits), but for direct cache access, we pass through the original funct3 to enable byte operations.
Integrating the cache with the pipeline required modifications to the hazard unit and all pipeline registers:
// In top.sv
cache cache(
.clk_i(clk),
.rst_i(rst),
.MemWriteM_i(MemWriteM),
.ResultSrcM_i(ResultSrcM),
.funct3_i(funct3M),
.addr_i(ALUResultM),
.data_i(WriteDataM),
.mem_rd_data_i(MemRdData),
.mem_addr_o(CacheMemAddr),
.mem_wr_en_o(CacheMemWrEn),
.mem_wr_data_o(CacheMemWrData),
.funct3_o(CacheFunct3), // Note this line
.data_o(ReadDataM),
.cache_miss_o(CacheMiss),
.stall_o(CacheStall)
);
The stall signal propagates through the hazard unit to freeze all pipeline stages during a cache miss. We discussed this above.
As mentioned in my project log, I created testbenches for half of the components in our project, splitting the workload with Yichan. This section details the comprehensive component testing framework I developed.
I created a modular testing framework in tb/tests/component_tests/ that allows isolated testing of individual RTL modules using Google Test and Verilator. The framework includes:
This was provided and designed for us by Peter. (Thanks Peter).
I wrote a shell script to automate running all component tests, this is similar to the one provided by Peter for the main testing framework that runs verify.cpp but this one works such that it runs the component test benches in the tb/tests/component_tests/ folder:
#!/bin/bash
#run all component testbenches under component_tests/
SCRIPT_DIR=$(dirname "$(realpath "$0")")
RTL_FOLDER=$(realpath "$SCRIPT_DIR/../../../rtl")
OUT_FOLDER="$SCRIPT_DIR/../test_out/component_tests"
passes=0
fails=0
mkdir -p "$OUT_FOLDER"
for file in "${TEST_FOLDER}"/*_tb.cpp; do
name=$(basename "$file" _tb.cpp)
verilator -Wall -trace \
-cc "${RTL_FOLDER}/${name}.sv" \
-exe "${file}" \
-y "${RTL_FOLDER}" \
-prefix "Vdut" \
-o Vdut \
-LDFLAGS "-lgtest -lgtest_main -lpthread"
make -j -C obj_dir/ -f Vdut.mk
./obj_dir/Vdut
if [ $? -eq 0 ]; then
((passes++))
echo "${GREEN}PASS${RESET} ${name}"
else
((fails++))
echo "${RED}FAIL${RESET} ${name}"
fi
# Stash build output per test
mv obj_dir "${OUT_FOLDER}/${name}_obj_dir"
done
The script:
-
Automatically discovers all
*_tb.cppfiles -
Compiles each component with its corresponding RTL module
-
Reports pass/fail status with coloured output
-
Preserves build artefacts for debugging
Commits:
- ALU test bench
- Branch unit test bench
- Control unit test bench
- Bugfix control unit test bench
- Data Memory test bench
- Hazard unit test bench
- Instruction memory test bench
- Mux reg test bench
- Cache test bench
- Changed pipeline register test benches for new cache
The hazard unit is critical for correct pipeline operation, so I created comprehensive tests covering all hazard scenarios:
class HazardUnitTestbench : public BaseTestbench
{
protected:
void initializeInputs() override
{
top->Rs1D_i = 0;
top->Rs2D_i = 0;
top->Rs1E_i = 0;
top->Rs2E_i = 0;
top->RdE_i = 0;
top->ResultSrcE0_i = 0;
top->RdM_i = 0;
top->RegWriteM_i = 0;
top->RdW_i = 0;
top->RegWriteW_i = 0;
top->PCSrcE_i = 0;
}
};
// Test forwarding from Memory stage
TEST_F(HazardUnitTestbench, ForwardFromMemory)
{
top->Rs1E_i = 5; // Source register 1 in Execute = x5
top->RdM_i = 5; // Destination in Memory = x5
top->RegWriteM_i = 1; // Memory stage will write
top->eval();
EXPECT_EQ(top->ForwardAE_o, 2); // Should forward from Memory (2'b10)
}
// Test forwarding from Writeback stage
TEST_F(HazardUnitTestbench, ForwardFromWriteback)
{
top->Rs2E_i = 3; // Source register 2 in Execute = x3
top->RdW_i = 3; // Destination in Writeback = x3
top->RegWriteW_i = 1; // Writeback stage will write
top->eval();
EXPECT_EQ(top->ForwardBE_o, 1); // Should forward from Writeback (2'b01)
}
// Test load-use hazard detection
TEST_F(HazardUnitTestbench, LoadUseStall)
{
top->ResultSrcE0_i = 1; // Load instruction in Execute
top->RdE_i = 8; // Loading into x8
top->Rs1D_i = 8; // Decode needs x8
top->eval();
EXPECT_EQ(top->StallD_o, 1); // Should stall Decode
EXPECT_EQ(top->StallF_o, 1); // Should stall Fetch
}
// Test that load-use stall also flushes Execute
TEST_F(HazardUnitTestbench, LoadUseStallFlushesExecute)
{
top->ResultSrcE0_i = 1;
top->RdE_i = 4;
top->Rs1D_i = 4;
top->PCSrcE_i = 0;
top->eval();
EXPECT_EQ(top->StallD_o, 1);
EXPECT_EQ(top->StallF_o, 1);
EXPECT_EQ(top->FlushE_o, 1); // Must flush to insert bubble
}
These tests verify:
-
ForwardFromMemory: Detects RAW hazard and forwards from Memory stage
-
ForwardFromWriteback: Detects RAW hazard and forwards from Writeback stage
-
LoadUseStall: Detects load-use hazard requiring a stall
-
LoadUseStallFlushesExecute: Ensures stall inserts a bubble by flushing Execute
Tests all branch condition types:
// BEQ: Branch if equal (Zero flag set)
TEST_F(BranchUnitTestbench, BeqTaken)
{
top->funct3_i = 0b000; // BEQ encoding
top->Zero_i = 1; // Operands are equal
top->eval();
EXPECT_EQ(top->BranchTaken_o, 1);
}
// BNE: Branch if not equal (Zero flag clear)
TEST_F(BranchUnitTestbench, BneTaken)
{
top->funct3_i = 0b001; // BNE encoding
top->Zero_i = 0; // Operands are not equal
top->eval();
EXPECT_EQ(top->BranchTaken_o, 1);
}
// BLT: Branch if less than (negative result)
TEST_F(BranchUnitTestbench, BltTaken)
{
top->funct3_i = 0b100; // BLT encoding
top->ALUResult_i = 0x80000000; // MSB set = negative
top->eval();
EXPECT_EQ(top->BranchTaken_o, 1);
}
// BGE: Branch if greater or equal (non-negative result)
TEST_F(BranchUnitTestbench, BgeTaken)
{
top->funct3_i = 0b101; // BGE encoding
top->ALUResult_i = 0x00000001; // Positive
top->eval();
EXPECT_EQ(top->BranchTaken_o, 1);
}
// Default case: unknown funct3 should not branch
TEST_F(BranchUnitTestbench, DefaultNotTaken)
{
top->funct3_i = 0b111; // Invalid/unused encoding
top->Zero_i = 0;
top->ALUResult_i = 0;
top->eval();
EXPECT_EQ(top->BranchTaken_o, 0);
}
Verifies correct decoding for each instruction type:
// ADDI: I-type immediate arithmetic
TEST_F(ControlUnitTestbench, AddiDecodeTest)
{
top->op_i = 0b0010011; // I-type ALU opcode
top->funct3_i = 0b000; // ADD function
top->eval();
EXPECT_EQ(top->RegWrite_o, 1); // Will write to register
EXPECT_EQ(top->ALUControl_o, 0b000); // ADD operation
EXPECT_EQ(top->ALUSrc_o, 1); // Use immediate
EXPECT_EQ(top->ImmSrc_o, 0b000); // I-type immediate
EXPECT_EQ(top->ResultSrc_o, 0b00); // Result from ALU
EXPECT_EQ(top->Branch_o, 0); // Not a branch
}
// LW: Load word
TEST_F(ControlUnitTestbench, LoadDecodeTest)
{
top->op_i = 0b0000011; // Load opcode
top->funct3_i = 0b010; // Word access
top->eval();
EXPECT_EQ(top->RegWrite_o, 1); // Will write to register
EXPECT_EQ(top->ResultSrc_o, 0b01); // Result from memory
EXPECT_EQ(top->ALUSrc_o, 1); // Use immediate for address
EXPECT_EQ(top->MemWrite_o, 0); // Not writing to memory
}
// SW: Store word
TEST_F(ControlUnitTestbench, StoreDecodeTest)
{
top->op_i = 0b0100011; // Store opcode
top->funct3_i = 0b010; // Word access
top->eval();
EXPECT_EQ(top->RegWrite_o, 0); // No register write
EXPECT_EQ(top->MemWrite_o, 1); // Writing to memory
EXPECT_EQ(top->ALUSrc_o, 1); // Use immediate for address
EXPECT_EQ(top->ImmSrc_o, 0b010); // S-type immediate
}
// BNE: Branch if not equal
TEST_F(ControlUnitTestbench, BneDecodeTest)
{
top->op_i = 0b1100011; // Branch opcode
top->funct3_i = 0b001; // BNE function
top->eval();
EXPECT_EQ(top->Branch_o, 1); // Is a branch
EXPECT_EQ(top->ALUControl_o, 0b001); // SUB for comparison
EXPECT_EQ(top->ALUSrc_o, 0); // Compare registers
EXPECT_EQ(top->ImmSrc_o, 0b001); // B-type immediate
}
Tests both word and byte memory operations:
// Store and load a full word
TEST_F(DataMemoryTestbench, StoreLoadWord)
{
top->wr_en_i = 1;
top->funct3_i = 0b010; // Word operation
top->addr_i = 0x00000010;
top->data_i = 0xDEADBEEF;
top->clk_i = 0; top->eval();
top->clk_i = 1; top->eval(); // Rising edge writes
top->wr_en_i = 0;
top->funct3_i = 0b010;
top->eval();
EXPECT_EQ(top->data_o, 0xDEADBEEF);
}
// Store and load a single byte
TEST_F(DataMemoryTestbench, StoreLoadByte)
{
top->wr_en_i = 1;
top->funct3_i = 0b000; // Byte operation
top->addr_i = 0x00000020;
top->data_i = 0x000000AA;
top->clk_i = 0; top->eval();
top->clk_i = 1; top->eval();
top->wr_en_i = 0;
top->funct3_i = 0b000;
top->eval();
EXPECT_EQ(top->data_o, 0xAAu); // Zero-extended byte
}
These tests were particularly important for debugging the funct3 byte offset logic in the cache.
Tests the PC selection logic for different instruction types:
// Default: sequential execution (PC + 4)
TEST_F(MuxRegTestbench, DefaultTakesPcPlus4)
{
top->PCPlus4F_i = 0x10;
top->eval();
EXPECT_EQ(top->PCNext_o, 0x10u);
}
// Branch taken: use PCTarget
TEST_F(MuxRegTestbench, BranchTakesTarget)
{
top->PCTargetE_i = 0x200;
top->PCSrcE_i = 1;
top->JalrE_i = 0;
top->eval();
EXPECT_EQ(top->PCNext_o, 0x200u);
}
// JALR: use ALU result (rs1 + imm)
TEST_F(MuxRegTestbench, JalrUsesAluResult)
{
top->ALUResultE_i = 0xDEADBEEF;
top->PCTargetE_i = 0x12340000; // Should be ignored
top->PCSrcE_i = 1;
top->JalrE_i = 1; // JALR flag
top->eval();
EXPECT_EQ(top->PCNext_o, 0xDEADBEEF);
}
Comprehensive tests for all ALU operations:
// ADD operation
TEST_F(ALUTestbench, AddWorksTest)
{
top->ALUControl_i = 0b000;
top->SrcA_i = 10;
top->SrcB_i = 20;
top->eval();
EXPECT_EQ(top->ALUResult_o, 30);
EXPECT_EQ(top->Zero_o, 0);
}
// SUB operation
TEST_F(ALUTestbench, SubWorksTest)
{
top->ALUControl_i = 0b001;
top->SrcA_i = 20;
top->SrcB_i = 5;
top->eval();
EXPECT_EQ(top->ALUResult_o, 15);
}
// AND operation
TEST_F(ALUTestbench, AndWorksTest)
{
top->ALUControl_i = 0b010;
top->SrcA_i = 0b1100;
top->SrcB_i = 0b1010;
top->eval();
EXPECT_EQ(top->ALUResult_o, 0b1000);
}
// OR operation
TEST_F(ALUTestbench, OrWorksTest)
{
top->ALUControl_i = 0b011;
top->SrcA_i = 0b1100;
top->SrcB_i = 0b0110;
top->eval();
EXPECT_EQ(top->ALUResult_o, 0b1110);
}
// SLT (Set Less Than)
TEST_F(ALUTestbench, SltWorksTest)
{
top->ALUControl_i = 0b101;
top->SrcA_i = 5;
top->SrcB_i = 9;
top->eval();
EXPECT_EQ(top->Zero_o, 1); // 5 < 9, so Zero flag set
}
Tests instruction fetch functionality:
TEST_F(InstrMemTestbench, ReadsFirstWordIfPresent)
{
std::ifstream fin("program.hex");
if (!fin.is_open())
{
GTEST_SKIP() << "program.hex not present, skipping ROM content check";
}
// Pull first 4 bytes and reconstruct word (little-endian)
std::string line;
uint32_t bytes[4] = {0};
for (int i = 0; i < 4 && std::getline(fin, line); ++i)
{
std::stringstream ss;
ss << std::hex << line;
ss >> bytes[i];
}
uint32_t expected = (bytes[3] << 24) | (bytes[2] << 16) |
(bytes[1] << 8) | bytes[0];
top->A_i = 0;
top->eval();
EXPECT_EQ(top->RD_o, expected);
}
This test gracefully handles the case where program.hex isn't present, using gtest's skip functionality.
The full system was verified using the provided test programs in tb/tests/verify.cpp:
TEST_F(CpuTestbench, TestAddiBne) {
setupTest("1_addi_bne");
initSimulation();
runSimulation(CYCLES);
EXPECT_EQ(top_->a0, 254);
}
TEST_F(CpuTestbench, TestLbuSb) {
setupTest("3_lbu_sb");
initSimulation();
runSimulation(CYCLES);
EXPECT_EQ(top_->a0, 300);
}
TEST_F(CpuTestbench, TestPdf) {
setupTest("5_pdf");
setData("reference/gaussian.mem");
initSimulation();
runSimulation(CYCLES * 100);
EXPECT_EQ(top_->a0, 15363);
}
The 3_lbu_sb test was particularly important for validating the funct3 byte offset logic - it only passed after implementing the byte addressing correctly in the cache.
Below is a summary of all component test benches I created:
ALU File: ALU_tb.cpp Tests covered: ADD, SUB, AND, OR, SLT, default case
Branch Unit File: branch_unit_tb.cpp Tests covered: BEQ, BNE, BLT, BGE, default
Control Unit File: control_unit_tb.cpp Tests covered: ADDI, LOAD, STORE, BNE, default
Data Memory File: data_memory_tb.cpp Tests covered: Word store/load, byte store/load
Hazard Unit File: hazard_unit_tb.cpp Tests covered: Memory forward, WB forward, LW stall, flush
Instruction Memory File: instr_mem_tb.cpp Tests covered: First word read verification
PC Mux File: mux_reg_tb.cpp Tests covered: PC+4, branch target, JALR
I established the versioning principles for our team during the November 28th meeting:
-
Feature branches: Each team member works on their own branch or 2 people working on the same branch if they are doing it together
-
Main branch protection: All merges to main require peer review and testing
-
Release tags: Major milestones are tagged for easy reference apart from the single cycle CPU which we left on one branch
I resolved multiple merge conflicts throughout the project, particularly when integrating:
-
Control unit updates with the pipeline
-
Cache module with the pipelined processor
-
Full instruction set branch with cache implementation
The largest one was this commit, introducing the stretch goal 3 and cache due to parallel development of both. See here. This required a live video call with Anthony, Yichan and I.
Task allocation was structured around the pipeline stages:
-
Stage 1 (Nov 25-26): Pipeline registers and initial integration
-
Stage 2 (Nov 28): Hazard unit implementation
-
Parallel work: Control unit and ALU updates by Carys and Anthony
-
Stage 3: Cache implementation
-
Stage 4: Lots of debugging the cache
-
Parallel work: Carys and Anthony making full instruction set working in ALU and control unit.
-
Stage 5: Integration and testing on assembly test benches
-
Stage 6: Individual component testing
This structure allowed parallel development while minimizing merge conflicts.
This project significantly deepened my understanding of:
-
Pipelined processor architecture: Understanding how instructions flow through stages and how hazards arise gave me practical insight into CPU design principles.
-
Hardware description vs programming: The biggest mental shift was thinking in terms of concurrent hardware rather than sequential software. Every
always_combblock runs simultaneously, not sequentially. -
Cache design trade-offs: Implementing a write-back cache taught me about the complexity of maintaining data coherency and the performance implications of different cache policies.
-
GTKWave debugging: Extensive debugging sessions made me proficient at tracing signals through waveforms and identifying timing issues.
-
Initial byte addressing oversight: Our first cache implementation assumed all addresses were word-aligned, failing on byte operations. This required significant rework to add byte offset logic.
-
Flush during cache stall: Initially, the hazard unit would flush instructions during cache stalls, corrupting the pipeline state. The fix was simple but finding the bug took hours of waveform analysis.
-
Underestimating integration complexity: I assumed connecting modules would be straightforward, but cache state changing and naming mismatches caused numerous issues which were gladly minimised due to our diagram and thoughtful planning however they weren't fully eliminated.
-
More upfront design: While we created a cache diagram before implementation, I would spend more time designing the hazard unit logic on paper before coding.
-
Earlier testing: Writing component testbenches before integration would have caught issues like the byte addressing problem sooner.
-
Better documentation: Inline comments explaining the "why" behind design decisions would have helped when debugging weeks later.
I'm grateful to have worked with such a dedicated team:
-
Yichan: Close collaboration on pipeline registers, hazard unit, and cache implementation through numerous video calls
-
Carys and Anthony: Delivering the control unit and ALU updates that enabled our pipelined processor
-
The entire team for their commitment to achieving all stretch goals
This project demonstrated that effective communication and clear task allocation are just as important as technical skills in successful hardware design. Josh and I worked closely together throughout this project. We were both able to contribute immensely, and many of our tasks would not have been possible without each other’s support. Our collaboration was essential to completing the work.







.drawio.png)


