The Borg (Bring yer Own GRaphics) project—supported by NLnet—is establishing a fully transparent, end-to-end silicon implementation flow for open-source GPU hardware using a 100% libre EDA toolchain. Recognizing that full GPU development is highly complex, the initiative capitalizes on recent advances in low-cost chip manufacturing to make individual tape-outs feasible for small teams.
📖 Read the Borg GPU Book for detailed documentation.
The design is a TinyQV RISC-V SoC with the Borg FP16 shader processor as a memory-mapped peripheral, targeting both iCE40 FPGAs (pico-ice) and ASIC (IHP SG13G2 via Tiny Tapeout).
A minimal programmable shading unit with:
- FP16 Fused Multiply-Add (FMA) — IEEE-754 compliant HardFloat unit supporting ADD, MUL, FMA, FNEG, FSTEP, and FRCP operations
- 32 general-purpose FP16 registers (r0–r31, expanding to 64), MMIO-accessible from the CPU
- 32-word instruction memory for shader programs
- Hardware FP16 reciprocal (RCP) — LUT + linear interpolation for perspective division
- 4-cycle pipeline with automatic halt-on-zero-instruction
The firmware implements a full triangle rendering pipeline:
- Vertex Shader — 4×4 MVP matrix multiply with hardware perspective division, executed as a single shader pass on the Borg FPU
- Screen-Space Translation — NDC to pixel coordinates with configurable framebuffer resolution (up to 64×64)
- Rasterization — Hardware-iterator driven edge evaluation with native FP16 coordinate expansion and FSM auto-chaining
- Fragment Shader — Unified pass (compiled via linear scan allocator) performing barycentric interpolation for RGB, Z, and UV simultaneously
- Z-Buffer — Per-pixel depth testing with texture mapping from PSRAM
- Framebuffer Output — Results written to PSRAM, read by host (RP2040) for display
Shaders are compiled from GLSL-like source to a compact binary format (SPIR-B) and loaded at runtime from PSRAM — no firmware reflash needed to change shaders.
The MMIO architecture is generated automatically via the Accellera SystemRDL standard using PeakRDL-chisel, emitting both the Chisel BorgGpuRegs layout and the C-headers directly.
It features an asynchronous 2-entry Command FIFO so the CPU can pack and queue asynchronous drawing packets while the GPU handles geometry and rasterization in the background.
Based on Michael Bell's TinyQV, an RV32I RISC-V core with nibble-serial processing designed for Tiny Tapeout. The original Verilog was rewritten in Chisel and heavily modified — including expanded register file support (RV32E → RV32I), integrated Borg peripheral bus, and adapted pipeline for QSPI flash/PSRAM and UART.
make test-allmake test-chisel-borg # Borg FPU unit tests (Chisel)
make test-chisel-core # TinyQV CPU tests (Chisel)
make test-cocotb-soc-core-rtl # CPU SoC integration tests (cocotb)
make test-cocotb-soc-borg-rtl # Borg peripheral tests (cocotb)Fast C++ simulators for RTL validation, capable of rendering frames locally without an FPGA, featuring a real-time cycle-accurate interactive view.
python simulation/verilator/viewer.py # Bind the Pygame UI to cycle-accurate renderingPrerequisites: pico-ice FPGA + Raspberry Pi debug probe.
cd fpga
make burn # Build bitstream and upload to FPGA
make triangle # Run triangle rendering (vertex shader on FPGA, display on RP2040)make gds # Full RTL-to-GDS flow via LibreLane/OpenROAD| Task | Status |
|---|---|
| FPU on software simulator (Chisel + cocotb) | ✅ Done |
| FPU integrated into TinyQV SoC | ✅ Done |
| Vertex shader on FPGA | ✅ Done |
| Triangle rasterization + fragment shading | ✅ Done |
| SPIR-B runtime shader loading | ✅ Done |
| Per-vertex color interpolation | ✅ Done |
| Dynamic framebuffer resolution | ✅ Done |
| Tiny Tapeout TTIHP26a submission | ✅ Submitted |
| 32-bit RISC-V instructions & 32-entry register file | ✅ Done |
| Hardware perspective projection (4×4 MVP shader) | ✅ Done |
| Hardware FP16 reciprocal (FRCP) | ✅ Done |
| Back-face culling & depth-correct vkcube | ✅ Done |
| Hardware fragment interpolation | ✅ Done |
| SystemRDL Automated Memory Mapping | ✅ Done |
| Hardware Command FIFO (2-entry asynchronous submission) | ✅ Done |
| Cycle-accurate C++ simulation (Arcilator & Verilator) | ✅ Done |
| Interactive UI Viewer (zero-copy Pygame) | ✅ Done |
| Test manufactured chip | ⏳ Pending |
| Vulkan driver | 📋 Planned |
| Component | Description | License |
|---|---|---|
| Chisel | Hardware construction language (Scala → Verilog) | Apache-2.0 |
| TinyQV | RV32I RISC-V CPU core (rewritten in Chisel) | Apache-2.0 |
| Berkeley HardFloat | IEEE-754 floating-point units (FMA) | BSD-3-Clause |
| LibreLane | RTL-to-GDS ASIC flow orchestrator | Apache-2.0 |
| Yosys | RTL synthesis | ISC |
| OpenROAD | Place and route | BSD-3-Clause |
| Magic | Layout tool, DRC, GDS export | MIT |
| KLayout | GDS viewer and DRC | GPL-2.0 |
| IHP SG13G2 PDK | IHP 130nm process design kit | Apache-2.0 |
| cocotb | Python-based RTL simulation and testing | BSD-3-Clause |
| Icarus Verilog | Verilog simulation (cocotb backend) | GPL-2.0 |
| Verilator | Verilog linting and simulation | LGPL-3.0 |
| nextpnr | FPGA place and route (iCE40) | ISC |
| IceStorm | iCE40 FPGA bitstream tools | ISC |
| Netgen | LVS (Layout vs. Schematic) | MIT |
| GCC | RISC-V cross-compiler (riscv32-embedded) |
GPL-3.0 |
| Mill | Scala build tool | MIT |
| Tiny Tapeout Tools | Build and submission orchestrator | Apache-2.0 |
| Nix | Reproducible development environment | LGPL-2.1 |
| CIRCT/firtool | Chisel → Verilog compiler (FIRRTL) | Apache-2.0 (LLVM) |
| Arcilator | Cycle-accurate FIRRTL C++ simulator | Apache-2.0 (LLVM) |
| OpenJDK | Java runtime for Chisel/Mill | GPL-2.0 + CE |
| SystemRDL | Register logic definition standard | Accellera |
| PeakRDL | Toolchain for parsing and exporting SystemRDL | GPL-3.0 |
| nanobind | Zero-overhead C++ to Python bindings | BSD-3-Clause |
| Pygame (SDL2) | Hardware-accelerated UI windowing subsystem | LGPL-2.1 |
