This repository contains the multi-level simulator system for PLENA (Programmable Long-context Efficient Neural Accelerator).
The PLENA Simulator provides three main components:
- Transaction-level Simulator: Models PLENA's architectural behavior at a high level, enabling rapid exploration of design choices, memory hierarchies, and long-context LLM inference workflows without the overhead of cycle-accurate RTL simulation.
- Analytical Latency Model: Provides fast estimation of PLENA's performance characteristics (TTFT, TPS) based on architectural parameters and instruction latencies for specified workloads.
- Utilization Model: Analyzes the utilization of the systolic array based on architectural parameters and instruction latencies, computing attainable vs theoretical FLOPS.
If you use this simulator in your research, please cite the following paper:
Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference
arXiv:2509.09505
@misc{wu2025combatingmemorywallsoptimization,
title = {Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference},
author = {Haoran Wu and Can Xiao and Jiayi Nie and Xuan Guo and Binglei Lou and Jeffrey T. H. Wong and Zhiwen Mo and Cheng Zhang and Przemyslaw Forys and Wayne Luk and Hongxiang Fan and Jianyi Cheng and Timothy M. Jones and Rika Antonova and Robert Mullins and Aaron Zhao},
year = {2025},
eprint = {2509.09505},
archivePrefix= {arXiv},
primaryClass = {cs.AR},
url = {https://arxiv.org/abs/2509.09505}
}There are two ways to get a working environment. Option A (Nix) runs directly on your machine and is best for day-to-day development. Option B (Docker) wraps the same Nix environment in a container for portable, reproducible builds when you don't want to install Nix on the host.
Prerequisites:
nixpackage manager (with flakes enabled)direnvfor environment management
# Install direnv hook in your shell
echo 'eval "$(direnv hook bash)"' >> ~/.bashrc
source ~/.bashrcInstallation:
# Allow direnv to load the environment
direnv allow
# Enter the development environment
nix develop
# Update git submodules
git submodule update --remote --mergeYou are now in a shell with the full toolchain (Rust, Python 3.12, clang, cmake,
etc.) and can run any of the just commands below directly.
The Docker image encapsulates the same Nix environment, so you only need Docker
installed (no Nix or direnv on the host). All commands are run from the repository
root. Your working tree is bind-mounted into the container at /workspace, so edits
on the host are picked up live.
Prerequisites:
- Docker Engine with the Compose plugin (
docker compose) - (Optional) NVIDIA Container Toolkit for CUDA support
Build the image and open a shell:
just docker-dev
# Equivalent to:
# docker compose -f docker/docker-compose.yml build dev
# docker compose -f docker/docker-compose.yml up -d dev
# docker compose -f docker/docker-compose.yml exec dev bashInside the container, initialize submodules once and build the Rust emulator binary (persisted on the host via the bind mount, so this is a one-time step):
git submodule update --init --recursive
(cd transactional_emulator && cargo build --release)Run commands without an interactive shell:
# Run any command in the dev environment
just docker-run bash -c "just latency llama-3.1-8b"
# Run a just target directly
just docker-test test-linearCommon Docker commands (see docker/README.md for the full list):
| Command | Description |
|---|---|
just docker-build-dev |
Build the development image |
just docker-dev |
Build, start, and enter the dev container |
just docker-run <cmd> |
Run a command in the dev environment |
just docker-test <target> |
Run a just target in Docker |
just docker-down |
Stop containers |
just docker-clean |
Remove containers and cached volumes |
CUDA support:
docker compose -f docker/docker-compose.yml --profile cuda up -d dev-cuda
docker compose -f docker/docker-compose.yml exec dev-cuda bashNote: The repository is bind-mounted from the host (owned by your host user) while the container runs as
root. The image marks/workspaceas a gitsafe.directoryso Nix's flake evaluation doesn't fail with a dubious-ownership error. If you build a custom image, preserve that setting.
The simulator and emulator both use plena_settings.toml as the main configuration file for hardware parameters. This file contains:
- Hardware dimensions (MLEN, BLEN, VLEN, HLEN)
- Memory configuration (HBM, SRAM sizes)
- Instruction latencies
- Prefetch/writeback amounts
The configuration file supports two modes:
analytic: Used by analytical models (latency and utilization)transactional: Used by the transaction-level emulator
Set the active mode in the [MODE] section of plena_settings.toml.
The transaction-level emulator executes machine code instructions sequentially, modeling PLENA's behavior at a high abstraction level. It includes:
- HBM/DRAM off-chip memory simulation
- Handwritten assembly templates for every operator in PLENA ISA for LLaMA
- Test scripts to verify correctness of assembly templates
The emulator reads hardware configuration from plena_settings.toml (using the behavior mode).
Standard mode:
just build-emulator [task]
# Example: just build-behave-sim linearDebug mode:
just build-emulator-debug [task]
# Example: just build-behave-sim-debug linearRun pre-generated assembly:
just run-generated-asmQuiet mode (latency and error metrics only):
just run-generated-asm-quietThe latency model provides fast performance estimation for PLENA workloads. It computes:
- TTFT (Time To First Token): Latency for the prefill phase
- TPS (Tokens Per Second): Throughput for the decode phase
List available models:
just latency-list-modelsRun with default settings (llama-3.1-8b, batch=4, input=2048, output=1024):
just latency llama-3.1-8bRun with custom batch size:
just latency-batch llama-3.1-8b 8Run with full custom parameters:
just latency-full llama-3.1-8b 4 2048 1024
# Format: just latency-full {model} {batch} {input_seq} {output_seq}Get JSON output:
just latency-json llama-3.1-8bPLENA_Simulator/
├── transactional_emulator/ # Transaction-level simulator (Rust)
├── analytic_models/ # Analytical models (Python)
│ ├── latency/ # Latency estimation model
│ └── utilisation/ # Utilization analysis model
├── compiler/ # Compiler and model definitions
├── PLENA_Tools/ # Supporting tools and utilities (submodule)
├── doc/ # Documentation and diagrams
├── plena_settings.toml # Main configuration file
└── justfile # Command shortcuts

