Skip to content

AICrossSim/PLENA_Simulator

Repository files navigation

PLENA Simulation System

PLENA Logo

This repository contains the multi-level simulator system for PLENA (Programmable Long-context Efficient Neural Accelerator).

Overview

The PLENA Simulator provides three main components:

  • Transaction-level Simulator: Models PLENA's architectural behavior at a high level, enabling rapid exploration of design choices, memory hierarchies, and long-context LLM inference workflows without the overhead of cycle-accurate RTL simulation.
  • Analytical Latency Model: Provides fast estimation of PLENA's performance characteristics (TTFT, TPS) based on architectural parameters and instruction latencies for specified workloads.
  • Utilization Model: Analyzes the utilization of the systolic array based on architectural parameters and instruction latencies, computing attainable vs theoretical FLOPS.

Figure 1: Diagram of the PLENA


PLENA Publication

If you use this simulator in your research, please cite the following paper:

Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference
arXiv:2509.09505

@misc{wu2025combatingmemorywallsoptimization,
  title        = {Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference},
  author       = {Haoran Wu and Can Xiao and Jiayi Nie and Xuan Guo and Binglei Lou and Jeffrey T. H. Wong and Zhiwen Mo and Cheng Zhang and Przemyslaw Forys and Wayne Luk and Hongxiang Fan and Jianyi Cheng and Timothy M. Jones and Rika Antonova and Robert Mullins and Aaron Zhao},
  year         = {2025},
  eprint       = {2509.09505},
  archivePrefix= {arXiv},
  primaryClass = {cs.AR},
  url          = {https://arxiv.org/abs/2509.09505}
}

Setup

There are two ways to get a working environment. Option A (Nix) runs directly on your machine and is best for day-to-day development. Option B (Docker) wraps the same Nix environment in a container for portable, reproducible builds when you don't want to install Nix on the host.

Option A — Nix (native)

Prerequisites:

  • nix package manager (with flakes enabled)
  • direnv for environment management
# Install direnv hook in your shell
echo 'eval "$(direnv hook bash)"' >> ~/.bashrc
source ~/.bashrc

Installation:

# Allow direnv to load the environment
direnv allow

# Enter the development environment
nix develop

# Update git submodules
git submodule update --remote --merge

You are now in a shell with the full toolchain (Rust, Python 3.12, clang, cmake, etc.) and can run any of the just commands below directly.

Option B — Docker

The Docker image encapsulates the same Nix environment, so you only need Docker installed (no Nix or direnv on the host). All commands are run from the repository root. Your working tree is bind-mounted into the container at /workspace, so edits on the host are picked up live.

Prerequisites:

  • Docker Engine with the Compose plugin (docker compose)
  • (Optional) NVIDIA Container Toolkit for CUDA support

Build the image and open a shell:

just docker-dev
# Equivalent to:
#   docker compose -f docker/docker-compose.yml build dev
#   docker compose -f docker/docker-compose.yml up -d dev
#   docker compose -f docker/docker-compose.yml exec dev bash

Inside the container, initialize submodules once and build the Rust emulator binary (persisted on the host via the bind mount, so this is a one-time step):

git submodule update --init --recursive
(cd transactional_emulator && cargo build --release)

Run commands without an interactive shell:

# Run any command in the dev environment
just docker-run bash -c "just latency llama-3.1-8b"

# Run a just target directly
just docker-test test-linear

Common Docker commands (see docker/README.md for the full list):

Command Description
just docker-build-dev Build the development image
just docker-dev Build, start, and enter the dev container
just docker-run <cmd> Run a command in the dev environment
just docker-test <target> Run a just target in Docker
just docker-down Stop containers
just docker-clean Remove containers and cached volumes

CUDA support:

docker compose -f docker/docker-compose.yml --profile cuda up -d dev-cuda
docker compose -f docker/docker-compose.yml exec dev-cuda bash

Note: The repository is bind-mounted from the host (owned by your host user) while the container runs as root. The image marks /workspace as a git safe.directory so Nix's flake evaluation doesn't fail with a dubious-ownership error. If you build a custom image, preserve that setting.


Configuration

The simulator and emulator both use plena_settings.toml as the main configuration file for hardware parameters. This file contains:

  • Hardware dimensions (MLEN, BLEN, VLEN, HLEN)
  • Memory configuration (HBM, SRAM sizes)
  • Instruction latencies
  • Prefetch/writeback amounts

The configuration file supports two modes:

  • analytic: Used by analytical models (latency and utilization)
  • transactional: Used by the transaction-level emulator

Set the active mode in the [MODE] section of plena_settings.toml.


Transaction-level Emulation

The transaction-level emulator executes machine code instructions sequentially, modeling PLENA's behavior at a high abstraction level. It includes:

  • HBM/DRAM off-chip memory simulation
  • Handwritten assembly templates for every operator in PLENA ISA for LLaMA
  • Test scripts to verify correctness of assembly templates

The emulator reads hardware configuration from plena_settings.toml (using the behavior mode).

Running Simulations

Standard mode:

just build-emulator [task]
# Example: just build-behave-sim linear

Debug mode:

just build-emulator-debug [task]
# Example: just build-behave-sim-debug linear

Run pre-generated assembly:

just run-generated-asm

Quiet mode (latency and error metrics only):

just run-generated-asm-quiet

Analytical Models

Latency Model

The latency model provides fast performance estimation for PLENA workloads. It computes:

  • TTFT (Time To First Token): Latency for the prefill phase
  • TPS (Tokens Per Second): Throughput for the decode phase

Available Commands

List available models:

just latency-list-models

Run with default settings (llama-3.1-8b, batch=4, input=2048, output=1024):

just latency llama-3.1-8b

Run with custom batch size:

just latency-batch llama-3.1-8b 8

Run with full custom parameters:

just latency-full llama-3.1-8b 4 2048 1024
# Format: just latency-full {model} {batch} {input_seq} {output_seq}

Get JSON output:

just latency-json llama-3.1-8b

Project Structure

PLENA_Simulator/
├── transactional_emulator/    # Transaction-level simulator (Rust)
├── analytic_models/          # Analytical models (Python)
│   ├── latency/             # Latency estimation model
│   └── utilisation/         # Utilization analysis model
├── compiler/                # Compiler and model definitions
├── PLENA_Tools/             # Supporting tools and utilities (submodule)
├── doc/                     # Documentation and diagrams
├── plena_settings.toml      # Main configuration file
└── justfile                 # Command shortcuts

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors