A high-performance LLM inference engine written in Zig.
Zero external ML libraries β all kernels, quantization, and model logic from scratch.
Quick Start β’ Features β’ Contributing β’ Docs
- 8 Model Architectures: Gemma 3, Gemma 4, Qwen 3.5, GPT-OSS, Nemotron-H, Nemotron Nano, GLM-4, Llama 4
- 6 Backends: CPU (SIMD-optimized, Accelerate.framework on macOS), Metal GPU (Apple Silicon), Vulkan, CUDA, ROCm, WebGPU β individually toggleable at build time
- Compile-Time Model Selection: Disable unused model architectures to reduce binary size
- 2 Formats: GGUF, SafeTensors (multi-shard, MLX quantized, NVFP4)
- 20+ Quantization Types: F32, F16, BF16, Q2_K, Q3_K, Q4_0, Q4_1, Q4_K, Q5_0, Q5_K, Q6_K, Q8_0, TQ1_0, IQ4_XS, IQ4_NL, FP8 E4M3, FP8 E5M2, NVFP4, MXFP4, MLX 4/6/8-bit, GPTQ
- 18 KV Cache Quantization Types: F32, F16, Q8_0, INT8, FP8, NVFP4, TurboQuant 2/3/4-bit, PlanarQuant 2/3/4-bit, IsoQuant 2/3/4-bit, RotorQuant 2/3/4-bit β with asymmetric K/V support and paged SDPA
- Tiered KV Cache: VRAM + RAM + SSD offloading with async prefetch (
--kv-tiers vram+ram+ssd) - Chat Templates: Data-driven per-architecture prompt formatting (ChatML, Gemma, Gemma 4, Qwen 3.5, GLM-4, GPT-OSS, Llama 4)
- Recipes: Optional proven-default configs per model/hardware/quant combo
- Model Download:
agave pull <org/repo>β download GGUF models from HuggingFace Hub with auto quant selection - Interactive REPL: Multi-turn chat with
/help,/clear,/stats,/model,/quit - HTTP Server: OpenAI + Anthropic API compatible, built-in chat UI, Prometheus metrics, rate limiting
- Multimodal Vision: Image understanding via Gemma 4 SigLIP-2, Gemma 3 SigLIP, and Qwen VL vision encoders β image upload via CLI (
--image) and HTTP API - Structured Output: GBNF grammar (
--grammar-string,--grammar), JSON schema (--json-schema), JSON mode (--json-output), serverresponse_format: json_object/json_schema - Full Sampling: temperature, top-k, top-p, min-p, repeat/frequency/presence penalties, seed, stop sequences
- Batched Prefill: Chunked GEMM + fused FlashAttention-2 for fast prompt processing
- Distributed Inference: Tensor parallelism (TP), pipeline parallelism (PP), disaggregated prefill/decode. Same-node multi-GPU via POSIX shm (zero-copy IPC), cross-node via TCP. Heterogeneous: mix CUDA + Vulkan + CPU across x86_64 + aarch64
- Speculative Decoding: Draft model, self-speculative (layer skip), DDTree with configurable tree budget, n-gram history-based prediction, multi-token prediction (MTP) heads
- Fused Megakernels: Composable GPU megakernels β gate+up+SiLU fused into single dispatch (3β1)
- Sparse GEMV: Skip near-zero FFN activation blocks (~40% sparsity from SiLU). CPU +21%, Metal +12%, all GPU backends. Inspired by PowerInfer/TurboSparse
- ~125 tok/s on Qwen3.5 0.8B Q8_0 (Metal, Apple Silicon M4 Pro), 24.9 tok/s on Qwen3.5 9B MLX-4bit (82% of native MLX-lm)
# Build (produces both ReleaseFast and Debug binaries)
zig build
# Download a model from HuggingFace
./zig-out/bin/agave pull Qwen/Qwen3.5-0.8B-GGUF
# Interactive REPL
./zig-out/bin/agave model.gguf
# Single prompt
./zig-out/bin/agave model.gguf "What is the capital of France?"
# HTTP server
./zig-out/bin/agave model.gguf --serve
# Quiet mode (pipe-friendly, no banner/stats)
./zig-out/bin/agave model.gguf -q "Hello" > output.txt
# Force CPU backend
./zig-out/bin/agave model.gguf --backend cpu
# SafeTensors directory (MLX models)
./zig-out/bin/agave models/mlx-community/gemma-3-4b-it-qat-4bit
# TurboQuant KV cache (2/3/4-bit quantization for longer contexts)
./zig-out/bin/agave model.gguf --kv-type turbo4
# KV cache eviction (extend context past --ctx-size limit)
./zig-out/bin/agave model.gguf --kv-eviction norm --kv-budget 2048
./zig-out/bin/agave model.gguf --kv-eviction tri # requires .cal file
# Generate TriAttention calibration data
./zig-out/bin/agave calibrate model.gguf
# Vision: describe an image (requires mmproj or built-in vision encoder)
./zig-out/bin/agave model.gguf --image photo.png "Describe this image"
# Override recipe defaults (user flags always win)
./zig-out/bin/agave model.gguf -t 0.9 --top-p 0.95 "Tell me a story"
# Structured output: force JSON
./zig-out/bin/agave model.gguf --json-output "Generate a user profile with name and age"
# Grammar-constrained decoding (GBNF format)
./zig-out/bin/agave model.gguf --grammar-string 'root ::= "yes" | "no"' "Is the sky blue?"
# JSON schema β structured output
./zig-out/bin/agave model.gguf --json-schema '{"type":"object","properties":{"name":{"type":"string"}}}' "User info"
# Sampling parameters
./zig-out/bin/agave model.gguf -t 0.7 --top-p 0.9 --min-p 0.05 "Tell me a story"
# GPU device selection
./zig-out/bin/agave model.gguf --list-devices # Show available GPUs
./zig-out/bin/agave model.gguf --backend vulkan --device 1 # Use second GPU
# Speculative decoding
./zig-out/bin/agave target.gguf --draft-model draft.gguf "prompt" # Separate draft model
./zig-out/bin/agave model.gguf --spec-mode self --draft-layers 9 # Self-speculative
./zig-out/bin/agave model.gguf --spec-mode ddtree "prompt" # DDTree self-draft
# Fused megakernel (3β1 GPU dispatch for FFN)
./zig-out/bin/agave model.gguf --megakernel "prompt"Split models across multiple GPUs or machines via tensor parallelism (TP) and pipeline parallelism (PP).
# Same-node multi-GPU (shared memory IPC, zero-copy)
# Terminal 1: rank 0 on GPU 0
./zig-out/bin/agave model.gguf --backend vulkan --device 0 --pp 2 --rank 0 --peers localhost "prompt"
# Terminal 2: rank 1 on GPU 1
./zig-out/bin/agave model.gguf --backend vulkan --device 1 --pp 2 --rank 1 --peers localhost "prompt"
# Cross-node pipeline parallelism (TCP transport)
# Machine A (first half of layers):
./zig-out/bin/agave model.gguf --backend cuda --pp 2 --rank 0 --peers 192.168.0.2 "prompt"
# Machine B (second half + logits):
./zig-out/bin/agave model.gguf --backend cpu --pp 2 --rank 1 --peers 192.168.0.1 "prompt"
# Distributed tensor parallelism (weight sharding + all-reduce)
# Machine A:
./zig-out/bin/agave model.gguf --tp 2 --rank 0 --peers 192.168.0.2 "prompt"
# Machine B:
./zig-out/bin/agave model.gguf --tp 2 --rank 1 --peers 192.168.0.1 "prompt"Supports heterogeneous setups: different backends (CUDA + Vulkan + CPU), architectures (aarch64 + x86_64), and GPU vendors (NVIDIA + AMD) in the same cluster. When --peers is localhost or 127.0.0.1, POSIX shared memory is used instead of TCP for zero-copy IPC.
| Model | Sizes | Status | Quant Types | Notes |
|---|---|---|---|---|
| Gemma 3 | 1B, 4B, 12B, 27B | Working | BF16, Q8_0, Q4_0, Q4_K, Q5_K, Q6_K, MLX 4-bit | SPM tokenizer, GELU activation, batched prefill |
| Gemma 4 | E2B, E4B, 26B-A4B | Working | Q8_0, Q4_K, MLX 4-bit | MoE (top-8), channel-based chat template, multimodal vision (SigLIP-2) |
| Qwen 3.5 | 0.8B, 9B, 27B, 35B | Working | Q4_0, Q4_K_M, Q8_0, BF16, MLX 4-bit | Hybrid DeltaNet SSM + attention |
| GPT-OSS | 20B | Partial | Q4_0 | MoE, sliding window, attention sinks (poor output quality) |
| Nemotron-H | β | Partial | Q5_0 | Mamba-2 + attention hybrid, GGUF (poor output quality) |
| Nemotron Nano | 30B | Partial | MLX 4-bit, NVFP4 | SSM + MoE + attention hybrid, SafeTensors (poor output quality) |
| GLM-4 MoE Lite | 4.7B | Partial | MLX 4/6/8-bit | MLA + MoE (GGUF compatibility issue, poor output quality) |
| Llama 4 | Scout | Working | Q4_K, Q8_0 | iRoPE, chunked attention, MoE top-1 + shared expert, batched prefill |
Download GGUF models from HuggingFace Hub with automatic quantization selection:
# Download best available quantization (prefers Q4_K_M)
./zig-out/bin/agave pull Qwen/Qwen3.5-0.8B-GGUF
# Request specific quantization
./zig-out/bin/agave pull Qwen/Qwen3.5-0.8B-GGUF --quant Q8_0
# List available GGUF files without downloading
./zig-out/bin/agave pull Qwen/Qwen3.5-0.8B-GGUF --list
# Private repos
HF_TOKEN=hf_xxxxx ./zig-out/bin/agave pull org/private-modelDownloads are stored in the standard HuggingFace cache layout with an agave convenience symlink. Supports resume on interrupted downloads.
Generate TriAttention calibration data for frequency-domain KV eviction:
# Run calibration (produces model.cal alongside model.gguf)
./zig-out/bin/agave calibrate model.ggufThe calibration pass records per-head Q/K frequency statistics used by the --kv-eviction tri policy. See docs/ARCHITECTURE.md for details.
Start with --serve. Supports both synchronous JSON and SSE streaming.
./zig-out/bin/agave model.gguf --serve --api-key sk-mykeyAPI Endpoints:
| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions |
POST | OpenAI chat completion API |
/v1/completions |
POST | OpenAI text completion API |
/v1/messages |
POST | Anthropic Messages API |
/v1/responses |
POST | OpenAI Responses API |
/v1/models |
GET | List loaded models |
/v1/embeddings |
POST | Embedding generation (stub β returns 501) |
/v1/chat |
POST | Built-in web chat UI |
/v1/chat/regenerate |
POST | Regenerate last assistant response |
/v1/conversations |
GET, POST | Conversation management |
/v1/tokenize |
POST | Count tokens in text |
/v1/detokenize |
POST | Convert token IDs to text |
/health |
GET | Health check |
/ready |
GET | Readiness check |
/metrics |
GET | Prometheus metrics |
Server features: up to 64 concurrent connections, request scheduler (batch up to 8, 120s timeout), 30s connection read timeout, rate limiting, Bearer token auth, CORS support.
Launch without a prompt argument for multi-turn chat:
./zig-out/bin/agave model.ggufCommands:
| Command | Description |
|---|---|
/clear, /reset |
Clear conversation history and KV cache |
/context, /ctx |
Show context window usage (tokens used / max) |
/system <text> |
Set system prompt (clears conversation) |
/system |
Show current system prompt |
/stats |
Toggle generation statistics display |
/verbose |
Toggle technical details (params, EOG tokens) |
/debug |
Toggle debug logging (token IDs, layer timing) |
/model |
Show model information |
/help |
Show REPL help |
/quit, /exit, /q |
Exit |
Keyboard shortcuts: Ctrl+C cancel, Ctrl+D quit, Ctrl+L clear screen, Ctrl+R reverse search.
Measured on Apple M4 Pro (48 GB unified memory). See docs/BENCHMARKS.md for full methodology.
| Model | Quant | Backend | Decode (tok/s) | vs llama.cpp |
|---|---|---|---|---|
| Qwen3.5 0.8B | Q8_0 | Metal | 183 | 1.31x |
| Qwen3.5 9B | Q8_0 | Metal | 41.7 | 1.67x |
| Gemma 3 4B | MLX-Q4 | Metal | 78.1 | β |
| Gemma 3 12B | Q8_0 | Metal | 22.3 | 1.19x |
| Gemma 4 E2B | Q4_K_M | Metal | 21.8 | β |
| Gemma 4 E4B | Q4_K_M | Metal | 14.4 | β |
| Gemma 4 26B-A4B | Q4_K_M | Metal | 4.2 | β |
| Gemma 3 27B | QAT 4-bit | Metal | 6.3 | β |
| Qwen3.5 9B | MLX-4bit | Metal | 24.9 | β |
| Backend | Hardware | Decode (tok/s) |
|---|---|---|
| Metal | Apple M4 Pro | 129 |
| ROCm | AMD RX 7900 XTX | 50.8 |
| CPU | Ryzen 9 9950X (32T) | 44 |
| CUDA | NVIDIA GB10 (aarch64) | 35 |
| Vulkan | AMD RX 7900 XTX | 2.7 |
| Model | Config | Transport | Decode (tok/s) |
|---|---|---|---|
| 9B Q8_0 | Single GPU | β | 9.1 |
| 9B Q8_0 | PP=2 | NCCL RoCE | 8.5 |
| 9B Q8_0 | TP=2 | NCCL RoCE | 5.1 |
| 9B Q8_0 | TP=2 | TCP RoCE | 4.9 |
| 27B Q4_K_M | Single GPU | β | 2.2 |
| 27B Q4_K_M | PP=2 | NCCL RoCE | 2.2 |
| 27B Q4_K_M | TP=2 | NCCL RoCE | 1.7 |
All quant formats supported on all backends: Q8_0 (GPU), Q4_0/Q4_K/Q5_K/Q6_K (GPU or CPU fallback on UMA). See docs/KERNELS.md for details.
- Zig 0.16.0
- macOS (Metal backend) / Linux (Vulkan, CUDA, ROCm) / any platform (CPU, WebGPU backends)
- GPU backends load drivers at runtime via dlopen β no SDK needed at build time
agave [OPTIONS] <model> [prompt]
-h, --help Show help
-v, --version Print version
-q, --quiet Suppress banner and stats
-s, --serve Start HTTP server
-p, --port <PORT> Server port [default: 49453]
-n, --max-tokens <N> Max tokens to generate [default: 512]
-t, --temperature <T> Sampling temperature, 0 = greedy [default: 0]
--top-p <P> Nucleus sampling threshold [default: 1.0]
--top-k <K> Top-k sampling, 0 = disabled [default: 0]
--min-p <P> Min-p sampling threshold [default: 0]
--repeat-penalty <R> Repetition penalty [default: 1.0]
--dry-multiplier <M> DRY n-gram repetition penalty [default: 0]
--dry-length <N> DRY minimum n-gram length [default: 2]
--xtc-probability <P> XTC diversity sampling [default: 0]
--xtc-threshold <T> XTC probability threshold [default: 0.1]
--mirostat-mode <N> Mirostat target-entropy sampling: 0=off, 2=on [default: 0]
--mirostat-tau <T> Mirostat target entropy [default: 5.0]
--mirostat-eta <E> Mirostat learning rate [default: 0.1]
--system <TEXT> System prompt for chat formatting
--backend <BE> auto, cpu, metal, vulkan, cuda, rocm, webgpu [default: auto]
--ctx-size <N|auto> Context window size [default: min(model, 4096), 0 = model max, auto = fit to memory]
--seed <N> Random seed for sampling [default: random]
--grammar <FILE> GBNF grammar file for constrained decoding
--grammar-string <G> Inline GBNF grammar string
--json-schema <S> JSON schema for structured output
--json-output Force valid JSON object output
--kv-type <TYPE> KV cache quantization: f32, f16, q8_0/q8, int8/i8, fp8/fp8_e4m3, nvfp4/fp4, turbo2/tq2, turbo3/tq3, turbo4/tq4, planar2-4/pq2-4, iso2-4/iq2-4, rotor2-4/rq2-4 [default: f16]
--kv-tiers <TIERS> Enable tiered KV cache: vram+ram, vram+ram+ssd [default: off]
--kv-ram-budget <GB> RAM tier budget in GB, requires --kv-tiers [default: 50% of free RAM]
--kv-ssd-path <PATH> SSD tier file path, requires --kv-tiers with ssd
--kv-ssd-budget <GB> SSD tier budget in GB, requires --kv-tiers with ssd [default: 10]
--host <ADDR> Server bind address [default: 127.0.0.1]
--api-key <KEY> API key for server authentication (Bearer token)
--prefill-batch-size <N> Prefill chunk size in tokens [default: 512]
--no-color Disable colored output (same as --color=never)
--color <MODE> Color mode: auto, always, never [default: auto]
--kv-type-k <TYPE> KV key quantization (overrides --kv-type)
--kv-type-v <TYPE> KV value quantization (overrides --kv-type)
-V, --verbose Show technical details (params, load times, EOG)
--allow-cpu-fallback Allow GPU backends to fall back to CPU
-d, --debug Enable debug logging (token IDs, layer timing)
--json Output results as JSON (implies --quiet)
--model-info Print model metadata and exit (combine with --json)
--profile Profile per-op timing (halves throughput)
--benchmark Run decode benchmark with built-in prompt
--mmproj <PATH> Path to vision projector GGUF (mmproj file)
--image <PATH> Path to image file for multimodal inference (PNG or PPM)
--kv-eviction <MODE> KV cache eviction policy: none, norm, tri [default: none]
--kv-budget <N> Max KV entries to retain after eviction [default: 80% of ctx-size]
--mmap Use lazy mmap instead of preloading weights into RAM
--megakernel Enable fused FFN megakernels (3β1 dispatch per layer)
--draft-model <PATH> Draft model GGUF for speculative decoding
--spec-mode <MODE> Speculative mode: standard, ddtree, self, ngram, mtp [default: ddtree with --draft-model]
ngram uses output history (no draft model needed)
-K, --spec-tokens <N> Draft tokens per speculation round [default: 5]
--tree-budget <N> DDTree node budget [default: 64]
--draft-layers <N> Layers for self-speculative draft [default: auto]
--list-devices List available compute devices and exit
--device <N> GPU device index for CUDA/ROCm/Vulkan [default: 0]
--tp <N> Tensor parallelism degree [default: 1]
--pp <N> Pipeline parallelism stages [default: 1]
--peers <ADDR> Peer address for distributed inference
--rank <N> This node's rank [default: 0]
--transport <TYPE> IPC transport: auto, tcp, shm, nccl [default: auto]
--disagg Disaggregated prefill/decode
All backends and models are enabled by default. Disable individually to reduce binary size or avoid unwanted dependencies.
# Disable specific backends
zig build -Denable-vulkan=false
zig build -Denable-cuda=false -Denable-rocm=false
# CPU-only build (no GPU backends)
zig build -Denable-metal=false -Denable-vulkan=false -Denable-cuda=false -Denable-rocm=false -Denable-webgpu=false
# GPU-only (disable CPU fallback β compile error if GPU init fails)
zig build -Denable-cpu=false
# Disable specific model architectures
zig build -Denable-glm4=false
# Minimal build: single model (Gemma 3) + single backend (Metal)
zig build -Denable-gemma4=false -Denable-qwen35=false -Denable-gpt-oss=false \
-Denable-nemotron-h=false -Denable-nemotron-nano=false -Denable-glm4=false \
-Denable-llama4=false \
-Denable-vulkan=false -Denable-cuda=false -Denable-rocm=false -Denable-webgpu=false
# Override GPU architecture targets
zig build -Dcuda-sm=sm_120 # Blackwell
zig build -Drocm-arch=gfx942 # MI300X
# Cross-compile
zig build -Dtarget=aarch64-linux-gnu -Denable-metal=falseBackend Options:
| Option | Type | Default | Purpose |
|---|---|---|---|
enable-cpu |
bool | true | CPU backend |
enable-metal |
bool | true | Metal backend (macOS only) |
enable-vulkan |
bool | true | Vulkan backend (runtime dlopen) |
enable-cuda |
bool | true | CUDA backend (runtime dlopen) |
enable-rocm |
bool | true | ROCm backend (runtime dlopen) |
enable-webgpu |
bool | false | WebGPU backend (WGSL shaders) |
cuda-sm |
enum | sm_90 | CUDA SM target (sm_50..sm_120) |
rocm-arch |
enum | gfx1100 | ROCm GFX target (gfx90a..gfx1151) |
Model Options:
| Option | Type | Default | Purpose |
|---|---|---|---|
enable-gemma3 |
bool | true | Gemma 3 model support |
enable-gemma4 |
bool | true | Gemma 4 model support |
enable-qwen35 |
bool | true | Qwen 3.5 model support |
enable-gpt-oss |
bool | true | GPT-OSS model support |
enable-nemotron-h |
bool | true | Nemotron-H model support |
enable-nemotron-nano |
bool | true | Nemotron Nano model support |
enable-glm4 |
bool | true | GLM-4 model support |
enable-llama4 |
bool | true | Llama 4 model support |
Recipes are optional preset configurations matched by architecture + backend + quantization. They provide proven defaults (temperature, top-p, context size, etc.) while allowing full user override via CLI flags.
# Recipe auto-applied, shown in banner:
π΅ agave Qwen3.5-0.8B Q4_0 Metal 32L/4096E/16H (45ms)
recipe: Qwen3.5 Q4 Metal
# User flags always take priority over recipe defaults:
./zig-out/bin/agave model.gguf -t 0 # overrides recipe temperature
Current presets: Qwen3.5 Q4 Metal, Gemma Q4 Metal, GPT-OSS Metal, GLM-4 generic, CPU generic. Add new recipes in src/recipe.zig.
src/
βββ main.zig # CLI, format detection, model init, REPL, recipe application
βββ arch.zig # Architecture enum, detection, chat template mapping
βββ pull.zig # Model download from HuggingFace Hub (agave pull)
βββ server/ # HTTP server
β βββ server.zig # HTTP server (OpenAI + Anthropic API + chat UI)
β βββ json.zig # JSON field extraction, encoding, and form-parsing
β βββ scheduler.zig # Continuous batching request scheduler
β βββ metrics.zig # Prometheus metrics collector
β βββ rate_limiter.zig # Token bucket rate limiter
βββ display.zig # Rich CLI output (banner, stats, progress)
βββ chat_template.zig # Data-driven chat prompt templates (ChatML, Gemma, Gemma4, Qwen3.5, GLM-4, GPT-OSS, Llama 4)
βββ recipe.zig # Optional preset configs per model/hardware/quant combo
βββ thread_pool.zig # Futex-based work-stealing thread pool
βββ image.zig # PNG/PPM image decoder and resize for multimodal inference
βββ perf.zig # Performance timer utilities
βββ readline.zig # Line editor for interactive REPL
βββ micro_bench.zig # Standalone micro-benchmark binary
βββ format/ # Weight file loaders
β βββ format.zig # Format interface (getTensor, getMetaStr, ...)
β βββ gguf.zig # GGUF v2/v3 with mmap
β βββ safetensors.zig# Multi-shard SafeTensors + config.json
βββ models/ # Model architectures
β βββ model.zig # Model interface + shared helpers (expertWeightStride, etc.)
β βββ gemma3.zig # Gemma 3 (GQA, GELU, post-norms)
β βββ gemma4.zig # Gemma 4 (MoE, dual attention, PLE)
β βββ qwen35.zig # Qwen 3.5 (DeltaNet SSM hybrid)
β βββ gpt_oss.zig # GPT-OSS (MoE, sliding window)
β βββ nemotron_h.zig # Nemotron-H (Mamba-2 hybrid, GGUF)
β βββ nemotron_nano.zig # Nemotron Nano (SSM+MoE+attn, SafeTensors NVFP4)
β βββ glm4.zig # GLM-4 MoE Lite (MLA, MoE)
β βββ llama4.zig # Llama 4 (iRoPE, chunked attention, MoE)
β βββ vision.zig # Vision encoder (SigLIP-2, SigLIP, Qwen VL) for multimodal models
βββ ops/ # Shared compute kernels
β βββ attention.zig # SDPA with SIMD + sliding window + backend dispatch
β βββ math.zig # argmax, softplus, sigmoid, GELU, sampleToken
β βββ ssm.zig # SSM ops: causal conv1d, Mamba-2 recurrence, group norm+gate
β βββ quant.zig # Quantization helpers (bf16, mxfp4, fp8, iq4nl, nvfp4_st)
β βββ kv_quant.zig # KV cache quantization (f32/f16/q8_0/int8/fp8/nvfp4/turbo/planar/iso/rotor)
β βββ mlx.zig # MLX 4/6/8-bit affine dequant
β βββ gptq.zig # GPTQ INT4 GEMV kernel (packed u32 weights, per-group scales/zeros)
β βββ kv_evict.zig # KV eviction: norm-based scoring, cache compaction
β βββ split_attention.zig # Split-attention: async CPU-GPU KV offloading
βββ backend/ # Hardware backends (all individually toggleable)
β βββ backend.zig # Tagged union dispatcher + NullBackend stub
β βββ cpu.zig # CPU with SIMD (AVX2, NEON, SVE)
β βββ metal.zig # Metal GPU (Apple Silicon)
β βββ vulkan.zig # Vulkan GPU (runtime dlopen)
β βββ cuda.zig # CUDA GPU (runtime dlopen, Zig PTX kernels)
β βββ rocm.zig # ROCm GPU (runtime dlopen)
β βββ webgpu.zig # WebGPU (WGSL shaders, browser + native)
β βββ accelerate.zig # Apple Accelerate.framework BLAS bindings (AMX-accelerated SGEMM)
β βββ objc.zig # Objective-C runtime bridge for Metal
β βββ kernels/ # GPU shader/kernel sources
β βββ metal/ # MSL compute shaders
β βββ vulkan/ # SPIR-V compute shaders
β βββ cuda/ # Zig CUDA kernels (compiled to PTX)
β βββ rocm/ # AMDGCN kernels (compiled to HSACO)
β βββ webgpu/ # WGSL compute shaders
βββ spec/ # Speculative decoding
β βββ spec_decode.zig # Orchestrator: draft, verify, accept
β βββ ddtree.zig # DDTree tree construction
β βββ ngram.zig # N-gram history-based draft (no draft model)
βββ parallel/ # Distributed inference
β βββ transport.zig # TCP, POSIX shm, NCCL transport
β βββ tp.zig # Tensor parallelism utilities
β βββ discovery.zig # UDP peer discovery
βββ devices/
β βββ discovery.zig # GPU device enumeration (--list-devices)
βββ kvcache/
β βββ manager.zig # KV cache alloc/free, PagedKvCache, RadixTree
β βββ block_allocator.zig # Block allocation for paged KV cache
β βββ tiered.zig # Tiered KV cache (VRAM + RAM + SSD)
β βββ prefetch.zig # Async block prefetching for tiered cache
βββ tokenizer/
βββ tokenizer.zig # Tokenizer interface
βββ bpe.zig # BPE + SPM tokenizer
research/kernels/ # Kernel research (not part of main build)
βββ reference.py # PyTorch reference implementations
βββ generate_golden.py # Golden test data generator
βββ autotune.py # Benchmarking and optimization orchestrator
βββ registry.py # Kernel registry and search spaces
βββ golden/ # Generated .bin files for Zig @embedFile
Build multi-platform images (x86_64 + aarch64) using docker buildx:
# Build for both platforms (all GPU backends enabled, glibc)
docker buildx build --platform linux/amd64,linux/arm64 -t agave .
# Build and load for current platform only
docker buildx build --load -t agave .
# CPU-only build (static musl binary, smaller image)
docker buildx build --load -t agave \
--build-arg ENABLE_VULKAN=false \
--build-arg ENABLE_CUDA=false \
--build-arg ENABLE_ROCM=false .
# Minimal build: single model + CPU only
docker buildx build --load -t agave \
--build-arg ENABLE_VULKAN=false \
--build-arg ENABLE_CUDA=false \
--build-arg ENABLE_ROCM=false \
--build-arg ENABLE_QWEN35=false \
--build-arg ENABLE_GPT_OSS=false \
--build-arg ENABLE_NEMOTRON_H=false \
--build-arg ENABLE_NEMOTRON_NANO=false \
--build-arg ENABLE_GLM4=false .
# Run inference
docker run --rm -v /path/to/models:/models agave /models/model.gguf "Hello"
# Run HTTP server
docker run --rm -p 49453:49453 -v /path/to/models:/models agave /models/model.gguf --serve
# Override Zig version at build time
docker buildx build --build-arg ZIG_VERSION=0.16.0 -t agave .GPU backends (CUDA, Vulkan, ROCm) load their drivers at runtime via dlopen, which requires glibc. When all three GPU backends are disabled, the build automatically switches to musl for a fully static binary. Zig cross-compiles natively β no QEMU emulation needed during build.
For environments where a fully static, dependency-free binary is needed (Alpine containers, embedded systems, minimal distros), disable all dlopen backends:
# Static musl binary β CPU backend only
zig build -Dtarget=x86_64-linux-musl \
-Denable-metal=false -Denable-vulkan=false \
-Denable-cuda=false -Denable-rocm=false
# Cross-compile static ARM64 binary
zig build -Dtarget=aarch64-linux-musl \
-Denable-metal=false -Denable-vulkan=false \
-Denable-cuda=false -Denable-rocm=falseNote: Static musl builds only work with the CPU backend. GPU backends (CUDA, Vulkan, ROCm) depend on dlopen to load vendor drivers at runtime, which requires glibc. Attempting to dlopen a glibc-linked .so from a musl binary will segfault.
- Tutorial: LLM Inference From Scratch β 18-chapter progressive tutorial + 4 appendixes
- Architecture β Project structure, module reference, inference pipeline
- Models β Supported models, parameters, per-model details
- Benchmarks β Performance comparisons vs llama.cpp
- Kernel Status β Per-backend kernel implementation status
- Distributed Inference β TP, PP, disaggregated prefill/decode
- Contributing β How to add backends, models, quantization
- API Reference β HTTP API endpoints, request/response formats
- Megakernel System β Composable fused GPU dispatch
- CLAUDE.md β Engineering standards for contributors
- research/kernels/ β Kernel research tools (benchmarks, golden tests)
GNU General Public License v3.0