GitHub - maci0/agave: A high-performance LLM inference engine written in Zig.

A high-performance LLM inference engine written in Zig.
Zero external ML libraries — all kernels, quantization, and model logic from scratch.

Quick Start • Features • Contributing • Docs

Features

8 Model Architectures: Gemma 3, Gemma 4, Qwen 3.5, GPT-OSS, Nemotron-H, Nemotron Nano, GLM-4, Llama 4
6 Backends: CPU (SIMD-optimized, Accelerate.framework on macOS), Metal GPU (Apple Silicon), Vulkan, CUDA, ROCm, WebGPU — individually toggleable at build time
Compile-Time Model Selection: Disable unused model architectures to reduce binary size
2 Formats: GGUF, SafeTensors (multi-shard, MLX quantized, NVFP4)
20+ Quantization Types: F32, F16, BF16, Q2_K, Q3_K, Q4_0, Q4_1, Q4_K, Q5_0, Q5_K, Q6_K, Q8_0, TQ1_0, IQ4_XS, IQ4_NL, FP8 E4M3, FP8 E5M2, NVFP4, MXFP4, MLX 4/6/8-bit, GPTQ
18 KV Cache Quantization Types: F32, F16, Q8_0, INT8, FP8, NVFP4, TurboQuant 2/3/4-bit, PlanarQuant 2/3/4-bit, IsoQuant 2/3/4-bit, RotorQuant 2/3/4-bit — with asymmetric K/V support and paged SDPA
Tiered KV Cache: VRAM + RAM + SSD offloading with async prefetch (--kv-tiers vram+ram+ssd)
Chat Templates: Data-driven per-architecture prompt formatting (ChatML, Gemma, Gemma 4, Qwen 3.5, GLM-4, GPT-OSS, Llama 4)
Recipes: Optional proven-default configs per model/hardware/quant combo
Model Download: agave pull <org/repo> — download GGUF models from HuggingFace Hub with auto quant selection
Interactive REPL: Multi-turn chat with /help, /clear, /stats, /model, /quit
HTTP Server: OpenAI + Anthropic API compatible, built-in chat UI, Prometheus metrics, rate limiting
Multimodal Vision: Image understanding via Gemma 4 SigLIP-2, Gemma 3 SigLIP, and Qwen VL vision encoders — image upload via CLI (--image) and HTTP API
Structured Output: GBNF grammar (--grammar-string, --grammar), JSON schema (--json-schema), JSON mode (--json-output), server response_format: json_object/json_schema
Full Sampling: temperature, top-k, top-p, min-p, repeat/frequency/presence penalties, seed, stop sequences
Batched Prefill: Chunked GEMM + fused FlashAttention-2 for fast prompt processing
Distributed Inference: Tensor parallelism (TP), pipeline parallelism (PP), disaggregated prefill/decode. Same-node multi-GPU via POSIX shm (zero-copy IPC), cross-node via TCP. Heterogeneous: mix CUDA + Vulkan + CPU across x86_64 + aarch64
Speculative Decoding: Draft model, self-speculative (layer skip), DDTree with configurable tree budget, n-gram history-based prediction, multi-token prediction (MTP) heads
Fused Megakernels: Composable GPU megakernels — gate+up+SiLU fused into single dispatch (3→1)
Sparse GEMV: Skip near-zero FFN activation blocks (~40% sparsity from SiLU). CPU +21%, Metal +12%, all GPU backends. Inspired by PowerInfer/TurboSparse
~125 tok/s on Qwen3.5 0.8B Q8_0 (Metal, Apple Silicon M4 Pro), 24.9 tok/s on Qwen3.5 9B MLX-4bit (82% of native MLX-lm)

Quick Start

# Build (produces both ReleaseFast and Debug binaries)
zig build

# Download a model from HuggingFace
./zig-out/bin/agave pull Qwen/Qwen3.5-0.8B-GGUF

# Interactive REPL
./zig-out/bin/agave model.gguf

# Single prompt
./zig-out/bin/agave model.gguf "What is the capital of France?"

# HTTP server
./zig-out/bin/agave model.gguf --serve

# Quiet mode (pipe-friendly, no banner/stats)
./zig-out/bin/agave model.gguf -q "Hello" > output.txt

# Force CPU backend
./zig-out/bin/agave model.gguf --backend cpu

# SafeTensors directory (MLX models)
./zig-out/bin/agave models/mlx-community/gemma-3-4b-it-qat-4bit

# TurboQuant KV cache (2/3/4-bit quantization for longer contexts)
./zig-out/bin/agave model.gguf --kv-type turbo4

# KV cache eviction (extend context past --ctx-size limit)
./zig-out/bin/agave model.gguf --kv-eviction norm --kv-budget 2048
./zig-out/bin/agave model.gguf --kv-eviction tri   # requires .cal file

# Generate TriAttention calibration data
./zig-out/bin/agave calibrate model.gguf

# Vision: describe an image (requires mmproj or built-in vision encoder)
./zig-out/bin/agave model.gguf --image photo.png "Describe this image"

# Override recipe defaults (user flags always win)
./zig-out/bin/agave model.gguf -t 0.9 --top-p 0.95 "Tell me a story"

# Structured output: force JSON
./zig-out/bin/agave model.gguf --json-output "Generate a user profile with name and age"

# Grammar-constrained decoding (GBNF format)
./zig-out/bin/agave model.gguf --grammar-string 'root ::= "yes" | "no"' "Is the sky blue?"

# JSON schema → structured output
./zig-out/bin/agave model.gguf --json-schema '{"type":"object","properties":{"name":{"type":"string"}}}' "User info"

# Sampling parameters
./zig-out/bin/agave model.gguf -t 0.7 --top-p 0.9 --min-p 0.05 "Tell me a story"

# GPU device selection
./zig-out/bin/agave model.gguf --list-devices                      # Show available GPUs
./zig-out/bin/agave model.gguf --backend vulkan --device 1          # Use second GPU

# Speculative decoding
./zig-out/bin/agave target.gguf --draft-model draft.gguf "prompt"   # Separate draft model
./zig-out/bin/agave model.gguf --spec-mode self --draft-layers 9    # Self-speculative
./zig-out/bin/agave model.gguf --spec-mode ddtree "prompt"          # DDTree self-draft

# Fused megakernel (3→1 GPU dispatch for FFN)
./zig-out/bin/agave model.gguf --megakernel "prompt"

Distributed Inference

Split models across multiple GPUs or machines via tensor parallelism (TP) and pipeline parallelism (PP).

# Same-node multi-GPU (shared memory IPC, zero-copy)
# Terminal 1: rank 0 on GPU 0
./zig-out/bin/agave model.gguf --backend vulkan --device 0 --pp 2 --rank 0 --peers localhost "prompt"
# Terminal 2: rank 1 on GPU 1
./zig-out/bin/agave model.gguf --backend vulkan --device 1 --pp 2 --rank 1 --peers localhost "prompt"

# Cross-node pipeline parallelism (TCP transport)
# Machine A (first half of layers):
./zig-out/bin/agave model.gguf --backend cuda --pp 2 --rank 0 --peers 192.168.0.2 "prompt"
# Machine B (second half + logits):
./zig-out/bin/agave model.gguf --backend cpu --pp 2 --rank 1 --peers 192.168.0.1 "prompt"

# Distributed tensor parallelism (weight sharding + all-reduce)
# Machine A:
./zig-out/bin/agave model.gguf --tp 2 --rank 0 --peers 192.168.0.2 "prompt"
# Machine B:
./zig-out/bin/agave model.gguf --tp 2 --rank 1 --peers 192.168.0.1 "prompt"

Supports heterogeneous setups: different backends (CUDA + Vulkan + CPU), architectures (aarch64 + x86_64), and GPU vendors (NVIDIA + AMD) in the same cluster. When --peers is localhost or 127.0.0.1, POSIX shared memory is used instead of TCP for zero-copy IPC.

Supported Models

Model	Sizes	Status	Quant Types	Notes
Gemma 3	1B, 4B, 12B, 27B	Working	BF16, Q8_0, Q4_0, Q4_K, Q5_K, Q6_K, MLX 4-bit	SPM tokenizer, GELU activation, batched prefill
Gemma 4	E2B, E4B, 26B-A4B	Working	Q8_0, Q4_K, MLX 4-bit	MoE (top-8), channel-based chat template, multimodal vision (SigLIP-2)
Qwen 3.5	0.8B, 9B, 27B, 35B	Working	Q4_0, Q4_K_M, Q8_0, BF16, MLX 4-bit	Hybrid DeltaNet SSM + attention
GPT-OSS	20B	Partial	Q4_0	MoE, sliding window, attention sinks (poor output quality)
Nemotron-H	—	Partial	Q5_0	Mamba-2 + attention hybrid, GGUF (poor output quality)
Nemotron Nano	30B	Partial	MLX 4-bit, NVFP4	SSM + MoE + attention hybrid, SafeTensors (poor output quality)
GLM-4 MoE Lite	4.7B	Partial	MLX 4/6/8-bit	MLA + MoE (GGUF compatibility issue, poor output quality)
Llama 4	Scout	Working	Q4_K, Q8_0	iRoPE, chunked attention, MoE top-1 + shared expert, batched prefill

Model Download

Download GGUF models from HuggingFace Hub with automatic quantization selection:

# Download best available quantization (prefers Q4_K_M)
./zig-out/bin/agave pull Qwen/Qwen3.5-0.8B-GGUF

# Request specific quantization
./zig-out/bin/agave pull Qwen/Qwen3.5-0.8B-GGUF --quant Q8_0

# List available GGUF files without downloading
./zig-out/bin/agave pull Qwen/Qwen3.5-0.8B-GGUF --list

# Private repos
HF_TOKEN=hf_xxxxx ./zig-out/bin/agave pull org/private-model

Downloads are stored in the standard HuggingFace cache layout with an agave convenience symlink. Supports resume on interrupted downloads.

Calibration

Generate TriAttention calibration data for frequency-domain KV eviction:

# Run calibration (produces model.cal alongside model.gguf)
./zig-out/bin/agave calibrate model.gguf

The calibration pass records per-head Q/K frequency statistics used by the --kv-eviction tri policy. See docs/ARCHITECTURE.md for details.

HTTP Server

Start with --serve. Supports both synchronous JSON and SSE streaming.

./zig-out/bin/agave model.gguf --serve --api-key sk-mykey

API Endpoints:

Endpoint	Method	Description
`/v1/chat/completions`	POST	OpenAI chat completion API
`/v1/completions`	POST	OpenAI text completion API
`/v1/messages`	POST	Anthropic Messages API
`/v1/responses`	POST	OpenAI Responses API
`/v1/models`	GET	List loaded models
`/v1/embeddings`	POST	Embedding generation (stub — returns 501)
`/v1/chat`	POST	Built-in web chat UI
`/v1/chat/regenerate`	POST	Regenerate last assistant response
`/v1/conversations`	GET, POST	Conversation management
`/v1/tokenize`	POST	Count tokens in text
`/v1/detokenize`	POST	Convert token IDs to text
`/health`	GET	Health check
`/ready`	GET	Readiness check
`/metrics`	GET	Prometheus metrics

Server features: up to 64 concurrent connections, request scheduler (batch up to 8, 120s timeout), 30s connection read timeout, rate limiting, Bearer token auth, CORS support.

Interactive REPL

Launch without a prompt argument for multi-turn chat:

./zig-out/bin/agave model.gguf

Commands:

Command	Description
`/clear`, `/reset`	Clear conversation history and KV cache
`/context`, `/ctx`	Show context window usage (tokens used / max)
`/system <text>`	Set system prompt (clears conversation)
`/system`	Show current system prompt
`/stats`	Toggle generation statistics display
`/verbose`	Toggle technical details (params, EOG tokens)
`/debug`	Toggle debug logging (token IDs, layer timing)
`/model`	Show model information
`/help`	Show REPL help
`/quit`, `/exit`, `/q`	Exit

Keyboard shortcuts: Ctrl+C cancel, Ctrl+D quit, Ctrl+L clear screen, Ctrl+R reverse search.

Benchmarks

Measured on Apple M4 Pro (48 GB unified memory). See docs/BENCHMARKS.md for full methodology.

Model	Quant	Backend	Decode (tok/s)	vs llama.cpp
Qwen3.5 0.8B	Q8_0	Metal	183	1.31x
Qwen3.5 9B	Q8_0	Metal	41.7	1.67x
Gemma 3 4B	MLX-Q4	Metal	78.1	—
Gemma 3 12B	Q8_0	Metal	22.3	1.19x
Gemma 4 E2B	Q4_K_M	Metal	21.8	—
Gemma 4 E4B	Q4_K_M	Metal	14.4	—
Gemma 4 26B-A4B	Q4_K_M	Metal	4.2	—
Gemma 3 27B	QAT 4-bit	Metal	6.3	—
Qwen3.5 9B	MLX-4bit	Metal	24.9	—

Multi-Backend (Qwen3.5 0.8B Q8_0)

Backend	Hardware	Decode (tok/s)
Metal	Apple M4 Pro	129
ROCm	AMD RX 7900 XTX	50.8
CPU	Ryzen 9 9950X (32T)	44
CUDA	NVIDIA GB10 (aarch64)	35
Vulkan	AMD RX 7900 XTX	2.7

Distributed Inference (dual NVIDIA GB10 over RoCE RDMA)

Model	Config	Transport	Decode (tok/s)
9B Q8_0	Single GPU	—	9.1
9B Q8_0	PP=2	NCCL RoCE	8.5
9B Q8_0	TP=2	NCCL RoCE	5.1
9B Q8_0	TP=2	TCP RoCE	4.9
27B Q4_K_M	Single GPU	—	2.2
27B Q4_K_M	PP=2	NCCL RoCE	2.2
27B Q4_K_M	TP=2	NCCL RoCE	1.7

All quant formats supported on all backends: Q8_0 (GPU), Q4_0/Q4_K/Q5_K/Q6_K (GPU or CPU fallback on UMA). See docs/KERNELS.md for details.

Prerequisites

Zig 0.16.0
macOS (Metal backend) / Linux (Vulkan, CUDA, ROCm) / any platform (CPU, WebGPU backends)
GPU backends load drivers at runtime via dlopen — no SDK needed at build time

CLI Options

agave [OPTIONS] <model> [prompt]

  -h, --help               Show help
  -v, --version            Print version
  -q, --quiet              Suppress banner and stats
  -s, --serve              Start HTTP server
  -p, --port <PORT>        Server port [default: 49453]
  -n, --max-tokens <N>     Max tokens to generate [default: 512]
  -t, --temperature <T>    Sampling temperature, 0 = greedy [default: 0]
      --top-p <P>          Nucleus sampling threshold [default: 1.0]
      --top-k <K>          Top-k sampling, 0 = disabled [default: 0]
      --min-p <P>          Min-p sampling threshold [default: 0]
      --repeat-penalty <R> Repetition penalty [default: 1.0]
      --dry-multiplier <M> DRY n-gram repetition penalty [default: 0]
      --dry-length <N>     DRY minimum n-gram length [default: 2]
      --xtc-probability <P> XTC diversity sampling [default: 0]
      --xtc-threshold <T>  XTC probability threshold [default: 0.1]
      --mirostat-mode <N>  Mirostat target-entropy sampling: 0=off, 2=on [default: 0]
      --mirostat-tau <T>   Mirostat target entropy [default: 5.0]
      --mirostat-eta <E>   Mirostat learning rate [default: 0.1]
      --system <TEXT>      System prompt for chat formatting
      --backend <BE>       auto, cpu, metal, vulkan, cuda, rocm, webgpu [default: auto]
      --ctx-size <N|auto>  Context window size [default: min(model, 4096), 0 = model max, auto = fit to memory]
      --seed <N>           Random seed for sampling [default: random]
      --grammar <FILE>     GBNF grammar file for constrained decoding
      --grammar-string <G> Inline GBNF grammar string
      --json-schema <S>    JSON schema for structured output
      --json-output        Force valid JSON object output
      --kv-type <TYPE>     KV cache quantization: f32, f16, q8_0/q8, int8/i8, fp8/fp8_e4m3, nvfp4/fp4, turbo2/tq2, turbo3/tq3, turbo4/tq4, planar2-4/pq2-4, iso2-4/iq2-4, rotor2-4/rq2-4 [default: f16]
      --kv-tiers <TIERS>   Enable tiered KV cache: vram+ram, vram+ram+ssd [default: off]
      --kv-ram-budget <GB> RAM tier budget in GB, requires --kv-tiers [default: 50% of free RAM]
      --kv-ssd-path <PATH> SSD tier file path, requires --kv-tiers with ssd
      --kv-ssd-budget <GB> SSD tier budget in GB, requires --kv-tiers with ssd [default: 10]
      --host <ADDR>        Server bind address [default: 127.0.0.1]
      --api-key <KEY>      API key for server authentication (Bearer token)
      --prefill-batch-size <N> Prefill chunk size in tokens [default: 512]
      --no-color           Disable colored output (same as --color=never)
      --color <MODE>       Color mode: auto, always, never [default: auto]
      --kv-type-k <TYPE>   KV key quantization (overrides --kv-type)
      --kv-type-v <TYPE>   KV value quantization (overrides --kv-type)
  -V, --verbose            Show technical details (params, load times, EOG)
      --allow-cpu-fallback Allow GPU backends to fall back to CPU
  -d, --debug              Enable debug logging (token IDs, layer timing)
      --json               Output results as JSON (implies --quiet)
      --model-info         Print model metadata and exit (combine with --json)
      --profile            Profile per-op timing (halves throughput)
      --benchmark          Run decode benchmark with built-in prompt
      --mmproj <PATH>      Path to vision projector GGUF (mmproj file)
      --image <PATH>       Path to image file for multimodal inference (PNG or PPM)
      --kv-eviction <MODE> KV cache eviction policy: none, norm, tri [default: none]
      --kv-budget <N>      Max KV entries to retain after eviction [default: 80% of ctx-size]
      --mmap               Use lazy mmap instead of preloading weights into RAM
      --megakernel         Enable fused FFN megakernels (3→1 dispatch per layer)
      --draft-model <PATH> Draft model GGUF for speculative decoding
      --spec-mode <MODE>   Speculative mode: standard, ddtree, self, ngram, mtp [default: ddtree with --draft-model]
                           ngram uses output history (no draft model needed)
  -K, --spec-tokens <N>    Draft tokens per speculation round [default: 5]
      --tree-budget <N>    DDTree node budget [default: 64]
      --draft-layers <N>   Layers for self-speculative draft [default: auto]
      --list-devices       List available compute devices and exit
      --device <N>         GPU device index for CUDA/ROCm/Vulkan [default: 0]
      --tp <N>             Tensor parallelism degree [default: 1]
      --pp <N>             Pipeline parallelism stages [default: 1]
      --peers <ADDR>       Peer address for distributed inference
      --rank <N>           This node's rank [default: 0]
      --transport <TYPE>   IPC transport: auto, tcp, shm, nccl [default: auto]
      --disagg             Disaggregated prefill/decode

Build Options

All backends and models are enabled by default. Disable individually to reduce binary size or avoid unwanted dependencies.

# Disable specific backends
zig build -Denable-vulkan=false
zig build -Denable-cuda=false -Denable-rocm=false

# CPU-only build (no GPU backends)
zig build -Denable-metal=false -Denable-vulkan=false -Denable-cuda=false -Denable-rocm=false -Denable-webgpu=false

# GPU-only (disable CPU fallback — compile error if GPU init fails)
zig build -Denable-cpu=false

# Disable specific model architectures
zig build -Denable-glm4=false

# Minimal build: single model (Gemma 3) + single backend (Metal)
zig build -Denable-gemma4=false -Denable-qwen35=false -Denable-gpt-oss=false \
  -Denable-nemotron-h=false -Denable-nemotron-nano=false -Denable-glm4=false \
  -Denable-llama4=false \
  -Denable-vulkan=false -Denable-cuda=false -Denable-rocm=false -Denable-webgpu=false

# Override GPU architecture targets
zig build -Dcuda-sm=sm_120        # Blackwell
zig build -Drocm-arch=gfx942      # MI300X

# Cross-compile
zig build -Dtarget=aarch64-linux-gnu -Denable-metal=false

Backend Options:

Option	Type	Default	Purpose
`enable-cpu`	bool	true	CPU backend
`enable-metal`	bool	true	Metal backend (macOS only)
`enable-vulkan`	bool	true	Vulkan backend (runtime dlopen)
`enable-cuda`	bool	true	CUDA backend (runtime dlopen)
`enable-rocm`	bool	true	ROCm backend (runtime dlopen)
`enable-webgpu`	bool	false	WebGPU backend (WGSL shaders)
`cuda-sm`	enum	sm_90	CUDA SM target (sm_50..sm_120)
`rocm-arch`	enum	gfx1100	ROCm GFX target (gfx90a..gfx1151)

Model Options:

Option	Type	Default	Purpose
`enable-gemma3`	bool	true	Gemma 3 model support
`enable-gemma4`	bool	true	Gemma 4 model support
`enable-qwen35`	bool	true	Qwen 3.5 model support
`enable-gpt-oss`	bool	true	GPT-OSS model support
`enable-nemotron-h`	bool	true	Nemotron-H model support
`enable-nemotron-nano`	bool	true	Nemotron Nano model support
`enable-glm4`	bool	true	GLM-4 model support
`enable-llama4`	bool	true	Llama 4 model support

Recipes

Recipes are optional preset configurations matched by architecture + backend + quantization. They provide proven defaults (temperature, top-p, context size, etc.) while allowing full user override via CLI flags.

# Recipe auto-applied, shown in banner:
🌵 agave Qwen3.5-0.8B Q4_0 Metal 32L/4096E/16H (45ms)
recipe: Qwen3.5 Q4 Metal

# User flags always take priority over recipe defaults:
./zig-out/bin/agave model.gguf -t 0  # overrides recipe temperature

Current presets: Qwen3.5 Q4 Metal, Gemma Q4 Metal, GPT-OSS Metal, GLM-4 generic, CPU generic. Add new recipes in src/recipe.zig.

Project Structure

src/
├── main.zig           # CLI, format detection, model init, REPL, recipe application
├── arch.zig           # Architecture enum, detection, chat template mapping
├── pull.zig           # Model download from HuggingFace Hub (agave pull)
├── server/            # HTTP server
│   ├── server.zig     #   HTTP server (OpenAI + Anthropic API + chat UI)
│   ├── json.zig       #   JSON field extraction, encoding, and form-parsing
│   ├── scheduler.zig  #   Continuous batching request scheduler
│   ├── metrics.zig    #   Prometheus metrics collector
│   └── rate_limiter.zig #  Token bucket rate limiter
├── display.zig        # Rich CLI output (banner, stats, progress)
├── chat_template.zig  # Data-driven chat prompt templates (ChatML, Gemma, Gemma4, Qwen3.5, GLM-4, GPT-OSS, Llama 4)
├── recipe.zig         # Optional preset configs per model/hardware/quant combo
├── thread_pool.zig    # Futex-based work-stealing thread pool
├── image.zig          # PNG/PPM image decoder and resize for multimodal inference
├── perf.zig           # Performance timer utilities
├── readline.zig       # Line editor for interactive REPL
├── micro_bench.zig    # Standalone micro-benchmark binary
├── format/            # Weight file loaders
│   ├── format.zig     #   Format interface (getTensor, getMetaStr, ...)
│   ├── gguf.zig       #   GGUF v2/v3 with mmap
│   └── safetensors.zig#   Multi-shard SafeTensors + config.json
├── models/            # Model architectures
│   ├── model.zig      #   Model interface + shared helpers (expertWeightStride, etc.)
│   ├── gemma3.zig     #   Gemma 3 (GQA, GELU, post-norms)
│   ├── gemma4.zig     #   Gemma 4 (MoE, dual attention, PLE)
│   ├── qwen35.zig     #   Qwen 3.5 (DeltaNet SSM hybrid)
│   ├── gpt_oss.zig    #   GPT-OSS (MoE, sliding window)
│   ├── nemotron_h.zig #   Nemotron-H (Mamba-2 hybrid, GGUF)
│   ├── nemotron_nano.zig # Nemotron Nano (SSM+MoE+attn, SafeTensors NVFP4)
│   ├── glm4.zig       #   GLM-4 MoE Lite (MLA, MoE)
│   ├── llama4.zig     #   Llama 4 (iRoPE, chunked attention, MoE)
│   └── vision.zig     #   Vision encoder (SigLIP-2, SigLIP, Qwen VL) for multimodal models
├── ops/               # Shared compute kernels
│   ├── attention.zig  #   SDPA with SIMD + sliding window + backend dispatch
│   ├── math.zig       #   argmax, softplus, sigmoid, GELU, sampleToken
│   ├── ssm.zig        #   SSM ops: causal conv1d, Mamba-2 recurrence, group norm+gate
│   ├── quant.zig      #   Quantization helpers (bf16, mxfp4, fp8, iq4nl, nvfp4_st)
│   ├── kv_quant.zig   #   KV cache quantization (f32/f16/q8_0/int8/fp8/nvfp4/turbo/planar/iso/rotor)
│   ├── mlx.zig        #   MLX 4/6/8-bit affine dequant
│   ├── gptq.zig       #   GPTQ INT4 GEMV kernel (packed u32 weights, per-group scales/zeros)
│   ├── kv_evict.zig   #   KV eviction: norm-based scoring, cache compaction
│   └── split_attention.zig # Split-attention: async CPU-GPU KV offloading
├── backend/           # Hardware backends (all individually toggleable)
│   ├── backend.zig    #   Tagged union dispatcher + NullBackend stub
│   ├── cpu.zig        #   CPU with SIMD (AVX2, NEON, SVE)
│   ├── metal.zig      #   Metal GPU (Apple Silicon)
│   ├── vulkan.zig     #   Vulkan GPU (runtime dlopen)
│   ├── cuda.zig       #   CUDA GPU (runtime dlopen, Zig PTX kernels)
│   ├── rocm.zig       #   ROCm GPU (runtime dlopen)
│   ├── webgpu.zig     #   WebGPU (WGSL shaders, browser + native)
│   ├── accelerate.zig #   Apple Accelerate.framework BLAS bindings (AMX-accelerated SGEMM)
│   ├── objc.zig       #   Objective-C runtime bridge for Metal
│   └── kernels/       #   GPU shader/kernel sources
│       ├── metal/     #     MSL compute shaders
│       ├── vulkan/    #     SPIR-V compute shaders
│       ├── cuda/      #     Zig CUDA kernels (compiled to PTX)
│       ├── rocm/      #     AMDGCN kernels (compiled to HSACO)
│       └── webgpu/    #     WGSL compute shaders
├── spec/              # Speculative decoding
│   ├── spec_decode.zig #  Orchestrator: draft, verify, accept
│   ├── ddtree.zig     #   DDTree tree construction
│   └── ngram.zig      #   N-gram history-based draft (no draft model)
├── parallel/          # Distributed inference
│   ├── transport.zig  #   TCP, POSIX shm, NCCL transport
│   ├── tp.zig         #   Tensor parallelism utilities
│   └── discovery.zig  #   UDP peer discovery
├── devices/
│   └── discovery.zig  #   GPU device enumeration (--list-devices)
├── kvcache/
│   ├── manager.zig    #   KV cache alloc/free, PagedKvCache, RadixTree
│   ├── block_allocator.zig # Block allocation for paged KV cache
│   ├── tiered.zig     #   Tiered KV cache (VRAM + RAM + SSD)
│   └── prefetch.zig   #   Async block prefetching for tiered cache
└── tokenizer/
    ├── tokenizer.zig  #   Tokenizer interface
    └── bpe.zig        #   BPE + SPM tokenizer

research/kernels/          # Kernel research (not part of main build)
├── reference.py           #   PyTorch reference implementations
├── generate_golden.py     #   Golden test data generator
├── autotune.py            #   Benchmarking and optimization orchestrator
├── registry.py            #   Kernel registry and search spaces
└── golden/                #   Generated .bin files for Zig @embedFile

Docker

Build multi-platform images (x86_64 + aarch64) using docker buildx:

# Build for both platforms (all GPU backends enabled, glibc)
docker buildx build --platform linux/amd64,linux/arm64 -t agave .

# Build and load for current platform only
docker buildx build --load -t agave .

# CPU-only build (static musl binary, smaller image)
docker buildx build --load -t agave \
  --build-arg ENABLE_VULKAN=false \
  --build-arg ENABLE_CUDA=false \
  --build-arg ENABLE_ROCM=false .

# Minimal build: single model + CPU only
docker buildx build --load -t agave \
  --build-arg ENABLE_VULKAN=false \
  --build-arg ENABLE_CUDA=false \
  --build-arg ENABLE_ROCM=false \
  --build-arg ENABLE_QWEN35=false \
  --build-arg ENABLE_GPT_OSS=false \
  --build-arg ENABLE_NEMOTRON_H=false \
  --build-arg ENABLE_NEMOTRON_NANO=false \
  --build-arg ENABLE_GLM4=false .

# Run inference
docker run --rm -v /path/to/models:/models agave /models/model.gguf "Hello"

# Run HTTP server
docker run --rm -p 49453:49453 -v /path/to/models:/models agave /models/model.gguf --serve

# Override Zig version at build time
docker buildx build --build-arg ZIG_VERSION=0.16.0 -t agave .

GPU backends (CUDA, Vulkan, ROCm) load their drivers at runtime via dlopen, which requires glibc. When all three GPU backends are disabled, the build automatically switches to musl for a fully static binary. Zig cross-compiles natively — no QEMU emulation needed during build.

Static musl builds

For environments where a fully static, dependency-free binary is needed (Alpine containers, embedded systems, minimal distros), disable all dlopen backends:

# Static musl binary — CPU backend only
zig build -Dtarget=x86_64-linux-musl \
  -Denable-metal=false -Denable-vulkan=false \
  -Denable-cuda=false -Denable-rocm=false

# Cross-compile static ARM64 binary
zig build -Dtarget=aarch64-linux-musl \
  -Denable-metal=false -Denable-vulkan=false \
  -Denable-cuda=false -Denable-rocm=false

Note: Static musl builds only work with the CPU backend. GPU backends (CUDA, Vulkan, ROCm) depend on dlopen to load vendor drivers at runtime, which requires glibc. Attempting to dlopen a glibc-linked .so from a musl binary will segfault.

Documentation

Tutorial: LLM Inference From Scratch — 18-chapter progressive tutorial + 4 appendixes
Architecture — Project structure, module reference, inference pipeline
Models — Supported models, parameters, per-model details
Benchmarks — Performance comparisons vs llama.cpp
Kernel Status — Per-backend kernel implementation status
Distributed Inference — TP, PP, disaggregated prefill/decode
Contributing — How to add backends, models, quantization
API Reference — HTTP API endpoints, request/response formats
Megakernel System — Composable fused GPU dispatch
CLAUDE.md — Engineering standards for contributors
research/kernels/ — Kernel research tools (benchmarks, golden tests)

License

GNU General Public License v3.0

Name		Name	Last commit message	Last commit date
Latest commit History 591 Commits
.github		.github
docs		docs
notes		notes
research		research
scripts		scripts
src		src
tests		tests
web		web
.dockerignore		.dockerignore
.gitignore		.gitignore
.review-loop.lock		.review-loop.lock
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build.zig		build.zig
build.zig.zon		build.zig.zon
models		models
root.wasm		root.wasm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Features

Quick Start

Distributed Inference

Supported Models

Model Download

Calibration

HTTP Server

Interactive REPL

Benchmarks

Multi-Backend (Qwen3.5 0.8B Q8_0)

Distributed Inference (dual NVIDIA GB10 over RoCE RDMA)

Prerequisites

CLI Options

Build Options

Recipes

Project Structure

Docker

Static musl builds

Documentation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Features

Quick Start

Distributed Inference

Supported Models

Model Download

Calibration

HTTP Server

Interactive REPL

Benchmarks

Multi-Backend (Qwen3.5 0.8B Q8_0)

Distributed Inference (dual NVIDIA GB10 over RoCE RDMA)

Prerequisites

CLI Options

Build Options

Recipes

Project Structure

Docker

Static musl builds

Documentation

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages