Skip to content

manishklach/kv-hierarchy-lab

Repository files navigation

kv-hierarchy-lab

Trace-driven policy evaluation for KV-cache hierarchy decisions in long-context LLM inference

kv-hierarchy-lab is a benchmark-first research harness for studying KV-cache residency, eviction, quantization, offload, and prefetch tradeoffs under long-context decoding. It is designed for reproducible simulator studies today, and for future integration with real runtimes later.

Modern long-context inference is often limited not just by compute, but by KV-cache residency and movement. Many ideas exist in isolation, but comparing them cleanly is awkward. This repo provides a hardware-aware, hardware-agnostic v0.1 environment for asking those questions without pretending to be a production serving stack.

Why This Repo Exists

Long-context serving work often blends together:

  • which KV pages stay resident in the fastest tier
  • which pages should be demoted or evicted
  • when footprint reduction changes the right residency choice
  • whether prefetch reduces misses or only adds traffic
  • how much a result depends on trace structure rather than policy quality

This repo exists to make those tradeoffs inspectable. The goal is not to "solve KV cache inference" in one repo. The goal is to provide a trace-driven, reproducible policy evaluation framework that systems and inference engineers can extend, question, and calibrate.

What It Is / Is Not

This repo is:

  • a research harness
  • a policy evaluation framework
  • benchmark-first and claim-light
  • reproducible and trace-driven
  • designed to support future runtime integration

This repo is not:

  • production-ready serving infrastructure
  • a replacement for vLLM
  • a claim of custom kernel or Hopper-specific runtime work
  • a magical CXL-backed 1M-token system
  • a substitute for real GPU profiling

v0.1 focuses on simulation, traces, policies, and evaluation. Future revisions may ingest real traces, calibrate against serving runtimes, and model slower tiers such as host RAM or CXL-like backends with more fidelity.

Architecture

flowchart LR
    W["Synthetic or Imported Workloads"] --> T["Page Access Traces"]
    T --> E["Simulation Engine"]
    Q["Quantization / Footprint Model"] --> E
    P["Policy Interface"] --> E
    E --> M["Metrics Collector"]
    M --> B["Benchmark Runner"]
    B --> A["Artifacts: JSON, CSV, Plots"]
    B --> N["Notebook / Analysis"]
Loading

The engine models page movement across configurable tiers, applies policy hooks on access and eviction, and records simulated latency and traffic rather than claiming runtime throughput.

Core Concepts

KV Pages

The simulator works on KV pages, not full sequences. A page carries:

  • page_id
  • token span
  • layer index
  • head-group metadata
  • byte footprint
  • quantization scheme
  • current tier

Memory Tiers

The default hierarchy models configurable cost tiers such as:

  • Tier 0: GPU-fast
  • Tier 1: GPU-overflow
  • Tier 2: Host RAM
  • Tier 3: NVMe-like backing store

These tiers are latency and bandwidth models, not claims about exact hardware support.

Access Traces

The harness is trace-driven. A trace is a sequence of logical page accesses with decode-step ordering and optional metadata. v0.1 ships synthetic workloads; future revisions can ingest runtime traces.

Eviction Policies

Policies decide which pages stay in fast tiers, which pages are demoted, and which choices are repeatedly costly.

Prefetch Policies

Prefetch policies speculate on future accesses. In v0.1, the predictive baseline is intentionally modest, and the harness reports usefulness rather than just raw prefetch count.

Quantization-Aware Residency

Footprint depends on quantization scheme. Smaller pages may improve residency, while optional decode penalties can offset some of that benefit. The simulator models footprint and configurable dequant overhead only; it does not model model quality or accuracy loss.

What v0.1 Includes

  • a trace-driven KV hierarchy simulator
  • configurable multi-tier latency, capacity, and bandwidth models
  • pluggable policy interfaces
  • baseline policies: lru, windowed_recency, heavy_hitter, cost_aware, predictive
  • a signature regret_aware eviction policy
  • quantization-aware page footprints for fp16, fp8, int4, and int2
  • synthetic workloads including retrieval bursts, periodic reuse, mixed locality, and adversarial bursts
  • benchmark runner with committed JSON, CSV, and plot artifacts
  • tests for engine behavior, prefetch accounting, quantization, workloads, and policy correctness

Initial Findings

These are simulator findings on synthetic traces, not runtime performance claims.

  • On small_fp16_prefetch + rag_burst, regret_aware reduced misses from 212 to 152 versus lru and cut bytes moved by 26.3%.
  • On constrained_fp16_prefetch + adversarial_burst, regret_aware delivered the lowest latency in the sweep: 3.365 ms average simulated access latency versus 3.664 ms for lru, with 9.5% fewer bytes moved.
  • On small_fp16_prefetch + prefetch_friendly, the predictive policy cut misses by 59.6% versus lru, but still had 29.2% higher latency than cost_aware. Lower miss count did not automatically mean a better latency outcome once speculative traffic was included.
  • Under the same small no-prefetch tier budget, switching from fp16 to int4 on rag_burst raised mean overall hit rate from 0.459 to 0.771 and reduced bytes moved by 93.9%. Footprint alone can dominate policy differences under pressure.

Regret-Aware Ablation

A dedicated ablation study (scripts/run_regret_ablation.py) sweeps regret_horizon and regret_weight across 3 independent seeds to isolate the regret signal's contribution. All results below are simulator metrics on synthetic traces with mean ± std across seeds. Prefetch is disabled to isolate eviction behavior.

Key findings:

  • Adversarial bursts, constrained capacity: LRU averaged 6.15 ± 0.10 ms latency with a 0% fast-tier hit rate. The best regret-aware config (w=1.0, h=64) achieved 5.94 ± 0.14 ms latency — a modest 3.4% improvement — while w=12.0, h=64 cut bytes moved by 7.1%. Under medium capacity, regret-aware (w=12.0, h=24) reduced latency by 6.6% versus LRU while cutting bytes moved by 9.1%.
  • Horizon sensitivity: Under adversarial pressure, longer horizons (64 steps) consistently delivered lower latency than short horizons (8 steps) at the same weight, suggesting that durable regret memory helps when the trace actively revisits recently-evicted pages.
  • Weight sensitivity: Higher regret weights (6.0, 12.0) improved hit rates on chat_continuation under constrained capacity — w=12.0 achieved 0.61 hit rate versus LRU's 0.66 and w=1.0's 0.53 — but the relationship is non-monotonic and workload-dependent.
  • Failure / neutral cases: On periodic_reuse under both capacity levels, all regret-aware variants performed identically to LRU (0% fast-tier hit rate, 640 misses). The periodic structure completely overwhelms finite-horizon regret tracking. Only cost_aware (which penalises large pages) salvaged any hits there.

Latency vs regret horizon Hit rate vs regret weight Latency vs bytes moved tradeoff Failure case: chat continuation

Ablation artifacts including raw results, aggregated summaries with variance, and run metadata are committed under artifacts/regret_ablation.

Benchmarks And Metrics

The default sweep reports:

  • overall hit rate by policy
  • hit rate by serving tier
  • miss count
  • average simulated access latency
  • bytes moved
  • demand versus prefetch traffic
  • useful prefetch ratio
  • eviction count
  • peak footprint per tier

Committed simulator artifacts live under artifacts/default_sweep.

Default Sweep Figures

Simulator outputs generated from the committed default sweep:

Overall hit rate by policy Average latency by policy Bytes moved by policy Useful prefetch ratio Latency versus bytes moved

Result Files

Quickstart

git clone https://github.com/manishklach/kv-hierarchy-lab.git
cd kv-hierarchy-lab
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -r requirements.txt
pip install -e .
python -m pytest

Minimal smoke run:

python examples/basic_simulation.py

Reproducing The Default Sweep

One-command path with make:

make test
make bench
make plots

Equivalent direct commands:

python -m pytest
python scripts/run_benchmarks.py --output-dir artifacts/default_sweep
python scripts/plot_results.py --results artifacts/default_sweep/results.json --out-dir artifacts/default_sweep/plots

The committed sweep artifacts were generated with:

python scripts/run_benchmarks.py --output-dir artifacts/default_sweep
python scripts/plot_results.py --results artifacts/default_sweep/results.json --out-dir artifacts/default_sweep/plots

Repository Layout

kv-hierarchy-lab/
├─ kv_hierarchy_lab/
│  ├─ simulator/     # Engine, tier model, traces, metrics
│  ├─ policies/      # Eviction and prefetch policies
│  ├─ quant/         # Footprint and decode-penalty models
│  ├─ workloads/     # Synthetic workload generators
│  ├─ bench/         # Scenarios, runner, report helpers
│  └─ utils/         # Plotting and I/O helpers
├─ scripts/          # Benchmark, trace, and plotting entrypoints
├─ artifacts/        # Committed simulator outputs for inspection
├─ examples/         # Small runnable examples
├─ notebooks/        # Analysis notebook seed
└─ tests/            # Unit tests

Adding A New Policy

  1. Subclass PolicyBase or match the BasePolicy protocol.
  2. Implement select_eviction_candidate.
  3. Optionally implement maybe_prefetch and on_promote.
  4. Keep internal state in on_access, on_insert, and on_evict.
  5. Export the new policy from kv_hierarchy_lab/policies/__init__.py.
  6. Compare it with python examples/compare_policies.py or the default sweep.

The signature regret_aware policy is a good reference because it adds a distinct idea without introducing runtime-specific complexity.

Open Research Questions

  • When does a regret signal beat plain recency or frequency under realistic trace churn?
  • How should quantization and residency be co-optimized per layer or head-group?
  • What overlap model is needed before transfer cost becomes a decent proxy for wall time?
  • Which workloads most consistently punish speculative prefetch despite high apparent reuse?
  • How should offline oracle analyses be used to bound achievable policy gains?

Caveats / Honesty Notes

  • Synthetic traces are not substitutes for full-model correctness.
  • Simulated latency is not a substitute for real GPU profiling.
  • Tier names are abstractions over cost models, not direct hardware contracts.
  • The committed figures are simulator artifacts, not runtime benchmarks.
  • Future runtime integrations may connect this harness to systems such as vLLM, but that is future work, not a current claim.

Roadmap

Milestone Focus Concrete next step
v0.1 Trace-driven simulator, baselines, artifacts, reproducibility Landed in this repo
v0.2 Runtime trace ingestion and calibrated event model Import serving traces and replay them through the same benchmark harness
v0.3 Better overlap and concurrency modeling Add explicit transfer concurrency and compute-overlap assumptions
v0.4 Runtime integration experiments Attach the harness to a real serving runtime for trace capture and replay

License / Contribution Note

This repository is released under the MIT License. Contributions are welcome, especially around trace ingestion, better workload suites, calibrated tier models, and policy ablations. Please keep changes empirical, reproducible, and careful about what is and is not being claimed.