Trace-driven policy evaluation for KV-cache hierarchy decisions in long-context LLM inference
kv-hierarchy-lab is a benchmark-first research harness for studying KV-cache residency, eviction, quantization, offload, and prefetch tradeoffs under long-context decoding. It is designed for reproducible simulator studies today, and for future integration with real runtimes later.
Modern long-context inference is often limited not just by compute, but by KV-cache residency and movement. Many ideas exist in isolation, but comparing them cleanly is awkward. This repo provides a hardware-aware, hardware-agnostic v0.1 environment for asking those questions without pretending to be a production serving stack.
Long-context serving work often blends together:
- which KV pages stay resident in the fastest tier
- which pages should be demoted or evicted
- when footprint reduction changes the right residency choice
- whether prefetch reduces misses or only adds traffic
- how much a result depends on trace structure rather than policy quality
This repo exists to make those tradeoffs inspectable. The goal is not to "solve KV cache inference" in one repo. The goal is to provide a trace-driven, reproducible policy evaluation framework that systems and inference engineers can extend, question, and calibrate.
This repo is:
- a research harness
- a policy evaluation framework
- benchmark-first and claim-light
- reproducible and trace-driven
- designed to support future runtime integration
This repo is not:
- production-ready serving infrastructure
- a replacement for
vLLM - a claim of custom kernel or Hopper-specific runtime work
- a magical CXL-backed 1M-token system
- a substitute for real GPU profiling
v0.1 focuses on simulation, traces, policies, and evaluation. Future revisions may ingest real traces, calibrate against serving runtimes, and model slower tiers such as host RAM or CXL-like backends with more fidelity.
flowchart LR
W["Synthetic or Imported Workloads"] --> T["Page Access Traces"]
T --> E["Simulation Engine"]
Q["Quantization / Footprint Model"] --> E
P["Policy Interface"] --> E
E --> M["Metrics Collector"]
M --> B["Benchmark Runner"]
B --> A["Artifacts: JSON, CSV, Plots"]
B --> N["Notebook / Analysis"]
The engine models page movement across configurable tiers, applies policy hooks on access and eviction, and records simulated latency and traffic rather than claiming runtime throughput.
The simulator works on KV pages, not full sequences. A page carries:
page_id- token span
- layer index
- head-group metadata
- byte footprint
- quantization scheme
- current tier
The default hierarchy models configurable cost tiers such as:
- Tier 0: GPU-fast
- Tier 1: GPU-overflow
- Tier 2: Host RAM
- Tier 3: NVMe-like backing store
These tiers are latency and bandwidth models, not claims about exact hardware support.
The harness is trace-driven. A trace is a sequence of logical page accesses with decode-step ordering and optional metadata. v0.1 ships synthetic workloads; future revisions can ingest runtime traces.
Policies decide which pages stay in fast tiers, which pages are demoted, and which choices are repeatedly costly.
Prefetch policies speculate on future accesses. In v0.1, the predictive baseline is intentionally modest, and the harness reports usefulness rather than just raw prefetch count.
Footprint depends on quantization scheme. Smaller pages may improve residency, while optional decode penalties can offset some of that benefit. The simulator models footprint and configurable dequant overhead only; it does not model model quality or accuracy loss.
- a trace-driven KV hierarchy simulator
- configurable multi-tier latency, capacity, and bandwidth models
- pluggable policy interfaces
- baseline policies:
lru,windowed_recency,heavy_hitter,cost_aware,predictive - a signature
regret_awareeviction policy - quantization-aware page footprints for
fp16,fp8,int4, andint2 - synthetic workloads including retrieval bursts, periodic reuse, mixed locality, and adversarial bursts
- benchmark runner with committed JSON, CSV, and plot artifacts
- tests for engine behavior, prefetch accounting, quantization, workloads, and policy correctness
These are simulator findings on synthetic traces, not runtime performance claims.
- On
small_fp16_prefetch+rag_burst,regret_awarereduced misses from212to152versuslruand cut bytes moved by26.3%. - On
constrained_fp16_prefetch+adversarial_burst,regret_awaredelivered the lowest latency in the sweep:3.365 msaverage simulated access latency versus3.664 msforlru, with9.5%fewer bytes moved. - On
small_fp16_prefetch+prefetch_friendly, thepredictivepolicy cut misses by59.6%versuslru, but still had29.2%higher latency thancost_aware. Lower miss count did not automatically mean a better latency outcome once speculative traffic was included. - Under the same small no-prefetch tier budget, switching from
fp16toint4onrag_burstraised mean overall hit rate from0.459to0.771and reduced bytes moved by93.9%. Footprint alone can dominate policy differences under pressure.
A dedicated ablation study (scripts/run_regret_ablation.py) sweeps regret_horizon and regret_weight across 3 independent seeds to isolate the regret signal's contribution. All results below are simulator metrics on synthetic traces with mean ± std across seeds. Prefetch is disabled to isolate eviction behavior.
Key findings:
- Adversarial bursts, constrained capacity: LRU averaged
6.15 ± 0.10 mslatency with a 0% fast-tier hit rate. The best regret-aware config (w=1.0, h=64) achieved5.94 ± 0.14 mslatency — a modest3.4%improvement — whilew=12.0, h=64cut bytes moved by7.1%. Under medium capacity, regret-aware (w=12.0, h=24) reduced latency by6.6%versus LRU while cutting bytes moved by9.1%. - Horizon sensitivity: Under adversarial pressure, longer horizons (64 steps) consistently delivered lower latency than short horizons (8 steps) at the same weight, suggesting that durable regret memory helps when the trace actively revisits recently-evicted pages.
- Weight sensitivity: Higher regret weights (6.0, 12.0) improved hit rates on
chat_continuationunder constrained capacity —w=12.0achieved0.61hit rate versus LRU's0.66andw=1.0's0.53— but the relationship is non-monotonic and workload-dependent. - Failure / neutral cases: On
periodic_reuseunder both capacity levels, all regret-aware variants performed identically to LRU (0% fast-tier hit rate, 640 misses). The periodic structure completely overwhelms finite-horizon regret tracking. Onlycost_aware(which penalises large pages) salvaged any hits there.
Ablation artifacts including raw results, aggregated summaries with variance, and run metadata are committed under artifacts/regret_ablation.
The default sweep reports:
- overall hit rate by policy
- hit rate by serving tier
- miss count
- average simulated access latency
- bytes moved
- demand versus prefetch traffic
- useful prefetch ratio
- eviction count
- peak footprint per tier
Committed simulator artifacts live under artifacts/default_sweep.
Simulator outputs generated from the committed default sweep:
git clone https://github.com/manishklach/kv-hierarchy-lab.git
cd kv-hierarchy-lab
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
pip install -e .
python -m pytestMinimal smoke run:
python examples/basic_simulation.pyOne-command path with make:
make test
make bench
make plotsEquivalent direct commands:
python -m pytest
python scripts/run_benchmarks.py --output-dir artifacts/default_sweep
python scripts/plot_results.py --results artifacts/default_sweep/results.json --out-dir artifacts/default_sweep/plotsThe committed sweep artifacts were generated with:
python scripts/run_benchmarks.py --output-dir artifacts/default_sweep
python scripts/plot_results.py --results artifacts/default_sweep/results.json --out-dir artifacts/default_sweep/plotskv-hierarchy-lab/
├─ kv_hierarchy_lab/
│ ├─ simulator/ # Engine, tier model, traces, metrics
│ ├─ policies/ # Eviction and prefetch policies
│ ├─ quant/ # Footprint and decode-penalty models
│ ├─ workloads/ # Synthetic workload generators
│ ├─ bench/ # Scenarios, runner, report helpers
│ └─ utils/ # Plotting and I/O helpers
├─ scripts/ # Benchmark, trace, and plotting entrypoints
├─ artifacts/ # Committed simulator outputs for inspection
├─ examples/ # Small runnable examples
├─ notebooks/ # Analysis notebook seed
└─ tests/ # Unit tests
- Subclass
PolicyBaseor match theBasePolicyprotocol. - Implement
select_eviction_candidate. - Optionally implement
maybe_prefetchandon_promote. - Keep internal state in
on_access,on_insert, andon_evict. - Export the new policy from
kv_hierarchy_lab/policies/__init__.py. - Compare it with
python examples/compare_policies.pyor the default sweep.
The signature regret_aware policy is a good reference because it adds a distinct idea without introducing runtime-specific complexity.
- When does a regret signal beat plain recency or frequency under realistic trace churn?
- How should quantization and residency be co-optimized per layer or head-group?
- What overlap model is needed before transfer cost becomes a decent proxy for wall time?
- Which workloads most consistently punish speculative prefetch despite high apparent reuse?
- How should offline oracle analyses be used to bound achievable policy gains?
- Synthetic traces are not substitutes for full-model correctness.
- Simulated latency is not a substitute for real GPU profiling.
- Tier names are abstractions over cost models, not direct hardware contracts.
- The committed figures are simulator artifacts, not runtime benchmarks.
- Future runtime integrations may connect this harness to systems such as
vLLM, but that is future work, not a current claim.
| Milestone | Focus | Concrete next step |
|---|---|---|
| v0.1 | Trace-driven simulator, baselines, artifacts, reproducibility | Landed in this repo |
| v0.2 | Runtime trace ingestion and calibrated event model | Import serving traces and replay them through the same benchmark harness |
| v0.3 | Better overlap and concurrency modeling | Add explicit transfer concurrency and compute-overlap assumptions |
| v0.4 | Runtime integration experiments | Attach the harness to a real serving runtime for trace capture and replay |
This repository is released under the MIT License. Contributions are welcome, especially around trace ingestion, better workload suites, calibrated tier models, and policy ablations. Please keep changes empirical, reproducible, and careful about what is and is not being claimed.








