kv-hierarchy-lab

Trace-driven policy evaluation for KV-cache hierarchy decisions in long-context LLM inference

kv-hierarchy-lab is a benchmark-first research harness for studying KV-cache residency, eviction, quantization, offload, and prefetch tradeoffs under long-context decoding. It is designed for reproducible simulator studies today, and for future integration with real runtimes later.

Modern long-context inference is often limited not just by compute, but by KV-cache residency and movement. Many ideas exist in isolation, but comparing them cleanly is awkward. This repo provides a hardware-aware, hardware-agnostic v0.1 environment for asking those questions without pretending to be a production serving stack.

Why This Repo Exists

Long-context serving work often blends together:

which KV pages stay resident in the fastest tier
which pages should be demoted or evicted
when footprint reduction changes the right residency choice
whether prefetch reduces misses or only adds traffic
how much a result depends on trace structure rather than policy quality

This repo exists to make those tradeoffs inspectable. The goal is not to "solve KV cache inference" in one repo. The goal is to provide a trace-driven, reproducible policy evaluation framework that systems and inference engineers can extend, question, and calibrate.

What It Is / Is Not

This repo is:

a research harness
a policy evaluation framework
benchmark-first and claim-light
reproducible and trace-driven
designed to support future runtime integration

This repo is not:

production-ready serving infrastructure
a replacement for vLLM
a claim of custom kernel or Hopper-specific runtime work
a magical CXL-backed 1M-token system
a substitute for real GPU profiling

v0.1 focuses on simulation, traces, policies, and evaluation. Future revisions may ingest real traces, calibrate against serving runtimes, and model slower tiers such as host RAM or CXL-like backends with more fidelity.

Architecture

flowchart LR
    W["Synthetic or Imported Workloads"] --> T["Page Access Traces"]
    T --> E["Simulation Engine"]
    Q["Quantization / Footprint Model"] --> E
    P["Policy Interface"] --> E
    E --> M["Metrics Collector"]
    M --> B["Benchmark Runner"]
    B --> A["Artifacts: JSON, CSV, Plots"]
    B --> N["Notebook / Analysis"]

The engine models page movement across configurable tiers, applies policy hooks on access and eviction, and records simulated latency and traffic rather than claiming runtime throughput.

Core Concepts

KV Pages

The simulator works on KV pages, not full sequences. A page carries:

page_id
token span
layer index
head-group metadata
byte footprint
quantization scheme
current tier

Memory Tiers

The default hierarchy models configurable cost tiers such as:

Tier 0: GPU-fast
Tier 1: GPU-overflow
Tier 2: Host RAM
Tier 3: NVMe-like backing store

These tiers are latency and bandwidth models, not claims about exact hardware support.

Access Traces

The harness is trace-driven. A trace is a sequence of logical page accesses with decode-step ordering and optional metadata. v0.1 ships synthetic workloads; future revisions can ingest runtime traces.

Eviction Policies

Policies decide which pages stay in fast tiers, which pages are demoted, and which choices are repeatedly costly.

Prefetch Policies

Prefetch policies speculate on future accesses. In v0.1, the predictive baseline is intentionally modest, and the harness reports usefulness rather than just raw prefetch count.

Quantization-Aware Residency

Footprint depends on quantization scheme. Smaller pages may improve residency, while optional decode penalties can offset some of that benefit. The simulator models footprint and configurable dequant overhead only; it does not model model quality or accuracy loss.

What v0.1 Includes

a trace-driven KV hierarchy simulator
configurable multi-tier latency, capacity, and bandwidth models
pluggable policy interfaces
baseline policies: lru, windowed_recency, heavy_hitter, cost_aware, predictive
a signature regret_aware eviction policy
quantization-aware page footprints for fp16, fp8, int4, and int2
synthetic workloads including retrieval bursts, periodic reuse, mixed locality, and adversarial bursts
benchmark runner with committed JSON, CSV, and plot artifacts
tests for engine behavior, prefetch accounting, quantization, workloads, and policy correctness

Initial Findings

These are simulator findings on synthetic traces, not runtime performance claims.

On small_fp16_prefetch + rag_burst, regret_aware reduced misses from 212 to 152 versus lru and cut bytes moved by 26.3%.
On constrained_fp16_prefetch + adversarial_burst, regret_aware delivered the lowest latency in the sweep: 3.365 ms average simulated access latency versus 3.664 ms for lru, with 9.5% fewer bytes moved.
On small_fp16_prefetch + prefetch_friendly, the predictive policy cut misses by 59.6% versus lru, but still had 29.2% higher latency than cost_aware. Lower miss count did not automatically mean a better latency outcome once speculative traffic was included.
Under the same small no-prefetch tier budget, switching from fp16 to int4 on rag_burst raised mean overall hit rate from 0.459 to 0.771 and reduced bytes moved by 93.9%. Footprint alone can dominate policy differences under pressure.

Regret-Aware Ablation

A dedicated ablation study (scripts/run_regret_ablation.py) sweeps regret_horizon and regret_weight across 3 independent seeds to isolate the regret signal's contribution. All results below are simulator metrics on synthetic traces with mean ± std across seeds. Prefetch is disabled to isolate eviction behavior.

Key findings:

Adversarial bursts, constrained capacity: LRU averaged 6.15 ± 0.10 ms latency with a 0% fast-tier hit rate. The best regret-aware config (w=1.0, h=64) achieved 5.94 ± 0.14 ms latency — a modest 3.4% improvement — while w=12.0, h=64 cut bytes moved by 7.1%. Under medium capacity, regret-aware (w=12.0, h=24) reduced latency by 6.6% versus LRU while cutting bytes moved by 9.1%.
Horizon sensitivity: Under adversarial pressure, longer horizons (64 steps) consistently delivered lower latency than short horizons (8 steps) at the same weight, suggesting that durable regret memory helps when the trace actively revisits recently-evicted pages.
Weight sensitivity: Higher regret weights (6.0, 12.0) improved hit rates on chat_continuation under constrained capacity — w=12.0 achieved 0.61 hit rate versus LRU's 0.66 and w=1.0's 0.53 — but the relationship is non-monotonic and workload-dependent.
Failure / neutral cases: On periodic_reuse under both capacity levels, all regret-aware variants performed identically to LRU (0% fast-tier hit rate, 640 misses). The periodic structure completely overwhelms finite-horizon regret tracking. Only cost_aware (which penalises large pages) salvaged any hits there.

Ablation artifacts including raw results, aggregated summaries with variance, and run metadata are committed under artifacts/regret_ablation.

Benchmarks And Metrics

The default sweep reports:

overall hit rate by policy
hit rate by serving tier
miss count
average simulated access latency
bytes moved
demand versus prefetch traffic
useful prefetch ratio
eviction count
peak footprint per tier

Committed simulator artifacts live under artifacts/default_sweep.

Default Sweep Figures

Simulator outputs generated from the committed default sweep:

Result Files

Quickstart

git clone https://github.com/manishklach/kv-hierarchy-lab.git
cd kv-hierarchy-lab
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -r requirements.txt
pip install -e .
python -m pytest

Minimal smoke run:

python examples/basic_simulation.py

Reproducing The Default Sweep

One-command path with make:

make test
make bench
make plots

Equivalent direct commands:

python -m pytest
python scripts/run_benchmarks.py --output-dir artifacts/default_sweep
python scripts/plot_results.py --results artifacts/default_sweep/results.json --out-dir artifacts/default_sweep/plots

The committed sweep artifacts were generated with:

python scripts/run_benchmarks.py --output-dir artifacts/default_sweep
python scripts/plot_results.py --results artifacts/default_sweep/results.json --out-dir artifacts/default_sweep/plots

Repository Layout

kv-hierarchy-lab/
├─ kv_hierarchy_lab/
│  ├─ simulator/     # Engine, tier model, traces, metrics
│  ├─ policies/      # Eviction and prefetch policies
│  ├─ quant/         # Footprint and decode-penalty models
│  ├─ workloads/     # Synthetic workload generators
│  ├─ bench/         # Scenarios, runner, report helpers
│  └─ utils/         # Plotting and I/O helpers
├─ scripts/          # Benchmark, trace, and plotting entrypoints
├─ artifacts/        # Committed simulator outputs for inspection
├─ examples/         # Small runnable examples
├─ notebooks/        # Analysis notebook seed
└─ tests/            # Unit tests

Adding A New Policy

Subclass PolicyBase or match the BasePolicy protocol.
Implement select_eviction_candidate.
Optionally implement maybe_prefetch and on_promote.
Keep internal state in on_access, on_insert, and on_evict.
Export the new policy from kv_hierarchy_lab/policies/__init__.py.
Compare it with python examples/compare_policies.py or the default sweep.

The signature regret_aware policy is a good reference because it adds a distinct idea without introducing runtime-specific complexity.

Open Research Questions

When does a regret signal beat plain recency or frequency under realistic trace churn?
How should quantization and residency be co-optimized per layer or head-group?
What overlap model is needed before transfer cost becomes a decent proxy for wall time?
Which workloads most consistently punish speculative prefetch despite high apparent reuse?
How should offline oracle analyses be used to bound achievable policy gains?

Caveats / Honesty Notes

Synthetic traces are not substitutes for full-model correctness.
Simulated latency is not a substitute for real GPU profiling.
Tier names are abstractions over cost models, not direct hardware contracts.
The committed figures are simulator artifacts, not runtime benchmarks.
Future runtime integrations may connect this harness to systems such as vLLM, but that is future work, not a current claim.

Roadmap

Milestone	Focus	Concrete next step
v0.1	Trace-driven simulator, baselines, artifacts, reproducibility	Landed in this repo
v0.2	Runtime trace ingestion and calibrated event model	Import serving traces and replay them through the same benchmark harness
v0.3	Better overlap and concurrency modeling	Add explicit transfer concurrency and compute-overlap assumptions
v0.4	Runtime integration experiments	Attach the harness to a real serving runtime for trace capture and replay

License / Contribution Note

This repository is released under the MIT License. Contributions are welcome, especially around trace ingestion, better workload suites, calibrated tier models, and policy ablations. Please keep changes empirical, reproducible, and careful about what is and is not being claimed.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
artifacts		artifacts
examples		examples
kv_hierarchy_lab		kv_hierarchy_lab
notebooks		notebooks
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kv-hierarchy-lab

Why This Repo Exists

What It Is / Is Not

Architecture

Core Concepts

KV Pages

Memory Tiers

Access Traces

Eviction Policies

Prefetch Policies

Quantization-Aware Residency

What v0.1 Includes

Initial Findings

Regret-Aware Ablation

Benchmarks And Metrics

Default Sweep Figures

Result Files

Quickstart

Reproducing The Default Sweep

Repository Layout

Adding A New Policy

Open Research Questions

Caveats / Honesty Notes

Roadmap

License / Contribution Note

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

kv-hierarchy-lab

Why This Repo Exists

What It Is / Is Not

Architecture

Core Concepts

KV Pages

Memory Tiers

Access Traces

Eviction Policies

Prefetch Policies

Quantization-Aware Residency

What v0.1 Includes

Initial Findings

Regret-Aware Ablation

Benchmarks And Metrics

Default Sweep Figures

Result Files

Quickstart

Reproducing The Default Sweep

Repository Layout

Adding A New Policy

Open Research Questions

Caveats / Honesty Notes

Roadmap

License / Contribution Note

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages