KVBoost

Chunk-level KV cache reuse for HuggingFace inference.
Reuse KV tensors across requests that share long prefixes. Drop-in on any HF causal LM.

Quick Start • Benchmarks • How it works • When it helps • API • Docs

Quick start

pip install kvboost

from kvboost import KVBoost

engine = KVBoost.from_pretrained("Qwen/Qwen2.5-3B")

# Warm the shared prefix once
engine.warm("You are a helpful coding assistant. Always be concise...")

# Subsequent generates reuse cached chunks automatically
result = engine.generate(
    "You are a helpful coding assistant. Always be concise...\n\n"
    "User: How do I reverse a linked list?\nAssistant:",
    max_new_tokens=128,
)

print(result.output_text)
print(f"TTFT: {result.ttft_ms:.1f} ms | reuse: {result.kv_reuse_ratio:.0%}")

From source:

git clone https://github.com/pythongiant/kvboost.git
cd kvboost
pip install -e .

Requirements: Python ≥ 3.9, PyTorch ≥ 2.1, Transformers ≥ 4.38.

How it works

The core idea is one sentence: split the prompt into fixed-size chunks, hash them, and on the next request load the K/V tensors for chunks you have already computed instead of recomputing them. Everything else is making that produce correct outputs.

1. Chunking

chunk_registry.py splits the token stream into fixed-size blocks (default 128). A 1000-token prompt becomes 7 full chunks plus a 104-token tail. With --chunk-boundary-window=16 the cut point slides up to ±16 tokens to avoid splitting mid-sentence, which reduces seam error on natural-language prompts.

2. Two-level hashing

Each chunk gets two keys (see models.py):

prefix_hash  = SHA256(previous_chunk.prefix_hash || this_chunk.tokens)
content_hash = SHA256(this_chunk.tokens)

The prefix hash only matches when the tokens and every preceding chunk are identical — this is the case where stored K/V is directly usable. The content hash is a fallback: the tokens match but the history doesn't, so the stored K/V is approximately right but needs heavier correction.

3. Lookup and assembly

KVCacheManager.find_matching_chunks() tries prefix hash, then falls back to content hash, and flags approximate matches. PromptAssembler then splits the prompt into a cached prefix (K/V loaded from memory) and a live suffix (tokens the model still has to process).

Cache storage is an OrderedDict in CPU RAM with frequency-based eviction; frequently-reused chunks (your system prompt) stay resident, one-off chunks get evicted first. Overflow spills to a pre-allocated binary file via disk_tier.py.

4. Seam repair

This is the part that makes stitching correct. Each cached chunk was originally computed without seeing the chunks now preceding it in the new prompt, so its K/V values are slightly wrong at the boundaries.

KVBoost has two strategies (recompute_strategy=):

selective (default) re-runs the model on the last R tokens at each seam with the preceding cached context visible, and overwrites the stale K/V. Cheap but only fixes the boundary. (selective_recompute.py)
cacheblend does one forward pass, measures per-token cosine deviation vs. what the K/V would be with full context, and recomputes only the ~15% most-deviated tokens. Catches mid-chunk errors selective misses. (cacheblend.py)

Approximate (content-hash) matches force CacheBlend regardless of the chosen strategy — position encodings are wrong in that case and boundary-only repair is not enough.

Two optional continuity features stack on top of either strategy:

--overlap-k=16: each chunk re-encodes the last K tokens of the previous chunk, so seam tokens always see K tokens of real preceding context at store time.
--sink-tokens=32: always keep the first N tokens (the "attention sink") fully fresh, since many attention heads anchor on them.

5. Forward pass

The corrected cached K/V and the live suffix go into a single model.forward(past_key_values=...) call in engine.py. Autoregressive decoding then proceeds normally. After generation, any newly-seen chunks are written back to the cache so the next request with overlapping text hits without an explicit warm().

6. Correctness guarantees

Under greedy decoding, the cached-and-corrected path is designed to produce the argmax-equivalent token at every step — which matches what the benchmark's cosine = 1.000 columns show on the KV-side logits. Despite this, task accuracy still drifts by a few points at high reuse. Why? Because "argmax matches at step 1" does not guarantee "full generation matches" — small K/V perturbations can tilt later tokens onto a different branch. The accuracy-by-reuse table is the ground truth; treat the logit-cosine metric as a necessary but not sufficient check.

Under sampling (temperature > 0), outputs differ run-to-run by construction; the meaningful check is distributional (KL between logit distributions), not token-identity.

Optional: KV quantization

kv_cache_bits=8 quantizes cached tensors (per-channel for K, per-token for V — the KIVI-paper asymmetry) for ~2× RAM savings with minimal accuracy loss. kv_cache_bits=4 is available for 4× but you should validate it with verify_correctness() on your workload before trusting it.

API reference

Minimum surface:

KVBoost.from_pretrained(
    model_name_or_path: str,
    recompute_strategy: Literal["selective", "cacheblend", "none"] = "selective",
    chunk_size: int = 128,
    kv_cache_bits: Optional[Literal[4, 8]] = None,
    device: Optional[str] = None,          # "cuda" | "mps" | "cpu"
    ...
) -> KVBoost

engine.warm(text: str) -> WarmResult
engine.generate(prompt: str, max_new_tokens: int = ..., **kwargs) -> GenerationResult
engine.verify_correctness(prompts: list[str], ...) -> CorrectnessReport

GenerationResult exposes output_text, ttft_ms, total_ms, kv_reuse_ratio, and the token-level traces used by the benchmarks.

Full docs: kvboost.readthedocs.io

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows		.github/workflows
benchmarks_and_experiments		benchmarks_and_experiments
docs		docs
examples		examples
src/kvboost		src/kvboost
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
results.json		results.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KVBoost

Quick start

How it works

1. Chunking

2. Two-level hashing

3. Lookup and assembly

4. Seam repair

5. Forward pass

6. Correctness guarantees

Optional: KV quantization

API reference

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KVBoost

Quick start

How it works

1. Chunking

2. Two-level hashing

3. Lookup and assembly

4. Seam repair

5. Forward pass

6. Correctness guarantees

Optional: KV quantization

API reference

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages