Chunk-level KV cache reuse for HuggingFace inference.
Reuse KV tensors across requests that share long prefixes. Drop-in on any HF causal LM.
Quick Start • Benchmarks • How it works • When it helps • API • Docs
pip install kvboostfrom kvboost import KVBoost
engine = KVBoost.from_pretrained("Qwen/Qwen2.5-3B")
# Warm the shared prefix once
engine.warm("You are a helpful coding assistant. Always be concise...")
# Subsequent generates reuse cached chunks automatically
result = engine.generate(
"You are a helpful coding assistant. Always be concise...\n\n"
"User: How do I reverse a linked list?\nAssistant:",
max_new_tokens=128,
)
print(result.output_text)
print(f"TTFT: {result.ttft_ms:.1f} ms | reuse: {result.kv_reuse_ratio:.0%}")From source:
git clone https://github.com/pythongiant/kvboost.git
cd kvboost
pip install -e .Requirements: Python ≥ 3.9, PyTorch ≥ 2.1, Transformers ≥ 4.38.
The core idea is one sentence: split the prompt into fixed-size chunks, hash them, and on the next request load the K/V tensors for chunks you have already computed instead of recomputing them. Everything else is making that produce correct outputs.
chunk_registry.py splits the token
stream into fixed-size blocks (default 128). A 1000-token prompt becomes
7 full chunks plus a 104-token tail. With --chunk-boundary-window=16
the cut point slides up to ±16 tokens to avoid splitting mid-sentence,
which reduces seam error on natural-language prompts.
Each chunk gets two keys (see models.py):
prefix_hash = SHA256(previous_chunk.prefix_hash || this_chunk.tokens)
content_hash = SHA256(this_chunk.tokens)
The prefix hash only matches when the tokens and every preceding chunk are identical — this is the case where stored K/V is directly usable. The content hash is a fallback: the tokens match but the history doesn't, so the stored K/V is approximately right but needs heavier correction.
KVCacheManager.find_matching_chunks()
tries prefix hash, then falls back to content hash, and flags approximate
matches. PromptAssembler then splits
the prompt into a cached prefix (K/V loaded from memory) and a live
suffix (tokens the model still has to process).
Cache storage is an OrderedDict in CPU RAM with frequency-based
eviction; frequently-reused chunks (your system prompt) stay resident,
one-off chunks get evicted first. Overflow spills to a pre-allocated
binary file via disk_tier.py.
This is the part that makes stitching correct. Each cached chunk was originally computed without seeing the chunks now preceding it in the new prompt, so its K/V values are slightly wrong at the boundaries.
KVBoost has two strategies (recompute_strategy=):
selective(default) re-runs the model on the lastRtokens at each seam with the preceding cached context visible, and overwrites the stale K/V. Cheap but only fixes the boundary. (selective_recompute.py)cacheblenddoes one forward pass, measures per-token cosine deviation vs. what the K/V would be with full context, and recomputes only the ~15% most-deviated tokens. Catches mid-chunk errors selective misses. (cacheblend.py)
Approximate (content-hash) matches force CacheBlend regardless of the chosen strategy — position encodings are wrong in that case and boundary-only repair is not enough.
Two optional continuity features stack on top of either strategy:
--overlap-k=16: each chunk re-encodes the last K tokens of the previous chunk, so seam tokens always see K tokens of real preceding context at store time.--sink-tokens=32: always keep the first N tokens (the "attention sink") fully fresh, since many attention heads anchor on them.
The corrected cached K/V and the live suffix go into a single
model.forward(past_key_values=...) call in
engine.py. Autoregressive decoding then
proceeds normally. After generation, any newly-seen chunks are written
back to the cache so the next request with overlapping text hits without
an explicit warm().
Under greedy decoding, the cached-and-corrected path is designed to
produce the argmax-equivalent token at every step — which matches what
the benchmark's cosine = 1.000 columns show on the KV-side logits.
Despite this, task accuracy still drifts by a few points at high reuse.
Why? Because "argmax matches at step 1" does not guarantee "full
generation matches" — small K/V perturbations can tilt later tokens onto
a different branch. The accuracy-by-reuse table is the ground truth;
treat the logit-cosine metric as a necessary but not sufficient check.
Under sampling (temperature > 0), outputs differ run-to-run by construction; the meaningful check is distributional (KL between logit distributions), not token-identity.
kv_cache_bits=8 quantizes cached tensors (per-channel for K,
per-token for V — the KIVI-paper asymmetry) for ~2× RAM savings with
minimal accuracy loss. kv_cache_bits=4 is available for 4× but you
should validate it with verify_correctness() on your workload before
trusting it.
Minimum surface:
KVBoost.from_pretrained(
model_name_or_path: str,
recompute_strategy: Literal["selective", "cacheblend", "none"] = "selective",
chunk_size: int = 128,
kv_cache_bits: Optional[Literal[4, 8]] = None,
device: Optional[str] = None, # "cuda" | "mps" | "cpu"
...
) -> KVBoost
engine.warm(text: str) -> WarmResult
engine.generate(prompt: str, max_new_tokens: int = ..., **kwargs) -> GenerationResult
engine.verify_correctness(prompts: list[str], ...) -> CorrectnessReportGenerationResult exposes output_text, ttft_ms, total_ms,
kv_reuse_ratio, and the token-level traces used by the benchmarks.
Full docs: kvboost.readthedocs.io