fix(embedding): set kv_unified=True when embedding=True to enable batch processing (abetlen#2217)

SanjanaB123 · abetlen · web-flow · commit 95ccb191ab11 · 2026-05-13T13:35:30.000-07:00
* fix(embedding): set kv_unified=True when embedding=True to enable batch processing

* chore: update changelog for batch embedding fix

---------

Co-authored-by: abetlen &lt;abetlen@gmail.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+- fix: Enable unified KV cache for embedding contexts to preserve full per-sequence context in batch embedding calls by @SanjanaB123 in #2217
+
 ## [0.3.23]
 
 - feat: Update llama.cpp to ggerganov/llama.cpp@7d442abf
diff --git a/llama_cpp/llama.py b/llama_cpp/llama.py
@@ -402,7 +402,7 @@ def __init__(
                 self.n_batch,
                 llama_cpp.llama_max_parallel_sequences(),
             )
-
+            self.context_params.kv_unified = True
         self._ctx = self._stack.enter_context(
             contextlib.closing(
                 internals.LlamaContext(

Original file line number	Diff line number	Diff line change
`@@ -402,7 +402,7 @@ def __init__(`
`402`	`402`	`self.n_batch,`
`403`	`403`	`llama_cpp.llama_max_parallel_sequences(),`
`404`	`404`	`)`
`405`		`-`
	`405`	`+ self.context_params.kv_unified = True`
`406`	`406`	`self._ctx = self._stack.enter_context(`
`407`	`407`	`contextlib.closing(`
`408`	`408`	`internals.LlamaContext(`