AzureCosmosDB · aayush3011 · Jul 1, 2026
diff --git a/.env.template b/.env.template
@@ -18,15 +18,6 @@ COSMOS_DB_SUMMARIES_CONTAINER="memories_summaries"
 COSMOS_DB_TURNS_CONTAINER="memories_turns"
 COSMOS_DB_COUNTERS_CONTAINER=counter
 COSMOS_DB_LEASE_CONTAINER=leases
-# Throughput mode for all required Cosmos DB containers created by the toolkit
-# (memories, counter, and lease).
-# - serverless: default. The toolkit does not send container RU/s settings.
-#   Use this only with a Cosmos DB account configured for serverless.
-# - autoscale: the toolkit provisions all required containers with autoscale
-#   throughput using COSMOS_DB_AUTOSCALE_MAX_RU as the max RU/s cap.
-#   Default max RU/s is 1000.
-COSMOS_DB_THROUGHPUT_MODE=serverless
-COSMOS_DB_AUTOSCALE_MAX_RU=1000
 
 # ---- Processing thresholds (set to 0 to disable) ----
 THREAD_SUMMARY_EVERY_N=10
@@ -53,9 +44,6 @@ AI_FOUNDRY_ENDPOINT=https://<your-account>.openai.azure.com/
 AI_FOUNDRY_API_KEY=
 AI_FOUNDRY_EMBEDDING_DEPLOYMENT_NAME=text-embedding-3-large
 AI_FOUNDRY_EMBEDDING_DIMENSIONS=1536
-AI_FOUNDRY_EMBEDDING_DATA_TYPE=float32
-AI_FOUNDRY_EMBEDDING_DISTANCE_FUNCTION=cosine
-COSMOS_DB_FULL_TEXT_LANGUAGE=en-US
 
 AI_FOUNDRY_CHAT_DEPLOYMENT_NAME=<your-model-deployment>
 # Optional. Pin the Azure OpenAI REST API version used by chat and embeddings

diff --git a/Docs/concepts.md b/Docs/concepts.md
@@ -116,26 +116,52 @@ Prompts for summarization and fact extraction live in `azure_functions/prompts/`
 
 ## Memory Reconciliation
 
-The `reconcile_memories(user_id, n=50)` pipeline step reads up to N most-recent active facts for a user and asks the LLM to identify two orthogonal outcomes in one pass:
+Reconciliation runs in **two complementary tiers**: a cheap, LLM-free **vector-floor dedup ladder** applied to freshly-extracted memories before they persist, and a periodic **LLM reconcile** that runs in a **dual mode** (cheap candidate clusters most sweeps, a full-pool backstop occasionally).
 
-- **Duplicates** — two or more facts that restate the same claim in different words. Resolution: collapse into one merged fact; the originals are soft-deleted with `supersede_reason="duplicate"` and `superseded_by` set to the merged fact's id.
-- **Contradictions** — two facts that assert opposing claims about the same subject. Resolution: keep the winner (more recent first, higher confidence as tiebreaker), soft-delete the loser with `supersede_reason="contradict"` and `superseded_by` set to the winner.
+### Vector-floor dedup ladder (write path, LLM-free)
 
-### Why one pass
+Between extraction and persist, `dedup_extracted_memories` compares each new fact/episodic memory against the user's existing active memories of the same type using Cosmos `VectorDistance` (pure vector, no hybrid). Each new memory takes one rung of a similarity ladder:
 
-Detecting contradictions semantically requires the LLM to see the candidate pool as a whole — paraphrased ("user prefers aisle seats") and contradictory ("user is vegetarian" vs "user loves steak") facts often have very different embedding vectors and would never co-occur in any cosine cluster. Putting all N candidates into one prompt lets the LLM do the semantic reasoning across both axes simultaneously. The pipeline returns `{"kept": int, "merged": int, "contradicted": int}`.
+| band | condition (cosine) | action |
+|------|--------------------|--------|
+| exact | `content_hash` hit | skip (Stage 0, free) |
+| near-exact | `s ≥ DEDUP_SIM_HIGH` (0.97) | **auto-skip** the new memory (no LLM); logged for audit |
+| borderline | `DEDUP_SIM_LOW ≤ s < DEDUP_SIM_HIGH` (0.80–0.97) | persist, tag `sys:dup-candidate` + stash `dup_of`/`dup_score` for the LLM reconcile |
+| novel | `s < DEDUP_SIM_LOW` | persist clean |
+
+The thresholds are calibrated for **cosine/dotproduct** on normalized embeddings. On a container whose `distanceFunction` is **euclidean**, the destructive near-exact auto-skip is **disabled** (one-shot warning) and those memories fall through to borderline tagging so the LLM adjudicates — euclidean distances aren't a bounded [0,1] similarity and would mis-fire the cosine-tuned drop.
+
+### Dual-mode LLM reconcile
+
+`reconcile_memories(user_id, n=50, *, memory_type="fact", full_rebuild=False)` identifies two orthogonal outcomes:
+
+- **Duplicates** — facts restating the same claim. Resolution: collapse into one merged fact; originals soft-deleted with `supersede_reason="duplicate"` and `superseded_by` set to the merged fact.
+- **Contradictions** — facts asserting opposing claims about the same subject. Resolution: keep the winner (more recent first, higher confidence as tiebreaker), soft-delete the loser with `supersede_reason="contradict"`.
+
+It runs in one of two modes:
+
+- **Candidate mode** (default auto sweeps) — builds connected-component clusters from the `sys:dup-candidate` seeds + their vector neighbors (edge threshold `DEDUP_CLUSTER_SIM`, 0.60) and sends **only those clusters** to the LLM. Cheap, but keyed on near-duplicate similarity. Tagged seeds that never join a cluster have their stale tag cleared so they aren't re-scanned forever.
+- **Full-pool backstop** — every `DEDUP_FULL_RECLUSTER_EVERY_N`-th sweep (default 12), and on any explicit `reconcile(full_rebuild=True)`, the **entire** active pool goes into one LLM pass. This is the only path that catches **dissimilar contradictions** — paraphrased ("prefers aisle seats") and contradictory ("vegetarian" vs "loves steak") facts have very different embedding vectors and would never co-occur in a cosine cluster, so candidate mode alone can't link them.
+
+Both modes return `{"kept": int, "merged": int, "contradicted": int}`. In-process and durable backends reconcile **both** facts and episodic memories so episodic duplicates don't accrue forever.
 
 ### Loser preservation
 
-Soft-deleted facts stay in the container with their `supersede_reason`, `superseded_at`, and `superseded_by` fields populated. Default reads (`get_memories`, `search_cosmos`) filter them out via `superseded_by IS NULL`. To inspect the audit trail (e.g. "show everything that ever applied to this user"), opt out of the filter at the query level.
+Soft-deleted facts stay in the container with their `supersede_reason`, `superseded_at`, and `superseded_by` fields populated. Default reads (`get_memories`, `search_cosmos`) filter them out via `superseded_by IS NULL`. To inspect the audit trail, opt out of the filter at the query level.
 
 ### Write-time exact dedup
 
 Each fact written by `extract_memories` carries a `content_hash` (SHA-256 of normalized content, truncated to 32 hex chars; lowercase, whitespace-collapsed). Before upserting a freshly-extracted fact, the pipeline checks the hash against existing active facts and short-circuits if a match exists, incrementing the `exact_dedup_skipped` metric. This catches identical re-extractions cheaply without an LLM call.
 
+### Extraction watermark (`recent_k`)
+
+The auto-trigger paths size `recent_k` (how many recent turns extraction reads) from a per-thread **watermark** (`last_extract_count` on the counter doc): `recent_k = current_count − last_extract_count` (with `last_extract_count` treated as `0` before the first successful extract). The newest-`recent_k` turns are exactly the turns added since the last successful extract, and the watermark advances **only after a successful extract** — so under normal operation no turns are skipped when extraction lags or transiently fails: a failed run leaves the watermark put and the full backlog is retried next sweep. The window is deliberately **not** capped by `DEDUP_POOL_SIZE` (that knob governs the reconcile prompt, not the extraction window) — capping would extract only the newest N and silently strand the oldest backlog turns.
+
+> **Caveat (rare):** the SDK's inline counter increment is best-effort — under sustained optimistic-concurrency contention it can drop an increment rather than block the user's write path (see `increment_counter_sync`). A dropped increment leaves `current_count` lagging the true turn count, which can in turn under-cover a later extraction window. This is the one case where the "no turns skipped" property does not hold; the Function App backend avoids it by raising to force change-feed redelivery.
+
 ### Tunable
 
-`DEDUP_EVERY_N` (default 5) controls how often `reconcile_memories` runs in the auto-trigger path. Set to `0` to disable. The candidate cap `n` (default 50) is tunable per call; larger values give the LLM a wider view at higher token cost.
+`DEDUP_EVERY_N` (default 5) controls how often reconcile runs in the auto-trigger path. Set to `0` to disable. The candidate cap `n` (default `DEDUP_POOL_SIZE`, 50) is tunable per call; larger values give the LLM a wider view at higher token cost. `DEDUP_FULL_RECLUSTER_EVERY_N` (default 12) sets how often the full-pool backstop fires.
 
 > **Indexing note.** The reconcile pool query orders by `created_at` (matching the prompt's "more recent first" tiebreaker). Cosmos's default indexing policy includes every property, so this works out of the box. If you customize the indexing policy to reduce write RU, ensure `/created_at/?` remains indexed or the query will fail with a 400 (`Order-by over a non-indexed path`).
 

diff --git a/Docs/design_patterns.md b/Docs/design_patterns.md
@@ -170,7 +170,6 @@ facts = await mem.search_cosmos(
 results = await mem.search_cosmos(
     search_terms="PostgreSQL to Cosmos DB",
     user_id="user-1",
-    hybrid_search=True,
     top_k=5,
 )
 ```

diff --git a/Docs/public_api.md b/Docs/public_api.md
@@ -37,7 +37,7 @@
 
 ### Retrieval
 
-- `search_cosmos(search_terms, memory_id=None, user_id=None, role=None, memory_types=None, thread_id=None, hybrid_search=False, top_k=5, tags_all=None, tags_any=None, exclude_tags=None, include_superseded=False, min_salience=None, min_confidence=None, created_after=None, created_before=None) -> list[dict]` — vector or hybrid search memories.
+- `search_cosmos(search_terms, memory_id=None, user_id=None, role=None, memory_types=None, thread_id=None, top_k=5, tags_all=None, tags_any=None, exclude_tags=None, include_superseded=False, min_salience=None, min_confidence=None, created_after=None, created_before=None) -> list[dict]` — hybrid vector/full-text search memories, falling back to vector-only for all-stopword queries.
 - `get_procedural_prompt(user_id) -> Optional[str]` — read the active procedural prompt.
 - `get_procedural_history(user_id, limit=10) -> list[dict]` — read procedural prompt history.
 - `get_procedural_memories(user_id, priority=None, category=None, min_salience=None, include_superseded=False) -> list[dict]` — retrieve procedural memory documents.
@@ -90,7 +90,7 @@ Local-buffer methods remain synchronous in-memory operations; Cosmos, retrieval,
 
 ### Retrieval
 
-- `async search_cosmos(search_terms, memory_id=None, user_id=None, role=None, memory_types=None, thread_id=None, hybrid_search=False, top_k=5, tags_all=None, tags_any=None, exclude_tags=None, include_superseded=False, min_salience=None, min_confidence=None, created_after=None, created_before=None) -> list[dict]` — vector or hybrid search memories.
+- `async search_cosmos(search_terms, memory_id=None, user_id=None, role=None, memory_types=None, thread_id=None, top_k=5, tags_all=None, tags_any=None, exclude_tags=None, include_superseded=False, min_salience=None, min_confidence=None, created_after=None, created_before=None) -> list[dict]` — hybrid vector/full-text search memories, falling back to vector-only for all-stopword queries.
 - `async get_procedural_prompt(user_id) -> Optional[str]` — read the active procedural prompt.
 - `async get_procedural_history(user_id, limit=10) -> list[dict]` — read procedural prompt history.
 - `async get_procedural_memories(user_id, priority=None, category=None, min_salience=None, include_superseded=False) -> list[dict]` — retrieve procedural memory documents.

diff --git a/Docs/troubleshooting.md b/Docs/troubleshooting.md
@@ -50,15 +50,11 @@ COSMOS_DB_DATABASE=ai_memory
 COSMOS_DB_MEMORIES_CONTAINER=memories
 COSMOS_DB_COUNTERS_CONTAINER=counter
 COSMOS_DB_LEASE_CONTAINER=leases
-COSMOS_DB_THROUGHPUT_MODE=serverless
-COSMOS_DB_AUTOSCALE_MAX_RU=1000
 
 AI_FOUNDRY_ENDPOINT=https://<account>.openai.azure.com/
 AI_FOUNDRY_API_KEY=
 AI_FOUNDRY_EMBEDDING_DEPLOYMENT_NAME=text-embedding-3-large
 AI_FOUNDRY_EMBEDDING_DIMENSIONS=1536
-AI_FOUNDRY_EMBEDDING_DATA_TYPE=float32
-AI_FOUNDRY_EMBEDDING_DISTANCE_FUNCTION=cosine
 AI_FOUNDRY_CHAT_DEPLOYMENT_NAME=<chat-deployment-name>
 ```
 
@@ -77,8 +73,6 @@ The notebooks and samples pass these values into the client like this:
 | `AI_FOUNDRY_EMBEDDING_DEPLOYMENT_NAME` | `embedding_deployment_name` |
 | `AI_FOUNDRY_CHAT_DEPLOYMENT_NAME` | `chat_deployment_name` |
 
-`AI_FOUNDRY_EMBEDDING_DIMENSIONS`, `AI_FOUNDRY_EMBEDDING_DATA_TYPE`, and `AI_FOUNDRY_EMBEDDING_DISTANCE_FUNCTION` are read by the toolkit when creating the Cosmos DB vector policy. The Function App also reads `COSMOS_DB__accountEndpoint` for its identity-based Cosmos DB trigger binding; set it to the same value as `COSMOS_DB_ENDPOINT`.
-
 Run `az login` before using `DefaultAzureCredential`.
 
 Required roles:
@@ -104,7 +98,7 @@ The memories container is created with:
 
 If vector or full-text search fails after changing dimensions or indexing settings, create a fresh container with the desired configuration. Cosmos container vector policies are creation-time infrastructure choices.
 
-Use `COSMOS_DB_THROUGHPUT_MODE=serverless` for the default setup. Use `autoscale` with `COSMOS_DB_AUTOSCALE_MAX_RU` when you need provisioned autoscale throughput.
+Pass `cosmos_throughput_mode="serverless"` (the default) when creating the client. Use `cosmos_throughput_mode="autoscale"` with `cosmos_autoscale_max_ru` when you need provisioned autoscale throughput.
 
 ---
 
@@ -117,7 +111,7 @@ Embedding failures usually mean one of these is wrong:
 - `AI_FOUNDRY_EMBEDDING_DIMENSIONS`
 - Azure OpenAI / AI Services RBAC
 
-For hybrid search, `search_terms` is required when `hybrid_search=True`.
+Search always uses hybrid vector/full-text ranking when keyword extraction finds terms; all-stopword queries fall back to vector-only ranking.
 
 If search returns documents but scores look poor, check that records have an `embedding` field and that the query uses similar language to the stored memory content.
 

diff --git a/Samples/Advanced/advanced_search_patterns.py b/Samples/Advanced/advanced_search_patterns.py
@@ -89,10 +89,11 @@ def seed_memories(mem: CosmosMemoryClient, user_id: str, thread_id: str) -> None
 # ---------------------------------------------------------------------------
 
 def vector_search(mem: CosmosMemoryClient, user_id: str) -> None:
-    """Pattern 1 — Pure vector (semantic similarity) search."""
-    print_header("1. Vector Search (semantic similarity)")
+    """Pattern 1 — Semantic-style query (natural language, low keyword overlap)."""
+    print_header("1. Semantic Search (natural-language query)")
     print("  Query: 'outdoor activities'")
-    print("  Finds semantically related memories even without exact keyword matches.\n")
+    print("  Hybrid ranking leans on embedding similarity when there are few exact")
+    print("  keyword matches, so semantically related memories still surface.\n")
 
     results = mem.search_cosmos(
         search_terms="outdoor activities",
@@ -103,15 +104,15 @@ def vector_search(mem: CosmosMemoryClient, user_id: str) -> None:
 
 
 def hybrid_search(mem: CosmosMemoryClient, user_id: str) -> None:
-    """Pattern 2 — Hybrid search (vector + full-text)."""
+    """Pattern 2 — Hybrid search (vector + full-text) is the default."""
     print_header("2. Hybrid Search (vector + full-text)")
     print("  Query: 'hiking trails Pacific Northwest'")
-    print("  Combines embedding similarity with BM25 keyword matching.\n")
+    print("  Every search_cosmos call fuses embedding similarity with BM25 keyword")
+    print("  matching automatically — no flag required.\n")
 
     results = mem.search_cosmos(
         search_terms="hiking trails Pacific Northwest",
         user_id=user_id,
-        hybrid_search=True,
         top_k=5,
     )
     print_results(results)

diff --git a/Samples/Notebooks/Demo_async.ipynb b/Samples/Notebooks/Demo_async.ipynb
@@ -872,7 +872,7 @@
     "results_search_async = await memory.search_cosmos(\n",
     "    search_terms=\"What did the user ask about the weather?\",\n",
     "    user_id=USER_ID,\n",
-    "    top_k=3, hybrid_search= True\n",
+    "    top_k=3\n",
     ")"
    ]
   },