Improve memory consolidation: unified ADD/UPDATE/DELETE/NOOP pipeline + semantic dedup

## Summary

Overhaul the memory consolidation/dedup pipeline. The current write path uses exact-`subject` + substring matching and a "longer wins" update heuristic, which misses semantic duplicates, never resolves contradictions, and lets stale long-term facts accumulate. This proposes moving to the now-standard **unified update pipeline** (retrieve similar → LLM decides ADD/UPDATE/DELETE/NOOP), staged so Tier 1 ships with no new dependencies.

## Current implementation

Two-tier SQLite memory (`schema/memory.sql`: `long_term`, `short_term`), two LLM passes:

- **Extraction** — `MemoryStore.extract_memories` (`core/memory.py:266`), runs after each turn with a 120s cooldown; classifies facts into LONG_TERM (written immediately) or SHORT_TERM (TTL-tagged).
- **Consolidation** — `MemoryStore.consolidate_and_cleanup` (`core/memory.py:436`), scheduled via `core/scheduler.py:131`; promotes surviving short-term → long-term and deletes expired rows.
- **Dedup/merge** — `_store_long_term` (`core/memory.py:355`): `SELECT ... WHERE subject = ?`, then Python substring containment; `UPDATE` only when the new content is **longer** than the existing.
- **Injection** — `format_for_prompt` (`core/memory.py:240`) dumps all long-term (top 50 by `updated_at DESC`) + active short-term into every system prompt.

## Problems

| # | Current behaviour | Problem |
|---|---|---|
| 1 | Dedup = exact `subject` + substring | Misses semantic duplicates ("Allergic to shellfish" vs "can't eat prawns"); subject-string variants ("matteo" / "the user") don't match |
| 2 | No contradiction handling | "uses a standing desk" then "switched back to sitting" → two conflicting long-term rows, both injected indefinitely |
| 3 | `UPDATE` only if new content is longer | Length ≠ correctness/recency; a corrected shorter fact never replaces a stale longer one |
| 4 | Extraction writes long-term directly; consolidation compacts separately | Two divergent prompts → inconsistent long-term quality (extraction skips the "strip temporal / compact aggressively" rules consolidation applies) |
| 5 | Injects all top-50 long-term every turn | Doesn't scale; with >50 rows, important older facts are silently truncated by `updated_at DESC`; wastes context + cache |
| 6 | Consolidation only promotes | Never merges existing long-term duplicates or prunes stale ones — long-term only grows |
| 7 | Hard 120s extraction cooldown | Two salient turns <120s apart → the second turn's facts are dropped entirely |
| 8 | Consolidation prompt has no timestamps | The model can't reason about staleness/recency when resolving conflicts |
| 9 | No importance / decay / reinforcement | Re-mentioned facts aren't strengthened; trivia and key facts are treated identically |

## Research — state of the art

Modern agent-memory systems (Mem0, A-MEM, Letta/MemGPT, Zep/Graphiti) converge on a **single unified update pipeline**: for each candidate fact, retrieve the top-k semantically similar existing memories (vector store), then a single LLM call decides **ADD / UPDATE / DELETE / NOOP** — explicitly handling refinement and contradiction. Conflict resolution classifies memory pairs as compatible / contradictory / subsumes / subsumed. Read-time retrieval is **relevance-ranked** (Generative Agents: recency + importance + relevance), not "inject everything". Forgetting is a first-class operation (importance/decay; biological forgetting à la FadeMem). Memory is treated as a managed lifecycle rather than an append-only log.

Key references:
- Mem0 — two-phase extract→update with ADD/UPDATE/DELETE/NOOP over vector-retrieved candidates: https://arxiv.org/html/2504.19413v1
- A-MEM — agentic memory; new notes trigger updates to linked existing notes: https://arxiv.org/pdf/2502.12110
- Agent memory vendor landscape 2026 (Letta, Zep, Mem0, LangMem): https://agentmarketcap.ai/blog/2026/04/10/agent-memory-vendor-landscape-2026-letta-zep-mem0-langmem
- FadeMem — biologically-inspired forgetting/decay: https://arxiv.org/pdf/2601.18642
- Memory mechanisms in LLM agents (overview): https://www.emergentmind.com/topics/memory-mechanisms-in-llm-based-agents

## Implementation plan

### Tier 1 — Unified ADD/UPDATE/DELETE/NOOP update pipeline (no new dependencies)

Goal: fix the write-path correctness gaps (#1–#4, #8) without adding an embedding model.

- Add `long_term` to FTS5 (or a content/subject `LIKE`/trigram pre-filter if FTS5 is undesirable) for cheap lexical candidate retrieval. FTS5 ships with SQLite — no new dependency.
- New method `MemoryStore.update_memory(llm, model, candidate)`:
  1. Retrieve top-k existing long-term memories lexically similar to the candidate (subject + content), including `id`, `content`, `category`, `created_at`/`updated_at`.
  2. One LLM call returns one operation: `ADD` (new fact) / `UPDATE {id}` (refine or correct) / `DELETE {id}` (contradicted, no replacement) / `NOOP` (duplicate or trivial).
  3. Apply the operation to `long_term`.
- Route both extraction's LONG_TERM writes and consolidation's promotions through `update_memory`, deleting the divergent substring logic in `_store_long_term` and the bespoke promotion loop in `_run_consolidation_llm`. Single source of truth for long-term writes → consistent compaction quality (fixes #4).
- Include `created_at`/`updated_at` in the decision prompt so the LLM can prefer recent facts on conflict (fixes #8); drop the `len()`-based update heuristic (fixes #3).
- Tests: semantic-ish duplicate → NOOP/UPDATE not ADD; contradiction → DELETE/UPDATE; refinement → UPDATE; genuinely new → ADD; malformed LLM output → safe no-op.

### Tier 2 — Embeddings (semantic similarity + relevance-ranked injection)

- Add an embedding backend: local model (`fastembed`/MiniLM — consistent with the self-hosted, container-bundled approach) or a provider embeddings API. Make it configurable like the other background models.
- Store vectors alongside `long_term` (sqlite-vec, or a blob column + brute-force cosine in Python — trivial at <1k rows).
- Use embeddings for candidate retrieval in the Tier-1 pipeline (replaces/augments FTS5).
- `format_for_prompt`: inject the top-k long-term memories **relevant to the current message** instead of all 50 (fixes #5). Requires passing the inbound message into prompt building for the memory section.

### Tier 3 — Forgetting / importance / reinforcement

- Schema: add `importance` (LLM- or rule-assigned), `last_accessed`, `access_count` to `long_term`.
- Reinforce on recall (bump `last_accessed`/`access_count`); compute a Generative-Agents-style retrieval score = recency + importance + relevance.
- Decay/archive cold, low-importance memories instead of unbounded growth (fixes #9). Consider an `archived` flag rather than hard delete.

### Tier 4 — Long-term hygiene pass

- Periodic pass (in `consolidate_and_cleanup`) that clusters near-duplicate long-term rows, merges them, and drops contradictions keeping the most recent (fixes #6). Reuse the conflict classification from Tier 1.

### Smaller fixes (fold into Tier 1)

- Replace the hard 120s extraction cooldown with a rolling-summary buffer so back-to-back salient turns aren't dropped (#7), or at minimum buffer skipped turns for the next extraction.
- Normalise `subject` (lowercase/canonicalise) in code, not only via prompt instruction.

## Suggested sequencing

Tier 1 first (highest correctness-per-effort, no new deps), then Tier 2 once memory volume justifies semantic retrieval, then Tiers 3–4 as separate PRs.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve memory consolidation: unified ADD/UPDATE/DELETE/NOOP pipeline + semantic dedup #5

Summary

Current implementation

Problems

Research — state of the art

Implementation plan

Tier 1 — Unified ADD/UPDATE/DELETE/NOOP update pipeline (no new dependencies)

Tier 2 — Embeddings (semantic similarity + relevance-ranked injection)

Tier 3 — Forgetting / importance / reinforcement

Tier 4 — Long-term hygiene pass

Smaller fixes (fold into Tier 1)

Suggested sequencing

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

#	Current behaviour	Problem
1	Dedup = exact `subject` + substring	Misses semantic duplicates ("Allergic to shellfish" vs "can't eat prawns"); subject-string variants ("matteo" / "the user") don't match
2	No contradiction handling	"uses a standing desk" then "switched back to sitting" → two conflicting long-term rows, both injected indefinitely
3	`UPDATE` only if new content is longer	Length ≠ correctness/recency; a corrected shorter fact never replaces a stale longer one
4	Extraction writes long-term directly; consolidation compacts separately	Two divergent prompts → inconsistent long-term quality (extraction skips the "strip temporal / compact aggressively" rules consolidation applies)
5	Injects all top-50 long-term every turn	Doesn't scale; with >50 rows, important older facts are silently truncated by `updated_at DESC`; wastes context + cache
6	Consolidation only promotes	Never merges existing long-term duplicates or prunes stale ones — long-term only grows
7	Hard 120s extraction cooldown	Two salient turns <120s apart → the second turn's facts are dropped entirely
8	Consolidation prompt has no timestamps	The model can't reason about staleness/recency when resolving conflicts
9	No importance / decay / reinforcement	Re-mentioned facts aren't strengthened; trivia and key facts are treated identically

Improve memory consolidation: unified ADD/UPDATE/DELETE/NOOP pipeline + semantic dedup #5

Description

Summary

Current implementation

Problems

Research — state of the art

Implementation plan

Tier 1 — Unified ADD/UPDATE/DELETE/NOOP update pipeline (no new dependencies)

Tier 2 — Embeddings (semantic similarity + relevance-ranked injection)

Tier 3 — Forgetting / importance / reinforcement

Tier 4 — Long-term hygiene pass

Smaller fixes (fold into Tier 1)

Suggested sequencing

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions