Skip to content

Improve memory consolidation: unified ADD/UPDATE/DELETE/NOOP pipeline + semantic dedup #5

Description

@mattmezza

Summary

Overhaul the memory consolidation/dedup pipeline. The current write path uses exact-subject + substring matching and a "longer wins" update heuristic, which misses semantic duplicates, never resolves contradictions, and lets stale long-term facts accumulate. This proposes moving to the now-standard unified update pipeline (retrieve similar → LLM decides ADD/UPDATE/DELETE/NOOP), staged so Tier 1 ships with no new dependencies.

Current implementation

Two-tier SQLite memory (schema/memory.sql: long_term, short_term), two LLM passes:

  • ExtractionMemoryStore.extract_memories (core/memory.py:266), runs after each turn with a 120s cooldown; classifies facts into LONG_TERM (written immediately) or SHORT_TERM (TTL-tagged).
  • ConsolidationMemoryStore.consolidate_and_cleanup (core/memory.py:436), scheduled via core/scheduler.py:131; promotes surviving short-term → long-term and deletes expired rows.
  • Dedup/merge_store_long_term (core/memory.py:355): SELECT ... WHERE subject = ?, then Python substring containment; UPDATE only when the new content is longer than the existing.
  • Injectionformat_for_prompt (core/memory.py:240) dumps all long-term (top 50 by updated_at DESC) + active short-term into every system prompt.

Problems

# Current behaviour Problem
1 Dedup = exact subject + substring Misses semantic duplicates ("Allergic to shellfish" vs "can't eat prawns"); subject-string variants ("matteo" / "the user") don't match
2 No contradiction handling "uses a standing desk" then "switched back to sitting" → two conflicting long-term rows, both injected indefinitely
3 UPDATE only if new content is longer Length ≠ correctness/recency; a corrected shorter fact never replaces a stale longer one
4 Extraction writes long-term directly; consolidation compacts separately Two divergent prompts → inconsistent long-term quality (extraction skips the "strip temporal / compact aggressively" rules consolidation applies)
5 Injects all top-50 long-term every turn Doesn't scale; with >50 rows, important older facts are silently truncated by updated_at DESC; wastes context + cache
6 Consolidation only promotes Never merges existing long-term duplicates or prunes stale ones — long-term only grows
7 Hard 120s extraction cooldown Two salient turns <120s apart → the second turn's facts are dropped entirely
8 Consolidation prompt has no timestamps The model can't reason about staleness/recency when resolving conflicts
9 No importance / decay / reinforcement Re-mentioned facts aren't strengthened; trivia and key facts are treated identically

Research — state of the art

Modern agent-memory systems (Mem0, A-MEM, Letta/MemGPT, Zep/Graphiti) converge on a single unified update pipeline: for each candidate fact, retrieve the top-k semantically similar existing memories (vector store), then a single LLM call decides ADD / UPDATE / DELETE / NOOP — explicitly handling refinement and contradiction. Conflict resolution classifies memory pairs as compatible / contradictory / subsumes / subsumed. Read-time retrieval is relevance-ranked (Generative Agents: recency + importance + relevance), not "inject everything". Forgetting is a first-class operation (importance/decay; biological forgetting à la FadeMem). Memory is treated as a managed lifecycle rather than an append-only log.

Key references:

Implementation plan

Tier 1 — Unified ADD/UPDATE/DELETE/NOOP update pipeline (no new dependencies)

Goal: fix the write-path correctness gaps (#1#4, #8) without adding an embedding model.

  • Add long_term to FTS5 (or a content/subject LIKE/trigram pre-filter if FTS5 is undesirable) for cheap lexical candidate retrieval. FTS5 ships with SQLite — no new dependency.
  • New method MemoryStore.update_memory(llm, model, candidate):
    1. Retrieve top-k existing long-term memories lexically similar to the candidate (subject + content), including id, content, category, created_at/updated_at.
    2. One LLM call returns one operation: ADD (new fact) / UPDATE {id} (refine or correct) / DELETE {id} (contradicted, no replacement) / NOOP (duplicate or trivial).
    3. Apply the operation to long_term.
  • Route both extraction's LONG_TERM writes and consolidation's promotions through update_memory, deleting the divergent substring logic in _store_long_term and the bespoke promotion loop in _run_consolidation_llm. Single source of truth for long-term writes → consistent compaction quality (fixes Tools tab (gh) + prompt caching + conversation compaction #4).
  • Include created_at/updated_at in the decision prompt so the LLM can prefer recent facts on conflict (fixes Fix manage_jobs tool — write failures block all subsequent writes #8); drop the len()-based update heuristic (fixes Re-vendor wacli to upstream openclaw/wacli v0.11.x (fixes auth loss, stale sync, lock contention) #3).
  • Tests: semantic-ish duplicate → NOOP/UPDATE not ADD; contradiction → DELETE/UPDATE; refinement → UPDATE; genuinely new → ADD; malformed LLM output → safe no-op.

Tier 2 — Embeddings (semantic similarity + relevance-ranked injection)

  • Add an embedding backend: local model (fastembed/MiniLM — consistent with the self-hosted, container-bundled approach) or a provider embeddings API. Make it configurable like the other background models.
  • Store vectors alongside long_term (sqlite-vec, or a blob column + brute-force cosine in Python — trivial at <1k rows).
  • Use embeddings for candidate retrieval in the Tier-1 pipeline (replaces/augments FTS5).
  • format_for_prompt: inject the top-k long-term memories relevant to the current message instead of all 50 (fixes Improve memory consolidation: unified ADD/UPDATE/DELETE/NOOP pipeline + semantic dedup #5). Requires passing the inbound message into prompt building for the memory section.

Tier 3 — Forgetting / importance / reinforcement

  • Schema: add importance (LLM- or rule-assigned), last_accessed, access_count to long_term.
  • Reinforce on recall (bump last_accessed/access_count); compute a Generative-Agents-style retrieval score = recency + importance + relevance.
  • Decay/archive cold, low-importance memories instead of unbounded growth (fixes Add thinking level support in LLM tab for compatible models #9). Consider an archived flag rather than hard delete.

Tier 4 — Long-term hygiene pass

Smaller fixes (fold into Tier 1)

Suggested sequencing

Tier 1 first (highest correctness-per-effort, no new deps), then Tier 2 once memory volume justifies semantic retrieval, then Tiers 3–4 as separate PRs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions