lerim-dev · kargarisaac · Mar 28, 2026 · Mar 27, 2026 · Mar 28, 2026 · Mar 28, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,6 +5,31 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [0.1.70] - 2026-03-28
+
+### Quality Improvements
+- **+41% composite quality score** via Layer 1 AutoResearch optimization
+- ChainOfThought for DSPy extraction pipeline (biggest single improvement)
+- Explicit dedup classification thresholds (0.7/0.4) in sync prompt
+- Improved MemoryCandidate schema field descriptions for better output consistency
+- Tighter post-extraction body filter (30→50 chars minimum)
+
+### Evaluation Infrastructure
+- 4 new eval runners: dedup accuracy, maintain quality, search relevance (NDCG@5), tool selection
+- LerimBench 7-dimension composite scoring with configurable weights
+- Fuzzy title matching for dedup accuracy (substring + Jaccard similarity)
+- Golden dataset support via `--golden-dir` flag
+- Deterministic extraction and summarization assertion checkers
+
+### Dashboard
+- Local bundled dashboard removed — web UI moving to https://lerim.dev
+- `lerim dashboard` shows transition message with CLI alternatives
+- API server remains for Docker container health checks
+
+### Cleanup
+- Removed stale Codex tool references from ask prompt
+- Cleaned up ResponsesProxy references in internal docs
+
 ## [0.1.69] - 2026-03-25
 
 ### Breaking

diff --git a/README.md b/README.md
@@ -206,7 +206,7 @@ For Docker deployments, set `ollama = "http://host.docker.internal:11434"` in `[
 
 ## Web UI (Lerim Cloud)
 
-The browser UI (sessions, memories, pipeline, settings) lives in **[lerim-cloud](https://github.com/lerim-dev/lerim-cloud)** and is served from **[lerim.dev](https://lerim.dev)**. The `lerim` daemon still exposes a **JSON API** on `http://localhost:8765` for the CLI and for Cloud to talk to your local runtime when connected.
+The web dashboard has moved to **[lerim.dev](https://lerim.dev)**. The local bundled dashboard has been removed as of v0.1.70 -- all UI features (sessions, memories, pipeline, settings) are now part of **[Lerim Cloud](https://lerim.dev)**. The `lerim` daemon still exposes a **JSON API** on `http://localhost:8765` for the CLI and for Cloud to talk to your local runtime when connected. Running `lerim dashboard` shows a transition message with CLI alternatives.
 
 ## CLI reference
 
@@ -231,6 +231,7 @@ lerim ask "Why did we choose this?"          # query memories
 lerim sync                                  # one-shot: sync sessions + extract
 lerim maintain                              # one-shot: merge, archive, decay
 lerim status                                # runtime state
+lerim queue                                 # show pending session queue
 
 # Local commands (run on host, no server needed)
 lerim memory search "auth pattern"          # keyword search

diff --git a/evals/README.md b/evals/README.md
@@ -9,24 +9,91 @@ or `<repo>/.lerim/`.
 
 ## Pipelines
 
-| Pipeline | Runner | What it tests |
-|----------|--------|---------------|
-| Extraction | `run_extraction.py` | DSPy memory candidate extraction from session traces |
-| Summarization | `run_summarization.py` | DSPy structured session summary generation |
-| Lifecycle | `run_lifecycle.py` | Full sync + maintain flow (accumulation, dedup, merge, archive) |
+| Pipeline | Runner | What it tests | Scoring |
+|----------|--------|---------------|---------|
+| Extraction | `run_extraction.py` | DSPy memory candidate extraction from session traces | Judge (4-dim) |
+| Summarization | `run_summarization.py` | DSPy structured session summary generation | Judge (4-dim) |
+| Lifecycle | `run_lifecycle.py` | Full sync + maintain flow (accumulation, dedup, merge, archive) | Judge (4-dim) |
+| Dedup | `run_dedup.py` | Dedup classification accuracy against golden assertions | Deterministic + Judge |
+| Maintain | `run_maintain.py` | Isolated maintain quality (archive/merge precision) | Deterministic + Judge |
+| Search | `run_search.py` | Search relevance via NDCG@5 | Deterministic (NDCG) |
+| Tool Selection | `run_tool_selection.py` | Tool call sequence accuracy from agent traces | Deterministic |
 
 ## Quick start
 
 ```bash
-# Run any pipeline eval (--config is required)
+# Trace-based evals (--config is required)
 PYTHONPATH=. python evals/run_extraction.py --config evals/configs/eval_minimax_m25.toml
 PYTHONPATH=. python evals/run_summarization.py --config evals/configs/eval_minimax_m25.toml
 PYTHONPATH=. python evals/run_lifecycle.py --config evals/configs/eval_minimax_m25.toml --limit 5 --maintain-every 3
 
+# Golden dataset evals (--golden-dir is required)
+PYTHONPATH=. python evals/run_dedup.py --config evals/configs/eval_minimax_m25.toml --golden-dir path/to/golden/dedup/
+PYTHONPATH=. python evals/run_maintain.py --config evals/configs/eval_minimax_m25.toml --golden-dir path/to/golden/maintain/
+PYTHONPATH=. python evals/run_search.py --golden-dir path/to/golden/search/
+PYTHONPATH=. python evals/run_tool_selection.py --golden-dir path/to/golden/tool_selection/
+
 # Compare results across runs
 PYTHONPATH=. python evals/compare.py
 ```
 
+## Golden Dataset
+
+The `--golden-dir` flag points to a directory of golden test cases. Each case is a
+subdirectory with an `input/` and `expected/` split:
+
+```
+golden/dedup/
+  case_001/
+    input/
+      trace.jsonl          # session trace
+      memory_store/        # pre-populated memories
+        decisions/
+        learnings/
+    expected/
+      assertions.json      # golden assertions
+  case_002/
+    ...
+
+golden/maintain/
+  case_001/
+    input/
+      memory_store/        # pre-populated memories
+    expected/
+      assertions.json      # {"should_archive": [...], "should_merge": [[...]], "should_keep": [...]}
+
+golden/search/
+  case_001/
+    input/
+      memory_store/        # memories to index
+      queries.json         # [{"query": "...", "relevant_memory_ids": ["id1", "id2"]}]
+
+golden/tool_selection/
+  case_001/
+    input/
+      agent_trace.json     # OAI agent trace with tool calls
+    expected/
+      assertions.json      # {"expected_sequence": [...], "must_not_call": [...]}
+```
+
+Golden datasets are not checked into git. Build them from real runs or create them
+manually for regression testing.
+
+## LerimBench (7 dimensions)
+
+The `scores.py` module provides `LerimBenchScore` and `compute_lerim_bench_composite()`
+for computing a weighted composite across all 7 evaluation dimensions:
+
+| Dimension | Weight | Source |
+|-----------|--------|--------|
+| extraction_precision | 0.20 | `run_extraction.py` judge |
+| extraction_recall | 0.20 | `run_extraction.py` judge |
+| dedup_accuracy | 0.15 | `run_dedup.py` deterministic |
+| consolidation_quality | 0.15 | `run_maintain.py` judge |
+| archive_precision | 0.10 | `run_maintain.py` deterministic |
+| search_relevance | 0.15 | `run_search.py` NDCG@5 |
+| scale_degradation | 0.05 | Lifecycle regression ratio |
+
 ## Configuration
 
 Each eval requires a TOML config (`--config`). Configs live in `evals/configs/`.
@@ -69,8 +136,12 @@ Configure in `evals/dataset/config.toml`.
 
 | File | Purpose |
 |------|---------|
-| `scores.py` | Deterministic checks and composite score computation |
+| `scores.py` | Deterministic checks, LerimBench composite, NDCG, dedup/archive accuracy |
 | `judge.py` | Coding agent CLI judge wrapper |
 | `compare.py` | Cross-run comparison table |
+| `run_dedup.py` | Dedup accuracy eval against golden datasets |
+| `run_maintain.py` | Isolated maintain eval against golden datasets |
+| `run_search.py` | Search relevance eval (NDCG@5) against golden datasets |
+| `run_tool_selection.py` | Tool selection accuracy from agent traces |
 | `scripts/bench_models.sh` | Multi-model benchmark runner |
 | `dataset/build.py` | Dataset pipeline entry point |
diff --git a/evals/common.py b/evals/common.py
@@ -33,7 +33,7 @@ def configure_dspy_from_eval(config: dict, prefix: str = "lerim_eval_") -> tuple
 
     Returns (Config, temp_dir_path).
     """
-    REQUIRED_SECTIONS = ("lead", "explorer", "extraction", "summarization")
+    REQUIRED_SECTIONS = ("lead", "extraction", "summarization")
     missing = [s for s in REQUIRED_SECTIONS if s not in config]
     if missing:
         raise ValueError(
@@ -43,7 +43,6 @@ def configure_dspy_from_eval(config: dict, prefix: str = "lerim_eval_") -> tuple
 
     section_to_role = {
         "lead": "lead",
-        "explorer": "explorer",
         "extraction": "extract",
         "summarization": "summarize",
     }

diff --git a/evals/judge_prompts/dedup.md b/evals/judge_prompts/dedup.md
@@ -0,0 +1,43 @@
+# Dedup Quality Judge
+
+You are evaluating the quality of dedup (deduplication) decisions made during a memory sync run.
+
+## Context
+
+- **Original trace**: `{trace_path}` -- the coding session being synced
+- **Memory root**: `{memory_root}` -- existing memory files (decisions/, learnings/)
+- **Predicted actions**: see below
+- **Golden assertions**: see below
+
+## Instructions
+
+Use your Read and search tools to investigate the files above. Do NOT load entire files into context -- read strategically:
+
+1. **Read a few memory files** in `{memory_root}/decisions/` and `{memory_root}/learnings/` to understand the existing memory state.
+2. **Compare predicted actions** against golden assertions to see where classifications diverge.
+3. **Read the original trace** at `{trace_path}` to verify whether add/update/no_op decisions make sense given the session content.
+
+## Predicted Actions
+
+```json
+{predictions}
+```
+
+## Golden Assertions
+
+```json
+{golden}
+```
+
+## Scoring (each 0.0 to 1.0)
+
+- **completeness** (weight 0.25): Did dedup find all duplicate/overlapping candidates? Were all candidates in the golden set properly classified? 1.0 = no missed duplicates.
+- **faithfulness** (weight 0.25): Are dedup decisions grounded in actual memory content? Are update decisions justified by real overlap between candidate and existing memory? 1.0 = all decisions evidence-based.
+- **coherence** (weight 0.20): Is the reasoning behind dedup decisions clear and consistent? Do add/update/no_op classifications follow a coherent strategy? 1.0 = excellent reasoning.
+- **precision** (weight 0.30): No false-positive duplicates? Items classified as no_op or update should genuinely overlap with existing memories. Penalize marking distinct candidates as duplicates. 1.0 = no incorrect dedup matches.
+
+## Response Format
+
+Return ONLY valid JSON (no markdown fences, no extra text):
+
+{"completeness": 0.0, "faithfulness": 0.0, "coherence": 0.0, "precision": 0.0, "reasoning": "Brief explanation of scores."}
diff --git a/evals/judge_prompts/maintain_isolated.md b/evals/judge_prompts/maintain_isolated.md
@@ -0,0 +1,40 @@
+# Isolated Maintain Quality Judge
+
+You are evaluating the quality of an isolated memory maintenance run against a golden dataset with known expected outcomes.
+
+## Context
+
+- **Memory root**: `{memory_root}` -- memory files (decisions/, learnings/, archived/)
+- **Run folder**: `{run_folder}` -- maintain artifacts (maintain_actions.json)
+- **Memories before maintain**: {before_count}
+- **Memories after maintain**: {after_count}
+- **Golden assertions**: see below
+
+## Golden Assertions
+
+```json
+{assertions}
+```
+
+## Instructions
+
+Use your Read and search tools to investigate the files above. Do NOT load entire files into context -- read strategically:
+
+1. **List memory files** in `{memory_root}/decisions/` and `{memory_root}/learnings/` to see the post-maintain state.
+2. **Read maintain_actions.json** in `{run_folder}` -- check what actions were taken (merge, archive, consolidate, unchanged).
+3. **Check archived/** directory at `{memory_root}/archived/` for newly archived files. Cross-reference with should_archive list.
+4. **Sample memory files** to verify merge decisions preserved important information from both sources.
+5. **Compare against assertions**: Were should_archive items archived? Were should_merge items merged? Were should_keep items left untouched?
+
+## Scoring (each 0.0 to 1.0)
+
+- **completeness** (weight 0.25): Did maintenance find all merge and archive opportunities listed in the golden assertions? Were all memory files reviewed? Were should_merge groups actually merged? 1.0 = no missed opportunities.
+- **faithfulness** (weight 0.25): Are maintenance actions reasonable? Do merges preserve important information from both originals? Are archive decisions justified (not discarding valuable content)? 1.0 = all actions correct.
+- **coherence** (weight 0.20): Is the final memory store well-organized after maintenance? Do merged memories read naturally? Is the maintain report well-structured with clear reasoning? 1.0 = excellent coherence.
+- **precision** (weight 0.30): Did maintenance correctly avoid archiving should_keep items? Were no valuable memories incorrectly archived or merged away? Reward archiving genuinely low-quality memories. 1.0 = no incorrect maintenance actions.
+
+## Response Format
+
+Return ONLY valid JSON (no markdown fences, no extra text):
+
+{"completeness": 0.0, "faithfulness": 0.0, "coherence": 0.0, "precision": 0.0, "reasoning": "Brief explanation of scores."}
diff --git a/evals/judge_prompts/search.md b/evals/judge_prompts/search.md
@@ -0,0 +1,43 @@
+# Search Relevance Judge
+
+You are evaluating the quality of memory search results returned by the hybrid FTS5 + vector search index.
+
+## Context
+
+- **Memory root**: `{memory_root}` -- indexed memory files (decisions/, learnings/)
+- **Query**: `{query}`
+- **Returned results** (ranked): see below
+- **Known relevant memories**: see below
+
+## Returned Results
+
+```json
+{results}
+```
+
+## Known Relevant Memories
+
+```json
+{relevant}
+```
+
+## Instructions
+
+Use your Read and search tools to investigate the files above. Do NOT load entire files into context -- read strategically:
+
+1. **Read the returned memory files** to verify they actually match the query intent.
+2. **Read the known relevant memories** (by their file paths) to understand what should have been returned.
+3. **Check ranking order**: Are the most relevant results ranked highest?
+
+## Scoring (each 0.0 to 1.0)
+
+- **completeness** (weight 0.25): Did the search find all known relevant memories within the top results? 1.0 = all relevant memories appeared in results.
+- **faithfulness** (weight 0.25): Do the returned results actually match the query semantically? Are they genuinely about the topic being searched? 1.0 = all results are on-topic.
+- **coherence** (weight 0.20): Is the ranking order reasonable? Are the most relevant results ranked first? 1.0 = perfect ranking.
+- **precision** (weight 0.30): Are there irrelevant results in the top-5? Penalize results that do not relate to the query at all. 1.0 = no irrelevant results in top-5.
+
+## Response Format
+
+Return ONLY valid JSON (no markdown fences, no extra text):
+
+{"completeness": 0.0, "faithfulness": 0.0, "coherence": 0.0, "precision": 0.0, "reasoning": "Brief explanation of scores."}
diff --git a/evals/judge_prompts/tool_selection.md b/evals/judge_prompts/tool_selection.md
@@ -0,0 +1,50 @@
+# Tool Selection Quality Judge
+
+You are evaluating whether the lerim agent selected the correct tools in the correct order during a sync or maintain run.
+
+## Context
+
+- **Agent trace**: `{agent_trace_path}` -- OpenAI Agents SDK run history with tool calls and results
+- **Expected tool sequence**: see below
+- **Forbidden tools**: see below
+- **Actual tool calls**: see below
+
+## Expected Sequence
+
+```json
+{expected_sequence}
+```
+
+## Forbidden Tools (must_not_call)
+
+```json
+{must_not_call}
+```
+
+## Actual Tool Calls
+
+```json
+{actual_calls}
+```
+
+## Instructions
+
+Use your Read tool to examine the agent trace at `{agent_trace_path}` if needed for deeper context.
+
+1. **Compare tool ordering**: Did the agent call tools in the expected order? Extract/summarize first, then dedup, then classify, then write.
+2. **Check forbidden calls**: Were any must_not_call tools invoked? This is a hard penalty.
+3. **Verify tool arguments**: Were the arguments passed to each tool reasonable for the task?
+4. **Check for unnecessary calls**: Did the agent make redundant or wasted tool calls?
+
+## Scoring (each 0.0 to 1.0)
+
+- **completeness** (weight 0.25): Were all necessary tools called? Did the agent complete the full pipeline without skipping steps? 1.0 = all expected tools were called.
+- **faithfulness** (weight 0.25): Were tool arguments correct and matched to the task? Did the agent pass appropriate data between tools? 1.0 = all arguments well-formed and task-appropriate.
+- **coherence** (weight 0.20): Was the tool ordering logical? Did the agent follow the expected pipeline sequence? 1.0 = perfect ordering.
+- **precision** (weight 0.30): Were there unnecessary tool calls or forbidden tool invocations? Penalize redundant calls and must_not_call violations heavily. 1.0 = no wasted or forbidden calls.
+
+## Response Format
+
+Return ONLY valid JSON (no markdown fences, no extra text):
+
+{"completeness": 0.0, "faithfulness": 0.0, "coherence": 0.0, "precision": 0.0, "reasoning": "Brief explanation of scores."}