Skip to content

Commit 398d08e

Browse files
authored
Merge pull request #19 from lerim-dev/feat/autoresearch-optimization
feat: v0.1.70 — AutoResearch optimization + eval infrastructure
2 parents a557ff6 + 98348a7 commit 398d08e

25 files changed

Lines changed: 2129 additions & 51 deletions

CHANGELOG.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,31 @@ All notable changes to this project will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [0.1.70] - 2026-03-28
9+
10+
### Quality Improvements
11+
- **+41% composite quality score** via Layer 1 AutoResearch optimization
12+
- ChainOfThought for DSPy extraction pipeline (biggest single improvement)
13+
- Explicit dedup classification thresholds (0.7/0.4) in sync prompt
14+
- Improved MemoryCandidate schema field descriptions for better output consistency
15+
- Tighter post-extraction body filter (30→50 chars minimum)
16+
17+
### Evaluation Infrastructure
18+
- 4 new eval runners: dedup accuracy, maintain quality, search relevance (NDCG@5), tool selection
19+
- LerimBench 7-dimension composite scoring with configurable weights
20+
- Fuzzy title matching for dedup accuracy (substring + Jaccard similarity)
21+
- Golden dataset support via `--golden-dir` flag
22+
- Deterministic extraction and summarization assertion checkers
23+
24+
### Dashboard
25+
- Local bundled dashboard removed — web UI moving to https://lerim.dev
26+
- `lerim dashboard` shows transition message with CLI alternatives
27+
- API server remains for Docker container health checks
28+
29+
### Cleanup
30+
- Removed stale Codex tool references from ask prompt
31+
- Cleaned up ResponsesProxy references in internal docs
32+
833
## [0.1.69] - 2026-03-25
934

1035
### Breaking

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -206,7 +206,7 @@ For Docker deployments, set `ollama = "http://host.docker.internal:11434"` in `[
206206

207207
## Web UI (Lerim Cloud)
208208

209-
The browser UI (sessions, memories, pipeline, settings) lives in **[lerim-cloud](https://github.com/lerim-dev/lerim-cloud)** and is served from **[lerim.dev](https://lerim.dev)**. The `lerim` daemon still exposes a **JSON API** on `http://localhost:8765` for the CLI and for Cloud to talk to your local runtime when connected.
209+
The web dashboard has moved to **[lerim.dev](https://lerim.dev)**. The local bundled dashboard has been removed as of v0.1.70 -- all UI features (sessions, memories, pipeline, settings) are now part of **[Lerim Cloud](https://lerim.dev)**. The `lerim` daemon still exposes a **JSON API** on `http://localhost:8765` for the CLI and for Cloud to talk to your local runtime when connected. Running `lerim dashboard` shows a transition message with CLI alternatives.
210210

211211
## CLI reference
212212

@@ -231,6 +231,7 @@ lerim ask "Why did we choose this?" # query memories
231231
lerim sync # one-shot: sync sessions + extract
232232
lerim maintain # one-shot: merge, archive, decay
233233
lerim status # runtime state
234+
lerim queue # show pending session queue
234235

235236
# Local commands (run on host, no server needed)
236237
lerim memory search "auth pattern" # keyword search

evals/README.md

Lines changed: 78 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -9,24 +9,91 @@ or `<repo>/.lerim/`.
99

1010
## Pipelines
1111

12-
| Pipeline | Runner | What it tests |
13-
|----------|--------|---------------|
14-
| Extraction | `run_extraction.py` | DSPy memory candidate extraction from session traces |
15-
| Summarization | `run_summarization.py` | DSPy structured session summary generation |
16-
| Lifecycle | `run_lifecycle.py` | Full sync + maintain flow (accumulation, dedup, merge, archive) |
12+
| Pipeline | Runner | What it tests | Scoring |
13+
|----------|--------|---------------|---------|
14+
| Extraction | `run_extraction.py` | DSPy memory candidate extraction from session traces | Judge (4-dim) |
15+
| Summarization | `run_summarization.py` | DSPy structured session summary generation | Judge (4-dim) |
16+
| Lifecycle | `run_lifecycle.py` | Full sync + maintain flow (accumulation, dedup, merge, archive) | Judge (4-dim) |
17+
| Dedup | `run_dedup.py` | Dedup classification accuracy against golden assertions | Deterministic + Judge |
18+
| Maintain | `run_maintain.py` | Isolated maintain quality (archive/merge precision) | Deterministic + Judge |
19+
| Search | `run_search.py` | Search relevance via NDCG@5 | Deterministic (NDCG) |
20+
| Tool Selection | `run_tool_selection.py` | Tool call sequence accuracy from agent traces | Deterministic |
1721

1822
## Quick start
1923

2024
```bash
21-
# Run any pipeline eval (--config is required)
25+
# Trace-based evals (--config is required)
2226
PYTHONPATH=. python evals/run_extraction.py --config evals/configs/eval_minimax_m25.toml
2327
PYTHONPATH=. python evals/run_summarization.py --config evals/configs/eval_minimax_m25.toml
2428
PYTHONPATH=. python evals/run_lifecycle.py --config evals/configs/eval_minimax_m25.toml --limit 5 --maintain-every 3
2529

30+
# Golden dataset evals (--golden-dir is required)
31+
PYTHONPATH=. python evals/run_dedup.py --config evals/configs/eval_minimax_m25.toml --golden-dir path/to/golden/dedup/
32+
PYTHONPATH=. python evals/run_maintain.py --config evals/configs/eval_minimax_m25.toml --golden-dir path/to/golden/maintain/
33+
PYTHONPATH=. python evals/run_search.py --golden-dir path/to/golden/search/
34+
PYTHONPATH=. python evals/run_tool_selection.py --golden-dir path/to/golden/tool_selection/
35+
2636
# Compare results across runs
2737
PYTHONPATH=. python evals/compare.py
2838
```
2939

40+
## Golden Dataset
41+
42+
The `--golden-dir` flag points to a directory of golden test cases. Each case is a
43+
subdirectory with an `input/` and `expected/` split:
44+
45+
```
46+
golden/dedup/
47+
case_001/
48+
input/
49+
trace.jsonl # session trace
50+
memory_store/ # pre-populated memories
51+
decisions/
52+
learnings/
53+
expected/
54+
assertions.json # golden assertions
55+
case_002/
56+
...
57+
58+
golden/maintain/
59+
case_001/
60+
input/
61+
memory_store/ # pre-populated memories
62+
expected/
63+
assertions.json # {"should_archive": [...], "should_merge": [[...]], "should_keep": [...]}
64+
65+
golden/search/
66+
case_001/
67+
input/
68+
memory_store/ # memories to index
69+
queries.json # [{"query": "...", "relevant_memory_ids": ["id1", "id2"]}]
70+
71+
golden/tool_selection/
72+
case_001/
73+
input/
74+
agent_trace.json # OAI agent trace with tool calls
75+
expected/
76+
assertions.json # {"expected_sequence": [...], "must_not_call": [...]}
77+
```
78+
79+
Golden datasets are not checked into git. Build them from real runs or create them
80+
manually for regression testing.
81+
82+
## LerimBench (7 dimensions)
83+
84+
The `scores.py` module provides `LerimBenchScore` and `compute_lerim_bench_composite()`
85+
for computing a weighted composite across all 7 evaluation dimensions:
86+
87+
| Dimension | Weight | Source |
88+
|-----------|--------|--------|
89+
| extraction_precision | 0.20 | `run_extraction.py` judge |
90+
| extraction_recall | 0.20 | `run_extraction.py` judge |
91+
| dedup_accuracy | 0.15 | `run_dedup.py` deterministic |
92+
| consolidation_quality | 0.15 | `run_maintain.py` judge |
93+
| archive_precision | 0.10 | `run_maintain.py` deterministic |
94+
| search_relevance | 0.15 | `run_search.py` NDCG@5 |
95+
| scale_degradation | 0.05 | Lifecycle regression ratio |
96+
3097
## Configuration
3198

3299
Each eval requires a TOML config (`--config`). Configs live in `evals/configs/`.
@@ -69,8 +136,12 @@ Configure in `evals/dataset/config.toml`.
69136

70137
| File | Purpose |
71138
|------|---------|
72-
| `scores.py` | Deterministic checks and composite score computation |
139+
| `scores.py` | Deterministic checks, LerimBench composite, NDCG, dedup/archive accuracy |
73140
| `judge.py` | Coding agent CLI judge wrapper |
74141
| `compare.py` | Cross-run comparison table |
142+
| `run_dedup.py` | Dedup accuracy eval against golden datasets |
143+
| `run_maintain.py` | Isolated maintain eval against golden datasets |
144+
| `run_search.py` | Search relevance eval (NDCG@5) against golden datasets |
145+
| `run_tool_selection.py` | Tool selection accuracy from agent traces |
75146
| `scripts/bench_models.sh` | Multi-model benchmark runner |
76147
| `dataset/build.py` | Dataset pipeline entry point |

evals/common.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ def configure_dspy_from_eval(config: dict, prefix: str = "lerim_eval_") -> tuple
3333
3434
Returns (Config, temp_dir_path).
3535
"""
36-
REQUIRED_SECTIONS = ("lead", "explorer", "extraction", "summarization")
36+
REQUIRED_SECTIONS = ("lead", "extraction", "summarization")
3737
missing = [s for s in REQUIRED_SECTIONS if s not in config]
3838
if missing:
3939
raise ValueError(
@@ -43,7 +43,6 @@ def configure_dspy_from_eval(config: dict, prefix: str = "lerim_eval_") -> tuple
4343

4444
section_to_role = {
4545
"lead": "lead",
46-
"explorer": "explorer",
4746
"extraction": "extract",
4847
"summarization": "summarize",
4948
}

evals/judge_prompts/dedup.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Dedup Quality Judge
2+
3+
You are evaluating the quality of dedup (deduplication) decisions made during a memory sync run.
4+
5+
## Context
6+
7+
- **Original trace**: `{trace_path}` -- the coding session being synced
8+
- **Memory root**: `{memory_root}` -- existing memory files (decisions/, learnings/)
9+
- **Predicted actions**: see below
10+
- **Golden assertions**: see below
11+
12+
## Instructions
13+
14+
Use your Read and search tools to investigate the files above. Do NOT load entire files into context -- read strategically:
15+
16+
1. **Read a few memory files** in `{memory_root}/decisions/` and `{memory_root}/learnings/` to understand the existing memory state.
17+
2. **Compare predicted actions** against golden assertions to see where classifications diverge.
18+
3. **Read the original trace** at `{trace_path}` to verify whether add/update/no_op decisions make sense given the session content.
19+
20+
## Predicted Actions
21+
22+
```json
23+
{predictions}
24+
```
25+
26+
## Golden Assertions
27+
28+
```json
29+
{golden}
30+
```
31+
32+
## Scoring (each 0.0 to 1.0)
33+
34+
- **completeness** (weight 0.25): Did dedup find all duplicate/overlapping candidates? Were all candidates in the golden set properly classified? 1.0 = no missed duplicates.
35+
- **faithfulness** (weight 0.25): Are dedup decisions grounded in actual memory content? Are update decisions justified by real overlap between candidate and existing memory? 1.0 = all decisions evidence-based.
36+
- **coherence** (weight 0.20): Is the reasoning behind dedup decisions clear and consistent? Do add/update/no_op classifications follow a coherent strategy? 1.0 = excellent reasoning.
37+
- **precision** (weight 0.30): No false-positive duplicates? Items classified as no_op or update should genuinely overlap with existing memories. Penalize marking distinct candidates as duplicates. 1.0 = no incorrect dedup matches.
38+
39+
## Response Format
40+
41+
Return ONLY valid JSON (no markdown fences, no extra text):
42+
43+
{"completeness": 0.0, "faithfulness": 0.0, "coherence": 0.0, "precision": 0.0, "reasoning": "Brief explanation of scores."}
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# Isolated Maintain Quality Judge
2+
3+
You are evaluating the quality of an isolated memory maintenance run against a golden dataset with known expected outcomes.
4+
5+
## Context
6+
7+
- **Memory root**: `{memory_root}` -- memory files (decisions/, learnings/, archived/)
8+
- **Run folder**: `{run_folder}` -- maintain artifacts (maintain_actions.json)
9+
- **Memories before maintain**: {before_count}
10+
- **Memories after maintain**: {after_count}
11+
- **Golden assertions**: see below
12+
13+
## Golden Assertions
14+
15+
```json
16+
{assertions}
17+
```
18+
19+
## Instructions
20+
21+
Use your Read and search tools to investigate the files above. Do NOT load entire files into context -- read strategically:
22+
23+
1. **List memory files** in `{memory_root}/decisions/` and `{memory_root}/learnings/` to see the post-maintain state.
24+
2. **Read maintain_actions.json** in `{run_folder}` -- check what actions were taken (merge, archive, consolidate, unchanged).
25+
3. **Check archived/** directory at `{memory_root}/archived/` for newly archived files. Cross-reference with should_archive list.
26+
4. **Sample memory files** to verify merge decisions preserved important information from both sources.
27+
5. **Compare against assertions**: Were should_archive items archived? Were should_merge items merged? Were should_keep items left untouched?
28+
29+
## Scoring (each 0.0 to 1.0)
30+
31+
- **completeness** (weight 0.25): Did maintenance find all merge and archive opportunities listed in the golden assertions? Were all memory files reviewed? Were should_merge groups actually merged? 1.0 = no missed opportunities.
32+
- **faithfulness** (weight 0.25): Are maintenance actions reasonable? Do merges preserve important information from both originals? Are archive decisions justified (not discarding valuable content)? 1.0 = all actions correct.
33+
- **coherence** (weight 0.20): Is the final memory store well-organized after maintenance? Do merged memories read naturally? Is the maintain report well-structured with clear reasoning? 1.0 = excellent coherence.
34+
- **precision** (weight 0.30): Did maintenance correctly avoid archiving should_keep items? Were no valuable memories incorrectly archived or merged away? Reward archiving genuinely low-quality memories. 1.0 = no incorrect maintenance actions.
35+
36+
## Response Format
37+
38+
Return ONLY valid JSON (no markdown fences, no extra text):
39+
40+
{"completeness": 0.0, "faithfulness": 0.0, "coherence": 0.0, "precision": 0.0, "reasoning": "Brief explanation of scores."}

evals/judge_prompts/search.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Search Relevance Judge
2+
3+
You are evaluating the quality of memory search results returned by the hybrid FTS5 + vector search index.
4+
5+
## Context
6+
7+
- **Memory root**: `{memory_root}` -- indexed memory files (decisions/, learnings/)
8+
- **Query**: `{query}`
9+
- **Returned results** (ranked): see below
10+
- **Known relevant memories**: see below
11+
12+
## Returned Results
13+
14+
```json
15+
{results}
16+
```
17+
18+
## Known Relevant Memories
19+
20+
```json
21+
{relevant}
22+
```
23+
24+
## Instructions
25+
26+
Use your Read and search tools to investigate the files above. Do NOT load entire files into context -- read strategically:
27+
28+
1. **Read the returned memory files** to verify they actually match the query intent.
29+
2. **Read the known relevant memories** (by their file paths) to understand what should have been returned.
30+
3. **Check ranking order**: Are the most relevant results ranked highest?
31+
32+
## Scoring (each 0.0 to 1.0)
33+
34+
- **completeness** (weight 0.25): Did the search find all known relevant memories within the top results? 1.0 = all relevant memories appeared in results.
35+
- **faithfulness** (weight 0.25): Do the returned results actually match the query semantically? Are they genuinely about the topic being searched? 1.0 = all results are on-topic.
36+
- **coherence** (weight 0.20): Is the ranking order reasonable? Are the most relevant results ranked first? 1.0 = perfect ranking.
37+
- **precision** (weight 0.30): Are there irrelevant results in the top-5? Penalize results that do not relate to the query at all. 1.0 = no irrelevant results in top-5.
38+
39+
## Response Format
40+
41+
Return ONLY valid JSON (no markdown fences, no extra text):
42+
43+
{"completeness": 0.0, "faithfulness": 0.0, "coherence": 0.0, "precision": 0.0, "reasoning": "Brief explanation of scores."}
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# Tool Selection Quality Judge
2+
3+
You are evaluating whether the lerim agent selected the correct tools in the correct order during a sync or maintain run.
4+
5+
## Context
6+
7+
- **Agent trace**: `{agent_trace_path}` -- OpenAI Agents SDK run history with tool calls and results
8+
- **Expected tool sequence**: see below
9+
- **Forbidden tools**: see below
10+
- **Actual tool calls**: see below
11+
12+
## Expected Sequence
13+
14+
```json
15+
{expected_sequence}
16+
```
17+
18+
## Forbidden Tools (must_not_call)
19+
20+
```json
21+
{must_not_call}
22+
```
23+
24+
## Actual Tool Calls
25+
26+
```json
27+
{actual_calls}
28+
```
29+
30+
## Instructions
31+
32+
Use your Read tool to examine the agent trace at `{agent_trace_path}` if needed for deeper context.
33+
34+
1. **Compare tool ordering**: Did the agent call tools in the expected order? Extract/summarize first, then dedup, then classify, then write.
35+
2. **Check forbidden calls**: Were any must_not_call tools invoked? This is a hard penalty.
36+
3. **Verify tool arguments**: Were the arguments passed to each tool reasonable for the task?
37+
4. **Check for unnecessary calls**: Did the agent make redundant or wasted tool calls?
38+
39+
## Scoring (each 0.0 to 1.0)
40+
41+
- **completeness** (weight 0.25): Were all necessary tools called? Did the agent complete the full pipeline without skipping steps? 1.0 = all expected tools were called.
42+
- **faithfulness** (weight 0.25): Were tool arguments correct and matched to the task? Did the agent pass appropriate data between tools? 1.0 = all arguments well-formed and task-appropriate.
43+
- **coherence** (weight 0.20): Was the tool ordering logical? Did the agent follow the expected pipeline sequence? 1.0 = perfect ordering.
44+
- **precision** (weight 0.30): Were there unnecessary tool calls or forbidden tool invocations? Penalize redundant calls and must_not_call violations heavily. 1.0 = no wasted or forbidden calls.
45+
46+
## Response Format
47+
48+
Return ONLY valid JSON (no markdown fences, no extra text):
49+
50+
{"completeness": 0.0, "faithfulness": 0.0, "coherence": 0.0, "precision": 0.0, "reasoning": "Brief explanation of scores."}

0 commit comments

Comments
 (0)