Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,31 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.1.70] - 2026-03-28

### Quality Improvements
- **+41% composite quality score** via Layer 1 AutoResearch optimization
- ChainOfThought for DSPy extraction pipeline (biggest single improvement)
- Explicit dedup classification thresholds (0.7/0.4) in sync prompt
- Improved MemoryCandidate schema field descriptions for better output consistency
- Tighter post-extraction body filter (30→50 chars minimum)

### Evaluation Infrastructure
- 4 new eval runners: dedup accuracy, maintain quality, search relevance (NDCG@5), tool selection
- LerimBench 7-dimension composite scoring with configurable weights
- Fuzzy title matching for dedup accuracy (substring + Jaccard similarity)
- Golden dataset support via `--golden-dir` flag
- Deterministic extraction and summarization assertion checkers

### Dashboard
- Local bundled dashboard removed — web UI moving to https://lerim.dev
- `lerim dashboard` shows transition message with CLI alternatives
- API server remains for Docker container health checks

### Cleanup
- Removed stale Codex tool references from ask prompt
- Cleaned up ResponsesProxy references in internal docs

## [0.1.69] - 2026-03-25

### Breaking
Expand Down
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,7 @@ For Docker deployments, set `ollama = "http://host.docker.internal:11434"` in `[

## Web UI (Lerim Cloud)

The browser UI (sessions, memories, pipeline, settings) lives in **[lerim-cloud](https://github.com/lerim-dev/lerim-cloud)** and is served from **[lerim.dev](https://lerim.dev)**. The `lerim` daemon still exposes a **JSON API** on `http://localhost:8765` for the CLI and for Cloud to talk to your local runtime when connected.
The web dashboard has moved to **[lerim.dev](https://lerim.dev)**. The local bundled dashboard has been removed as of v0.1.70 -- all UI features (sessions, memories, pipeline, settings) are now part of **[Lerim Cloud](https://lerim.dev)**. The `lerim` daemon still exposes a **JSON API** on `http://localhost:8765` for the CLI and for Cloud to talk to your local runtime when connected. Running `lerim dashboard` shows a transition message with CLI alternatives.

## CLI reference

Expand All @@ -231,6 +231,7 @@ lerim ask "Why did we choose this?" # query memories
lerim sync # one-shot: sync sessions + extract
lerim maintain # one-shot: merge, archive, decay
lerim status # runtime state
lerim queue # show pending session queue

# Local commands (run on host, no server needed)
lerim memory search "auth pattern" # keyword search
Expand Down
85 changes: 78 additions & 7 deletions evals/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,24 +9,91 @@ or `<repo>/.lerim/`.

## Pipelines

| Pipeline | Runner | What it tests |
|----------|--------|---------------|
| Extraction | `run_extraction.py` | DSPy memory candidate extraction from session traces |
| Summarization | `run_summarization.py` | DSPy structured session summary generation |
| Lifecycle | `run_lifecycle.py` | Full sync + maintain flow (accumulation, dedup, merge, archive) |
| Pipeline | Runner | What it tests | Scoring |
|----------|--------|---------------|---------|
| Extraction | `run_extraction.py` | DSPy memory candidate extraction from session traces | Judge (4-dim) |
| Summarization | `run_summarization.py` | DSPy structured session summary generation | Judge (4-dim) |
| Lifecycle | `run_lifecycle.py` | Full sync + maintain flow (accumulation, dedup, merge, archive) | Judge (4-dim) |
| Dedup | `run_dedup.py` | Dedup classification accuracy against golden assertions | Deterministic + Judge |
| Maintain | `run_maintain.py` | Isolated maintain quality (archive/merge precision) | Deterministic + Judge |
| Search | `run_search.py` | Search relevance via NDCG@5 | Deterministic (NDCG) |
| Tool Selection | `run_tool_selection.py` | Tool call sequence accuracy from agent traces | Deterministic |

## Quick start

```bash
# Run any pipeline eval (--config is required)
# Trace-based evals (--config is required)
PYTHONPATH=. python evals/run_extraction.py --config evals/configs/eval_minimax_m25.toml
PYTHONPATH=. python evals/run_summarization.py --config evals/configs/eval_minimax_m25.toml
PYTHONPATH=. python evals/run_lifecycle.py --config evals/configs/eval_minimax_m25.toml --limit 5 --maintain-every 3

# Golden dataset evals (--golden-dir is required)
PYTHONPATH=. python evals/run_dedup.py --config evals/configs/eval_minimax_m25.toml --golden-dir path/to/golden/dedup/
PYTHONPATH=. python evals/run_maintain.py --config evals/configs/eval_minimax_m25.toml --golden-dir path/to/golden/maintain/
PYTHONPATH=. python evals/run_search.py --golden-dir path/to/golden/search/
PYTHONPATH=. python evals/run_tool_selection.py --golden-dir path/to/golden/tool_selection/

# Compare results across runs
PYTHONPATH=. python evals/compare.py
```

## Golden Dataset

The `--golden-dir` flag points to a directory of golden test cases. Each case is a
subdirectory with an `input/` and `expected/` split:

```
golden/dedup/
case_001/
input/
trace.jsonl # session trace
memory_store/ # pre-populated memories
decisions/
learnings/
expected/
assertions.json # golden assertions
case_002/
...

golden/maintain/
case_001/
input/
memory_store/ # pre-populated memories
expected/
assertions.json # {"should_archive": [...], "should_merge": [[...]], "should_keep": [...]}

golden/search/
case_001/
input/
memory_store/ # memories to index
queries.json # [{"query": "...", "relevant_memory_ids": ["id1", "id2"]}]

golden/tool_selection/
case_001/
input/
agent_trace.json # OAI agent trace with tool calls
expected/
assertions.json # {"expected_sequence": [...], "must_not_call": [...]}
```

Golden datasets are not checked into git. Build them from real runs or create them
manually for regression testing.

## LerimBench (7 dimensions)

The `scores.py` module provides `LerimBenchScore` and `compute_lerim_bench_composite()`
for computing a weighted composite across all 7 evaluation dimensions:

| Dimension | Weight | Source |
|-----------|--------|--------|
| extraction_precision | 0.20 | `run_extraction.py` judge |
| extraction_recall | 0.20 | `run_extraction.py` judge |
| dedup_accuracy | 0.15 | `run_dedup.py` deterministic |
| consolidation_quality | 0.15 | `run_maintain.py` judge |
| archive_precision | 0.10 | `run_maintain.py` deterministic |
| search_relevance | 0.15 | `run_search.py` NDCG@5 |
| scale_degradation | 0.05 | Lifecycle regression ratio |

## Configuration

Each eval requires a TOML config (`--config`). Configs live in `evals/configs/`.
Expand Down Expand Up @@ -69,8 +136,12 @@ Configure in `evals/dataset/config.toml`.

| File | Purpose |
|------|---------|
| `scores.py` | Deterministic checks and composite score computation |
| `scores.py` | Deterministic checks, LerimBench composite, NDCG, dedup/archive accuracy |
| `judge.py` | Coding agent CLI judge wrapper |
| `compare.py` | Cross-run comparison table |
| `run_dedup.py` | Dedup accuracy eval against golden datasets |
| `run_maintain.py` | Isolated maintain eval against golden datasets |
| `run_search.py` | Search relevance eval (NDCG@5) against golden datasets |
| `run_tool_selection.py` | Tool selection accuracy from agent traces |
| `scripts/bench_models.sh` | Multi-model benchmark runner |
| `dataset/build.py` | Dataset pipeline entry point |
3 changes: 1 addition & 2 deletions evals/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ def configure_dspy_from_eval(config: dict, prefix: str = "lerim_eval_") -> tuple

Returns (Config, temp_dir_path).
"""
REQUIRED_SECTIONS = ("lead", "explorer", "extraction", "summarization")
REQUIRED_SECTIONS = ("lead", "extraction", "summarization")
missing = [s for s in REQUIRED_SECTIONS if s not in config]
if missing:
raise ValueError(
Expand All @@ -43,7 +43,6 @@ def configure_dspy_from_eval(config: dict, prefix: str = "lerim_eval_") -> tuple

section_to_role = {
"lead": "lead",
"explorer": "explorer",
"extraction": "extract",
"summarization": "summarize",
}
Expand Down
43 changes: 43 additions & 0 deletions evals/judge_prompts/dedup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Dedup Quality Judge

You are evaluating the quality of dedup (deduplication) decisions made during a memory sync run.

## Context

- **Original trace**: `{trace_path}` -- the coding session being synced
- **Memory root**: `{memory_root}` -- existing memory files (decisions/, learnings/)
- **Predicted actions**: see below
- **Golden assertions**: see below

## Instructions

Use your Read and search tools to investigate the files above. Do NOT load entire files into context -- read strategically:

1. **Read a few memory files** in `{memory_root}/decisions/` and `{memory_root}/learnings/` to understand the existing memory state.
2. **Compare predicted actions** against golden assertions to see where classifications diverge.
3. **Read the original trace** at `{trace_path}` to verify whether add/update/no_op decisions make sense given the session content.

## Predicted Actions

```json
{predictions}
```

## Golden Assertions

```json
{golden}
```

## Scoring (each 0.0 to 1.0)

- **completeness** (weight 0.25): Did dedup find all duplicate/overlapping candidates? Were all candidates in the golden set properly classified? 1.0 = no missed duplicates.
- **faithfulness** (weight 0.25): Are dedup decisions grounded in actual memory content? Are update decisions justified by real overlap between candidate and existing memory? 1.0 = all decisions evidence-based.
- **coherence** (weight 0.20): Is the reasoning behind dedup decisions clear and consistent? Do add/update/no_op classifications follow a coherent strategy? 1.0 = excellent reasoning.
- **precision** (weight 0.30): No false-positive duplicates? Items classified as no_op or update should genuinely overlap with existing memories. Penalize marking distinct candidates as duplicates. 1.0 = no incorrect dedup matches.

## Response Format

Return ONLY valid JSON (no markdown fences, no extra text):

{"completeness": 0.0, "faithfulness": 0.0, "coherence": 0.0, "precision": 0.0, "reasoning": "Brief explanation of scores."}
40 changes: 40 additions & 0 deletions evals/judge_prompts/maintain_isolated.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Isolated Maintain Quality Judge

You are evaluating the quality of an isolated memory maintenance run against a golden dataset with known expected outcomes.

## Context

- **Memory root**: `{memory_root}` -- memory files (decisions/, learnings/, archived/)
- **Run folder**: `{run_folder}` -- maintain artifacts (maintain_actions.json)
- **Memories before maintain**: {before_count}
- **Memories after maintain**: {after_count}
- **Golden assertions**: see below

## Golden Assertions

```json
{assertions}
```

## Instructions

Use your Read and search tools to investigate the files above. Do NOT load entire files into context -- read strategically:

1. **List memory files** in `{memory_root}/decisions/` and `{memory_root}/learnings/` to see the post-maintain state.
2. **Read maintain_actions.json** in `{run_folder}` -- check what actions were taken (merge, archive, consolidate, unchanged).
3. **Check archived/** directory at `{memory_root}/archived/` for newly archived files. Cross-reference with should_archive list.
4. **Sample memory files** to verify merge decisions preserved important information from both sources.
5. **Compare against assertions**: Were should_archive items archived? Were should_merge items merged? Were should_keep items left untouched?

## Scoring (each 0.0 to 1.0)

- **completeness** (weight 0.25): Did maintenance find all merge and archive opportunities listed in the golden assertions? Were all memory files reviewed? Were should_merge groups actually merged? 1.0 = no missed opportunities.
- **faithfulness** (weight 0.25): Are maintenance actions reasonable? Do merges preserve important information from both originals? Are archive decisions justified (not discarding valuable content)? 1.0 = all actions correct.
- **coherence** (weight 0.20): Is the final memory store well-organized after maintenance? Do merged memories read naturally? Is the maintain report well-structured with clear reasoning? 1.0 = excellent coherence.
- **precision** (weight 0.30): Did maintenance correctly avoid archiving should_keep items? Were no valuable memories incorrectly archived or merged away? Reward archiving genuinely low-quality memories. 1.0 = no incorrect maintenance actions.

## Response Format

Return ONLY valid JSON (no markdown fences, no extra text):

{"completeness": 0.0, "faithfulness": 0.0, "coherence": 0.0, "precision": 0.0, "reasoning": "Brief explanation of scores."}
43 changes: 43 additions & 0 deletions evals/judge_prompts/search.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Search Relevance Judge

You are evaluating the quality of memory search results returned by the hybrid FTS5 + vector search index.

## Context

- **Memory root**: `{memory_root}` -- indexed memory files (decisions/, learnings/)
- **Query**: `{query}`
- **Returned results** (ranked): see below
- **Known relevant memories**: see below

## Returned Results

```json
{results}
```

## Known Relevant Memories

```json
{relevant}
```

## Instructions

Use your Read and search tools to investigate the files above. Do NOT load entire files into context -- read strategically:

1. **Read the returned memory files** to verify they actually match the query intent.
2. **Read the known relevant memories** (by their file paths) to understand what should have been returned.
3. **Check ranking order**: Are the most relevant results ranked highest?

## Scoring (each 0.0 to 1.0)

- **completeness** (weight 0.25): Did the search find all known relevant memories within the top results? 1.0 = all relevant memories appeared in results.
- **faithfulness** (weight 0.25): Do the returned results actually match the query semantically? Are they genuinely about the topic being searched? 1.0 = all results are on-topic.
- **coherence** (weight 0.20): Is the ranking order reasonable? Are the most relevant results ranked first? 1.0 = perfect ranking.
- **precision** (weight 0.30): Are there irrelevant results in the top-5? Penalize results that do not relate to the query at all. 1.0 = no irrelevant results in top-5.

## Response Format

Return ONLY valid JSON (no markdown fences, no extra text):

{"completeness": 0.0, "faithfulness": 0.0, "coherence": 0.0, "precision": 0.0, "reasoning": "Brief explanation of scores."}
50 changes: 50 additions & 0 deletions evals/judge_prompts/tool_selection.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Tool Selection Quality Judge

You are evaluating whether the lerim agent selected the correct tools in the correct order during a sync or maintain run.

## Context

- **Agent trace**: `{agent_trace_path}` -- OpenAI Agents SDK run history with tool calls and results
- **Expected tool sequence**: see below
- **Forbidden tools**: see below
- **Actual tool calls**: see below

## Expected Sequence

```json
{expected_sequence}
```

## Forbidden Tools (must_not_call)

```json
{must_not_call}
```

## Actual Tool Calls

```json
{actual_calls}
```

## Instructions

Use your Read tool to examine the agent trace at `{agent_trace_path}` if needed for deeper context.

1. **Compare tool ordering**: Did the agent call tools in the expected order? Extract/summarize first, then dedup, then classify, then write.
2. **Check forbidden calls**: Were any must_not_call tools invoked? This is a hard penalty.
3. **Verify tool arguments**: Were the arguments passed to each tool reasonable for the task?
4. **Check for unnecessary calls**: Did the agent make redundant or wasted tool calls?

## Scoring (each 0.0 to 1.0)

- **completeness** (weight 0.25): Were all necessary tools called? Did the agent complete the full pipeline without skipping steps? 1.0 = all expected tools were called.
- **faithfulness** (weight 0.25): Were tool arguments correct and matched to the task? Did the agent pass appropriate data between tools? 1.0 = all arguments well-formed and task-appropriate.
- **coherence** (weight 0.20): Was the tool ordering logical? Did the agent follow the expected pipeline sequence? 1.0 = perfect ordering.
- **precision** (weight 0.30): Were there unnecessary tool calls or forbidden tool invocations? Penalize redundant calls and must_not_call violations heavily. 1.0 = no wasted or forbidden calls.

## Response Format

Return ONLY valid JSON (no markdown fences, no extra text):

{"completeness": 0.0, "faithfulness": 0.0, "coherence": 0.0, "precision": 0.0, "reasoning": "Brief explanation of scores."}
Loading
Loading