feat: v0.1.70 — AutoResearch optimization + eval infrastructure by kargarisaac · Pull Request #19 · lerim-dev/lerim-cli

kargarisaac · 2026-03-28T07:21:01Z

Summary

+41% composite quality score via Layer 1 AutoResearch optimization (14 experiments, 7 kept)
4 new eval runners (dedup, maintain, search, tool_selection) with LerimBench 7-dimension scoring
Dashboard transition: lerim dashboard shows "coming soon" message, web UI moves to lerim.dev
Stale Codex tool references cleaned up from ask prompt

Key changes

Quality (AutoResearch Layer 1)

dspy.Predict → dspy.ChainOfThought for extraction (single biggest win)
Explicit dedup thresholds (0.7/0.4) in sync prompt + batch_dedup tool description
MemoryCandidate schema field descriptions improved (title/body format guidance)
Post-extraction body filter raised from 30 to 50 chars

Eval infrastructure

run_dedup.py, run_maintain.py, run_search.py, run_tool_selection.py
LerimBenchScore 7-dimension composite with configurable weights
Fuzzy title matching for dedup accuracy (_fuzzy_title_match)
Golden dataset support via --golden-dir flag
59 scoring tests (all passing)

Cleanup

lerim dashboard → transition message (no more local dashboard)
Ask prompt: codex tool → memory_search + read_file
ResponsesProxy reference removed from internal docs

Test plan

659 unit tests passing
Component-level eval: composite 0.608 → 0.855 (+41%)
E2E lifecycle eval: 0.845 → 0.909 (+7.6%)
lerim dashboard shows transition message

🤖 Generated with Claude Code

…ipts - Removed the "explorer" section from the required configurations in `configure_dspy_from_eval` and related lifecycle functions. - Introduced new evaluation scripts: `run_dedup.py`, `run_maintain.py`, `run_search.py`, and `run_tool_selection.py` for assessing deduplication accuracy, maintenance quality, search relevance, and tool selection accuracy against golden datasets. - Enhanced the README documentation to include detailed descriptions of the new evaluation pipelines and their usage. - Updated scoring functions in `scores.py` to support new evaluation metrics and composite scoring for the LerimBench. - Added judge prompts for deduplication, maintenance, search, and tool selection evaluations to standardize quality assessments.

AutoResearch-style optimization loop (14 experiments, 7 kept): - ChainOfThought for DSPy extraction (Predict → CoT, biggest single win) - Explicit dedup thresholds (0.7/0.4) in sync prompt - Behavioral guidance in batch_dedup_candidates tool description - Improved MemoryCandidate schema field descriptions (title/body format) - Title format guidance in extraction signature - Raised post-extraction body filter from 30 to 50 chars Eval infrastructure extensions: - check_extraction_assertions() for deterministic golden-case scoring - check_summarization_assertions() for summary validation - _fuzzy_title_match() for dedup accuracy with Jaccard similarity - LerimBenchScore 7-dimension composite scoring - 59 scoring tests (all passing) Composite: 0.608 → 0.855 (+41%) Extraction: 0.693 → 0.877 (+27%) Dedup: 0.278 → 0.722 (+160%) Search: 0.905 (unchanged) Maintain: 1.000 (unchanged) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Dashboard: `lerim dashboard` shows transition message to lerim.dev - Ask prompt: replace stale Codex tool refs with memory_search/read_file - Clean up ResponsesProxy reference in src/lerim/README.md - Update test assertion for ask prompt (codex → memory_search) - Add v0.1.70 section to CHANGELOG.md - Update README dashboard section - Bump version 0.1.69 → 0.1.70 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

kargarisaac and others added 3 commits March 27, 2026 19:29

kargarisaac merged commit 398d08e into main Mar 28, 2026
1 check passed

kargarisaac deleted the feat/autoresearch-optimization branch March 28, 2026 07:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: v0.1.70 — AutoResearch optimization + eval infrastructure#19

feat: v0.1.70 — AutoResearch optimization + eval infrastructure#19
kargarisaac merged 3 commits intomainfrom
feat/autoresearch-optimization

kargarisaac commented Mar 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kargarisaac commented Mar 28, 2026

Summary

Key changes

Quality (AutoResearch Layer 1)

Eval infrastructure

Cleanup

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant