Summary
Create a standalone benchmark suite (basic-memory-bench) for evaluating retrieval quality across BM deployments using academic datasets. Designed to be publicly shareable, runnable by anyone, and integrated into CI.
Why
- Internal quality tracking — benchmark before/after every release
- Cloud vs Local comparison — validate that Cloud's better embeddings produce better retrieval
- Public credibility — reproducible numbers on academic benchmarks
- Marketing — "we benchmark in the open" content
- Competitive positioning — compare against Mem0/Supermemory on the same datasets
Current Results (from prototype in openclaw-basic-memory plugin)
Full LoCoMo benchmark — 1,982 queries across 10 conversations:
| Metric |
BM Local (v0.18.5) |
| Recall@5 |
76.4% |
| Recall@10 |
85.5% |
| MRR |
0.658 |
| Content Hit Rate |
25.4% |
| Mean Latency |
1,063ms |
By category:
| Category |
N |
R@5 |
| open_domain |
841 |
86.6% |
| multi_hop |
321 |
84.1% |
| adversarial |
446 |
67.0% |
| temporal |
92 |
59.1% |
| single_hop |
282 |
57.7% |
Architecture
- Python (not TS) — same ecosystem as BM, uses BM's importer framework
- Provider abstraction — BM Local (MCP stdio), BM Cloud (API), Mem0 (optional)
- Two eval modes — retrieval metrics (R@K, MRR) + LLM-as-Judge (for Mem0 comparison)
- Deterministic conversion — LoCoMo/LongMemEval → BM markdown via
EntityMarkdown
- CI-ready —
just full-locomo runs everything, fail if recall drops >2%
Datasets
- LoCoMo (ACL 2024, Snap Research) — 10 conversations, 1,986 QA pairs. Mem0 publishes numbers on this.
- LongMemEval (ICLR 2025) — Supermemory uses this. More challenging.
- Synthetic (our hand-crafted 38-query suite for CI smoke tests)
Known Improvement Opportunities
From analysis of the 375 failures:
- RRF scoring flattens results (#577) — hybrid search scores all ~0.016, ranking destroyed. FTS alone finds observations that hybrid misses.
- Single-hop recall at 57.7% — specific fact lookups need better chunk matching
- Temporal at 59.1% — date-aware scoring needed
Implementation Phases
- Repo setup + LoCoMo (Python port from current TS prototype)
- BM Cloud provider + LongMemEval dataset
- LLM-as-Judge + competitor comparison
- CI integration + public results dashboard
Reference
- Full spec:
drafts/spec-benchmark-suite.md in openclaw workspace
- Current TS prototype:
benchmark/ in openclaw-basic-memory repo
- LoCoMo dataset: github.com/snap-research/locomo
- LongMemEval: github.com/xiaowu0162/LongMemEval
- Supermemory's harness: github.com/supermemoryai/memorybench
Summary
Create a standalone benchmark suite (
basic-memory-bench) for evaluating retrieval quality across BM deployments using academic datasets. Designed to be publicly shareable, runnable by anyone, and integrated into CI.Why
Current Results (from prototype in openclaw-basic-memory plugin)
Full LoCoMo benchmark — 1,982 queries across 10 conversations:
By category:
Architecture
EntityMarkdownjust full-locomoruns everything, fail if recall drops >2%Datasets
Known Improvement Opportunities
From analysis of the 375 failures:
Implementation Phases
Reference
drafts/spec-benchmark-suite.mdin openclaw workspacebenchmark/in openclaw-basic-memory repo