|
| 1 | +# Basic Memory Benchmark |
| 2 | + |
| 3 | +Open, reproducible retrieval quality benchmarks for the Basic Memory OpenClaw plugin. |
| 4 | + |
| 5 | +## Why |
| 6 | + |
| 7 | +Memory systems for AI agents make big claims with no reproducible evidence. We're building benchmarks in the open to: |
| 8 | + |
| 9 | +1. **Improve Basic Memory** — evals are a feedback loop, not a marketing tool |
| 10 | +2. **Compare honestly** — show where we're strong AND where we're weak |
| 11 | +3. **Publish methodology** — anyone can reproduce our results or challenge them |
| 12 | + |
| 13 | +## What We Measure |
| 14 | + |
| 15 | +### Retrieval Quality (primary) |
| 16 | +- **Recall@K** — does the correct memory appear in the top K results? |
| 17 | +- **Precision@K** — of the top K results, how many are actually relevant? |
| 18 | +- **MRR** — Mean Reciprocal Rank: where does the first correct answer appear? |
| 19 | +- **Content Hit Rate** — for exact facts, did the expected value appear in results? |
| 20 | + |
| 21 | +### Query Categories |
| 22 | +| Category | What it tests | |
| 23 | +|----------|---------------| |
| 24 | +| `exact_fact` | Keyword precision — find specific values | |
| 25 | +| `semantic` | Vector similarity — find conceptually related content | |
| 26 | +| `temporal` | Date awareness — retrieve by when things happened | |
| 27 | +| `relational` | Graph traversal — follow connections between entities | |
| 28 | +| `cross_note` | Multi-document recall — stitch information across files | |
| 29 | +| `task_recall` | Structured task queries — find active/assigned tasks | |
| 30 | +| `needle_in_haystack` | Exact token retrieval — find specific IDs, URLs, numbers | |
| 31 | +| `absence` | Knowing what ISN'T there — or is planned but not done | |
| 32 | +| `evolving_fact` | Freshness — prefer newer data over stale entries | |
| 33 | + |
| 34 | +### Providers Compared |
| 35 | +1. **Basic Memory** (`bm search`) — semantic graph + observations + relations |
| 36 | +2. **OpenClaw builtin** (`memory-core`) — SQLite + vector + BM25 hybrid |
| 37 | +3. **QMD** (experimental) — BM25 + vectors + reranking sidecar |
| 38 | + |
| 39 | +## Quick Start |
| 40 | + |
| 41 | +```bash |
| 42 | +# Prerequisites: bm CLI installed |
| 43 | +# https://github.com/basicmachines-co/basic-memory |
| 44 | + |
| 45 | +# Run the benchmark (small corpus, default) |
| 46 | +just benchmark |
| 47 | + |
| 48 | +# Verbose output (per-query details) |
| 49 | +just benchmark-verbose |
| 50 | + |
| 51 | +# Run all corpus sizes to see scaling behavior |
| 52 | +just benchmark-all |
| 53 | + |
| 54 | +# Run a specific size |
| 55 | +just benchmark-medium |
| 56 | +just benchmark-large |
| 57 | +``` |
| 58 | + |
| 59 | +## Corpus Tiers |
| 60 | + |
| 61 | +Three nested corpus sizes test how retrieval scales with data growth. Each tier is a superset of the previous — medium contains all of small, large contains all of medium. |
| 62 | + |
| 63 | +### Small (~10 files, ~12KB) — `corpus-small/` |
| 64 | +A single day's work. Baseline: "does search work at all?" |
| 65 | +- 1 MEMORY.md, 4 daily notes, 2 tasks, 2 people, 2 topics |
| 66 | + |
| 67 | +### Medium (~35-40 files, ~50KB) — `corpus-medium/` |
| 68 | +A working week. Tests noise resistance and temporal ranking. |
| 69 | +- Everything in small + 7 more daily notes, 3 more tasks (incl. done), 3 more people, 3 more topics |
| 70 | +- Done tasks that should NOT appear in active task queries |
| 71 | +- More entities competing for relevance on each query |
| 72 | +- 2-hop relation chains |
| 73 | + |
| 74 | +### Large (~100-120 files, ~150-200KB) — `corpus-large/` |
| 75 | +A month of accumulated knowledge. The real stress test. |
| 76 | +- Everything in medium + 25 more daily notes, 10 more tasks, 10 more people/orgs, 15 more topics |
| 77 | +- Deep needle-in-haystack: specific IDs buried in old notes |
| 78 | +- 3+ hop relation chains |
| 79 | +- Heavy cross-document synthesis requirements |
| 80 | +- Stale vs fresh fact resolution at scale |
| 81 | + |
| 82 | +### What scaling reveals |
| 83 | + |
| 84 | +| Metric | Small → Medium | Medium → Large | |
| 85 | +|--------|---------------|----------------| |
| 86 | +| Recall@5 | Should hold steady | May degrade — more noise | |
| 87 | +| MRR | Should hold steady | Ranking quality under pressure | |
| 88 | +| Latency | Baseline | Index size impact | |
| 89 | +| Content hit | High | Needle-in-haystack stress | |
| 90 | + |
| 91 | +If recall drops significantly from small → large, that's the signal to improve chunking, ranking, or indexing. |
| 92 | + |
| 93 | +## Queries |
| 94 | + |
| 95 | +`benchmark/queries.json` contains 38 annotated queries with: |
| 96 | +- Ground truth file paths (which files contain the answer) |
| 97 | +- Expected content strings (for exact fact verification) |
| 98 | +- Category labels (for per-category scoring) |
| 99 | +- Notes explaining edge cases |
| 100 | + |
| 101 | +## Results |
| 102 | + |
| 103 | +Results are written to `benchmark/results/` as JSON with full per-query breakdowns: |
| 104 | +- Overall metrics (recall, precision, MRR, latency) |
| 105 | +- Category breakdown |
| 106 | +- Individual query scores |
| 107 | +- Failure analysis |
| 108 | + |
| 109 | +## Contributing |
| 110 | + |
| 111 | +We welcome contributions: |
| 112 | +- **Add queries** — especially edge cases you've encountered |
| 113 | +- **Expand the corpus** — more realistic memory patterns |
| 114 | +- **Add providers** — help us compare against other memory systems |
| 115 | +- **Challenge methodology** — if our scoring is unfair, tell us |
| 116 | + |
| 117 | +## License |
| 118 | + |
| 119 | +MIT — same as the plugin. |
0 commit comments