Skip to content

Commit 3c1265f

Browse files
committed
feat: use warm bm mcp session for bm-local benchmarks
Signed-off-by: phernandez <paul@basicmachines.co>
1 parent 8e003ef commit 3c1265f

6 files changed

Lines changed: 526 additions & 44 deletions

File tree

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ __pycache__/
99
benchmarks/runs/
1010
benchmarks/results/public/
1111
benchmarks/generated/
12+
benchmarks/logs/
1213

1314
# Downloaded datasets (source distribution may be restricted)
1415
benchmarks/datasets/locomo/locomo10.json

AGENTS.md

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
# AGENTS.md - basic-memory-benchmarks Guide
2+
3+
## Project Overview
4+
5+
`basic-memory-benchmarks` is a standalone benchmark harness for comparing Basic Memory against other memory systems.
6+
7+
Primary goals:
8+
- Deterministic retrieval benchmarks
9+
- Optional LLM-as-a-judge benchmarks
10+
- Public, reproducible artifact publication (including provenance metadata)
11+
12+
This repo is intentionally isolated from `basic-memory` so benchmark dependencies do not pollute the product repo.
13+
14+
## Build / Test Commands
15+
16+
- Install: `uv sync --group dev`
17+
- Install judge extras: `uv sync --group dev --extra judge`
18+
- Run tests: `uv run pytest -q`
19+
- Lint: `uv run ruff check .`
20+
- Type check: `uv run pyright`
21+
22+
Recommended local gate before pushing:
23+
1. `uv run pytest -q`
24+
2. `uv run ruff check .`
25+
3. `uv run pyright`
26+
27+
## Benchmark Commands
28+
29+
CLI entrypoint: `uv run bm-bench ...`
30+
31+
Dataset and conversion:
32+
- `uv run bm-bench datasets fetch --dataset locomo`
33+
- `uv run bm-bench convert locomo --dataset-path benchmarks/datasets/locomo/locomo10.json --output-dir benchmarks/generated/locomo`
34+
35+
Run retrieval:
36+
- `uv run bm-bench run retrieval --providers bm-local,mem0-local --dataset-id locomo --dataset-path benchmarks/datasets/locomo/locomo10.json --corpus-dir benchmarks/generated/locomo/docs --queries-path benchmarks/generated/locomo/queries.json --output-root benchmarks/runs --allow-provider-skip`
37+
38+
Run judge (optional):
39+
- `uv run bm-bench run judge --run-dir benchmarks/runs/<run-id>`
40+
41+
Validate and publish:
42+
- `uv run bm-bench validate-artifacts --run-dir benchmarks/runs/<run-id>`
43+
- `uv run bm-bench publish --run-dir benchmarks/runs/<run-id> --destination benchmarks/results/public`
44+
45+
`just` shortcuts:
46+
- `just bench-smoke`
47+
- `just bench-fetch-locomo`
48+
- `just bench-convert-locomo`
49+
- `just bench-run-bm-local`
50+
- `just bench-run-mem0-local`
51+
- `just bench-run-full`
52+
- `just bench-judge RUN_DIR=benchmarks/runs/<run-id>`
53+
- `just bench-publish RUN_DIR=benchmarks/runs/<run-id>`
54+
55+
## Repository Layout
56+
57+
- `src/basic_memory_benchmarks/cli.py` - CLI surface
58+
- `src/basic_memory_benchmarks/runner.py` - run orchestration
59+
- `src/basic_memory_benchmarks/providers/` - provider adapters (`bm-local`, `bm-cloud`, `mem0-local`, `zep-reference`)
60+
- `src/basic_memory_benchmarks/scoring/` - retrieval + judge scoring
61+
- `src/basic_memory_benchmarks/reporting/` - artifact writers / comparison helpers
62+
- `src/basic_memory_benchmarks/converters/` - dataset conversion logic
63+
- `src/basic_memory_benchmarks/datasets/` - dataset fetch/load helpers
64+
- `benchmarks/datasets/` - source metadata + download helpers
65+
- `benchmarks/generated/` - generated corpus/query outputs
66+
- `benchmarks/runs/` - raw run artifacts
67+
- `benchmarks/results/public/` - published bundles
68+
- `tests/`, `test-int/` - unit and integration tests
69+
70+
## Benchmark Integrity Rules
71+
72+
These are non-negotiable for headline comparisons:
73+
74+
1. Use the same query set and same `top_k` across providers.
75+
2. Do not apply provider-specific query rewriting for headline runs.
76+
3. Keep official categories and adversarial breakout separate.
77+
4. Record provider `SKIPPED(reason)` explicitly; do not silently drop providers.
78+
5. Always capture provenance in `manifest.json`:
79+
- benchmark repo SHA
80+
- BM source + resolved BM SHA
81+
- provider versions
82+
- dataset source + checksum
83+
- runtime metadata
84+
85+
## Provider Notes
86+
87+
### Basic Memory (`bm-local`)
88+
89+
- Interact via external `bm` CLI contract, not internal imports from `basic-memory`.
90+
- Repeated runs against the same corpus path may reuse an existing BM project name.
91+
- The benchmark command is typically invoked through `uv run ...` so `.venv/bin/bm` is used.
92+
93+
### Mem0 (`mem0-local`)
94+
95+
- Requires `OPENAI_API_KEY` (or equivalent configured model creds).
96+
- Ingest and search use a stable benchmark `user_id` namespace per run.
97+
- Store source metadata (`source_doc_id`, `source_path`, `conversation_id`, `dataset_id`) for grounding.
98+
99+
## Environment / Secrets
100+
101+
- Keep secrets in `.env` (already gitignored).
102+
- Avoid exporting unrelated `BASIC_MEMORY_*` environment variables into benchmark runs unless intended.
103+
- Prefer setting only required credentials for run reproducibility.
104+
105+
## Dataset Policy
106+
107+
- If redistribution is allowed: publish snapshot + checksum.
108+
- If restricted: publish canonical source link + downloader + checksum verification.
109+
- Always publish conversion code and run artifacts.
110+
111+
## Coding Guidelines
112+
113+
- Python 3.12+ style with type hints.
114+
- Keep diffs focused and minimal.
115+
- Fail fast; do not silently swallow benchmark-critical failures.
116+
- Use `apply_patch` for targeted edits when practical.
117+
- Add tests for behavior changes in adapters, scoring, or artifact schemas.
118+
119+
## Git / Collaboration
120+
121+
- Use non-interactive git commands.
122+
- Sign commits: `git commit -s`.
123+
- Do not commit `.env`, generated runs, or local editor state.
124+
- If changing benchmark behavior, include a brief note in PR/commit describing fairness or reproducibility impact.

README.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Standalone, reproducible benchmark suite for comparing Basic Memory against comp
1212
## Current v1 Scope
1313

1414
- Providers:
15-
- `bm-local`
15+
- `bm-local` (warm `bm mcp` stdio session)
1616
- `bm-cloud` (optional, credential-gated)
1717
- `mem0-local`
1818
- `zep-reference` (reference-only in v1)
@@ -95,6 +95,13 @@ export OPENAI_API_KEY=...
9595

9696
If unavailable, provider status will be recorded as `SKIPPED(reason)`.
9797

98+
## BM indexing readiness
99+
100+
`bm-local` verifies index readiness before querying.
101+
102+
- If the installed `bm` supports `bm status --json`, readiness is polled from that output.
103+
- If `--json` is not available in the installed `bm`, the benchmark proceeds after reindex.
104+
98105
## Run Artifacts
99106

100107
Per run (`benchmarks/runs/<run-id>/`):
@@ -125,4 +132,3 @@ just bench-publish RUN_DIR=benchmarks/runs/<run-id>
125132
Dataset publication follows licensing constraints:
126133
- If redistribution is permitted: snapshot + checksum may be published.
127134
- If not: canonical source links + downloader + checksum verification are published.
128-

docs/benchmarks.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Providers
44

5-
- `bm-local`: Basic Memory local execution via external CLI/MCP contract
5+
- `bm-local`: Basic Memory local execution via warm `bm mcp` stdio session
66
- `bm-cloud`: Optional cloud mode (credential gated)
77
- `mem0-local`: Mem0 package execution in local environment
88
- `zep-reference`: reference-only placeholder in v1

0 commit comments

Comments
 (0)