feat: use warm bm mcp session for bm-local benchmarks

phernandez · phernandez · commit 3c1265fbf650 · 2026-02-26T11:16:14.000-06:00
Signed-off-by: phernandez &lt;paul@basicmachines.co&gt;
diff --git a/.gitignore b/.gitignore
@@ -9,6 +9,7 @@ __pycache__/
 benchmarks/runs/
 benchmarks/results/public/
 benchmarks/generated/
+benchmarks/logs/
 
 # Downloaded datasets (source distribution may be restricted)
 benchmarks/datasets/locomo/locomo10.json
diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1,124 @@
+# AGENTS.md - basic-memory-benchmarks Guide
+
+## Project Overview
+
+`basic-memory-benchmarks` is a standalone benchmark harness for comparing Basic Memory against other memory systems.
+
+Primary goals:
+- Deterministic retrieval benchmarks
+- Optional LLM-as-a-judge benchmarks
+- Public, reproducible artifact publication (including provenance metadata)
+
+This repo is intentionally isolated from `basic-memory` so benchmark dependencies do not pollute the product repo.
+
+## Build / Test Commands
+
+- Install: `uv sync --group dev`
+- Install judge extras: `uv sync --group dev --extra judge`
+- Run tests: `uv run pytest -q`
+- Lint: `uv run ruff check .`
+- Type check: `uv run pyright`
+
+Recommended local gate before pushing:
+1. `uv run pytest -q`
+2. `uv run ruff check .`
+3. `uv run pyright`
+
+## Benchmark Commands
+
+CLI entrypoint: `uv run bm-bench ...`
+
+Dataset and conversion:
+- `uv run bm-bench datasets fetch --dataset locomo`
+- `uv run bm-bench convert locomo --dataset-path benchmarks/datasets/locomo/locomo10.json --output-dir benchmarks/generated/locomo`
+
+Run retrieval:
+- `uv run bm-bench run retrieval --providers bm-local,mem0-local --dataset-id locomo --dataset-path benchmarks/datasets/locomo/locomo10.json --corpus-dir benchmarks/generated/locomo/docs --queries-path benchmarks/generated/locomo/queries.json --output-root benchmarks/runs --allow-provider-skip`
+
+Run judge (optional):
+- `uv run bm-bench run judge --run-dir benchmarks/runs/<run-id>`
+
+Validate and publish:
+- `uv run bm-bench validate-artifacts --run-dir benchmarks/runs/<run-id>`
+- `uv run bm-bench publish --run-dir benchmarks/runs/<run-id> --destination benchmarks/results/public`
+
+`just` shortcuts:
+- `just bench-smoke`
+- `just bench-fetch-locomo`
+- `just bench-convert-locomo`
+- `just bench-run-bm-local`
+- `just bench-run-mem0-local`
+- `just bench-run-full`
+- `just bench-judge RUN_DIR=benchmarks/runs/<run-id>`
+- `just bench-publish RUN_DIR=benchmarks/runs/<run-id>`
+
+## Repository Layout
+
+- `src/basic_memory_benchmarks/cli.py` - CLI surface
+- `src/basic_memory_benchmarks/runner.py` - run orchestration
+- `src/basic_memory_benchmarks/providers/` - provider adapters (`bm-local`, `bm-cloud`, `mem0-local`, `zep-reference`)
+- `src/basic_memory_benchmarks/scoring/` - retrieval + judge scoring
+- `src/basic_memory_benchmarks/reporting/` - artifact writers / comparison helpers
+- `src/basic_memory_benchmarks/converters/` - dataset conversion logic
+- `src/basic_memory_benchmarks/datasets/` - dataset fetch/load helpers
+- `benchmarks/datasets/` - source metadata + download helpers
+- `benchmarks/generated/` - generated corpus/query outputs
+- `benchmarks/runs/` - raw run artifacts
+- `benchmarks/results/public/` - published bundles
+- `tests/`, `test-int/` - unit and integration tests
+
+## Benchmark Integrity Rules
+
+These are non-negotiable for headline comparisons:
+
+1. Use the same query set and same `top_k` across providers.
+2. Do not apply provider-specific query rewriting for headline runs.
+3. Keep official categories and adversarial breakout separate.
+4. Record provider `SKIPPED(reason)` explicitly; do not silently drop providers.
+5. Always capture provenance in `manifest.json`:
+   - benchmark repo SHA
+   - BM source + resolved BM SHA
+   - provider versions
+   - dataset source + checksum
+   - runtime metadata
+
+## Provider Notes
+
+### Basic Memory (`bm-local`)
+
+- Interact via external `bm` CLI contract, not internal imports from `basic-memory`.
+- Repeated runs against the same corpus path may reuse an existing BM project name.
+- The benchmark command is typically invoked through `uv run ...` so `.venv/bin/bm` is used.
+
+### Mem0 (`mem0-local`)
+
+- Requires `OPENAI_API_KEY` (or equivalent configured model creds).
+- Ingest and search use a stable benchmark `user_id` namespace per run.
+- Store source metadata (`source_doc_id`, `source_path`, `conversation_id`, `dataset_id`) for grounding.
+
+## Environment / Secrets
+
+- Keep secrets in `.env` (already gitignored).
+- Avoid exporting unrelated `BASIC_MEMORY_*` environment variables into benchmark runs unless intended.
+- Prefer setting only required credentials for run reproducibility.
+
+## Dataset Policy
+
+- If redistribution is allowed: publish snapshot + checksum.
+- If restricted: publish canonical source link + downloader + checksum verification.
+- Always publish conversion code and run artifacts.
+
+## Coding Guidelines
+
+- Python 3.12+ style with type hints.
+- Keep diffs focused and minimal.
+- Fail fast; do not silently swallow benchmark-critical failures.
+- Use `apply_patch` for targeted edits when practical.
+- Add tests for behavior changes in adapters, scoring, or artifact schemas.
+
+## Git / Collaboration
+
+- Use non-interactive git commands.
+- Sign commits: `git commit -s`.
+- Do not commit `.env`, generated runs, or local editor state.
+- If changing benchmark behavior, include a brief note in PR/commit describing fairness or reproducibility impact.
diff --git a/README.md b/README.md
@@ -12,7 +12,7 @@ Standalone, reproducible benchmark suite for comparing Basic Memory against comp
 ## Current v1 Scope
 
 - Providers:
-  - `bm-local`
+  - `bm-local` (warm `bm mcp` stdio session)
   - `bm-cloud` (optional, credential-gated)
   - `mem0-local`
   - `zep-reference` (reference-only in v1)
@@ -95,6 +95,13 @@ export OPENAI_API_KEY=...
 
 If unavailable, provider status will be recorded as `SKIPPED(reason)`.
 
+## BM indexing readiness
+
+`bm-local` verifies index readiness before querying.
+
+- If the installed `bm` supports `bm status --json`, readiness is polled from that output.
+- If `--json` is not available in the installed `bm`, the benchmark proceeds after reindex.
+
 ## Run Artifacts
 
 Per run (`benchmarks/runs/<run-id>/`):
@@ -125,4 +132,3 @@ just bench-publish RUN_DIR=benchmarks/runs/<run-id>
 Dataset publication follows licensing constraints:
 - If redistribution is permitted: snapshot + checksum may be published.
 - If not: canonical source links + downloader + checksum verification are published.
-
diff --git a/docs/benchmarks.md b/docs/benchmarks.md
@@ -2,7 +2,7 @@
 
 ## Providers
 
-- `bm-local`: Basic Memory local execution via external CLI/MCP contract
+- `bm-local`: Basic Memory local execution via warm `bm mcp` stdio session
 - `bm-cloud`: Optional cloud mode (credential gated)
 - `mem0-local`: Mem0 package execution in local environment
 - `zep-reference`: reference-only placeholder in v1
diff --git a/src/basic_memory_benchmarks/providers/bm_local.py b/src/basic_memory_benchmarks/providers/bm_local.py
diff --git a/tests/providers/test_bm_local_provider.py b/tests/providers/test_bm_local_provider.py