|
| 1 | +# AGENTS.md - basic-memory-benchmarks Guide |
| 2 | + |
| 3 | +## Project Overview |
| 4 | + |
| 5 | +`basic-memory-benchmarks` is a standalone benchmark harness for comparing Basic Memory against other memory systems. |
| 6 | + |
| 7 | +Primary goals: |
| 8 | +- Deterministic retrieval benchmarks |
| 9 | +- Optional LLM-as-a-judge benchmarks |
| 10 | +- Public, reproducible artifact publication (including provenance metadata) |
| 11 | + |
| 12 | +This repo is intentionally isolated from `basic-memory` so benchmark dependencies do not pollute the product repo. |
| 13 | + |
| 14 | +## Build / Test Commands |
| 15 | + |
| 16 | +- Install: `uv sync --group dev` |
| 17 | +- Install judge extras: `uv sync --group dev --extra judge` |
| 18 | +- Run tests: `uv run pytest -q` |
| 19 | +- Lint: `uv run ruff check .` |
| 20 | +- Type check: `uv run pyright` |
| 21 | + |
| 22 | +Recommended local gate before pushing: |
| 23 | +1. `uv run pytest -q` |
| 24 | +2. `uv run ruff check .` |
| 25 | +3. `uv run pyright` |
| 26 | + |
| 27 | +## Benchmark Commands |
| 28 | + |
| 29 | +CLI entrypoint: `uv run bm-bench ...` |
| 30 | + |
| 31 | +Dataset and conversion: |
| 32 | +- `uv run bm-bench datasets fetch --dataset locomo` |
| 33 | +- `uv run bm-bench convert locomo --dataset-path benchmarks/datasets/locomo/locomo10.json --output-dir benchmarks/generated/locomo` |
| 34 | + |
| 35 | +Run retrieval: |
| 36 | +- `uv run bm-bench run retrieval --providers bm-local,mem0-local --dataset-id locomo --dataset-path benchmarks/datasets/locomo/locomo10.json --corpus-dir benchmarks/generated/locomo/docs --queries-path benchmarks/generated/locomo/queries.json --output-root benchmarks/runs --allow-provider-skip` |
| 37 | + |
| 38 | +Run judge (optional): |
| 39 | +- `uv run bm-bench run judge --run-dir benchmarks/runs/<run-id>` |
| 40 | + |
| 41 | +Validate and publish: |
| 42 | +- `uv run bm-bench validate-artifacts --run-dir benchmarks/runs/<run-id>` |
| 43 | +- `uv run bm-bench publish --run-dir benchmarks/runs/<run-id> --destination benchmarks/results/public` |
| 44 | + |
| 45 | +`just` shortcuts: |
| 46 | +- `just bench-smoke` |
| 47 | +- `just bench-fetch-locomo` |
| 48 | +- `just bench-convert-locomo` |
| 49 | +- `just bench-run-bm-local` |
| 50 | +- `just bench-run-mem0-local` |
| 51 | +- `just bench-run-full` |
| 52 | +- `just bench-judge RUN_DIR=benchmarks/runs/<run-id>` |
| 53 | +- `just bench-publish RUN_DIR=benchmarks/runs/<run-id>` |
| 54 | + |
| 55 | +## Repository Layout |
| 56 | + |
| 57 | +- `src/basic_memory_benchmarks/cli.py` - CLI surface |
| 58 | +- `src/basic_memory_benchmarks/runner.py` - run orchestration |
| 59 | +- `src/basic_memory_benchmarks/providers/` - provider adapters (`bm-local`, `bm-cloud`, `mem0-local`, `zep-reference`) |
| 60 | +- `src/basic_memory_benchmarks/scoring/` - retrieval + judge scoring |
| 61 | +- `src/basic_memory_benchmarks/reporting/` - artifact writers / comparison helpers |
| 62 | +- `src/basic_memory_benchmarks/converters/` - dataset conversion logic |
| 63 | +- `src/basic_memory_benchmarks/datasets/` - dataset fetch/load helpers |
| 64 | +- `benchmarks/datasets/` - source metadata + download helpers |
| 65 | +- `benchmarks/generated/` - generated corpus/query outputs |
| 66 | +- `benchmarks/runs/` - raw run artifacts |
| 67 | +- `benchmarks/results/public/` - published bundles |
| 68 | +- `tests/`, `test-int/` - unit and integration tests |
| 69 | + |
| 70 | +## Benchmark Integrity Rules |
| 71 | + |
| 72 | +These are non-negotiable for headline comparisons: |
| 73 | + |
| 74 | +1. Use the same query set and same `top_k` across providers. |
| 75 | +2. Do not apply provider-specific query rewriting for headline runs. |
| 76 | +3. Keep official categories and adversarial breakout separate. |
| 77 | +4. Record provider `SKIPPED(reason)` explicitly; do not silently drop providers. |
| 78 | +5. Always capture provenance in `manifest.json`: |
| 79 | + - benchmark repo SHA |
| 80 | + - BM source + resolved BM SHA |
| 81 | + - provider versions |
| 82 | + - dataset source + checksum |
| 83 | + - runtime metadata |
| 84 | + |
| 85 | +## Provider Notes |
| 86 | + |
| 87 | +### Basic Memory (`bm-local`) |
| 88 | + |
| 89 | +- Interact via external `bm` CLI contract, not internal imports from `basic-memory`. |
| 90 | +- Repeated runs against the same corpus path may reuse an existing BM project name. |
| 91 | +- The benchmark command is typically invoked through `uv run ...` so `.venv/bin/bm` is used. |
| 92 | + |
| 93 | +### Mem0 (`mem0-local`) |
| 94 | + |
| 95 | +- Requires `OPENAI_API_KEY` (or equivalent configured model creds). |
| 96 | +- Ingest and search use a stable benchmark `user_id` namespace per run. |
| 97 | +- Store source metadata (`source_doc_id`, `source_path`, `conversation_id`, `dataset_id`) for grounding. |
| 98 | + |
| 99 | +## Environment / Secrets |
| 100 | + |
| 101 | +- Keep secrets in `.env` (already gitignored). |
| 102 | +- Avoid exporting unrelated `BASIC_MEMORY_*` environment variables into benchmark runs unless intended. |
| 103 | +- Prefer setting only required credentials for run reproducibility. |
| 104 | + |
| 105 | +## Dataset Policy |
| 106 | + |
| 107 | +- If redistribution is allowed: publish snapshot + checksum. |
| 108 | +- If restricted: publish canonical source link + downloader + checksum verification. |
| 109 | +- Always publish conversion code and run artifacts. |
| 110 | + |
| 111 | +## Coding Guidelines |
| 112 | + |
| 113 | +- Python 3.12+ style with type hints. |
| 114 | +- Keep diffs focused and minimal. |
| 115 | +- Fail fast; do not silently swallow benchmark-critical failures. |
| 116 | +- Use `apply_patch` for targeted edits when practical. |
| 117 | +- Add tests for behavior changes in adapters, scoring, or artifact schemas. |
| 118 | + |
| 119 | +## Git / Collaboration |
| 120 | + |
| 121 | +- Use non-interactive git commands. |
| 122 | +- Sign commits: `git commit -s`. |
| 123 | +- Do not commit `.env`, generated runs, or local editor state. |
| 124 | +- If changing benchmark behavior, include a brief note in PR/commit describing fairness or reproducibility impact. |
0 commit comments