Skip to content

Commit 5f9704a

Browse files
committed
feat: scaffold standalone basic-memory-benchmarks repo
Signed-off-by: phernandez <paul@basicmachines.co>
0 parents  commit 5f9704a

50 files changed

Lines changed: 5666 additions & 0 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
name: benchmark-nightly
2+
3+
on:
4+
schedule:
5+
- cron: "0 8 * * *"
6+
workflow_dispatch:
7+
8+
jobs:
9+
nightly:
10+
runs-on: ubuntu-latest
11+
steps:
12+
- uses: actions/checkout@v4
13+
14+
- uses: astral-sh/setup-uv@v4
15+
16+
- name: Install dependencies
17+
run: uv sync --group dev --extra judge
18+
19+
- name: Fetch LoCoMo
20+
run: just bench-fetch-locomo
21+
22+
- name: Convert LoCoMo
23+
run: just bench-convert-locomo
24+
25+
- name: Run retrieval benchmark
26+
run: |
27+
uv run bm-bench run retrieval \
28+
--providers bm-local,bm-cloud,mem0-local \
29+
--dataset-id locomo \
30+
--dataset-path benchmarks/datasets/locomo/locomo10.json \
31+
--corpus-dir benchmarks/generated/locomo/docs \
32+
--queries-path benchmarks/generated/locomo/queries.json \
33+
--allow-provider-skip
34+
35+
- name: Run judge benchmark (best effort)
36+
run: |
37+
LATEST_RUN=$(ls -1t benchmarks/runs | head -n1)
38+
uv run bm-bench run judge --run-dir "benchmarks/runs/${LATEST_RUN}" || true
39+
40+
- name: Upload benchmark artifacts
41+
uses: actions/upload-artifact@v4
42+
with:
43+
name: benchmark-nightly-artifacts
44+
path: benchmarks/runs
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
name: benchmark-smoke
2+
3+
on:
4+
pull_request:
5+
workflow_dispatch:
6+
7+
jobs:
8+
smoke:
9+
runs-on: ubuntu-latest
10+
steps:
11+
- uses: actions/checkout@v4
12+
13+
- uses: astral-sh/setup-uv@v4
14+
15+
- name: Install dependencies
16+
run: uv sync --group dev
17+
18+
- name: Run smoke benchmark
19+
run: just bench-smoke
20+
21+
- name: Validate latest artifacts
22+
run: |
23+
LATEST_RUN=$(ls -1t benchmarks/runs | head -n1)
24+
uv run bm-bench validate-artifacts --run-dir "benchmarks/runs/${LATEST_RUN}"
25+
26+
- name: Upload benchmark artifacts
27+
uses: actions/upload-artifact@v4
28+
with:
29+
name: benchmark-smoke-artifacts
30+
path: benchmarks/runs

.gitignore

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
__pycache__/
2+
*.pyc
3+
.venv/
4+
.pytest_cache/
5+
.ruff_cache/
6+
7+
# Generated benchmark outputs
8+
benchmarks/runs/
9+
benchmarks/results/public/
10+
benchmarks/generated/
11+
12+
# Downloaded datasets (source distribution may be restricted)
13+
benchmarks/datasets/locomo/locomo10.json
14+
benchmarks/datasets/locomo/locomo10.provenance.json

.python-version

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
3.13

README.md

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
# basic-memory-benchmarks
2+
3+
Standalone, reproducible benchmark suite for comparing Basic Memory against competitor memory systems.
4+
5+
## Goals
6+
7+
- Deterministic retrieval benchmarks (Recall@5/10, MRR, Precision@5, content-hit, latency)
8+
- Optional LLM-as-judge scoring (Pydantic Evals)
9+
- Public artifacts with provenance and reproducibility metadata
10+
- Clean dependency isolation from the core `basic-memory` repository
11+
12+
## Current v1 Scope
13+
14+
- Providers:
15+
- `bm-local`
16+
- `bm-cloud` (optional, credential-gated)
17+
- `mem0-local`
18+
- `zep-reference` (reference-only in v1)
19+
- Datasets:
20+
- LoCoMo (primary)
21+
- LongMemEval scaffold (placeholder)
22+
- Built-in synthetic smoke corpus
23+
24+
## Installation
25+
26+
```bash
27+
uv sync --group dev
28+
```
29+
30+
Optional judge dependencies:
31+
32+
```bash
33+
uv sync --group dev --extra judge
34+
```
35+
36+
## Quickstart
37+
38+
### 1) Fetch LoCoMo dataset
39+
40+
```bash
41+
uv run bm-bench datasets fetch --dataset locomo
42+
```
43+
44+
### 2) Convert LoCoMo into benchmark corpus
45+
46+
```bash
47+
uv run bm-bench convert locomo
48+
```
49+
50+
### 3) Run retrieval benchmark
51+
52+
```bash
53+
uv run bm-bench run retrieval \
54+
--providers bm-local,mem0-local \
55+
--corpus-dir benchmarks/generated/locomo/docs \
56+
--queries-path benchmarks/generated/locomo/queries.json
57+
```
58+
59+
### 4) Optional judge benchmark
60+
61+
```bash
62+
uv run bm-bench run judge --run-dir benchmarks/runs/<run-id>
63+
```
64+
65+
### 5) Publish run artifacts
66+
67+
```bash
68+
uv run bm-bench publish --run-dir benchmarks/runs/<run-id>
69+
```
70+
71+
## Basic Memory source policy
72+
73+
By default this project tracks Basic Memory from `main`.
74+
75+
Each run manifest stores:
76+
- BM source (`github main` or local path override)
77+
- resolved BM commit SHA
78+
79+
Local override:
80+
81+
```bash
82+
uv run bm-bench run retrieval \
83+
--bm-local-path /Users/phernandez/dev/basicmachines/basic-memory
84+
```
85+
86+
## Mem0 local requirements
87+
88+
`mem0-local` requires model credentials available in environment.
89+
90+
At minimum, set:
91+
92+
```bash
93+
export OPENAI_API_KEY=...
94+
```
95+
96+
If unavailable, provider status will be recorded as `SKIPPED(reason)`.
97+
98+
## Run Artifacts
99+
100+
Per run (`benchmarks/runs/<run-id>/`):
101+
102+
- `manifest.json`
103+
- `provider-status.json`
104+
- `per-query-retrieval.jsonl`
105+
- `retrieval-summary.json`
106+
- `per-query-judge.jsonl` (optional)
107+
- `judge-summary.json` (optional)
108+
- `summary.md`
109+
110+
## Just commands
111+
112+
```bash
113+
just bench-smoke
114+
just bench-fetch-locomo
115+
just bench-convert-locomo
116+
just bench-run-bm-local
117+
just bench-run-mem0-local
118+
just bench-run-full
119+
just bench-judge
120+
just bench-publish RUN_DIR=benchmarks/runs/<run-id>
121+
```
122+
123+
## Notes on dataset publication
124+
125+
Dataset publication follows licensing constraints:
126+
- If redistribution is permitted: snapshot + checksum may be published.
127+
- If not: canonical source links + downloader + checksum verification are published.
128+

benchmarks/datasets/README.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Dataset Policy
2+
3+
This repository publishes benchmark provenance in the open.
4+
5+
- If redistribution is allowed, snapshots may be stored in-repo with checksums.
6+
- If redistribution is restricted, we publish canonical source links, download scripts,
7+
and checksums so anyone can reproduce runs.
8+
9+
Always published:
10+
- conversion scripts
11+
- query manifests
12+
- run artifacts
13+
- provenance metadata
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# LoCoMo Dataset
2+
3+
Primary v1 dataset for benchmark runs.
4+
5+
- Canonical source URL is configured in `src/basic_memory_benchmarks/datasets/locomo.py`.
6+
- Download via `bm-bench datasets fetch --dataset locomo`.
7+
- Provenance checksum is written alongside the downloaded file.
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
#!/usr/bin/env bash
2+
set -euo pipefail
3+
4+
uv run bm-bench datasets fetch --dataset locomo --output benchmarks/datasets/locomo/locomo10.json
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
{
2+
"dataset_id": "locomo",
3+
"source_url": "https://raw.githubusercontent.com/snap-research/locomo/main/data/locomo10.json",
4+
"citation": "LoCoMo (ACL 2024, Snap Research)",
5+
"license_note": "Check upstream dataset terms before redistributing snapshots."
6+
}
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# LongMemEval (Scaffold)
2+
3+
LongMemEval integration is scaffolded in v1 and will be implemented in a follow-up.

0 commit comments

Comments
 (0)