basicmachines-co
diff --git a/‎.github/workflows/benchmark-nightly.yml‎
Lines changed: 44 additions & 0 deletions b/‎.github/workflows/benchmark-nightly.yml‎
Lines changed: 44 additions & 0 deletions
diff --git a/‎.github/workflows/benchmark-smoke.yml‎
Lines changed: 30 additions & 0 deletions b/‎.github/workflows/benchmark-smoke.yml‎
Lines changed: 30 additions & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 14 additions & 0 deletions b/‎.gitignore‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎.python-version‎
Lines changed: 1 addition & 0 deletions b/‎.python-version‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎README.md‎
Lines changed: 128 additions & 0 deletions b/‎README.md‎
Lines changed: 128 additions & 0 deletions
diff --git a/‎benchmarks/datasets/README.md‎
Lines changed: 13 additions & 0 deletions b/‎benchmarks/datasets/README.md‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎benchmarks/datasets/locomo/README.md‎
Lines changed: 7 additions & 0 deletions b/‎benchmarks/datasets/locomo/README.md‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎benchmarks/datasets/locomo/download.sh‎
Lines changed: 4 additions & 0 deletions b/‎benchmarks/datasets/locomo/download.sh‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎benchmarks/datasets/locomo/source.json‎
Lines changed: 6 additions & 0 deletions b/‎benchmarks/datasets/locomo/source.json‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎benchmarks/datasets/longmemeval/README.md‎
Lines changed: 3 additions & 0 deletions b/‎benchmarks/datasets/longmemeval/README.md‎
Lines changed: 3 additions & 0 deletions
@@ -0,0 +1,44 @@
+name: benchmark-nightly
+
+on:
+  schedule:
+    - cron: "0 8 * * *"
+  workflow_dispatch:
+
+jobs:
+  nightly:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: astral-sh/setup-uv@v4
+
+      - name: Install dependencies
+        run: uv sync --group dev --extra judge
+
+      - name: Fetch LoCoMo
+        run: just bench-fetch-locomo
+
+      - name: Convert LoCoMo
+        run: just bench-convert-locomo
+
+      - name: Run retrieval benchmark
+        run: |
+          uv run bm-bench run retrieval \
+            --providers bm-local,bm-cloud,mem0-local \
+            --dataset-id locomo \
+            --dataset-path benchmarks/datasets/locomo/locomo10.json \
+            --corpus-dir benchmarks/generated/locomo/docs \
+            --queries-path benchmarks/generated/locomo/queries.json \
+            --allow-provider-skip
+
+      - name: Run judge benchmark (best effort)
+        run: |
+          LATEST_RUN=$(ls -1t benchmarks/runs | head -n1)
+          uv run bm-bench run judge --run-dir "benchmarks/runs/${LATEST_RUN}" || true
+
+      - name: Upload benchmark artifacts
+        uses: actions/upload-artifact@v4
+        with:
+          name: benchmark-nightly-artifacts
+          path: benchmarks/runs
@@ -0,0 +1,30 @@
+name: benchmark-smoke
+
+on:
+  pull_request:
+  workflow_dispatch:
+
+jobs:
+  smoke:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: astral-sh/setup-uv@v4
+
+      - name: Install dependencies
+        run: uv sync --group dev
+
+      - name: Run smoke benchmark
+        run: just bench-smoke
+
+      - name: Validate latest artifacts
+        run: |
+          LATEST_RUN=$(ls -1t benchmarks/runs | head -n1)
+          uv run bm-bench validate-artifacts --run-dir "benchmarks/runs/${LATEST_RUN}"
+
+      - name: Upload benchmark artifacts
+        uses: actions/upload-artifact@v4
+        with:
+          name: benchmark-smoke-artifacts
+          path: benchmarks/runs
@@ -0,0 +1,14 @@
+__pycache__/
+*.pyc
+.venv/
+.pytest_cache/
+.ruff_cache/
+
+# Generated benchmark outputs
+benchmarks/runs/
+benchmarks/results/public/
+benchmarks/generated/
+
+# Downloaded datasets (source distribution may be restricted)
+benchmarks/datasets/locomo/locomo10.json
+benchmarks/datasets/locomo/locomo10.provenance.json
@@ -0,0 +1 @@
+3.13
@@ -0,0 +1,128 @@
+# basic-memory-benchmarks
+
+Standalone, reproducible benchmark suite for comparing Basic Memory against competitor memory systems.
+
+## Goals
+
+- Deterministic retrieval benchmarks (Recall@5/10, MRR, Precision@5, content-hit, latency)
+- Optional LLM-as-judge scoring (Pydantic Evals)
+- Public artifacts with provenance and reproducibility metadata
+- Clean dependency isolation from the core `basic-memory` repository
+
+## Current v1 Scope
+
+- Providers:
+  - `bm-local`
+  - `bm-cloud` (optional, credential-gated)
+  - `mem0-local`
+  - `zep-reference` (reference-only in v1)
+- Datasets:
+  - LoCoMo (primary)
+  - LongMemEval scaffold (placeholder)
+  - Built-in synthetic smoke corpus
+
+## Installation
+
+```bash
+uv sync --group dev
+```
+
+Optional judge dependencies:
+
+```bash
+uv sync --group dev --extra judge
+```
+
+## Quickstart
+
+### 1) Fetch LoCoMo dataset
+
+```bash
+uv run bm-bench datasets fetch --dataset locomo
+```
+
+### 2) Convert LoCoMo into benchmark corpus
+
+```bash
+uv run bm-bench convert locomo
+```
+
+### 3) Run retrieval benchmark
+
+```bash
+uv run bm-bench run retrieval \
+  --providers bm-local,mem0-local \
+  --corpus-dir benchmarks/generated/locomo/docs \
+  --queries-path benchmarks/generated/locomo/queries.json
+```
+
+### 4) Optional judge benchmark
+
+```bash
+uv run bm-bench run judge --run-dir benchmarks/runs/<run-id>
+```
+
+### 5) Publish run artifacts
+
+```bash
+uv run bm-bench publish --run-dir benchmarks/runs/<run-id>
+```
+
+## Basic Memory source policy
+
+By default this project tracks Basic Memory from `main`.
+
+Each run manifest stores:
+- BM source (`github main` or local path override)
+- resolved BM commit SHA
+
+Local override:
+
+```bash
+uv run bm-bench run retrieval \
+  --bm-local-path /Users/phernandez/dev/basicmachines/basic-memory
+```
+
+## Mem0 local requirements
+
+`mem0-local` requires model credentials available in environment.
+
+At minimum, set:
+
+```bash
+export OPENAI_API_KEY=...
+```
+
+If unavailable, provider status will be recorded as `SKIPPED(reason)`.
+
+## Run Artifacts
+
+Per run (`benchmarks/runs/<run-id>/`):
+
+- `manifest.json`
+- `provider-status.json`
+- `per-query-retrieval.jsonl`
+- `retrieval-summary.json`
+- `per-query-judge.jsonl` (optional)
+- `judge-summary.json` (optional)
+- `summary.md`
+
+## Just commands
+
+```bash
+just bench-smoke
+just bench-fetch-locomo
+just bench-convert-locomo
+just bench-run-bm-local
+just bench-run-mem0-local
+just bench-run-full
+just bench-judge
+just bench-publish RUN_DIR=benchmarks/runs/<run-id>
+```
+
+## Notes on dataset publication
+
+Dataset publication follows licensing constraints:
+- If redistribution is permitted: snapshot + checksum may be published.
+- If not: canonical source links + downloader + checksum verification are published.
+
@@ -0,0 +1,13 @@
+# Dataset Policy
+
+This repository publishes benchmark provenance in the open.
+
+- If redistribution is allowed, snapshots may be stored in-repo with checksums.
+- If redistribution is restricted, we publish canonical source links, download scripts,
+  and checksums so anyone can reproduce runs.
+
+Always published:
+- conversion scripts
+- query manifests
+- run artifacts
+- provenance metadata
@@ -0,0 +1,7 @@
+# LoCoMo Dataset
+
+Primary v1 dataset for benchmark runs.
+
+- Canonical source URL is configured in `src/basic_memory_benchmarks/datasets/locomo.py`.
+- Download via `bm-bench datasets fetch --dataset locomo`.
+- Provenance checksum is written alongside the downloaded file.
@@ -0,0 +1,4 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+uv run bm-bench datasets fetch --dataset locomo --output benchmarks/datasets/locomo/locomo10.json
@@ -0,0 +1,6 @@
+{
+  "dataset_id": "locomo",
+  "source_url": "https://raw.githubusercontent.com/snap-research/locomo/main/data/locomo10.json",
+  "citation": "LoCoMo (ACL 2024, Snap Research)",
+  "license_note": "Check upstream dataset terms before redistributing snapshots."
+}
@@ -0,0 +1,3 @@
+# LongMemEval (Scaffold)
+
+LongMemEval integration is scaffolded in v1 and will be implemented in a follow-up.
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+# LongMemEval (Scaffold)`
	`2`	`+`
	`3`	`+LongMemEval integration is scaffolded in v1 and will be implemented in a follow-up.`