basicmachines-co
diff --git a/‎benchmark/README.md‎
Lines changed: 119 additions & 0 deletions b/‎benchmark/README.md‎
Lines changed: 119 additions & 0 deletions
diff --git a/‎benchmark/corpus-large/MEMORY.md‎
Lines changed: 54 additions & 0 deletions b/‎benchmark/corpus-large/MEMORY.md‎
Lines changed: 54 additions & 0 deletions
diff --git a/‎benchmark/corpus-large/memory/2026-02-10.md‎
Lines changed: 34 additions & 0 deletions b/‎benchmark/corpus-large/memory/2026-02-10.md‎
Lines changed: 34 additions & 0 deletions
diff --git a/‎benchmark/corpus-large/memory/2026-02-11.md‎
Lines changed: 44 additions & 0 deletions b/‎benchmark/corpus-large/memory/2026-02-11.md‎
Lines changed: 44 additions & 0 deletions
diff --git a/‎benchmark/corpus-large/memory/2026-02-12.md‎
Lines changed: 48 additions & 0 deletions b/‎benchmark/corpus-large/memory/2026-02-12.md‎
Lines changed: 48 additions & 0 deletions
diff --git a/‎benchmark/corpus-large/memory/2026-02-14.md‎
Lines changed: 44 additions & 0 deletions b/‎benchmark/corpus-large/memory/2026-02-14.md‎
Lines changed: 44 additions & 0 deletions
diff --git a/‎benchmark/corpus-large/memory/people/lena-vogt.md‎
Lines changed: 29 additions & 0 deletions b/‎benchmark/corpus-large/memory/people/lena-vogt.md‎
Lines changed: 29 additions & 0 deletions
@@ -0,0 +1,119 @@
+# Basic Memory Benchmark
+
+Open, reproducible retrieval quality benchmarks for the Basic Memory OpenClaw plugin.
+
+## Why
+
+Memory systems for AI agents make big claims with no reproducible evidence. We're building benchmarks in the open to:
+
+1. **Improve Basic Memory** — evals are a feedback loop, not a marketing tool
+2. **Compare honestly** — show where we're strong AND where we're weak
+3. **Publish methodology** — anyone can reproduce our results or challenge them
+
+## What We Measure
+
+### Retrieval Quality (primary)
+- **Recall@K** — does the correct memory appear in the top K results?
+- **Precision@K** — of the top K results, how many are actually relevant?
+- **MRR** — Mean Reciprocal Rank: where does the first correct answer appear?
+- **Content Hit Rate** — for exact facts, did the expected value appear in results?
+
+### Query Categories
+| Category | What it tests |
+|----------|---------------|
+| `exact_fact` | Keyword precision — find specific values |
+| `semantic` | Vector similarity — find conceptually related content |
+| `temporal` | Date awareness — retrieve by when things happened |
+| `relational` | Graph traversal — follow connections between entities |
+| `cross_note` | Multi-document recall — stitch information across files |
+| `task_recall` | Structured task queries — find active/assigned tasks |
+| `needle_in_haystack` | Exact token retrieval — find specific IDs, URLs, numbers |
+| `absence` | Knowing what ISN'T there — or is planned but not done |
+| `evolving_fact` | Freshness — prefer newer data over stale entries |
+
+### Providers Compared
+1. **Basic Memory** (`bm search`) — semantic graph + observations + relations
+2. **OpenClaw builtin** (`memory-core`) — SQLite + vector + BM25 hybrid
+3. **QMD** (experimental) — BM25 + vectors + reranking sidecar
+
+## Quick Start
+
+```bash
+# Prerequisites: bm CLI installed
+# https://github.com/basicmachines-co/basic-memory
+
+# Run the benchmark (small corpus, default)
+just benchmark
+
+# Verbose output (per-query details)
+just benchmark-verbose
+
+# Run all corpus sizes to see scaling behavior
+just benchmark-all
+
+# Run a specific size
+just benchmark-medium
+just benchmark-large
+```
+
+## Corpus Tiers
+
+Three nested corpus sizes test how retrieval scales with data growth. Each tier is a superset of the previous — medium contains all of small, large contains all of medium.
+
+### Small (~10 files, ~12KB) — `corpus-small/`
+A single day's work. Baseline: "does search work at all?"
+- 1 MEMORY.md, 4 daily notes, 2 tasks, 2 people, 2 topics
+
+### Medium (~35-40 files, ~50KB) — `corpus-medium/`
+A working week. Tests noise resistance and temporal ranking.
+- Everything in small + 7 more daily notes, 3 more tasks (incl. done), 3 more people, 3 more topics
+- Done tasks that should NOT appear in active task queries
+- More entities competing for relevance on each query
+- 2-hop relation chains
+
+### Large (~100-120 files, ~150-200KB) — `corpus-large/`
+A month of accumulated knowledge. The real stress test.
+- Everything in medium + 25 more daily notes, 10 more tasks, 10 more people/orgs, 15 more topics
+- Deep needle-in-haystack: specific IDs buried in old notes
+- 3+ hop relation chains
+- Heavy cross-document synthesis requirements
+- Stale vs fresh fact resolution at scale
+
+### What scaling reveals
+
+| Metric | Small → Medium | Medium → Large |
+|--------|---------------|----------------|
+| Recall@5 | Should hold steady | May degrade — more noise |
+| MRR | Should hold steady | Ranking quality under pressure |
+| Latency | Baseline | Index size impact |
+| Content hit | High | Needle-in-haystack stress |
+
+If recall drops significantly from small → large, that's the signal to improve chunking, ranking, or indexing.
+
+## Queries
+
+`benchmark/queries.json` contains 38 annotated queries with:
+- Ground truth file paths (which files contain the answer)
+- Expected content strings (for exact fact verification)
+- Category labels (for per-category scoring)
+- Notes explaining edge cases
+
+## Results
+
+Results are written to `benchmark/results/` as JSON with full per-query breakdowns:
+- Overall metrics (recall, precision, MRR, latency)
+- Category breakdown
+- Individual query scores
+- Failure analysis
+
+## Contributing
+
+We welcome contributions:
+- **Add queries** — especially edge cases you've encountered
+- **Expand the corpus** — more realistic memory patterns
+- **Add providers** — help us compare against other memory systems
+- **Challenge methodology** — if our scoring is unfair, tell us
+
+## License
+
+MIT — same as the plugin.
@@ -0,0 +1,54 @@
+# MEMORY.md - Long-Term Memory
+
+## About Me
+- Name: Atlas 🔭. First boot: 2026-01-15.
+- Running on dev machine (Ubuntu 24.04, always on)
+- GitHub: atlas-bot (atlas@stellartools.dev), member of stellartools org
+
+## About the Human
+- Name: Maya Chen
+- Role: Founder of Stellar Tools
+- Timezone: America/Los_Angeles (PST/PDT)
+- Prefers Slack over email for quick things
+- Morning person — most productive before noon
+
+## Team
+- Maya (founder, full-stack)
+- Raj Patel (eng, backend)
+- Lena Vogt (design, part-time)
+- All currently bootstrapped, no outside funding
+
+## Stellar Tools — The Product
+- **What:** Developer productivity CLI that aggregates metrics across GitHub, Linear, and Slack
+- **Core differentiator:** Single pane of glass for engineering velocity — no dashboards, just terminal
+- **Tech:** Rust, SQLite, gRPC, ships as single binary
+- **OSS:** github.com/stellartools/stl (~1,800 stars)
+- **Cloud:** app.stellartools.dev (hosted dashboard)
+- **Pricing:** $9/mo per seat (team plan), free for solo devs
+- **Revenue:** ~$2,100 MRR, 65 paying teams, growing 8% month-over-month
+- **Active dev:** Linear integration, webhook pipeline, team analytics view
+
+## Architecture Decisions
+- Chose SQLite over Postgres for local-first story (2026-01-20)
+- gRPC for service mesh, REST for public API (2026-01-22)
+- Ship as single binary — no Docker required (key differentiator)
+- Webhook ingestion via async queue, not synchronous processing (2026-02-01)
+
+## Competitive Landscape
+- **LinearB** — enterprise, expensive ($30/seat), heavy setup
+- **Sleuth** — DORA metrics focused, SaaS only
+- **Swarmia** — good UX but GitHub-only, no Linear integration
+- **Our moat:** CLI-first, works offline, single binary, respects developer privacy
+
+## Communication
+- **Slack:** stellartools workspace (Maya + Raj + Lena)
+- **Email:** maya@stellartools.dev, atlas@stellartools.dev
+- **GitHub:** stellartools org
+- **Linear:** Stellar Tools workspace (project key: STL)
+
+## Opinions & Lessons
+- "Ship the CLI first, web dashboard second" — Maya, every standup
+- Rust compile times are brutal but the binary size payoff is worth it
+- SQLite WAL mode is mandatory for concurrent reads during metric aggregation
+- Never trust webhook delivery — always implement idempotent handlers
+- The onboarding flow is our weakest point right now (users drop off at OAuth)
@@ -0,0 +1,34 @@
+---
+title: '2026-02-10'
+type: note
+permalink: memory/2026-02-10
+---
+
+# 2026-02-10 — Monday
+
+## Observations
+- [date] 2026-02-10
+- [type] daily-note
+
+## Standup
+- Maya working on Linear webhook integration — parsing cycle time events
+- Raj fixing the SQLite connection pooling bug (issue STL-142)
+- Lena delivered new onboarding mockups — 3-step flow replacing the current 7-step monster
+
+## Decisions
+- Moving webhook processing to background queue (Bull MQ equivalent in Rust)
+- Decided to drop support for GitLab in v1 — focus on GitHub + Linear only
+- OAuth flow will use PKCE instead of implicit grant (security review finding)
+
+## Bug Report
+- User "fasttrack_dev" reported metrics dashboard shows stale data after timezone change
+- Root cause: cache key doesn't include timezone offset
+- Raj will fix in STL-145
+
+## Claw Time
+Read an interesting paper on "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" — the key insight is that combining parametric and non-parametric memory outperforms either alone. Relevant to how we think about metric aggregation caching.
+
+## Relations
+- relates_to [[Linear Integration]]
+- relates_to [[Onboarding Redesign]]
+- relates_to [[Raj Patel]]
@@ -0,0 +1,44 @@
+---
+title: '2026-02-11'
+type: note
+permalink: memory/2026-02-11
+---
+
+# 2026-02-11 — Tuesday
+
+## Observations
+- [date] 2026-02-11
+- [type] daily-note
+
+## Standup
+- Maya shipped webhook queue prototype — 3x throughput improvement on ingestion
+- Raj closed STL-142 (connection pooling) and STL-145 (timezone cache key)
+- Lena presenting onboarding redesign to Maya at 2pm
+
+## Customer Feedback
+- Team at Nexus Labs (12 seats) requesting Jira integration — told them post-v1
+- "fasttrack_dev" confirmed timezone fix works — sent us a thank-you tweet
+- New trial signup from a YC W26 batch company called "Cortex AI"
+
+## Meeting: Onboarding Redesign Review (2pm)
+- Lena's new flow: Install → Connect GitHub → See first metric (3 steps)
+- Old flow had 7 steps including email verification, team invite, preference wizard
+- Maya approved the simplified flow — "if they can see value in 60 seconds, they'll stay"
+- Decision: remove email verification from onboarding, move to settings
+- Target: ship new onboarding by Feb 20
+
+## Architecture Discussion
+- Maya proposed caching webhook payloads in SQLite before processing
+- Advantage: replay capability, audit trail, crash recovery
+- Raj concerned about disk usage on high-volume teams
+- Compromise: 7-day retention with configurable TTL per team
+- Linear workspace ID: ws_stl_prod_7x2k
+
+## Evening
+- Pushed v0.8.3 hotfix for the timezone bug
+- Release notes drafted and posted to #announcements
+
+## Relations
+- relates_to [[Onboarding Redesign]]
+- relates_to [[Lena Vogt]]
+- relates_to [[Cortex AI]]
@@ -0,0 +1,48 @@
+---
+title: '2026-02-12'
+type: note
+permalink: memory/2026-02-12
+---
+
+# 2026-02-12 — Wednesday
+
+## Observations
+- [date] 2026-02-12
+- [type] daily-note
+
+## Standup
+- Maya integrating Linear cycle time API — endpoint is rate-limited to 100 req/min
+- Raj building webhook replay tool based on yesterday's architecture decision
+- Lena started implementing new onboarding UI in the web dashboard
+
+## Metrics Review
+- MRR: $2,100 (up from $1,950 last month)
+- Active teams: 65 (net +3 this week)
+- Churn: 2 teams churned — both cited "not enough integrations"
+- Trial-to-paid conversion: 23% (target: 30%)
+- GitHub stars: 1,847 (was 1,800 last week)
+
+## Incident
+- 14:30 PST: webhook ingestion queue backed up for ~20 minutes
+- Root cause: Linear sent a burst of 2,000 events from a large workspace migration
+- Raj's fix: added backpressure mechanism with configurable max queue depth
+- No data loss, all events eventually processed
+- Postmortem scheduled for Friday
+
+## Customer Call: Cortex AI
+- CTO James Liu, 8-person engineering team
+- They want to track deployment frequency + lead time (DORA metrics)
+- Currently using a janky spreadsheet
+- Interested in team plan at $9/seat
+- Follow-up demo scheduled for Feb 18 at 10am PST
+- Email: james@cortexai.dev
+
+## Security
+- Dependabot flagged a vulnerability in the HTTP client crate (hyper v0.14)
+- Upgraded to hyper v1.2 — breaking changes in connection pooling API
+- Raj handling the migration, ETA Thursday
+
+## Relations
+- relates_to [[Cortex AI]]
+- relates_to [[Raj Patel]]
+- relates_to [[Linear Integration]]
@@ -0,0 +1,44 @@
+---
+title: '2026-02-14'
+type: note
+permalink: memory/2026-02-14
+---
+
+# 2026-02-14 — Friday
+
+## Observations
+- [date] 2026-02-14
+- [type] daily-note
+
+## Standup
+- Maya: Linear integration MVP done — cycle time + throughput metrics working
+- Raj: webhook replay tool shipped, hyper v1.2 migration complete
+- Lena: onboarding UI 70% done, blocked on OAuth flow changes
+
+## Postmortem: Webhook Queue Backup (Feb 12)
+- Timeline: 14:22-14:42 PST, 20 min degradation
+- Impact: ~200 teams experienced delayed metric updates
+- Root cause: no backpressure on ingestion queue, Linear burst exceeded capacity
+- Fix: configurable max queue depth (default 5000), overflow to disk-backed queue
+- Action items:
+  1. Add queue depth alerting (Maya, by Feb 19)
+  2. Load test with simulated burst traffic (Raj, by Feb 21)
+  3. Document incident response runbook (Atlas, by Feb 17)
+- Severity: P2 (degraded but not down)
+
+## Pricing Discussion
+- Maya considering a free tier change: currently unlimited for solo, might cap at 3 repos
+- Raj argues against caps — "developers hate artificial limits"
+- Decision deferred to next week — want to look at usage data first
+- Current pricing: $9/seat/month for teams, free for solo developers
+- Enterprise inquiries: 2 (both want SSO + audit logs)
+
+## Deploy
+- v0.9.0-beta.1 tagged with Linear integration
+- Changelog: Linear cycle time, throughput, backlog age metrics
+- Webhook replay tool (admin only)
+- Breaking: dropped GitLab support (deprecated in v0.8)
+
+## Relations
+- relates_to [[Linear Integration]]
+- relates_to [[Onboarding Redesign]]
@@ -0,0 +1,29 @@
+---
+title: Lena Vogt
+type: person
+permalink: people/lena-vogt
+---
+
+# Lena Vogt
+
+## Observations
+- [name] Lena Vogt
+- [role] Designer (part-time)
+- [email] lena@stellartools.dev
+- [timezone] Europe/Berlin (CET/CEST)
+- [expertise] UX design, Figma, onboarding flows
+- [joined] 2026-01-25
+- [status] active team member
+
+## Notes
+- Part-time contributor, available ~20 hours/week
+- Delivered the 3-step onboarding mockups that Maya approved
+- Currently implementing the onboarding UI in the web dashboard
+- Timezone difference means async collaboration with Maya/Raj
+- Strong advocate for "time to first value" metric
+
+## Relations
+- works_at [[Stellar Tools]]
+- collaborates_with [[Maya Chen]]
+- collaborates_with [[Raj Patel]]
+- working_on [[Onboarding Redesign]]