Summary
Add BEAM as a benchmark dataset alongside LoCoMo and LongMemEval.
BEAM is from the ICLR 2026 paper "Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs" (University of Alberta + UMass Amherst).
Why BEAM
- Scale: 100 conversations ranging from 128K to 10M tokens (vs LoCoMo's ~10K tokens / 81 QA pairs)
- Breadth: Tests 10 distinct memory abilities — abstention, contradiction resolution, event ordering, info extraction, instruction following, knowledge update, multi-hop reasoning, preference following, summarization, temporal reasoning
- Rigor: 2,000 human-validated probing questions with nugget-based evaluation
- Multi-domain: Coding, math, health, finance, personal — not just casual/personal conversations
- Key finding: Even 1M-token context window LLMs degrade substantially on long conversations. RAG alone doesn't fix it. Structured external memory (their LIGHT framework) improves 3.5-12.7%.
This validates the core thesis behind Basic Memory — structured knowledge graphs beat raw context windows.
Resources
Implementation Notes
- Dataset has 4 size tiers: 128K (20 chats), 500K (35), 1M (35), 10M (10)
- Evaluation uses nugget scoring (atomic semantic units) + Kendall tau-b for event ordering
- Will need a converter similar to
locomo_to_corpus.py
- Consider starting with 128K tier for fast iteration, then scaling up
- Their eval scripts are in the repo — could adapt or wrap
Context
Drew is working on a fork of supermemory's memorybench (which covers LoCoMo, LongMemEval, ConvoMem). BEAM would give us a much more comprehensive evaluation at scale in our own repo.
Related competitive intel: ByteRover and OpenViking are both positioning as "memory for agents" — having strong benchmark numbers across multiple datasets strengthens BM's story.
Labels suggestion
enhancement, benchmarks
Summary
Add BEAM as a benchmark dataset alongside LoCoMo and LongMemEval.
BEAM is from the ICLR 2026 paper "Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs" (University of Alberta + UMass Amherst).
Why BEAM
This validates the core thesis behind Basic Memory — structured knowledge graphs beat raw context windows.
Resources
Implementation Notes
locomo_to_corpus.pyContext
Drew is working on a fork of supermemory's memorybench (which covers LoCoMo, LongMemEval, ConvoMem). BEAM would give us a much more comprehensive evaluation at scale in our own repo.
Related competitive intel: ByteRover and OpenViking are both positioning as "memory for agents" — having strong benchmark numbers across multiple datasets strengthens BM's story.
Labels suggestion
enhancement, benchmarks