Context
MemMachine reports 80% fewer tokens than Mem0 for the same benchmark — a major cost story. We should track this too.
Proposal
During benchmark evaluation, measure and report:
- Input tokens per query — how much context is injected from memory
- Output tokens per query — how much the LLM generates
- Total tokens across benchmark — overall cost comparison
- Tokens per correct answer — efficiency metric (quality per token)
Why it matters
BM returns raw markdown chunks. We don't extract/compress memories like Mem0 does. This means:
- We might use MORE tokens per query (full context vs extracted facts)
- But our context might be richer and lead to better answers
- The tokens-per-correct-answer metric tells the real story
If we can show competitive accuracy with fewer tokens, that's a cost argument. If we use more tokens but get better answers, that's a quality argument. Either way, the data tells a story.
Implementation
- Count tokens using tiktoken (cl100k_base) on retrieved context + LLM prompt
- Log per-query:
{query, category, tokens_in, tokens_out, correct, latency}
- Aggregate: total tokens, tokens/query by category, tokens/correct answer
Related
Milestone
v0.19.0
Context
MemMachine reports 80% fewer tokens than Mem0 for the same benchmark — a major cost story. We should track this too.
Proposal
During benchmark evaluation, measure and report:
Why it matters
BM returns raw markdown chunks. We don't extract/compress memories like Mem0 does. This means:
If we can show competitive accuracy with fewer tokens, that's a cost argument. If we use more tokens but get better answers, that's a quality argument. Either way, the data tells a story.
Implementation
{query, category, tokens_in, tokens_out, correct, latency}Related
Milestone
v0.19.0