Benchmark: Track token usage per query for cost comparison

## Context

MemMachine reports 80% fewer tokens than Mem0 for the same benchmark — a major cost story. We should track this too.

## Proposal

During benchmark evaluation, measure and report:
- **Input tokens per query** — how much context is injected from memory
- **Output tokens per query** — how much the LLM generates
- **Total tokens across benchmark** — overall cost comparison
- **Tokens per correct answer** — efficiency metric (quality per token)

## Why it matters
BM returns raw markdown chunks. We don't extract/compress memories like Mem0 does. This means:
- We might use MORE tokens per query (full context vs extracted facts)
- But our context might be richer and lead to better answers
- The tokens-per-correct-answer metric tells the real story

If we can show competitive accuracy with fewer tokens, that's a cost argument. If we use more tokens but get better answers, that's a quality argument. Either way, the data tells a story.

## Implementation
- Count tokens using tiktoken (cl100k_base) on retrieved context + LLM prompt
- Log per-query: `{query, category, tokens_in, tokens_out, correct, latency}`
- Aggregate: total tokens, tokens/query by category, tokens/correct answer

## Related
- basicmachines-co/basic-memory-benchmarks#9 (LLM-as-Judge)
- MemMachine blog: 419K input tokens (MemMachine) vs 1.92M (Mem0) for same benchmark

## Milestone
v0.19.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark: Track token usage per query for cost comparison #6

Context

Proposal

Why it matters

Implementation

Related

Milestone

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Benchmark: Track token usage per query for cost comparison #6

Description

Context

Proposal

Why it matters

Implementation

Related

Milestone

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions