Date: January 11, 2026
Status: ✅ Fixes implemented, ready for re-evaluation
Critical bugs were discovered in the evaluation methodology that made the monolithic vs ensemble comparison invalid. The primary issue was incorrect token counting where ensemble appeared to use "98% fewer tokens" when the actual difference was ~20%.
All previous evaluation results (in evaluation_analysis.md) are INVALID and should be discarded.
Evidence:
// Monolithic cache (data/cache/summaries/doc_1_summary.json)
{
"metadata": {
"tokens_used": 14639, // ✓ Correct
...
}
}
// Ensemble cache (data/cache/ensemble_summaries/doc_1_summary.json)
{
"metadata": {
"tokens_used": 0, // ✗ BUG - Should be ~14,600
...
}
}Impact:
- Monolithic correctly counted ~146K tokens from cached summaries (10 docs × 14.6K)
- Ensemble counted 0 tokens from cached summaries
- This created a false ~146K token difference
Root Cause:
The process_chunk function in ensemble.py failed to extract token usage from CrewAI results and silently defaulted to 0.
The cache loading logic in utils.py didn't warn when tokens_used was 0, making the bug invisible.
Both agents used the same cached document summaries (map phase), but only monolithic included those tokens in its metrics. This wasn't a "bug" per se, but made the comparison unfair.
Location: ensemble.py lines 136-163
Changes:
- Added multiple fallback methods to extract token usage from CrewAI
- Implemented token estimation from input+output text when API data unavailable
- Added detailed logging when extraction fails or falls back to estimation
- Ensures
tokens_usedis never 0 in cache files
Code:
# Try multiple ways to extract token usage
if hasattr(result, "usage_metrics"):
# ... extract from usage_metrics
elif hasattr(result, "token_usage"):
# ... extract from token_usage
# Fallback: estimate if extraction failed
if not tokens_found or metrics["total_tokens"] == 0:
logger.warning(f"No token usage found for doc {doc_idx} chunk {chunk_idx}, using estimation")
input_tokens = estimate_tokens(chunk)
output_tokens = estimate_tokens(summary)
metrics["total_tokens"] = input_tokens + output_tokensLocation: utils.py lines 205-211
Changes:
- Added validation warning when cache has
tokens_used=0 - Helps catch this bug in future evaluations
Code:
tokens_used = cached['metadata'].get('tokens_used', 0)
if tokens_used == 0:
logger.warning(
f"⚠️ Cache for document {doc_idx} has tokens_used=0! "
f"This may indicate a bug in cache generation. "
f"Cache file: {cache_file}"
)Action taken:
# Backed up old caches
data/cache/backup_20260111_111517/summaries/
data/cache/backup_20260111_111517/ensemble_summaries/
# Cleared for fresh evaluation
data/cache/summaries/ - EMPTY
data/cache/ensemble_summaries/ - EMPTYConfirmed:
- Monolithic: Uses
OLLAMA_MODEL(defaults toqwen2.5:7b) - Ensemble: Uses
CREWAI_MODEL(defaults toopenai/qwen2.5:7b) - Both use the same underlying model ✓
This provides the most honest comparison by measuring everything from scratch:
# Caches are already cleared, just run:
python evaluate.py
# This will:
# 1. Generate fresh document summaries for both agents (map phase)
# 2. Track tokens accurately for both
# 3. Perform synthesis (reduce/ensemble phase)
# 4. Generate fair comparison metricsExpected results:
- Both agents will show ~146K tokens for document summarization (map phase)
- Monolithic will show ~151K tokens for synthesis (reduce phase with all summaries)
- Ensemble will show ~169K tokens for ensemble work:
- Pre-reduction: ~148K (reducing summaries from 146K to ~6K)
- Archivist: ~8K (processing reduced summaries)
- Drafter: ~5K
- Critic: ~4K
- Orchestrator: ~4K
- Total: Monolithic ~297K vs Ensemble ~315K (ensemble uses +6% MORE, not less!)
For quick verification of fixes:
python evaluate.py --test
# Uses only 1 document instead of 10
# Faster but less comprehensivepython evaluate.py --task task1
# Evaluates only task1
# Good for debugging| Agent | Map Phase | Reduce/Ensemble Phase | Total | % Difference |
|---|---|---|---|---|
| Monolithic | ~146K | ~151K (1 call with all summaries) | ~297K | baseline |
| Ensemble | ~146K | ~169K (reduction + 4 agents) | ~315K | +6% MORE |
Key insight: Ensemble actually uses MORE tokens, not less!
- Monolithic: One synthesis call with all summaries
- Ensemble: Pre-reduction step + Archivist + Drafter + Critic + Orchestrator
- The "98% fewer tokens" was completely wrong (bug in cache metadata)
- The "20% fewer tokens" was also wrong (forgot about pre-reduction step)
- Reality: Ensemble uses ~6% MORE tokens due to additional orchestration
Both agents load summaries fresh (no cache), so timing will be fair:
- Map phase: Similar for both (~10-15 min for 10 docs)
- Reduce phase: Varies based on synthesis complexity
- Total: Expect monolithic to be slightly faster (fewer orchestration steps)
These should remain similar to before (assuming they weren't affected by the token bug):
- ROUGE-1 F1
- BERTScore F1
- Judge scores (instruction, groundedness, completeness)
After running new evaluation, verify:
- No warnings about
tokens_used=0in logs - Both agents show similar map-phase token counts (~146K)
- Cache files contain valid token metadata
- MLflow logs show all metrics
- New analysis markdown generated
ensemble.py- Improved token extraction with fallbacksutils.py- Added cache validation warningsevaluation_analysis.md- Added disclaimer about invalid data- Cache directories - Cleared and backed up
If you see unexpected results or have questions:
- Check logs for warnings about token extraction
- Inspect cache files:
cat data/cache/summaries/doc_1_summary.json - Verify token counts in MLflow UI
- Compare with backed-up cache in
data/cache/backup_*
Ready to run fair evaluation! 🚀
python evaluate.py