How academic findings and commercial patterns map to SPECTRA v4 design decisions. Each decision traces to specific evidence.
Separate thinking from doing. Make the thinking verifiable. Make the results persistent.
Every tool that outperforms a basic ReAct agent implements some version of this.
Evidence: Cursor enforces ≤2 questions. Claude Code uses AskUserQuestion. Copilot enters "Alignment" phase. Commercial analysis: 40%+ token waste from wrong assumptions.
Decision: ≤3 structured questions. Skip only when unambiguous.
Evidence: Cursor: "YOU MUST NOT make any edits." Claude Code: "CRITICAL: READ-ONLY MODE." Copilot: Plan agent has no write tools. Decision: No code, no edits during SPECTRA phases. Forces staying in problem space.
Evidence: PlanSearch nearly doubled Claude Sonnet's performance. ToT: 74% vs 4% on Game of 24. No commercial tool does this. Decision: 3–5 distinct hypotheses with weighted rubric. Anti-strawman: if within 5%, re-observe.
Evidence: ToT uses "sure/maybe/impossible" — effective but crude. No commercial tool exposes scoring criteria. Decision: 7 dimensions scored 1–10 with weights summing to 100%. Balances rigor with practicality.
Evidence: No commercial tool verifies plans before execution. Agentless validation phase achieves 50.8% SWE-bench. Decision: Structural → Self-Consistency → Dependency → Constraint → Process Reward → Adversarial. Key insight: 10x cheaper to catch a flawed plan than to execute one.
Evidence: Reflexion: 91% Pass@1 on HumanEval. "Rich reflections >> 'try again'." Decision: Diagnose → Explain → Prescribe → Apply → Re-verify. Max 3 cycles.
Evidence: Cursor stores .cursor/plans/*.md. Claude Code stores .claude/plans/. Copilot uses Markdown + JSON dual format.
Decision: Triple format: MD (human) + YAML (agent) + JSON (state). Files, not chat messages.
Evidence: ADaPT: attempt first, decompose on failure. LangGraph replanner node. No commercial tool does structured replanning. Decision: Patch (minor) / Partial (moderate) / Full (major) based on failure scope.
Evidence: "Context Rot" documented in Gemini analysis. Claude Code uses interleaved reasoning. Decision: At 80% capacity, compress to Knowledge Artifact. Re-inject system prompt + artifact.
Evidence: Aider: o1 + DeepSeek = 85%. Cline: R1 + Sonnet = 97% cost reduction. Model leadership changes quarterly. Decision: Capability classes (reasoning/speed) not vendor names.
Evidence: MSR '26 Cursor Rules study: 77.6% of repos encode conventions. Qodo 2025: 65% of developers report AI misses context during refactoring. RepoGraph (ICLR 2025): modular context enrichment improves frameworks by 32.8% avg without changing the framework itself.
Decision: Convention extraction is a one-time installation step (via spectra-init.sh + LLM), not a per-session cost. The output (spectra-conventions.md) is consumed as structural context through the existing CLARIFY and Pattern phases — no special retrofit logic in the core cycle. Convention maps improve Pattern Match quality within the existing scoring architecture (no arbitrary bonuses).
Evidence: ADaPT (Prasad et al.): attempt first, decompose on failure — don't over-analyze simple problems. EVPI framework (THEORY.md §3): gather information only when cost < expected rework savings. Decision: Three installation depths for brownfield projects — Standard (init script + LLM), Deep (exemplar analysis + cross-validation), Structural (AST parsing + dependency graphs). Depth choice is made once at installation, not per session. Simple repos use Standard; large/complex repos benefit from Deep or Structural.
| Gap | SPECTRA's Choice | Confidence | Needs |
|---|---|---|---|
| Optimal hypothesis count | 3–5 | Medium | Empirical testing |
| Rubric weight calibration | 25/20/15/15/10/10/5 | Medium | A/B testing |
| Self-consistency threshold | 70% overlap | Medium | Cross-project validation |
| Verification layer ordering | Current sequence | Low | Failure mode frequency data |
| Context compaction trigger | 80% capacity | Low | Model-dependent tuning |
| Convention file entropy reduction | Conventions lower plan entropy | Low | Empirical measurement across repos with/without conventions |
| Optimal convention depth | Progressive tiers mapped to complexity | Low | Case studies comparing Tier 1-only vs full convention analysis |
These gaps are where community case studies are most valuable.
For formal theoretical treatment of SPECTRA's design decisions — including EVPI analysis for confidence thresholds, Shannon entropy for adaptive verification, and cognitive load bounds for hypothesis generation — see THEORY.md.
Last updated: 2026-03-01