Skip to content

feat: implement act apply-model and report baseline tripwire (#607)#616

Open
ozymandiashh wants to merge 2 commits into
mainfrom
feat/issue-607-model-defaults
Open

feat: implement act apply-model and report baseline tripwire (#607)#616
ozymandiashh wants to merge 2 commits into
mainfrom
feat/issue-607-model-defaults

Conversation

@ozymandiashh

@ozymandiashh ozymandiashh commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

Implements the design-gated #607 path for model default recommendations from real compare data.

This PR adds the no-proxy version of model routing: CodeBurn can now notice when a cheaper same-provider model has matched the current dominant model's edit reliability for a project, show the evidence in compare and optimize, and let the user explicitly apply that recommendation with codeburn act apply-model <project>.

The implementation intentionally stays conservative:

  • no per-turn routing
  • no request rewriting
  • no automatic apply
  • no effortLevel recommendation in v1
  • no cross-provider recommendations
  • every write goes through the existing action journal and can be undone

Closes #607.

Why this exists

The acting-layer epic deliberately avoided live routing and provider interception. #607 is the safe middle ground: if local history already shows that a project can use a cheaper model without losing edit reliability, CodeBurn can recommend a project-level default model change.

That keeps the decision reviewable and reversible. The user still controls the session and can override with --model when a particular task needs a different model.

Design followed

This PR follows the thresholds and rollout described in the #607 design thread:

  1. Compare only models with at least 30 edit turns in the same project.
  2. Recommend the candidate only if its one-shot rate is within 3 percentage points of the current dominant model.
  3. Require candidate cost per edit turn to be at most 60 percent of the current model's cost per edit turn.
  4. For debugging-heavy projects, defined as more than 40 percent Debugging edit turns, remove the tolerance and require the candidate to meet or beat the current one-shot rate.
  5. Require both models to have been observed in the project within the last 14 days.
  6. Keep v1 same-provider only.
  7. Write only the Claude Code model setting. effortLevel is intentionally deferred because compare does not have per-effort evidence yet.

What changed

Recommendation engine

Adds src/act/model-defaults.ts.

The engine consumes the existing compare stats pipeline and produces one recommendation per qualifying project. Each recommendation includes the evidence reviewers need to judge it:

  • project path and display name
  • current dominant model
  • candidate model
  • provider
  • edit turn counts for both models
  • one-shot rates for both models
  • cost per edit turn for both models
  • cost ratio
  • debugging-heavy flag

The engine refuses to recommend when any guardrail is missing, including insufficient volume, stale observations, cost not actually lower, reliability below threshold, or provider mismatch.

Explicit apply command

Adds:

codeburn act apply-model <project>

The command:

  • resolves the project from parsed local sessions
  • recomputes the recommendation at apply time
  • writes <project>/.claude/settings.json
  • sets only the model key
  • preserves the rest of the JSON object
  • stores an expectedHash when the settings file already exists
  • writes through runAction() as kind: model-default
  • prints the evidence line
  • prints the undo command
  • prints the per-session override hint

This command is intentionally explicit. It is not wired into optimize --apply --yes.

Compare and optimize surfaces

Adds low-key recommendation blocks in both places where users already inspect model and waste evidence:

  • codeburn compare
  • codeburn optimize

These blocks are informational. They show the evidence and point to codeburn act apply-model <project> instead of mutating anything.

Act report integration

Extends codeburn act report for model-default actions.

Model default rows are not token or dollar claims. They are correlation-only quality checks:

  • capture the candidate model's pre-apply one-shot baseline
  • remeasure the same model after apply
  • require at least 20 post-apply edit turns before reporting
  • flag quality regression, consider undo if post-apply one-shot rate drops more than 5 percentage points
  • otherwise report correlation, not attribution

This keeps the accounting honest. We do not claim savings from a model-default change in v1 because the causal story is weaker than for config-token removals.

Reviewer guide

Suggested review order:

  1. tests/act-model-defaults.test.ts

    • This captures the intended behavior first: qualifying recommendation, insufficient volume, one-shot guard, debugging guard, stale models, provider mismatch, apply, and undo.
  2. src/act/model-defaults.ts

    • Main recommendation logic and apply-plan builder.
    • Check the threshold constants and the expectedHash behavior.
  3. src/act/cli.ts

    • New act apply-model <project> command.
    • Verify that apply recomputes the recommendation and stays explicit.
  4. src/compare.tsx and src/optimize.ts

    • Informational recommendation blocks only.
    • No automatic mutation path.
  5. src/act/report.ts

    • Correlation-only model-default reporting and quality tripwire.

Safety model

This PR keeps the acting-layer invariants intact:

  • passive analyzer behavior is unchanged unless the user runs the explicit apply command
  • no hook, proxy, or request-path component is added
  • model-default changes are journaled
  • undo restores through the existing action framework
  • stale settings files are protected by expectedHash
  • recommendations require local project evidence, not cross-project borrowing
  • v1 only writes model, never effortLevel
  • v1 is same-provider only
  • optimize --apply --yes does not apply model defaults

Files changed

  • src/act/model-defaults.ts

    • New recommendation engine and apply-plan builder.
  • src/act/cli.ts

    • Adds codeburn act apply-model <project>.
  • src/act/report.ts

    • Adds model-default report rows and the one-shot quality regression tripwire.
  • src/compare.tsx

    • Adds recommendation block to compare output.
  • src/optimize.ts

    • Adds recommendation block to optimize output.
  • tests/act-model-defaults.test.ts

    • Adds focused coverage for the recommendation engine, guardrails, apply, and undo.

Verification

Local verification on branch feat/issue-607-model-defaults:

./node_modules/.bin/tsc --noEmit

Result: clean.

./node_modules/.bin/vitest run tests/act-model-defaults.test.ts

Result: 8 tests passing.

npm run build

Result: successful build, including dashboard build.

./node_modules/.bin/vitest run

Result: 1500 of 1503 tests passing locally.

The 3 failing tests reproduce on main, so they are not introduced by this PR:

  • tests/cli-status-menubar.test.ts, 2 failures on main
  • tests/cli-proxy-path.test.ts, 1 failure on main

Also verified before opening the PR:

  • final diff contains only the 6 intended source/test files
  • no AppleDouble ._* files remain
  • generated data files are not included in the diff
  • branch pushed cleanly to origin/feat/issue-607-model-defaults

Notes for maintainers

The main thing to scrutinize is not whether the command works, but whether the recommendation should exist at all under borderline data. The implementation is deliberately biased toward silence. If volume, reliability, recency, cost, provider family, or debugging-heavy safety is unclear, it returns no recommendation.

That is intentional. A missing recommendation is much cheaper than a bad model default.

@ozymandiashh ozymandiashh requested a review from iamtoruk July 3, 2026 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Model/effort default recommendations from compare data (design-gated)

1 participant