Bump OpenAI confer/debate token budget; add per-provider override layer by fxspeiser · Pull Request #26 · fxspeiser/crosscheck-agent

fxspeiser · 2026-05-29T23:59:34Z

Summary

Fixes the recurring OpenAI MAX_TOKENS truncation on `confer` and `debate`.

Root cause: gpt-5 charges reasoning tokens against `max_completion_tokens`. At the reasoning-class default of 2048 it spends 1500-3000 on internal thinking and the visible reply gets truncated mid-sentence.

Fix: add a per-provider override layer in `_budget_for_purpose`. New precedence (highest to lowest):

`CFG.token_budgets[purpose]` — caller override
`CFG.token_budgets_by_provider[provider][purpose]` — per-provider operator override
`_PROVIDER_TOKEN_BUDGETS[provider][purpose]` — shipped per-provider overrides (this PR seeds `openai.confer` / `openai.debate` = 6144)
`_NON_REASONING_TOKEN_BUDGETS[purpose]` — non-reasoning model
`_DEFAULT_TOKEN_BUDGETS[purpose]` — reasoning-safe default

6144 ≈ ~3k headroom for gpt-5 reasoning + ~3k room for the visible answer. Other providers (anthropic / xai / gemini / mistral / groq / deepseek) are unaffected.

Anti-foot-gun: per-provider override values <= 0 are ignored. The override is keyed on provider name (not model class) because the issue is specifically OpenAI's API accounting — every OpenAI model gets the higher ceiling.

Test plan

New `scripts/test_provider_token_budgets.py` covers all four precedence levels, shipped openai defaults, other providers unaffected, 0/negative fallthrough, and end-to-end wire payload assertion
Full suite (38 scripts) passes locally

🤖 Generated with Claude Code

Symptom: OpenAI gpt-5 routinely runs out of `max_completion_tokens` mid-answer during `confer` and `debate`. Root cause: gpt-5 charges reasoning tokens against `max_completion_tokens`. At the reasoning- class default of 2048 it spends 1500-3000 tokens on internal thinking and the visible reply gets MAX_TOKENS-truncated. Fix: add a per-provider override layer in `_budget_for_purpose`. The new precedence (highest to lowest) is: 1. `CFG.token_budgets[purpose]` — caller override 2. `CFG.token_budgets_by_provider[provider][purpose]` — per-provider operator override (no code changes needed) 3. `_PROVIDER_TOKEN_BUDGETS[provider][purpose]` — shipped per-provider overrides (this PR seeds openai.confer / openai.debate = 6144) 4. `_NON_REASONING_TOKEN_BUDGETS[purpose]` — non-reasoning model 5. `_DEFAULT_TOKEN_BUDGETS[purpose]` — reasoning default Shipped seed values: - openai.confer = 6144 - openai.debate = 6144 6144 = ~3k headroom for gpt-5 reasoning + ~3k room for the visible answer. Other providers are unaffected (anthropic / xai / gemini / mistral / groq / deepseek all fall through to the existing reasoning- or non-reasoning ceilings). Anti-foot-gun: per-provider override values <= 0 are ignored (fall through to the next tier). The override is keyed on the provider name (not model class) because the issue is specifically how OpenAI's API accounts for reasoning tokens — even gpt-test (a non-reasoning OpenAI stub) gets the higher ceiling, since the threat model is provider- behavior rather than model-class. Tests (scripts/test_provider_token_budgets.py): - Shipped openai confer/debate = 6144 (reasoning + non-reasoning models) - Non-overridden purposes still respect tier ceilings (audit/synth fall to non-reasoning table for gpt-test, reasoning default for gpt-5) - Other providers unaffected by the openai override - CFG.token_budgets[purpose] beats the per-provider override - CFG.token_budgets_by_provider beats the shipped defaults; can also add overrides for providers that don't have shipped ones - Operator-supplied 0 / negative fall through to the next tier - No provider/model -> reasoning-safe defaults unchanged - End-to-end: openai confer wire payload carries max_completion_tokens=6144 (not 2048); anthropic confer still 2048 Full suite (38 scripts) passes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Follow-up to PR #26. Same MAX_TOKENS truncation pattern is hitting triangulate on gpt-5 and both confer + triangulate on gemini-2.5-pro: reasoning tokens count against `max_completion_tokens`, the visible answer gets cut mid-emission. New _PROVIDER_TOKEN_BUDGETS entries (all 6144): - openai.triangulate (was 2048 reasoning-default) - gemini.confer (was 1500 non-reasoning ceiling) - gemini.triangulate (was 1500 non-reasoning ceiling) Notable side-finding (logged as a follow-up, NOT fixed here): gemini-2.5-pro is NOT currently tagged in `PROVIDER_CAPS["gemini"]["reasoning_prefixes"]`, so non-overridden purposes (debate, synth, audit) fall through to the SMALLER non-reasoning ceilings (1500 / 1024 / 768) instead of the reasoning default of 2048. The provider-specific override in this PR short-circuits that for confer + triangulate, but if the user starts hitting caps on gemini debate / synth / audit, the right fix is adding `"reasoning_prefixes": ("gemini-2.5-pro",)` to the gemini PROVIDER_CAPS entry. Out of scope for this PR. Tests: - openai.triangulate = 6144 - gemini.confer = 6144, gemini.triangulate = 6144 - gemini.debate falls through to 1500 (the non-reasoning ceiling, documenting the follow-up gap explicitly) Full suite (38 scripts) passes. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

fxspeiser merged commit 548e95e into main May 29, 2026
1 check passed

fxspeiser deleted the feature/openai-budget-bump branch May 29, 2026 23:59

fxspeiser mentioned this pull request May 31, 2026

Bump confer / triangulate token budgets for OpenAI + Gemini #27

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump OpenAI confer/debate token budget; add per-provider override layer#26

Bump OpenAI confer/debate token budget; add per-provider override layer#26
fxspeiser merged 1 commit into
mainfrom
feature/openai-budget-bump

fxspeiser commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fxspeiser commented May 29, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant