Skip to content

Bump OpenAI confer/debate token budget; add per-provider override layer#26

Merged
fxspeiser merged 1 commit into
mainfrom
feature/openai-budget-bump
May 29, 2026
Merged

Bump OpenAI confer/debate token budget; add per-provider override layer#26
fxspeiser merged 1 commit into
mainfrom
feature/openai-budget-bump

Conversation

@fxspeiser
Copy link
Copy Markdown
Owner

Summary

Fixes the recurring OpenAI MAX_TOKENS truncation on `confer` and `debate`.

Root cause: gpt-5 charges reasoning tokens against `max_completion_tokens`. At the reasoning-class default of 2048 it spends 1500-3000 on internal thinking and the visible reply gets truncated mid-sentence.

Fix: add a per-provider override layer in `_budget_for_purpose`. New precedence (highest to lowest):

  1. `CFG.token_budgets[purpose]` — caller override
  2. `CFG.token_budgets_by_provider[provider][purpose]` — per-provider operator override
  3. `_PROVIDER_TOKEN_BUDGETS[provider][purpose]` — shipped per-provider overrides (this PR seeds `openai.confer` / `openai.debate` = 6144)
  4. `_NON_REASONING_TOKEN_BUDGETS[purpose]` — non-reasoning model
  5. `_DEFAULT_TOKEN_BUDGETS[purpose]` — reasoning-safe default

6144 ≈ ~3k headroom for gpt-5 reasoning + ~3k room for the visible answer. Other providers (anthropic / xai / gemini / mistral / groq / deepseek) are unaffected.

Anti-foot-gun: per-provider override values <= 0 are ignored. The override is keyed on provider name (not model class) because the issue is specifically OpenAI's API accounting — every OpenAI model gets the higher ceiling.

Test plan

  • New `scripts/test_provider_token_budgets.py` covers all four precedence levels, shipped openai defaults, other providers unaffected, 0/negative fallthrough, and end-to-end wire payload assertion
  • Full suite (38 scripts) passes locally

🤖 Generated with Claude Code

Symptom: OpenAI gpt-5 routinely runs out of `max_completion_tokens`
mid-answer during `confer` and `debate`. Root cause: gpt-5 charges
reasoning tokens against `max_completion_tokens`. At the reasoning-
class default of 2048 it spends 1500-3000 tokens on internal thinking
and the visible reply gets MAX_TOKENS-truncated.

Fix: add a per-provider override layer in `_budget_for_purpose`. The
new precedence (highest to lowest) is:

  1. `CFG.token_budgets[purpose]`                — caller override
  2. `CFG.token_budgets_by_provider[provider][purpose]` — per-provider
     operator override (no code changes needed)
  3. `_PROVIDER_TOKEN_BUDGETS[provider][purpose]` — shipped per-provider
     overrides (this PR seeds openai.confer / openai.debate = 6144)
  4. `_NON_REASONING_TOKEN_BUDGETS[purpose]`      — non-reasoning model
  5. `_DEFAULT_TOKEN_BUDGETS[purpose]`            — reasoning default

Shipped seed values:
- openai.confer = 6144
- openai.debate = 6144

6144 = ~3k headroom for gpt-5 reasoning + ~3k room for the visible
answer. Other providers are unaffected (anthropic / xai / gemini /
mistral / groq / deepseek all fall through to the existing reasoning-
or non-reasoning ceilings).

Anti-foot-gun: per-provider override values <= 0 are ignored (fall
through to the next tier). The override is keyed on the provider name
(not model class) because the issue is specifically how OpenAI's API
accounts for reasoning tokens — even gpt-test (a non-reasoning OpenAI
stub) gets the higher ceiling, since the threat model is provider-
behavior rather than model-class.

Tests (scripts/test_provider_token_budgets.py):
- Shipped openai confer/debate = 6144 (reasoning + non-reasoning models)
- Non-overridden purposes still respect tier ceilings
  (audit/synth fall to non-reasoning table for gpt-test, reasoning
  default for gpt-5)
- Other providers unaffected by the openai override
- CFG.token_budgets[purpose] beats the per-provider override
- CFG.token_budgets_by_provider beats the shipped defaults; can also
  add overrides for providers that don't have shipped ones
- Operator-supplied 0 / negative fall through to the next tier
- No provider/model -> reasoning-safe defaults unchanged
- End-to-end: openai confer wire payload carries
  max_completion_tokens=6144 (not 2048); anthropic confer still 2048

Full suite (38 scripts) passes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@fxspeiser fxspeiser merged commit 548e95e into main May 29, 2026
1 check passed
@fxspeiser fxspeiser deleted the feature/openai-budget-bump branch May 29, 2026 23:59
fxspeiser added a commit that referenced this pull request May 31, 2026
Follow-up to PR #26. Same MAX_TOKENS truncation pattern is hitting
triangulate on gpt-5 and both confer + triangulate on gemini-2.5-pro:
reasoning tokens count against `max_completion_tokens`, the visible
answer gets cut mid-emission.

New _PROVIDER_TOKEN_BUDGETS entries (all 6144):
- openai.triangulate    (was 2048 reasoning-default)
- gemini.confer         (was 1500 non-reasoning ceiling)
- gemini.triangulate    (was 1500 non-reasoning ceiling)

Notable side-finding (logged as a follow-up, NOT fixed here):
gemini-2.5-pro is NOT currently tagged in
`PROVIDER_CAPS["gemini"]["reasoning_prefixes"]`, so non-overridden
purposes (debate, synth, audit) fall through to the SMALLER
non-reasoning ceilings (1500 / 1024 / 768) instead of the reasoning
default of 2048. The provider-specific override in this PR
short-circuits that for confer + triangulate, but if the user starts
hitting caps on gemini debate / synth / audit, the right fix is
adding `"reasoning_prefixes": ("gemini-2.5-pro",)` to the gemini
PROVIDER_CAPS entry. Out of scope for this PR.

Tests:
- openai.triangulate = 6144
- gemini.confer = 6144, gemini.triangulate = 6144
- gemini.debate falls through to 1500 (the non-reasoning ceiling,
  documenting the follow-up gap explicitly)

Full suite (38 scripts) passes.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant