Skip to content

Feat/Provider Fallback Chain — Design Document (#2574)#2581

Closed
idling11 wants to merge 2 commits into
Hmbown:mainfrom
idling11:feat/provider-fallback-chain
Closed

Feat/Provider Fallback Chain — Design Document (#2574)#2581
idling11 wants to merge 2 commits into
Hmbown:mainfrom
idling11:feat/provider-fallback-chain

Conversation

@idling11
Copy link
Copy Markdown
Contributor

@idling11 idling11 commented Jun 2, 2026

Summary

Add an automatic provider fallback chain so that when the active provider
returns a non-recoverable error (429, selected 5xx, connection timeout),
CodeWhale switches to the next configured provider without interrupting
the user's workflow.

Motivation

Currently, users must manually run /provider to switch when their
primary provider fails. This is especially disruptive during long-running
agentic tasks. A fallback chain keeps the agent working without user
intervention.

Design

Configuration

[providers]
active = "nvidia-nim"
fallback = ["deepseek", "openrouter"]

[providers.nvidia-nim]
api_key = "nvapi-..."
base_url = "https://integrate.api.nvidia.com/v1"
model = "meta/llama-4"

[providers.deepseek]
api_key = "$DEEPSEEK_API_KEY"
model = "deepseek-v4-pro"

[providers.openrouter]
api_key = "$OPENROUTER_API_KEY"
model = "deepseek/deepseek-v4-0324"
  • fallback — ordered list of provider names to try
  • active — the primary provider (existing provider key, renamed for clarity)

Fallback triggers

Error Fallback? Rationale
429 (rate limit) Quota exhausted — swap key/provider
502 / 503 / 504 Provider infrastructure issue
Connection timeout / DNS failure Network path broken
401 / 403 Auth issue — no other provider will help
400 (bad request) Client error — not provider-specific
Stream interrupted mid-content Already consumed partial response

Sequence

1. Try primary provider (nvidia-nim)
2. On fallback-eligible error → try fallback[0] (deepseek)
3. On fallback-eligible error → try fallback[1] (openrouter)
4. All exhausted → surface clear error to user

Transcript / UI

  • Status toast: NVIDIA NIM unavailable — switched to DeepSeek
  • Transcript marker: [provider: nvidia-nim → deepseek]
  • /provider command shows current chain position: deepseek (fallback #1)
  • Original (active) provider is remembered so user can /provider reset to go back

Capability awareness

Before switching, the engine checks that the fallback provider supports
the current turn's needs:

Capability Check
Tools / function calling Fallback provider must support tools
Reasoning effort Must support same reasoning levels
Context length Model must have ≥ current turn's token count
Vision Must support image inputs if turn has images

If no fallback provider meets capabilities, the error is surfaced directly.

Retry integration

Existing [retry] settings apply per-provider before fallback triggers.
A provider gets max_retries attempts with retry_delay between them.
Only after retry exhaustion does fallback move to the next provider.

Config schema validation

On startup, validate:

  • Each fallback entry is a known provider
  • No duplicate providers in chain
  • Fallback entry is not the same as active provider
  • Warn if fallback model has different capability profile

Implementation Plan (3 Draft PRs)

Phase 1: Config schema + validation

Branch: feat/provider-fallback-chain-phase1
Files: crates/tui/src/config.rs

  • Add fallback: Option<Vec<String>> field to ProvidersConfig
  • Add #[serde(default)] for backward compatibility
  • Add validation in Config::validate(): known provider, no duplicates, not same as active
  • Add fallback merge logic in merge_provider_config()
  • Unit tests: valid chain, invalid provider, duplicates

Phase 2: Engine fallback logic

Branch: feat/provider-fallback-chain-phase2
Files: crates/tui/src/client.rs, crates/tui/src/core/engine/turn_loop.rs

  • Add ActiveProviderTracker to remember original provider and current position
  • Error classifier: is_fallback_eligible(error) -> bool
  • try_with_fallback() in client.rs: iterate fallback chain on eligible errors
  • Save original provider before first fallback, restore via /provider reset
  • Event emission: ProviderFallback { from, to, reason }

Phase 3: UI feedback

Branch: feat/provider-fallback-chain-phase3
Files: crates/tui/src/tui/ui.rs, crates/tui/src/commands/provider.rs

  • Status toast on fallback switch
  • Transcript marker for provider transitions
  • /provider shows fallback position and chain
  • /provider reset to return to primary provider

Rejected alternatives

  • Per-request model routing: Too fine-grained; turns have state (system prompt, tools) that shouldn't change mid-turn
  • Weighted random selection: Unpredictable billing; users need deterministic behavior
  • Sub-agent-level fallback: Complicates sub-agent lifecycle for marginal gain

Open questions

  1. Should fallback persist across sessions or reset each launch?
    Reset each launch (avoids silently staying on fallback forever)
  2. Should /compact reset to primary provider?
    No — compaction changes context, not provider
  3. Tool call mid-turn: if tool call succeeds but next API call fails, do we fallback?
    Yes, same turn can span providers as long as capabilities match

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 2, 2026

No reviewable files after applying ignore patterns.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

Thanks @idling11 for taking the time to contribute.

This repository is currently observing a maintainer-managed contribution gate in dry-run mode, so this pull request is staying open. When enforcement is enabled, pull requests from contributors who are not listed in .github/APPROVED_CONTRIBUTORS will be closed automatically.

Please read CONTRIBUTING.md for the expected contribution shape. A maintainer can grant PR access by commenting /lgtm on a pull request.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Note

Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported.

@Hmbown
Copy link
Copy Markdown
Owner

Hmbown commented Jun 3, 2026

Thanks @idling11 for writing up the provider fallback-chain design. This was not something to squeeze into the corrective v0.8.52 publish, but it is still a useful planning artifact for #2574 and belongs in the v0.8.53+ review sweep. Sorry it sat under mostly bot comments during the release cleanup.

@Hmbown
Copy link
Copy Markdown
Owner

Hmbown commented Jun 5, 2026

Thanks @idling11 — the provider fallback chain design document was harvested into the public v0.9.0 integration branch (codex/v0.9.0-stewardship) as 5dc1a63cd under docs/rfcs/2574-provider-fallback-chain.md.

The original PR head had no remaining net file changes when reviewed, so this was taken as a manual RFC harvest rather than a direct merge. The commit includes Harvested from PR #2581 by @idling11 and your GitHub-mappable co-author trailer.

Closing this PR as harvested. Issue #2574 remains open for implementation work beyond the design document.

@Hmbown Hmbown closed this Jun 5, 2026
Hmbown added a commit that referenced this pull request Jun 5, 2026
Update the execution map after closing harvested or superseded PRs #2476, #2498, #2502, #2513, #2530, #2576, #2581, #2636, #2639, #2640, #2708, and #2730, and refresh the live PR count.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants