Feat/Provider Fallback Chain — Design Document (#2574) by idling11 · Pull Request #2581 · Hmbown/CodeWhale

idling11 · 2026-06-02T07:27:15Z

Summary

Add an automatic provider fallback chain so that when the active provider
returns a non-recoverable error (429, selected 5xx, connection timeout),
CodeWhale switches to the next configured provider without interrupting
the user's workflow.

Motivation

Currently, users must manually run /provider to switch when their
primary provider fails. This is especially disruptive during long-running
agentic tasks. A fallback chain keeps the agent working without user
intervention.

Design

Configuration

[providers]
active = "nvidia-nim"
fallback = ["deepseek", "openrouter"]

[providers.nvidia-nim]
api_key = "nvapi-..."
base_url = "https://integrate.api.nvidia.com/v1"
model = "meta/llama-4"

[providers.deepseek]
api_key = "$DEEPSEEK_API_KEY"
model = "deepseek-v4-pro"

[providers.openrouter]
api_key = "$OPENROUTER_API_KEY"
model = "deepseek/deepseek-v4-0324"

fallback — ordered list of provider names to try
active — the primary provider (existing provider key, renamed for clarity)

Fallback triggers

Error	Fallback?	Rationale
429 (rate limit)	✅	Quota exhausted — swap key/provider
502 / 503 / 504	✅	Provider infrastructure issue
Connection timeout / DNS failure	✅	Network path broken
401 / 403	❌	Auth issue — no other provider will help
400 (bad request)	❌	Client error — not provider-specific
Stream interrupted mid-content	❌	Already consumed partial response

Sequence

1. Try primary provider (nvidia-nim)
2. On fallback-eligible error → try fallback[0] (deepseek)
3. On fallback-eligible error → try fallback[1] (openrouter)
4. All exhausted → surface clear error to user

Transcript / UI

Status toast: NVIDIA NIM unavailable — switched to DeepSeek
Transcript marker: [provider: nvidia-nim → deepseek]
/provider command shows current chain position: deepseek (fallback #1)
Original (active) provider is remembered so user can /provider reset to go back

Capability awareness

Before switching, the engine checks that the fallback provider supports
the current turn's needs:

Capability	Check
Tools / function calling	Fallback provider must support tools
Reasoning effort	Must support same reasoning levels
Context length	Model must have ≥ current turn's token count
Vision	Must support image inputs if turn has images

If no fallback provider meets capabilities, the error is surfaced directly.

Retry integration

Existing [retry] settings apply per-provider before fallback triggers.
A provider gets max_retries attempts with retry_delay between them.
Only after retry exhaustion does fallback move to the next provider.

Config schema validation

On startup, validate:

Each fallback entry is a known provider
No duplicate providers in chain
Fallback entry is not the same as active provider
Warn if fallback model has different capability profile

Implementation Plan (3 Draft PRs)

Phase 1: Config schema + validation

Branch: feat/provider-fallback-chain-phase1
Files: crates/tui/src/config.rs

Add fallback: Option<Vec<String>> field to ProvidersConfig
Add #[serde(default)] for backward compatibility
Add validation in Config::validate(): known provider, no duplicates, not same as active
Add fallback merge logic in merge_provider_config()
Unit tests: valid chain, invalid provider, duplicates

Phase 2: Engine fallback logic

Branch: feat/provider-fallback-chain-phase2
Files: crates/tui/src/client.rs, crates/tui/src/core/engine/turn_loop.rs

Add ActiveProviderTracker to remember original provider and current position
Error classifier: is_fallback_eligible(error) -> bool
try_with_fallback() in client.rs: iterate fallback chain on eligible errors
Save original provider before first fallback, restore via /provider reset
Event emission: ProviderFallback { from, to, reason }

Phase 3: UI feedback

Branch: feat/provider-fallback-chain-phase3
Files: crates/tui/src/tui/ui.rs, crates/tui/src/commands/provider.rs

Status toast on fallback switch
Transcript marker for provider transitions
/provider shows fallback position and chain
/provider reset to return to primary provider

Rejected alternatives

Per-request model routing: Too fine-grained; turns have state (system prompt, tools) that shouldn't change mid-turn
Weighted random selection: Unpredictable billing; users need deterministic behavior
Sub-agent-level fallback: Complicates sub-agent lifecycle for marginal gain

Open questions

Should fallback persist across sessions or reset each launch?
→ Reset each launch (avoids silently staying on fallback forever)
Should /compact reset to primary provider?
→ No — compaction changes context, not provider
Tool call mid-turn: if tool call succeeds but next API call fails, do we fallback?
→ Yes, same turn can span providers as long as capabilities match

greptile-apps · 2026-06-02T07:27:19Z

No reviewable files after applying ignore patterns.

github-actions · 2026-06-02T07:27:24Z

Thanks @idling11 for taking the time to contribute.

This repository is currently observing a maintainer-managed contribution gate in dry-run mode, so this pull request is staying open. When enforcement is enabled, pull requests from contributors who are not listed in .github/APPROVED_CONTRIBUTORS will be closed automatically.

Please read CONTRIBUTING.md for the expected contribution shape. A maintainer can grant PR access by commenting /lgtm on a pull request.

gemini-code-assist · 2026-06-02T07:27:26Z

Note

Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported.

Hmbown · 2026-06-03T15:50:49Z

Thanks @idling11 for writing up the provider fallback-chain design. This was not something to squeeze into the corrective v0.8.52 publish, but it is still a useful planning artifact for #2574 and belongs in the v0.8.53+ review sweep. Sorry it sat under mostly bot comments during the release cleanup.

Hmbown · 2026-06-05T02:06:32Z

Thanks @idling11 — the provider fallback chain design document was harvested into the public v0.9.0 integration branch (codex/v0.9.0-stewardship) as 5dc1a63cd under docs/rfcs/2574-provider-fallback-chain.md.

The original PR head had no remaining net file changes when reviewed, so this was taken as a manual RFC harvest rather than a direct merge. The commit includes Harvested from PR #2581 by @idling11 and your GitHub-mappable co-author trailer.

Closing this PR as harvested. Issue #2574 remains open for implementation work beyond the design document.

Update the execution map after closing harvested or superseded PRs #2476, #2498, #2502, #2513, #2530, #2576, #2581, #2636, #2639, #2640, #2708, and #2730, and refresh the live PR count.

idling11 added 2 commits June 2, 2026 14:58

docs: provider fallback chain design (Hmbown#2574)

c768e8d

docs: provider fallback chain design, 3-phase plan (Hmbown#2574)

912aa7f

idling11 mentioned this pull request Jun 2, 2026

Feature Request: Provider fallback chain — auto-switch on API failure #2574

Open

Hmbown closed this Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/Provider Fallback Chain — Design Document (#2574)#2581

Feat/Provider Fallback Chain — Design Document (#2574)#2581
idling11 wants to merge 2 commits into
Hmbown:mainfrom
idling11:feat/provider-fallback-chain

idling11 commented Jun 2, 2026

Uh oh!

greptile-apps Bot commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

gemini-code-assist Bot commented Jun 2, 2026

Uh oh!

Hmbown commented Jun 3, 2026

Uh oh!

Hmbown commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

idling11 commented Jun 2, 2026

Summary

Motivation

Design

Configuration

Fallback triggers

Sequence

Transcript / UI

Capability awareness

Retry integration

Config schema validation

Implementation Plan (3 Draft PRs)

Phase 1: Config schema + validation

Phase 2: Engine fallback logic

Phase 3: UI feedback

Rejected alternatives

Open questions

Uh oh!

greptile-apps Bot commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

gemini-code-assist Bot commented Jun 2, 2026

Uh oh!

Hmbown commented Jun 3, 2026

Uh oh!

Hmbown commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants