feat(eval): Capture silent provider API errors as unhealthy results#682
Merged
Conversation
When a rollout's LLM API calls fail (rate limit, auth rejection, quota, model-not-found, 5xx) but the agent swallows the failure and ends its turn normally, benchflow recorded reward=0.0 / error=None — a silent failure indistinguishable from a healthy model fail, poisoning score denominators. (Motivating fixture: an OpenCode-family agent rejecting a proxy model alias via ProviderModelNotFoundError, issuing zero requests, verifier scores 0.0.) Two-layer post-rollout detection in Rollout._build_result, batch never interrupted by isolated failures: - Layer 1 (proxy-proven, "api_error"): _provider_api_failure_summary_ from_runtime scans ALL >=400 statuses from the usage proxy's captured exchanges (previously only 401/403 were read) — status-code-only, same #546/#564 security posture. Fires when every captured request failed and the agent produced zero tokens; classifies auth / quota / model_not_found / rate_limit / provider_error / rejected_request with a transient flag and fingerprint. - Layer 2 (zero-signal, "suspected_api_error"): an executed-prompt rollout ending with zero tokens AND zero tool calls and no error. Setup/export-failure paths that never executed a prompt are exempt (#389 channel separation preserved). Both verdicts null the reward (slot excluded from score denominators, rerun-able) and emit structured diagnostics (api_error_info / suspected_api_error_info) through DIAGNOSTIC_REGISTRY, so result.json, the job summary, and check_results pick them up with no extra wiring. Retry policy (RetryConfig.retry_on_api_error, default on): transient api_errors (rate limit, 5xx) retry with the existing backoff; permanent ones (auth, quota, model-not-found, rejected request) and suspected_api_error never auto-retry. Circuit breaker (ApiErrorCircuitBreaker): N consecutive permanent failures with the SAME fingerprint (classic dead key / wrong model id) stop launching new tasks — in-flight tasks finish, skipped tasks record an explicit breaker error, any healthy completion resets the streak. BENCHFLOW_API_ERROR_BREAKER_THRESHOLD (default 5, 0 disables). Tests: tests/test_api_error_capture.py (27 cases: status mapping, failure summary, verdicts, diagnostics registry, retry policy, breaker). Full suite 3371 passed / 0 failed; ruff + ty clean.
This was referenced Jun 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Today, when a rollout's underlying LLM API calls fail (rate limit, rejected request, auth, quota, model-not-found, 5xx) but the agent harness swallows the failure and ends its turn normally, benchflow records
reward=0.0/error=None— a silent failure indistinguishable from a healthy model fail, silently dragging score denominators down. Motivating live fixture (#679 bring-up): an OpenCode-family agent rejects the proxy model alias withProviderModelNotFoundError, issues zero provider requests, ends politely, verifier scores 0.0.This PR catches and classifies these as unhealthy, rerun-able results — without ever interrupting the batch for isolated failures.
Design (4 decisions)
error+error_category(api_error/suspected_api_error), nulls the reward (slot excluded from score denominators), and emits structured diagnostics (api_error_info/suspected_api_error_info) throughDIAGNOSTIC_REGISTRY— result.json, job summary, and check_results pick them up with zero extra wiring.Rollout._build_result:api_error): new_provider_api_failure_summary_from_runtimescans all ≥400 statuses from the usage proxy's captured exchanges (previously only 401/403 were read), status-code-only per the Provider auth failures retry three Daytona sandboxes instead of failing fast #546/fix: classify provider auth failures as non-retryable #564 security posture. Fires when every captured request failed and the agent produced zero tokens. Subcategories:auth/quota/model_not_found/rate_limit/provider_error/rejected_request, each with a transient flag + fingerprint.suspected_api_error): an executed-prompt rollout ending with 0 tokens AND 0 tool calls and no error — catches non-proxy runs and agents that fail before issuing any request. Setup/export-failure paths that never executed a prompt are exempt (Self-gen skill export failures are swallowed, allowing empty skill updates to look successful #389 channel separation preserved — caught by the existing test, gated accordingly).RetryConfig.retry_on_api_error, default on): rate-limit/5xx retry under the existing backoff/max_retries; auth/quota/model-not-found/rejected and allsuspected_api_errorverdicts never auto-retry (permanent until a human fixes the key/model — retrying only burns sandbox-hours).ApiErrorCircuitBreaker): N consecutive permanent failures with the SAME fingerprint (dead key / wrong model id) stop launching new tasks; in-flight tasks finish; skipped tasks record an explicit breaker error; any healthy completion resets the streak.BENCHFLOW_API_ERROR_BREAKER_THRESHOLD(default 5,0disables). Isolated api_errors never interrupt the batch.Evidence
tests/test_api_error_capture.py: 27 cases across status mapping → failure summary → verdicts → diagnostics → retry policy → breakerruff check/ruff format --check(405 files) /ty check: cleanHow to test