feat(eval): Capture silent provider API errors as unhealthy results by Yiminnn · Pull Request #682 · benchflow-ai/benchflow

Yiminnn · 2026-06-11T23:21:50Z

Summary

Today, when a rollout's underlying LLM API calls fail (rate limit, rejected request, auth, quota, model-not-found, 5xx) but the agent harness swallows the failure and ends its turn normally, benchflow records reward=0.0 / error=None — a silent failure indistinguishable from a healthy model fail, silently dragging score denominators down. Motivating live fixture (#679 bring-up): an OpenCode-family agent rejects the proxy model alias with ProviderModelNotFoundError, issues zero provider requests, ends politely, verifier scores 0.0.

This PR catches and classifies these as unhealthy, rerun-able results — without ever interrupting the batch for isolated failures.

Design (4 decisions)

Result semantics — unhealthy: detection sets error + error_category (api_error / suspected_api_error), nulls the reward (slot excluded from score denominators), and emits structured diagnostics (api_error_info / suspected_api_error_info) through DIAGNOSTIC_REGISTRY — result.json, job summary, and check_results pick them up with zero extra wiring.
Two-layer detection in Rollout._build_result:
- Proxy-proven (api_error): new _provider_api_failure_summary_from_runtime scans all ≥400 statuses from the usage proxy's captured exchanges (previously only 401/403 were read), status-code-only per the Provider auth failures retry three Daytona sandboxes instead of failing fast #546/fix: classify provider auth failures as non-retryable #564 security posture. Fires when every captured request failed and the agent produced zero tokens. Subcategories: auth / quota / model_not_found / rate_limit / provider_error / rejected_request, each with a transient flag + fingerprint.
- Zero-signal (suspected_api_error): an executed-prompt rollout ending with 0 tokens AND 0 tool calls and no error — catches non-proxy runs and agents that fail before issuing any request. Setup/export-failure paths that never executed a prompt are exempt (Self-gen skill export failures are swallowed, allowing empty skill updates to look successful #389 channel separation preserved — caught by the existing test, gated accordingly).
Transient-only retry (RetryConfig.retry_on_api_error, default on): rate-limit/5xx retry under the existing backoff/max_retries; auth/quota/model-not-found/rejected and all suspected_api_error verdicts never auto-retry (permanent until a human fixes the key/model — retrying only burns sandbox-hours).
Same-fingerprint circuit breaker (ApiErrorCircuitBreaker): N consecutive permanent failures with the SAME fingerprint (dead key / wrong model id) stop launching new tasks; in-flight tasks finish; skipped tasks record an explicit breaker error; any healthy completion resets the streak. BENCHFLOW_API_ERROR_BREAKER_THRESHOLD (default 5, 0 disables). Isolated api_errors never interrupt the batch.

Evidence

New tests/test_api_error_capture.py: 27 cases across status mapping → failure summary → verdicts → diagnostics → retry policy → breaker
Full suite: 3371 passed, 0 failed (the one regression during development — Self-gen skill export failures are swallowed, allowing empty skill updates to look successful #389 export-failure channel separation — was caught by the existing suite and fixed by the executed-prompt gate)
ruff check / ruff format --check (405 files) / ty check: clean

How to test

uv run python -m pytest tests/test_api_error_capture.py -q
uv run python -m pytest tests/ -q
# Live repro: run any opencode-family agent against an unregistered proxy alias —
# the result now records error_category=suspected_api_error with reward=None
# instead of a fake healthy reward=0.0.

When a rollout's LLM API calls fail (rate limit, auth rejection, quota, model-not-found, 5xx) but the agent swallows the failure and ends its turn normally, benchflow recorded reward=0.0 / error=None — a silent failure indistinguishable from a healthy model fail, poisoning score denominators. (Motivating fixture: an OpenCode-family agent rejecting a proxy model alias via ProviderModelNotFoundError, issuing zero requests, verifier scores 0.0.) Two-layer post-rollout detection in Rollout._build_result, batch never interrupted by isolated failures: - Layer 1 (proxy-proven, "api_error"): _provider_api_failure_summary_ from_runtime scans ALL >=400 statuses from the usage proxy's captured exchanges (previously only 401/403 were read) — status-code-only, same #546/#564 security posture. Fires when every captured request failed and the agent produced zero tokens; classifies auth / quota / model_not_found / rate_limit / provider_error / rejected_request with a transient flag and fingerprint. - Layer 2 (zero-signal, "suspected_api_error"): an executed-prompt rollout ending with zero tokens AND zero tool calls and no error. Setup/export-failure paths that never executed a prompt are exempt (#389 channel separation preserved). Both verdicts null the reward (slot excluded from score denominators, rerun-able) and emit structured diagnostics (api_error_info / suspected_api_error_info) through DIAGNOSTIC_REGISTRY, so result.json, the job summary, and check_results pick them up with no extra wiring. Retry policy (RetryConfig.retry_on_api_error, default on): transient api_errors (rate limit, 5xx) retry with the existing backoff; permanent ones (auth, quota, model-not-found, rejected request) and suspected_api_error never auto-retry. Circuit breaker (ApiErrorCircuitBreaker): N consecutive permanent failures with the SAME fingerprint (classic dead key / wrong model id) stop launching new tasks — in-flight tasks finish, skipped tasks record an explicit breaker error, any healthy completion resets the streak. BENCHFLOW_API_ERROR_BREAKER_THRESHOLD (default 5, 0 disables). Tests: tests/test_api_error_capture.py (27 cases: status mapping, failure summary, verdicts, diagnostics registry, retry policy, breaker). Full suite 3371 passed / 0 failed; ruff + ty clean.

xdotli merged commit 5f6ca8b into release/v0.6.0 Jun 12, 2026
1 check passed

xdotli deleted the feat/api-error-capture branch June 12, 2026 03:38

This was referenced Jun 12, 2026

fix(provider): surface provider failures behind ACP errors #653

Merged

fix(provider): surface 429/503 provider failures behind ACP errors (reimpl of #653) #706

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): Capture silent provider API errors as unhealthy results#682

feat(eval): Capture silent provider API errors as unhealthy results#682
xdotli merged 1 commit into
release/v0.6.0from
feat/api-error-capture

Yiminnn commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Yiminnn commented Jun 11, 2026

Summary

Design (4 decisions)

Evidence

How to test

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants