Skip to content

feat(eval): Capture silent provider API errors as unhealthy results#682

Merged
xdotli merged 1 commit into
release/v0.6.0from
feat/api-error-capture
Jun 12, 2026
Merged

feat(eval): Capture silent provider API errors as unhealthy results#682
xdotli merged 1 commit into
release/v0.6.0from
feat/api-error-capture

Conversation

@Yiminnn

@Yiminnn Yiminnn commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

Today, when a rollout's underlying LLM API calls fail (rate limit, rejected request, auth, quota, model-not-found, 5xx) but the agent harness swallows the failure and ends its turn normally, benchflow records reward=0.0 / error=None — a silent failure indistinguishable from a healthy model fail, silently dragging score denominators down. Motivating live fixture (#679 bring-up): an OpenCode-family agent rejects the proxy model alias with ProviderModelNotFoundError, issues zero provider requests, ends politely, verifier scores 0.0.

This PR catches and classifies these as unhealthy, rerun-able results — without ever interrupting the batch for isolated failures.

Design (4 decisions)

  1. Result semantics — unhealthy: detection sets error + error_category (api_error / suspected_api_error), nulls the reward (slot excluded from score denominators), and emits structured diagnostics (api_error_info / suspected_api_error_info) through DIAGNOSTIC_REGISTRY — result.json, job summary, and check_results pick them up with zero extra wiring.
  2. Two-layer detection in Rollout._build_result:
  3. Transient-only retry (RetryConfig.retry_on_api_error, default on): rate-limit/5xx retry under the existing backoff/max_retries; auth/quota/model-not-found/rejected and all suspected_api_error verdicts never auto-retry (permanent until a human fixes the key/model — retrying only burns sandbox-hours).
  4. Same-fingerprint circuit breaker (ApiErrorCircuitBreaker): N consecutive permanent failures with the SAME fingerprint (dead key / wrong model id) stop launching new tasks; in-flight tasks finish; skipped tasks record an explicit breaker error; any healthy completion resets the streak. BENCHFLOW_API_ERROR_BREAKER_THRESHOLD (default 5, 0 disables). Isolated api_errors never interrupt the batch.

Evidence

  • New tests/test_api_error_capture.py: 27 cases across status mapping → failure summary → verdicts → diagnostics → retry policy → breaker
  • Full suite: 3371 passed, 0 failed (the one regression during development — Self-gen skill export failures are swallowed, allowing empty skill updates to look successful #389 export-failure channel separation — was caught by the existing suite and fixed by the executed-prompt gate)
  • ruff check / ruff format --check (405 files) / ty check: clean

How to test

uv run python -m pytest tests/test_api_error_capture.py -q
uv run python -m pytest tests/ -q
# Live repro: run any opencode-family agent against an unregistered proxy alias —
# the result now records error_category=suspected_api_error with reward=None
# instead of a fake healthy reward=0.0.

When a rollout's LLM API calls fail (rate limit, auth rejection, quota,
model-not-found, 5xx) but the agent swallows the failure and ends its
turn normally, benchflow recorded reward=0.0 / error=None — a silent
failure indistinguishable from a healthy model fail, poisoning score
denominators. (Motivating fixture: an OpenCode-family agent rejecting a
proxy model alias via ProviderModelNotFoundError, issuing zero requests,
verifier scores 0.0.)

Two-layer post-rollout detection in Rollout._build_result, batch never
interrupted by isolated failures:

- Layer 1 (proxy-proven, "api_error"): _provider_api_failure_summary_
  from_runtime scans ALL >=400 statuses from the usage proxy's captured
  exchanges (previously only 401/403 were read) — status-code-only,
  same #546/#564 security posture. Fires when every captured request
  failed and the agent produced zero tokens; classifies auth / quota /
  model_not_found / rate_limit / provider_error / rejected_request with
  a transient flag and fingerprint.
- Layer 2 (zero-signal, "suspected_api_error"): an executed-prompt
  rollout ending with zero tokens AND zero tool calls and no error.
  Setup/export-failure paths that never executed a prompt are exempt
  (#389 channel separation preserved).

Both verdicts null the reward (slot excluded from score denominators,
rerun-able) and emit structured diagnostics (api_error_info /
suspected_api_error_info) through DIAGNOSTIC_REGISTRY, so result.json,
the job summary, and check_results pick them up with no extra wiring.

Retry policy (RetryConfig.retry_on_api_error, default on): transient
api_errors (rate limit, 5xx) retry with the existing backoff;
permanent ones (auth, quota, model-not-found, rejected request) and
suspected_api_error never auto-retry.

Circuit breaker (ApiErrorCircuitBreaker): N consecutive permanent
failures with the SAME fingerprint (classic dead key / wrong model id)
stop launching new tasks — in-flight tasks finish, skipped tasks record
an explicit breaker error, any healthy completion resets the streak.
BENCHFLOW_API_ERROR_BREAKER_THRESHOLD (default 5, 0 disables).

Tests: tests/test_api_error_capture.py (27 cases: status mapping,
failure summary, verdicts, diagnostics registry, retry policy, breaker).
Full suite 3371 passed / 0 failed; ruff + ty clean.
@xdotli xdotli merged commit 5f6ca8b into release/v0.6.0 Jun 12, 2026
1 check passed
@xdotli xdotli deleted the feat/api-error-capture branch June 12, 2026 03:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants