Skip to content

fix(provider): surface provider failures behind ACP errors#653

Merged
bingran-you merged 1 commit into
benchflow-ai:release/v0.6.0from
bingran-you:bry/acp-provider-failure-classification
Jun 12, 2026
Merged

fix(provider): surface provider failures behind ACP errors#653
bingran-you merged 1 commit into
benchflow-ai:release/v0.6.0from
bingran-you:bry/acp-provider-failure-classification

Conversation

@bingran-you

Copy link
Copy Markdown
Collaborator

Summary

  • preserve provider 429/503 status codes when importing LiteLLM failure callbacks
  • surface provider-owned failures hidden behind ACP -32603 as sanitized result.error markers
  • classify provider_rate_limit separately and exclude it from immediate retries by default, while keeping provider 503 retryable as infra

Why

High-concurrency OpenHands runs were reporting ACP error -32603: Internal error, but the captured LLM trajectory showed Bedrock 429 Too Many Requests / daily-token-cap failures. The generic ACP wrapper caused dashboards and retry loops to misread provider quota exhaustion as ACP instability.

Tests

  • uv run python -m pytest tests/test_provider_auth_detection.py tests/test_scoring.py tests/test_job.py tests/test_eval_worker_retry.py -q
  • uv run ruff check src/benchflow/providers/litellm_logging.py src/benchflow/rollout.py src/benchflow/_utils/scoring.py src/benchflow/evaluation.py tests/test_provider_auth_detection.py tests/test_scoring.py tests/test_job.py tests/test_eval_worker_retry.py
  • uv run ty check src/benchflow/providers/litellm_logging.py src/benchflow/rollout.py src/benchflow/_utils/scoring.py src/benchflow/evaluation.py

@bingran-you bingran-you force-pushed the bry/acp-provider-failure-classification branch 2 times, most recently from 980725b to f870624 Compare June 9, 2026 22:30
@xdotli xdotli changed the base branch from main to release/v0.6.0 June 12, 2026 16:46
@xdotli

xdotli commented Jun 12, 2026

Copy link
Copy Markdown
Member

Retargeted to release/v0.6.0 — NET-NEW (release has only the 401/403 auth path on the ACP boundary; #682 is orthogonal — it captures silent zero-token rollouts, never the AgentProtocolError -32603 boundary your 429/503 surfacing targets). Handing back to you because it's a non-mechanical re-anchor: #682 refactored the surrounding code (the rollout.pyrollout/ package split and the litellm_logging.py helper renames like _coerce_auth_status/_explicit_auth_status), and release kept the simpler auth-only helpers — it does NOT have the ProviderFailure dataclass your diff assumes. Net-new pieces to re-anchor onto release: (1) preserve 429/503 in litellm_logging.py (release still drops them to 500); (2) the ProviderFailure/_provider_failure_from_runtime mapping + _classify_acp_error 429/503 marker; (3) PROVIDER_RATE_LIMIT category + markers in _utils/scoring.py (429→rate_limit, 503→infra_failure); (4) add PROVIDER_RATE_LIMIT to RetryConfig.exclude_categories in evaluation.py.

@xdotli

xdotli commented Jun 12, 2026

Copy link
Copy Markdown
Member

Reimplemented on release/v0.6.0 as #706 (this PR targets the pre-split monolithic rollout.py; #706 maps the same fix onto the v0.6 rollout/ package and carries all your tests). Credited you as co-author. Recommend closing this in favor of #706 once that merges — leaving the close to you/@bingran-you.

Re-anchored onto release/v0.6.0 (rollout/ package split + the api_error
taxonomy that release added after this PR's original base).

When a provider rate-limit (429) or outage (503) is wrapped by an ACP agent
as a generic `AgentProtocolError -32603`, surface a sanitized marker (status
code only — never response bodies/headers, benchflow-ai#546/benchflow-ai#564) so retry classification
can act on it instead of burning retries on an opaque acp_error:

- litellm_logging: preserve 429/503 in the captured trajectory (release drops
  them to 500). This also sharpens release's silent-path
  `_api_error_subcategory`, which now sees the real 429/503.
- rollout/_usage: `ProviderFailure` scan over the trajectory;
  `_provider_auth_status_from_runtime` kept as an auth-only back-compat view.
- rollout: snapshot the failure during cleanup; `_classify_acp_error` appends
  the sanitized suffix; `_provider_failure()` falls back to the auth cache.
- scoring/evaluation: 429 -> provider_rate_limit (non-retryable, added to
  RetryConfig.exclude_categories); 503 -> infra_failure (retryable).

A 429 at the ACP boundary is non-retryable because that is how Bedrock daily
token caps manifest (the SDK raises -> -32603), and retrying a daily cap
in-batch is futile. A 429 that surfaces *silently* (zero tokens, no raise)
keeps release's `rate_limit/transient = retryable` verdict — a self-surfacing
throttle self-heals on backoff. The two paths track genuinely different
failure shapes, so the differing retry verdicts are intentional.
@bingran-you bingran-you force-pushed the bry/acp-provider-failure-classification branch from f870624 to 30fa159 Compare June 12, 2026 19:25
@bingran-you

Copy link
Copy Markdown
Collaborator Author

Re-anchored onto release/v0.6.0 — now MERGEABLE (was CONFLICTING), one commit, +315/−47. Rebuilt against release's rollout/ package split and its post-base api_error taxonomy. All four net-new pieces ported:

  1. litellm_logging preserves 429/503 in the captured trajectory (release drops them to 500). Bonus: this also sharpens release's own silent-failure path — _provider_api_failure_summary_from_runtime / _api_error_subcategory now see the real 429/503 instead of a coerced 500.
  2. rollout/_usage gains the ProviderFailure scan over the trajectory; _provider_auth_status_from_runtime stays as an auth-only back-compat view (the #564 callers + tests still import it). ProviderFailure / _provider_failure_from_runtime / _provider_failure_from_status are re-exported from benchflow.rollout.
  3. rollout/__init__ snapshots _provider_failure_cached during cleanup(), derives the auth-only cache from it, and _classify_acp_error appends provider_failure.error_suffix. _provider_failure() falls back to the auth cache for the partial-Rollout test doubles.
  4. scoring/evaluation: 429 → provider_rate_limit (non-retryable, added to RetryConfig.exclude_categories); 503 → infra_failure (retryable).

Design note on the 429 verdict (the one non-mechanical part). Release already classifies a silent 429 as rate_limit / transient → retryable. This PR makes an ACP-boundary 429 non-retryable. That's intentional, not a contradiction: the two paths handle different failure shapes — a Bedrock daily token cap is raised by the SDK and lands here as -32603 (futile to retry in-batch), whereas a 429 that surfaces silently (zero tokens, no raise) goes through _maybe_classify_api_error and keeps release's transient/retryable verdict (a self-surfacing throttle self-heals on backoff). The rationale is documented inline in scoring.py. I considered a subcategory-aware split (daily-cap vs throttle) but it needs a new sanitized field on the LLMResponse trajectory schema — the #546/#564 posture forbids the runtime scan from reading response bodies — so it's deferred as a follow-up rather than landed on the release branch.

Verification: ruff + ty clean; targeted suites green (test_provider_auth_detection, test_scoring, test_job, test_eval_worker_retry, test_api_error_capture, test_litellm_logging) + full sweep 3807 passed. Dropped one now-unreferenced back-compat wrapper (_provider_auth_status_from_failure_record) to avoid dead code. Thanks @xdotli for the re-anchor map.

@bingran-you bingran-you merged commit e1ac5bb into benchflow-ai:release/v0.6.0 Jun 12, 2026
@bingran-you bingran-you deleted the bry/acp-provider-failure-classification branch June 12, 2026 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants