Skip to content

fix(provider): surface 429/503 provider failures behind ACP errors (reimpl of #653)#706

Closed
xdotli wants to merge 1 commit into
release/v0.6.0from
bry/acp-provider-failure-classification-v06
Closed

fix(provider): surface 429/503 provider failures behind ACP errors (reimpl of #653)#706
xdotli wants to merge 1 commit into
release/v0.6.0from
bry/acp-provider-failure-classification-v06

Conversation

@xdotli

@xdotli xdotli commented Jun 12, 2026

Copy link
Copy Markdown
Member

Reimplements #653 (by @bingran-you) on release/v0.6.0. The original is a fork PR that patched the pre-split monolithic rollout.py; this lands the same fix mapped onto the v0.6 rollout/ package layout, with all of #653's tests.

What

Extends the existing 401/403 provider-auth detection (#546/#564) to also classify provider rate-limit (429) and unavailable (503) failures that ACP agents wrap as a generic -32603 Internal error. The real status lives only in the proxy-captured trajectory, so we surface a sanitized marker (status code only — never response bodies or headers) that retry classification can act on.

Status Marker Category Retry?
401/403 provider auth failed provider_auth no (unchanged)
429 provider rate limited provider_rate_limit (new) no — daily caps won't recover in-window
503 provider unavailable infra_failure yes — transient

Changes

  • _utils/scoring.py — add PROVIDER_RATE_LIMIT; 429-markers → provider_rate_limit, 503-markers → infra_failure.
  • evaluation.pyRetryConfig.exclude_categories default adds PROVIDER_RATE_LIMIT.
  • providers/litellm_logging.py — generalize _provider_auth_status extraction to 401/403/429/503 with rate-limit/unavailable hint regexes; keep a backward-compatible auth-only shim (_provider_auth_status_from_failure_record).
  • rollout/_usage.py + rollout/__init__.py — add ProviderFailure + _provider_failure_from_runtime/_from_status; _provider_auth_status_from_runtime becomes an auth-only view; snapshot _provider_failure_cached in cleanup(); _classify_acp_error appends provider_failure.error_suffix.

Why it doesn't overlap with #682

#682's _maybe_classify_api_error runs only when self._error is None (silent zero-token failures). When a 429 surfaces as an ACP error, self._error is already set, so #682's path returns early — the failure stays ACP_ERROR (retryable) today. This PR operates on that disjoint ACP-error path. One known seam: an ACP-wrapped 429 is treated as non-retryable here, while #682's silent-failure 429 is rate_limit/transient; that's intentional (an agent that gave up with a daily-cap 429 should not burn retries), but worth a maintainer eye.

Verification

  • pytest tests/test_scoring.py tests/test_job.py tests/test_eval_worker_retry.py tests/test_provider_auth_detection.py110 passed (includes fix(provider): surface provider failures behind ACP errors #653's new daily-cap/503 cases).
  • Broader -k 'provider or litellm or retry or scoring or api_error or diagnostic'569 passed, 2 skipped.
  • ruff clean.

Closes #653 (supersedes the fork PR). Credit: @bingran-you.


Note

Medium Risk
Changes rollout error classification, retry defaults, and eval worker behavior for provider failures; low security risk (still status-only) but affects batch retry spend and error bucketing.

Overview
Extends provider failure detection beyond 401/403 so 429 (rate limit) and 503 (unavailable) hidden behind generic ACP -32603 Internal error get sanitized markers from the proxy-captured trajectory (status code only, no response bodies).

Rollout path: ProviderFailure and _provider_failure_from_runtime replace auth-only snapshots; cleanup() caches _provider_failure_cached and _classify_acp_error appends provider_failure.error_suffix (e.g. provider rate limited (HTTP 429)).

Classification & retries: New provider_rate_limit category; 503 markers map to infra_failure (still retryable). RetryConfig default excludes provider_rate_limit alongside provider_auth so daily caps do not burn retries.

LiteLLM import: Failure extraction generalized to 401/403/429/503 with hint regexes; trajectories get correct status codes for Bedrock-style daily-cap errors.

Reviewed by Cursor Bugbot for commit 77d6688. Bugbot is set up for automated code reviews on this repo. Configure here.

Extend the existing 401/403 provider-auth detection (#546/#564) to also
classify provider rate-limit (429) and unavailable (503) failures that ACP
agents wrap as a generic '-32603 Internal error'. The real status lives only
in the proxy-captured trajectory; surface a sanitized marker (status code
only, never response bodies/headers) so retry classification can act on it.

- scoring.py: add PROVIDER_RATE_LIMIT; 429-markers -> provider_rate_limit
  (non-retryable, daily caps won't recover in-window), 503-markers ->
  infra_failure (transient, stays retryable).
- evaluation.py: RetryConfig.exclude_categories defaults add
  PROVIDER_RATE_LIMIT.
- litellm_logging.py: generalize _provider_auth_status extraction to
  401/403/429/503 with rate-limit/unavailable hint regexes; keep a
  backward-compatible auth-only shim.
- rollout: add ProviderFailure + _provider_failure_from_runtime/_from_status
  (split across rollout/_usage.py and rollout/__init__.py for the v0.6
  package layout); _classify_acp_error appends the sanitized marker.

Operates on a disjoint path from #682's silent-zero-token api-error classifier
(which only runs when self._error is None), so no overlap.

Reimplements #653 (a fork PR that targeted the pre-split monolithic
rollout.py) on release/v0.6.0. Original author: bingran-you.

Co-authored-by: bingran-you <bingran@benchflow.ai>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 77d6688935

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +71 to +73
failure = _provider_failure_from_runtime(runtime)
if failure and failure.marker == "provider auth failed":
return failure.status

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve auth scan past later provider failures

When a captured trajectory contains a 401/403 followed by a 429 or 503, this backward-compatible auth helper now returns None because _provider_failure_from_runtime() stops at the latest provider failure even if it is non-auth. The previous behavior scanned newest-first but ignored non-auth statuses, so callers/tests using _provider_auth_status_from_runtime() and cleanup's _provider_auth_status_cached can miss a real auth failure in mixed-failure trajectories; keep this helper scanning until it finds 401/403 rather than delegating to the generalized latest-failure helper.

Useful? React with 👍 / 👎.

@bingran-you

Copy link
Copy Markdown
Collaborator

Closing — this and #653 are independent re-implementations of the same fix on release/v0.6.0, and #653's re-anchor merged first (e1ac5bb). Symbol-for-symbol the provider-failure work here (PROVIDER_RATE_LIMIT, ProviderFailure, _provider_failure_from_runtime/_from_status, 429→provider_rate_limit / 503→infra_failure, the litellm 401/403/429/503 generalization) is already on release via #653, so there's no net-new delta — the merged version just additionally drops the now-unreferenced _provider_auth_status_from_failure_record shim. Thanks @xdotli for the parallel implementation and the clean #682-disjointness write-up. 🙏

@xdotli

xdotli commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

Superseded — closing. release/v0.6.0 already landed this exact feature in commit 30fa159c ("fix(provider): surface 429/503 provider failures behind ACP errors"), in a more polished form: same ProviderFailure/_provider_failure_from_runtime design, same PROVIDER_RATE_LIMIT (429 non-retryable) / 503→infra_failure classification, same _classify_acp_error marker suffix, and the identical test suite (test_should_not_retry_provider_rate_limit_sanitized_marker, test_litellm_daily_cap_failure_import_drives_sanitized_provider_marker, test_litellm_503_failure_import_drives_sanitized_provider_marker, test_provider_rate_limit_marker, test_provider_unavailable_marker_is_infra). It also already carries the deliberate 429-retry-divergence-vs-#682 comment and the #546/#564 provenance notes.

This PR was a parallel reimplementation of @bingran-you's fork #653 on the v0.6 rollout/ package; that work is now in release via 30fa159, so no net-new value remains here. Credit to @bingran-you for the original design (#653).

@xdotli xdotli deleted the bry/acp-provider-failure-classification-v06 branch June 14, 2026 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants