fix(provider): surface provider failures behind ACP errors#653
Conversation
980725b to
f870624
Compare
|
Retargeted to release/v0.6.0 — NET-NEW (release has only the 401/403 auth path on the ACP boundary; #682 is orthogonal — it captures silent zero-token rollouts, never the |
|
Reimplemented on |
Re-anchored onto release/v0.6.0 (rollout/ package split + the api_error taxonomy that release added after this PR's original base). When a provider rate-limit (429) or outage (503) is wrapped by an ACP agent as a generic `AgentProtocolError -32603`, surface a sanitized marker (status code only — never response bodies/headers, benchflow-ai#546/benchflow-ai#564) so retry classification can act on it instead of burning retries on an opaque acp_error: - litellm_logging: preserve 429/503 in the captured trajectory (release drops them to 500). This also sharpens release's silent-path `_api_error_subcategory`, which now sees the real 429/503. - rollout/_usage: `ProviderFailure` scan over the trajectory; `_provider_auth_status_from_runtime` kept as an auth-only back-compat view. - rollout: snapshot the failure during cleanup; `_classify_acp_error` appends the sanitized suffix; `_provider_failure()` falls back to the auth cache. - scoring/evaluation: 429 -> provider_rate_limit (non-retryable, added to RetryConfig.exclude_categories); 503 -> infra_failure (retryable). A 429 at the ACP boundary is non-retryable because that is how Bedrock daily token caps manifest (the SDK raises -> -32603), and retrying a daily cap in-batch is futile. A 429 that surfaces *silently* (zero tokens, no raise) keeps release's `rate_limit/transient = retryable` verdict — a self-surfacing throttle self-heals on backoff. The two paths track genuinely different failure shapes, so the differing retry verdicts are intentional.
f870624 to
30fa159
Compare
|
Re-anchored onto
Design note on the 429 verdict (the one non-mechanical part). Release already classifies a silent 429 as Verification: |
Summary
Why
High-concurrency OpenHands runs were reporting
ACP error -32603: Internal error, but the captured LLM trajectory showed Bedrock429 Too Many Requests/ daily-token-cap failures. The generic ACP wrapper caused dashboards and retry loops to misread provider quota exhaustion as ACP instability.Tests
uv run python -m pytest tests/test_provider_auth_detection.py tests/test_scoring.py tests/test_job.py tests/test_eval_worker_retry.py -quv run ruff check src/benchflow/providers/litellm_logging.py src/benchflow/rollout.py src/benchflow/_utils/scoring.py src/benchflow/evaluation.py tests/test_provider_auth_detection.py tests/test_scoring.py tests/test_job.py tests/test_eval_worker_retry.pyuv run ty check src/benchflow/providers/litellm_logging.py src/benchflow/rollout.py src/benchflow/_utils/scoring.py src/benchflow/evaluation.py