fix(proxy): limit provider circuit breaker failures by dofastted · Pull Request #1134 · ding113/claude-code-hub

dofastted · 2026-04-28T11:26:43Z

Summary

Refines provider circuit breaker logic to distinguish between infrastructure failures (connectivity issues, timeouts) and request-level errors (HTTP 4xx/5xx, fake-200, empty responses). Only infrastructure failures now affect provider health scores, preventing healthy providers from being incorrectly penalized for transient or client-side issues.

Problem

Previously, the circuit breaker counted the following as provider failures:

All HTTP 4xx/5xx responses from upstream providers
Fake-200 errors (HTML error pages returned with HTTP 200)
Empty response errors
Stream abort errors

This caused several operational issues:

Providers returning legitimate HTTP errors (like 429 rate limits, 401 auth failures) would be marked as unhealthy and excluded from rotation
Client disconnections (499 errors) were polluting the circuit breaker metrics
Transient model-specific errors could take down an entire provider endpoint
The ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS default was false, meaning network errors weren't counted even when operators wanted strict health tracking

Related Issues:

Partially addresses FAKE_200_JSON_ERROR_MESSAGE_NON_EMPTY: Our servers are currently overloaded. Please try again later. #1005 - Fake 200 "service overloaded" errors were incorrectly counting against provider health, causing premature circuit trips
Related to 为啥大部分请求都是499 #1083 - 499 client aborts were being conflated with provider errors in health metrics
Related to 状态499问题 #985 - Status 499 confusion between client and upstream errors
Related to 接入openrouter的gpt，调用的时候每次都报499，且会话无上下文 #1054 - 499 errors with openrouter context loss
Follow-up to fix(proxy): classify errors by rule category to preserve failover for service errors #1103 - Continues error classification refinement work
Related to feat(circuit-breaker): endpoint CB default-off + 524 decision chain audit #773 - Part of ongoing circuit breaker behavior improvements

Solution

Core Behavior Changes

HTTP 4xx/5xx responses no longer count as circuit breaker failures
- These indicate the upstream was reachable but returned an error (auth failure, rate limit, model error)
- The proxy still retries/fails over, but doesn't penalize provider health
- Exception: 524 timeout errors (Cloudflare timeout) still count as infrastructure failures
Fake-200 responses (HTML error pages) excluded from circuit breaker
- Previously recorded as recordFailure() - now only logged for debugging
- Session binding is still cleared to avoid sticky sessions to problematic endpoints
Empty response errors excluded
- Empty responses may be request-specific; provider health shouldn't suffer
Stream abort errors excluded
- Client-side disconnections after response started don't indicate provider unhealth
Network error config now defaults to true
- ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS now defaults to true
- When enabled, only connectivity failures and 524 timeouts affect provider circuit state

Error Category Refinements

Updated ErrorCategory classification in errors.ts:

Category	Circuit Breaker Impact	Examples
`CLIENT_ABORT`	Never counted	User disconnect, request cancelled
`NON_RETRYABLE_CLIENT_ERROR`	Never counted	Prompt too large, content filter
`RESOURCE_NOT_FOUND` (404)	Never counted	Model doesn't exist on this provider
`PROVIDER_ERROR` (4xx/5xx)	No longer counted	Rate limit, auth failure, bad request
`SYSTEM_ERROR`	Conditionally counted	DNS failure, connection refused, 524 timeout

Changes

Core Proxy Changes

src/app/v1/_lib/proxy/forwarder.ts - Only 524 status codes and network errors trigger recordFailure()
src/app/v1/_lib/proxy/response-handler.ts - Removed all recordFailure() calls for non-200/fake-200/stream errors
src/app/v1/_lib/proxy/errors.ts - Updated error category documentation
src/app/v1/_lib/proxy/session.ts - Updated chain reason comments

Configuration Changes

src/lib/config/env.schema.ts - ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS default: false → true
.env.example - Updated documentation for the new behavior
README.md / README.en.md - Clarified that only connectivity failures affect circuit state

Test Updates

tests/unit/proxy/client-abort-vs-upstream-499.test.ts - Updated expectations
tests/unit/proxy/proxy-forwarder-fake-200-html.test.ts - Removed recordFailure() assertions
tests/unit/proxy/response-handler-endpoint-circuit-isolation.test.ts - Verified no circuit recording
tests/unit/proxy/response-handler-non200.test.ts - Non-200 responses don't trigger failures

Breaking Changes

Behavior change for operators:

Scenario	Before	After
Provider returns HTTP 500	Counts as failure, may trip circuit	Doesn't count; still retries/fails over
Provider returns HTTP 429	Counts as failure, may trip circuit	Doesn't count; still retries/fails over
Fake-200 HTML error	Counts as failure	Doesn't count
Empty response	Counts as failure	Doesn't count
524 Cloudflare timeout	Counts as failure	Still counts
Network/DNS error	Depends on `ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS` (default: false)	Counts if config true (default: true)

Migration:

No code changes required
Monitor provider health metrics after deployment - providers that were incorrectly penalized may now show better health scores
If you prefer the old behavior (all 4xx/5xx count as failures), this PR does not provide a toggle - the old behavior was deemed too aggressive for production use

Testing

Automated Tests

bun run typecheck - pass
bun run test - All 4 updated test suites pass
- client-abort-vs-upstream-499.test.ts (16 tests)
- proxy-forwarder-fake-200-html.test.ts (7 tests)
- response-handler-endpoint-circuit-isolation.test.ts (12 tests)
- response-handler-non200.test.ts (14 tests)

Manual Verification

Configure provider that returns HTTP 429
Send multiple requests until rate limit triggered
Verify provider is NOT marked unhealthy in circuit breaker
Verify requests still fail over to alternative provider
Verify provider_chain shows retry_failed but not circuit breaker penalties

Checklist

Code follows project conventions (no emoji, named exports, i18n strings)
Self-review completed
Tests updated to reflect new behavior
Documentation updated (env var descriptions)
No breaking API changes
Target branch is dev

Description enhanced by Claude AI

Greptile Summary

This PR refines the provider circuit breaker to count only infrastructure failures (524 Cloudflare timeouts and network-layer errors when ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true) rather than all 4xx/5xx HTTP responses, and flips that config default from false to true. The core forwarder.ts and response-handler.ts changes are clean and internally consistent; the circuitFailureCount log values are corrected to match the new non-recording behavior.

Confidence Score: 5/5

Safe to merge; only P2 findings remain and they are pre-existing test quality issues.

All findings are P2 style/test concerns. The production logic change is straightforward and well-scoped. The meaningful test suites (forwarder fake-200 and endpoint circuit isolation) do exercise real code paths and correctly reflect the new behavior. The default-flip for ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS is a documented intentional breaking change.

tests/unit/proxy/response-handler-non200.test.ts — the handleNonStream describe block tests are trivially passing and do not exercise the real handler.

Important Files Changed

Filename	Overview
src/app/v1/_lib/proxy/forwarder.ts	Core change: sequential path now gates recordFailure on statusCode === 524 (previously all 4xx/5xx); empty-response path also stops calling recordFailure; circuitFailureCount log corrected accordingly. Concurrent path mirrors the 524-only guard but still lacks the shouldAccountCircuitBreaker check (pre-existing gap, already flagged).
src/app/v1/_lib/proxy/response-handler.ts	Removes all recordFailure calls for fake-200, non-200 passthrough, and stream-abort paths; replaces them with comment-only explanations. Logic and cleanup are consistent with the stated design intent.
src/lib/config/env.schema.ts	Default for ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS flipped from false to true; one-liner change, no schema-logic issues.
tests/unit/proxy/response-handler-non200.test.ts	handleNonStream describe block never invokes the real handler; tests trivially pass and provide no coverage of the actual response-handler.ts change. Other test files (endpoint-circuit-isolation) do exercise the real dispatch path and are meaningful.
tests/unit/proxy/response-handler-endpoint-circuit-isolation.test.ts	Test titles and expectations updated; all scenarios call ProxyResponseHandler.dispatch and meaningfully verify the new no-recordFailure behavior for fake-200 and non-200 paths.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Upstream error received] --> B{Error category?}
    B -->|CLIENT_ABORT| C[No recordFailure\nNo retry]
    B -->|NON_RETRYABLE_CLIENT_ERROR| D[No recordFailure\nNo retry]
    B -->|RESOURCE_NOT_FOUND 404| E[No recordFailure\nSwitch provider]
    B -->|PROVIDER_ERROR\n4xx/5xx + EmptyResponse| F{statusCode == 524?}
    F -->|Yes\nCloudflare timeout| G[recordFailure ✓\nSwitch provider]
    F -->|No\n429/500/502/etc.| H[No recordFailure\nRetry/switch provider]
    B -->|SYSTEM_ERROR\nNetwork/DNS| I{ENABLE_CIRCUIT_BREAKER\n_ON_NETWORK_ERRORS?}
    I -->|true\nnew default| J[recordFailure ✓\nRetry once]
    I -->|false| K[No recordFailure\nRetry once]

Prompt To Fix All With AI

This is a comment left during a code review.
Path: tests/unit/proxy/response-handler-non200.test.ts
Line: 270-368

Comment:
**Tests don't invoke the real handler — trivially pass**

The `handleNonStream with non-200 status code` describe block never calls `ProxyResponseHandler.dispatch` or any other production method. Each test manually runs `detectUpstreamErrorFromSseOrJsonText` and `session.addProviderToChain` inline, then asserts `expect(mockRecordFailure).not.toHaveBeenCalled()`. Because no production code is exercised, these assertions are trivially true regardless of whether `response-handler.ts` was changed or not. The companion tests in `response-handler-endpoint-circuit-isolation.test.ts` do call `ProxyResponseHandler.dispatch` and are meaningful; these four tests provide no additional coverage of the actual behavior change.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (2): Last reviewed commit: "fix(proxy): limit provider circuit break..." | Re-trigger Greptile}

coderabbitai · 2026-04-28T11:26:57Z

📝 Walkthrough

Walkthrough

将 ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS 默认由 false 改为 true，并将提供商断路器计数限定为超时/连接类故障（如 HTTP 524）；移除针对 EmptyResponse/伪 200/多数非 2xx 路径的提供商断路器记录，仅在超时样式失败时计数。

Changes

Cohort / File(s)	Summary
配置与示例 `\.env.example`, `src/lib/config/env.schema.ts`	将 `ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS` 默认值从 `false` 改为 `true`。
文档 `README.en.md`, `README.md`	更新配置表述：默认改为 `true`，并将描述收窄为仅由连接失败和超时影响断路器。
错误分类注释 `src/app/v1/_lib/proxy/errors.ts`	调整 JSDoc/comments，澄清 EmptyResponse 属于 PROVIDER_ERROR 不应累计提供商断路器；SYSTEM_ERROR（传输/网络）是否计入由配置决定。
转发逻辑 `src/app/v1/_lib/proxy/forwarder.ts`	移除 EmptyResponse 与多数 PROVIDER_ERROR 的断路器增量；仅当 `statusCode === 524`（超时样式）且配置允许时才增加 `circuitFailureCount` 与调用 `recordFailure()`。
响应处理与流控 `src/app/v1/_lib/proxy/response-handler.ts`	从 SSE/伪 200、非 2xx 流和若干非流路径中移除对 `recordFailure` 的动态调用，仅将失败记录入提供商决策链并避免会话绑定更新。
会话注释 `src/app/v1/_lib/proxy/session.ts`	更新 `ProviderSession.addProviderToChain` 中 `metadata.reason` 的文档，说明 `retry_failed` 与 `system_error` 在断路器计数上的不同语义（后者受配置影响）。
测试（注释/断言调整） `tests/unit/proxy/client-abort-vs-upstream-499.test.ts`	更新文件注释，描述 499 被归类为 PROVIDER_ERROR、触发重试/回退而非断路器。
测试（行为断言修改） `tests/unit/proxy/proxy-forwarder-fake-200-html.test.ts`, `tests/unit/proxy/response-handler-endpoint-circuit-isolation.test.ts`, `tests/unit/proxy/response-handler-non200.test.ts`	调整单元测试期望：移除或替换对 `recordFailure`/`mockRecordFailure` 被调用的断言为“未被调用”，保留提供商选择与 provider-chain 相关断言。

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Possibly related PRs

fix(proxy): implement endpoint stickiness on retries #651: 修改 forwarder.ts 中 SYSTEM_ERROR vs PROVIDER_ERROR 的重试/端点行为，与本次对断路器计数与重试语义调整有强关联。
fix(proxy): detect SSE error block in HTTP 200 response and trigger retry #649: 在错误分类与 SSE/空响应处理上有交叉修改（errors.ts、forwarder、response-handler），可能与本次收窄断路器计数的变更冲突或互补。
fix: propagate upstream stream errors to prevent truncated responses marked as success #917: 调整了传输/流错误的检测与传播（errors.ts、forwarder、response-handler），涉及是否记录传输类故障到断路器，故代码层面高度相关。

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title clearly and accurately summarizes the main change: limiting what counts as provider circuit breaker failures, which is the core purpose of this PR.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The pull request description clearly explains the rationale, core behavior changes, affected files, test updates, and breaking changes related to provider circuit breaker logic refinement.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request refines the circuit breaker logic so that only connectivity failures and timeouts contribute to a provider's circuit breaker state, while application-level errors trigger fallbacks without penalizing overall health. It also updates the default configuration to enable circuit breakers on network errors. Feedback points out an inconsistency in the streaming hedge path regarding network error handling and endpoint policies, and recommends making failure recording non-blocking to improve performance.

gemini-code-assist · 2026-04-28T11:29:28Z

+      if (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524) {
        await recordFailure(attempt.provider.id, error);
      }


The circuit breaker logic in the streaming hedge path is inconsistent with the non-hedge path and the PR description. Connectivity failures (SYSTEM_ERROR) should also affect the provider circuit state when ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS is enabled. Additionally, the allowCircuitBreakerAccounting endpoint policy should be respected. Note: In performance-sensitive paths like this, the recordFailure I/O operation should be fire-and-forget to avoid blocking the main logic.

const shouldAccountCircuitBreaker = session.getEndpointPolicy().allowCircuitBreakerAccounting; const shouldRecordProviderCircuitFailure = (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524) || (errorCategory === ErrorCategory.SYSTEM_ERROR && getEnvConfig().ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS); if (shouldRecordProviderCircuitFailure && shouldAccountCircuitBreaker) { recordFailure(attempt.provider.id, error); }

References

In performance-sensitive code paths like provider failover, non-critical I/O operations (e.g., releasing a session in Redis) should be executed as fire-and-forget tasks to avoid blocking the main logic.

greptile-apps · 2026-04-28T11:30:43Z

 # 使用场景：
 # - 默认关闭：适用于网络不稳定环境（如使用代理），避免因临时网络抖动触发熔断器
 # - 启用：适用于网络稳定环境，连续网络错误也应触发熔断保护，避免持续请求不可达的供应商
-ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=false
+ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true


Stale "默认关闭" comment after default change

The comment block still reads - 默认关闭：适用于网络不稳定环境… (disabled by default: for unstable networks), but the default was just flipped to true. Operators who set up a fresh deployment will read "disabled by default" and assume false is the initial state, leading to misconfigured circuit breakers.

Suggested change

# 使用场景：

# - 默认关闭：适用于网络不稳定环境（如使用代理），避免因临时网络抖动触发熔断器

# - 启用：适用于网络稳定环境，连续网络错误也应触发熔断保护，避免持续请求不可达的供应商

ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=false

ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true

# 使用场景：

# - 关闭：适用于网络不稳定环境（如使用代理），避免因临时网络抖动触发熔断器

# - 启用（默认）：适用于网络稳定环境，连续网络错误也应触发熔断保护，避免持续请求不可达的供应商

ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true

Prompt To Fix With AI

This is a comment left during a code review. Path: .env.example Line: 94-97 Comment: **Stale "默认关闭" comment after default change** The comment block still reads `- 默认关闭：适用于网络不稳定环境…` (disabled by default: for unstable networks), but the default was just flipped to `true`. Operators who set up a fresh deployment will read "disabled by default" and assume `false` is the initial state, leading to misconfigured circuit breakers. ```suggestion # 使用场景： # - 关闭：适用于网络不稳定环境（如使用代理），避免因临时网络抖动触发熔断器 # - 启用（默认）：适用于网络稳定环境，连续网络错误也应触发熔断保护，避免持续请求不可达的供应商 ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-28T11:30:44Z

+      if (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524) {
        await recordFailure(attempt.provider.id, error);
      }


Concurrent mode skips shouldAccountCircuitBreaker guard

The sequential path gates its 524-failure recording on shouldRecordProviderCircuitFailure && shouldAccountCircuitBreaker (line ~2024), but the concurrent/hedging path at this line does not check shouldAccountCircuitBreaker (endpointPolicy.allowCircuitBreakerAccounting). This means a 524 error in a hedge attempt will always record a circuit-breaker failure, even for endpoints whose policy explicitly opts out of circuit-breaker accounting.

This asymmetry pre-dates this PR, but since this PR now narrows the trigger to only 524, the missing guard becomes more visible. Consider adding the same guard for consistency:

if (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524 && shouldAccountCircuitBreaker) { await recordFailure(attempt.provider.id, error); }

Prompt To Fix With AI

This is a comment left during a code review. Path: src/app/v1/_lib/proxy/forwarder.ts Line: 3782-3784 Comment: **Concurrent mode skips `shouldAccountCircuitBreaker` guard** The sequential path gates its 524-failure recording on `shouldRecordProviderCircuitFailure && shouldAccountCircuitBreaker` (line ~2024), but the concurrent/hedging path at this line does not check `shouldAccountCircuitBreaker` (`endpointPolicy.allowCircuitBreakerAccounting`). This means a 524 error in a hedge attempt will always record a circuit-breaker failure, even for endpoints whose policy explicitly opts out of circuit-breaker accounting. This asymmetry pre-dates this PR, but since this PR now narrows the trigger to only `524`, the missing guard becomes more visible. Consider adding the same guard for consistency: ```typescript if (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524 && shouldAccountCircuitBreaker) { await recordFailure(attempt.provider.id, error); } ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-28T11:30:45Z

 | `SESSION_TTL`                              | `300`                    | Session cache window (seconds) that drives vendor reuse.                                             |
 | `ENABLE_SECURE_COOKIES`                    | `true`                   | Browsers require HTTPS for Secure cookies; set to `false` when serving plain HTTP outside localhost. |
-| `ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS` | `false`                  | When `true`, network errors also trip the circuit breaker for quicker isolation.                     |
+| `ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS` | `true`                   | When `true`, only connectivity failures and timeouts affect provider circuit breaker state.          |


Description ambiguous when variable is false

The new description "When true, only connectivity failures and timeouts affect provider circuit breaker state" implies that when false, other things also affect provider circuit state. In practice, HTTP 524 still triggers recordFailure regardless of this variable — it purely controls whether network-layer errors are additionally included.

Consider wording like: "When true, network-layer errors (DNS, connection refused, etc.) also trip the provider circuit breaker in addition to 524 timeouts."

Prompt To Fix With AI

This is a comment left during a code review. Path: README.en.md Line: 347 Comment: **Description ambiguous when variable is `false`** The new description "When `true`, only connectivity failures and timeouts affect provider circuit breaker state" implies that when `false`, other things also affect provider circuit state. In practice, HTTP 524 still triggers `recordFailure` regardless of this variable — it purely controls whether *network-layer* errors are additionally included. Consider wording like: "When `true`, network-layer errors (DNS, connection refused, etc.) also trip the provider circuit breaker in addition to 524 timeouts." How can I resolve this? If you propose a fix, please make it concise.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

.env.example (1)

90-97: ⚠️ Potential issue | 🟡 Minor

熔断器注释与当前行为不一致，易误导运维配置。

Line 97 已改为默认 true，但 Line 92-96 仍写“false 为默认、true 计入所有错误”。这与本 PR“仅连通性失败/超时类问题影响 provider 熔断状态”的语义冲突。

建议同步更新注释文案

 # 熔断器配置
-# 功能说明：控制网络错误是否计入熔断器失败计数
-# - false (默认)：网络错误（DNS 解析失败、连接超时、代理连接失败等）不计入熔断器，仅供应商错误（4xx/5xx HTTP 响应）计入
-# - true：所有错误（包括网络错误）都计入熔断器失败计数
+# 功能说明：控制“连通性失败/超时类错误”是否影响 provider 熔断状态
+# - true (默认)：连接失败、连接超时、超时风格错误（如 524）会计入 provider 熔断
+# - false：上述网络连通性错误不计入 provider 熔断（仍保留重试/切换逻辑）
 ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.env.example around lines 90 - 97, The comment for
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS is inconsistent with the new default
and PR behavior; update the explanatory text so the default value shown (true)
matches the described behavior and clearly define what true vs false do given
the PR's change (specifically that, under the current implementation, only
connectivity failures/timeouts affect the provider circuit state or explain that
true/false now invert behavior as implemented). Edit the block mentioning "false
(默认)" / "true：所有错误计入熔断器失败计数" and rewrite it to explicitly state the current
default (true), what each setting does, and that only connectivity/time‑out
style network errors are considered for provider circuit status per this PR;
ensure ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS is referenced so maintainers can
locate the setting.

tests/unit/proxy/response-handler-non200.test.ts (1)

269-367: ⚠️ Potential issue | 🟠 Major

这些 non-200 用例未走真实处理链路，当前断言无法证明行为变更。

在 Line 270-367，测试只做本地分支和 session.addProviderToChain(...)，并没有调用 ProxyResponseHandler（或 forwarder）实际逻辑，所以 Line 291/319/343/367 的 not.toHaveBeenCalled() 基本是平凡成立，回归保护不足。

建议改为真实调用路径（示例）

+import { ProxyResponseHandler } from "@/app/v1/_lib/proxy/response-handler";
...
   describe("handleNonStream with non-200 status code", () => {
     it("should not record provider circuit breaker failure for 500 status", async () => {
       const session = createSession({
         provider: mockProvider,
         messageContext: mockMessageContext,
       });
-
-      const statusCode = 500;
-      const responseText = '{"error":"internal error"}';
-
-      if (statusCode >= 400) {
-        const detected = detectUpstreamErrorFromSseOrJsonText(responseText);
-        const errorMessageForDb = detected.isError ? detected.code : `HTTP ${statusCode}`;
-
-        session.addProviderToChain(mockProvider, {
-          reason: "retry_failed",
-          attemptNumber: 1,
-          statusCode: statusCode,
-          errorMessage: errorMessageForDb,
-        });
-      }
+      const response = new Response('{"error":"internal error"}', {
+        status: 500,
+        headers: { "content-type": "application/json" },
+      });
+
+      await ProxyResponseHandler.dispatch(session, response);
+      await Promise.all(asyncTasks.splice(0, asyncTasks.length));
 
       expect(mockRecordFailure).not.toHaveBeenCalled();
+      expect(
+        session
+          .getProviderChain()
+          .some((item) => item.reason === "retry_failed" && item.statusCode === 500),
+      ).toBe(true);
     });
   });

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/proxy/response-handler-non200.test.ts` around lines 269 - 367,
Tests are only exercising local branching and calling session.addProviderToChain
directly, so mockRecordFailure assertions are vacuous; instead invoke the real
handling path (e.g., ProxyResponseHandler.handleNonStream or the forwarder
method that processes an upstream Response) with a constructed Response-like
object containing status and body so the handler calls
detectUpstreamErrorFromSseOrJsonText and then
session.addProviderToChain/recordProviderFailure; update each test to call that
handler with the mocked provider and response (using the same statusCode and
responseText), assert mockRecordFailure called or not accordingly, and keep
references to detectUpstreamErrorFromSseOrJsonText, session.addProviderToChain,
ProxyResponseHandler.handleNonStream (or the forwarder method) and
mockRecordFailure to find the right code to change.

src/app/v1/_lib/proxy/response-handler.ts (2)

926-938: ⚠️ Potential issue | 🟡 Minor

404 在这两个 non-stream 分支里也需要保留 resource_not_found。

这个 PR 去掉了这里的 provider breaker 记账后，决策链已经变成这条路径里唯一的失败归因。但当前把所有 >= 400 都写成 retry_failed，会把 404 和普通上游失败混在一起，和本文件的流式分支、categorizeErrorAsync() 的 404 特判不一致。

建议修改

             if (statusCode >= 400) {
               const detected = detectUpstreamErrorFromSseOrJsonText(responseText);
               errorMessageForFinalize = detected.isError ? detected.code : `HTTP ${statusCode}`;
+              const chainReason =
+                statusCode === 404 ? "resource_not_found" : "retry_failed";

               // 记录到决策链
               session.addProviderToChain(provider, {
-                reason: "retry_failed",
+                reason: chainReason,
                 attemptNumber: 1,
                 statusCode: statusCode,
                 errorMessage: errorMessageForFinalize,
               });
             }

         if (statusCode >= 400) {
           const detected = detectUpstreamErrorFromSseOrJsonText(responseText);
           const errorMessageForDb = detected.isError ? detected.code : `HTTP ${statusCode}`;
+          const chainReason =
+            statusCode === 404 ? "resource_not_found" : "retry_failed";

           // 记录到决策链
           session.addProviderToChain(provider, {
-            reason: "retry_failed",
+            reason: chainReason,
             attemptNumber: 1,
             statusCode: statusCode,
             errorMessage: errorMessageForDb,
           });
         }

Also applies to: 1305-1316

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/app/v1/_lib/proxy/response-handler.ts` around lines 926 - 938, The
non-stream error handling currently logs every >=400 as reason "retry_failed",
which incorrectly merges 404s with other failures; update the
session.addProviderToChain call in response-handler.ts (the block using
detectUpstreamErrorFromSseOrJsonText, variables
provider/statusCode/detected/errorMessageForFinalize) so that if the upstream
represents a resource-not-found (statusCode === 404 or detected.isError &&
detected.code === 'resource_not_found') you set reason to "resource_not_found",
otherwise keep "retry_failed"; apply the same conditional fix to the other
non-stream branch that mirrors this logic (the block around lines 1305-1316).

594-608: ⚠️ Potential issue | 🟡 Minor

客户端中断这里不应记成 system_error。

当 clientAborted === true 时，这里仍然固定写入 reason: "system_error"。后续 updateMessageRequestDetails() 会把 provider chain 原样持久化，所以用户主动取消会被审计/监控误记成上游故障。

建议修改

     session.addProviderToChain(providerForChain, {
       endpointId: meta.endpointId,
       endpointUrl: meta.endpointUrl,
-      reason: "system_error",
+      reason: clientAborted ? "client_abort" : "system_error",
       attemptNumber: meta.attemptNumber,
       statusCode: effectiveStatusCode,
       errorMessage: errorMessage ?? undefined,
     });

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/app/v1/_lib/proxy/response-handler.ts` around lines 594 - 608, The code
currently records provider chain entries with reason: "system_error" even when
the client aborted; update the block that runs when !streamEndedNormally (around
streamEndedNormally, clearSessionBinding, and session.addProviderToChain) to
detect clientAborted (or equivalent flag) and set reason to "client_aborted" (or
a distinct non-system reason) instead of "system_error"; ensure
session.addProviderToChain receives the adjusted reason, leaving other fields
(endpointId, endpointUrl, attemptNumber, statusCode, errorMessage) unchanged so
updateMessageRequestDetails() persists the correct cause.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@README.en.md`:
- Line 347: The README's descriptions for
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS are inconsistent: one place says "only
connectivity failures and timeouts affect provider circuit breaker state" while
another still states "4xx/5xx or network errors will trigger"—pick the intended
behavior and make both descriptions match; update the other occurrence so the
text referencing ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS and any FAQ entry use
the same wording (either "only connectivity failures and timeouts" or "4xx/5xx
and network errors") and ensure the FAQ and the table entry convey the identical
policy.

In `@README.md`:
- Line 360: The README has inconsistent descriptions: the config table for
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS says only connection failures/timeouts
count toward provider circuit-breaking, but the FAQ still states "4xx/5xx or
network errors"—please reconcile by picking one authoritative rule and making
both places consistent; specifically update the FAQ text that mentions "4xx/5xx"
to match the table (or vice versa) so both state clearly whether 4xx/5xx
responses are included, and ensure the ENV name
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS and its explanatory Chinese wording
consistently describe the exact trigger set (e.g., "仅将连接失败、超时等网络不可达错误计入熔断" or "将
4xx/5xx 与网络错误一并计入熔断") across README.

---

Outside diff comments:
In @.env.example:
- Around line 90-97: The comment for ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS is
inconsistent with the new default and PR behavior; update the explanatory text
so the default value shown (true) matches the described behavior and clearly
define what true vs false do given the PR's change (specifically that, under the
current implementation, only connectivity failures/timeouts affect the provider
circuit state or explain that true/false now invert behavior as implemented).
Edit the block mentioning "false (默认)" / "true：所有错误计入熔断器失败计数" and rewrite it to
explicitly state the current default (true), what each setting does, and that
only connectivity/time‑out style network errors are considered for provider
circuit status per this PR; ensure ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS is
referenced so maintainers can locate the setting.

In `@src/app/v1/_lib/proxy/response-handler.ts`:
- Around line 926-938: The non-stream error handling currently logs every >=400
as reason "retry_failed", which incorrectly merges 404s with other failures;
update the session.addProviderToChain call in response-handler.ts (the block
using detectUpstreamErrorFromSseOrJsonText, variables
provider/statusCode/detected/errorMessageForFinalize) so that if the upstream
represents a resource-not-found (statusCode === 404 or detected.isError &&
detected.code === 'resource_not_found') you set reason to "resource_not_found",
otherwise keep "retry_failed"; apply the same conditional fix to the other
non-stream branch that mirrors this logic (the block around lines 1305-1316).
- Around line 594-608: The code currently records provider chain entries with
reason: "system_error" even when the client aborted; update the block that runs
when !streamEndedNormally (around streamEndedNormally, clearSessionBinding, and
session.addProviderToChain) to detect clientAborted (or equivalent flag) and set
reason to "client_aborted" (or a distinct non-system reason) instead of
"system_error"; ensure session.addProviderToChain receives the adjusted reason,
leaving other fields (endpointId, endpointUrl, attemptNumber, statusCode,
errorMessage) unchanged so updateMessageRequestDetails() persists the correct
cause.

In `@tests/unit/proxy/response-handler-non200.test.ts`:
- Around line 269-367: Tests are only exercising local branching and calling
session.addProviderToChain directly, so mockRecordFailure assertions are
vacuous; instead invoke the real handling path (e.g.,
ProxyResponseHandler.handleNonStream or the forwarder method that processes an
upstream Response) with a constructed Response-like object containing status and
body so the handler calls detectUpstreamErrorFromSseOrJsonText and then
session.addProviderToChain/recordProviderFailure; update each test to call that
handler with the mocked provider and response (using the same statusCode and
responseText), assert mockRecordFailure called or not accordingly, and keep
references to detectUpstreamErrorFromSseOrJsonText, session.addProviderToChain,
ProxyResponseHandler.handleNonStream (or the forwarder method) and
mockRecordFailure to find the right code to change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5ef302ec-9adf-483e-8862-d56cdbfef7b4

📥 Commits

Reviewing files that changed from the base of the PR and between 79cbbc5 and 01b8ace.

📒 Files selected for processing (12)

.env.example
README.en.md
README.md
src/app/v1/_lib/proxy/errors.ts
src/app/v1/_lib/proxy/forwarder.ts
src/app/v1/_lib/proxy/response-handler.ts
src/app/v1/_lib/proxy/session.ts
src/lib/config/env.schema.ts
tests/unit/proxy/client-abort-vs-upstream-499.test.ts
tests/unit/proxy/proxy-forwarder-fake-200-html.test.ts
tests/unit/proxy/response-handler-endpoint-circuit-isolation.test.ts
tests/unit/proxy/response-handler-non200.test.ts

github-actions

Code Review Summary

This PR changes the circuit breaker failure recording policy so that only HTTP 524 (timeout) errors affect provider circuit state, removing circuit breaker recording for upstream 4xx/5xx responses, fake-200 errors, empty responses, and stream aborts. The behavioral change is consistent across all code paths (forwarder main loop, hedge path, response handler). The test suite is updated to match, but has a coverage gap for the new 524-specific logic.

PR Size: M

Lines changed: 258
Files changed: 12

Issues Found

Category	High	Medium
Logic/Bugs	0	0
Security	0	0
Error Handling	0	0
Types	0	0
Comments/Docs	0	1
Tests	1	0
Simplification	0	0

High Priority Issues (Should Fix)

[TEST-MISSING-CRITICAL] No test verifies that 524 status codes trigger recordFailure (src/app/v1/_lib/proxy/forwarder.ts:1890). The shouldRecordProviderCircuitFailure = statusCode === 524 conditional is the sole remaining path for provider errors to affect circuit breaker state. Existing tests only verify the negative case (non-524 does not record). A test for the positive case is needed to prevent silent regression. The hedge path (forwarder.ts ~line 3782) has the same gap.

Medium Priority Issues

[COMMENT-OUTDATED] .env.example lines 91-96 contain stale comments after this PR's changes. The comments state "默认关闭" (default off) and describe the old semantics where false means "only 4xx/5xx count" and true means "all errors count." After this PR, 4xx/5xx no longer count towards the circuit breaker regardless of this flag, making both descriptions inaccurate. The .env.example comments should be updated to match the new behavior.

Review Coverage

Automated review by Claude AI

github-actions · 2026-04-28T11:44:58Z

@@ -1892,6 +1887,7 @@ export class ProxyForwarder {
            const proxyError = lastError as ProxyError;


[High] [TEST-MISSING-CRITICAL] No test verifies that 524 status codes still trigger recordFailure

Why this is a problem: This PR introduces a critical behavioral gate -- only HTTP 524 (timeout) errors now trigger provider circuit breaker recording in the PROVIDER_ERROR path. The existing tests were updated to assert recordFailure.not.toHaveBeenCalled() for non-524 errors, but no test verifies the positive case: that a 524 error does call recordFailure. This is the sole remaining path for provider errors to affect circuit breaker state. If this conditional breaks, the circuit breaker becomes silently non-functional for all provider errors.

The hedge path (~line 3782, changed from statusCode \!== 404 to statusCode === 524) has the same gap.

Per CLAUDE.md: "All new features must have unit test coverage of at least 80%"

Suggested fix -- add a test in tests/unit/proxy/ that:

// Verify 524 timeout still triggers circuit breaker recording it("should record circuit breaker failure for 524 timeout", async () => { // Setup: mock provider that throws ProxyError with statusCode 524 doForward.mockRejectedValueOnce( new ProxyError("provider timeout", 524, { /* ... */ }) ); await ProxyForwarder.send(session); expect(mocks.recordFailure).toHaveBeenCalledWith( expect.any(Number), expect.objectContaining({ message: expect.stringContaining("timeout") }) ); });

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/app/v1/_lib/proxy/response-handler.ts (1)
594-610: ⚠️ Potential issue | 🟡 Minor

把客户端主动中断单独标成 client_abort。

当前 !streamEndedNormally 分支无论是客户端断开还是上游/网络异常，都会写成 system_error。这样会把用户主动取消和真正的供应商故障混在一起，降低后续排障与统计精度。addProviderToChain 已经支持 client_abort，建议按 clientAborted 分开记录。
建议修正
   if (!streamEndedNormally) {
     await clearSessionBinding();

+    const chainReason = clientAborted ? "client_abort" : "system_error";
     session.addProviderToChain(providerForChain, {
       endpointId: meta.endpointId,
       endpointUrl: meta.endpointUrl,
-      reason: "system_error",
+      reason: chainReason,
       attemptNumber: meta.attemptNumber,
       statusCode: effectiveStatusCode,
       errorMessage: errorMessage ?? undefined,
     });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/v1/_lib/proxy/response-handler.ts` around lines 594 - 610, The branch
handling !streamEndedNormally currently records every abort as "system_error";
change it to distinguish client-initiated disconnects by checking the
clientAborted flag (or equivalent boolean that indicates the client closed the
connection) before calling session.addProviderToChain; if clientAborted is true,
pass reason: "client_abort", otherwise keep reason: "system_error". Keep the
surrounding logic (await clearSessionBinding(), use providerForChain,
meta.attemptNumber, effectiveStatusCode, errorMessage, and return values)
unchanged.
tests/unit/proxy/response-handler-non200.test.ts (1)
269-391: ⚠️ Potential issue | 🟠 Major

这组测试没有真正覆盖 response-handler.ts。

当前用例只是手动调用 session.addProviderToChain(...)，并没有驱动 ProxyResponseHandler 的非 200 分支；因此即使处理逻辑回退，这些断言也会继续通过。建议改成走真实 handler，再断言 provider chain / errorMessage / recordFailure 不被触发。

如果你愿意，我可以帮你把这组测试改回真实的 handler 路径。
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/proxy/response-handler-non200.test.ts` around lines 269 - 391, The
tests are directly calling session.addProviderToChain instead of exercising the
real handler; replace the manual adds with invoking the actual
ProxyResponseHandler method that processes non-200 responses (e.g.,
ProxyResponseHandler.handleNonStream or the method used in tests to handle
responses) using a mocked response with the target statusCode and responseText
so the handler populates the session provider chain and potentially calls
recordFailure; assert on session.getProviderChain(), the added entry's
errorMessage/reason, and that mockRecordFailure was or was not called
accordingly, removing the manual session.addProviderToChain calls.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/app/v1/_lib/proxy/forwarder.ts`:
- Around line 1974-1975: The interim increment of circuitFailureCount (using
health.failureCount + (shouldRecordProviderCircuitFailure ? 1 : 0)) is applied
even when recordFailure() may not actually run, causing mismatch with the real
circuit state; change the logic so circuitFailureCount is only incremented under
the exact same conditions used when calling recordFailure() — i.e., evaluate
probe status, shouldAccountCircuitBreaker, retry exhaustion, and
shouldRecordProviderCircuitFailure together (the same predicates used around
recordFailure()) or simply defer computing/setting circuitFailureCount until
after recordFailure() actually executes; update references in the forwarder code
where circuitFailureCount, shouldRecordProviderCircuitFailure,
shouldAccountCircuitBreaker, probe, and recordFailure() are used so the
displayed count always matches real writes.
- Line 1819: Remove emoji characters from the inline comments in
src/app/v1/_lib/proxy/forwarder.ts — specifically replace the comment containing
"⭐ 6. 供应商错误处理（所有 4xx/5xx HTTP 错误 + 空响应错误，重试耗尽后切换）" and the comments at the
nearby block around lines 2015-2016 with equivalent plain-text comments (e.g.,
remove "⭐" and any other emoji), and scan the forwarder.ts file for any other
emoji in comments or string literals and convert them to plain text to comply
with the repository rule.
- Around line 3782-3784: The hedge branch calls
recordFailure(attempt.provider.id, error) on a 524 provider error without
honoring the same circuit-breaker accounting guards used elsewhere; update the
hedge-path check to only call recordFailure when allowCircuitBreakerAccounting
is enabled and the request is not a probe (mirror the ordinary branch's
conditions), i.e. wrap the existing if (errorCategory ===
ErrorCategory.PROVIDER_ERROR && statusCode === 524) { recordFailure(...) } with
the same allowCircuitBreakerAccounting and probe-request guards (using the
existing probe flag/check and allowCircuitBreakerAccounting variable) so hedge
failures are accounted consistently.

---

Outside diff comments:
In `@src/app/v1/_lib/proxy/response-handler.ts`:
- Around line 594-610: The branch handling !streamEndedNormally currently
records every abort as "system_error"; change it to distinguish client-initiated
disconnects by checking the clientAborted flag (or equivalent boolean that
indicates the client closed the connection) before calling
session.addProviderToChain; if clientAborted is true, pass reason:
"client_abort", otherwise keep reason: "system_error". Keep the surrounding
logic (await clearSessionBinding(), use providerForChain, meta.attemptNumber,
effectiveStatusCode, errorMessage, and return values) unchanged.

In `@tests/unit/proxy/response-handler-non200.test.ts`:
- Around line 269-391: The tests are directly calling session.addProviderToChain
instead of exercising the real handler; replace the manual adds with invoking
the actual ProxyResponseHandler method that processes non-200 responses (e.g.,
ProxyResponseHandler.handleNonStream or the method used in tests to handle
responses) using a mocked response with the target statusCode and responseText
so the handler populates the session provider chain and potentially calls
recordFailure; assert on session.getProviderChain(), the added entry's
errorMessage/reason, and that mockRecordFailure was or was not called
accordingly, removing the manual session.addProviderToChain calls.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 16478241-0290-465a-971e-cc551af90d8c

📥 Commits

Reviewing files that changed from the base of the PR and between 01b8ace and 668daf6.

📒 Files selected for processing (12)

.env.example
README.en.md
README.md
src/app/v1/_lib/proxy/errors.ts
src/app/v1/_lib/proxy/forwarder.ts
src/app/v1/_lib/proxy/response-handler.ts
src/app/v1/_lib/proxy/session.ts
src/lib/config/env.schema.ts
tests/unit/proxy/client-abort-vs-upstream-499.test.ts
tests/unit/proxy/proxy-forwarder-fake-200-html.test.ts
tests/unit/proxy/response-handler-endpoint-circuit-isolation.test.ts
tests/unit/proxy/response-handler-non200.test.ts

✅ Files skipped from review due to trivial changes (4)

tests/unit/proxy/client-abort-vs-upstream-499.test.ts
README.md
src/app/v1/_lib/proxy/session.ts
src/app/v1/_lib/proxy/errors.ts

🚧 Files skipped from review as they are similar to previous changes (3)

src/lib/config/env.schema.ts
.env.example
README.en.md

coderabbitai · 2026-04-28T12:24:37Z

          }

-          // ⭐ 6. 供应商错误处理（所有 4xx/5xx HTTP 错误 + 空响应错误，计入熔断器，重试耗尽后切换）
+          // ⭐ 6. 供应商错误处理（所有 4xx/5xx HTTP 错误 + 空响应错误，重试耗尽后切换）


⚠️ Potential issue | 🟡 Minor

请移除新增注释中的 emoji 字符。

Line 1819、Line 2015 的注释含有 emoji，不符合仓库规则，建议改成纯文本注释。

As per coding guidelines "**/*.{ts,tsx,js,jsx,json,css,scss,sql,py,go,sh,bash}: Never use emoji characters in any code, comments, or string literals".

Also applies to: 2015-2016

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/app/v1/_lib/proxy/forwarder.ts` at line 1819, Remove emoji characters from the inline comments in src/app/v1/_lib/proxy/forwarder.ts — specifically replace the comment containing "⭐ 6. 供应商错误处理（所有 4xx/5xx HTTP 错误 + 空响应错误，重试耗尽后切换）" and the comments at the nearby block around lines 2015-2016 with equivalent plain-text comments (e.g., remove "⭐" and any other emoji), and scan the forwarder.ts file for any other emoji in comments or string literals and convert them to plain text to comply with the repository rule.

coderabbitai · 2026-04-28T12:24:37Z

+              circuitFailureCount:
+                health.failureCount + (shouldRecordProviderCircuitFailure ? 1 : 0),


⚠️ Potential issue | 🟠 Major

circuitFailureCount 会在未实际 recordFailure() 时被提前加 1。

Line 1974-1975 只按 statusCode===524 预加计数，但真正写入还受 probe、shouldAccountCircuitBreaker、以及是否重试耗尽控制（Line 2017-2025）。这会让决策链计数与真实熔断状态不一致。

建议修复

const statusCode = proxyError.statusCode; const willRetry = attemptCount < maxAttemptsPerProvider; const shouldRecordProviderCircuitFailure = statusCode === 524; +const isProbeRequest = session.isProbeRequest(); +const willRecordProviderCircuitFailure = + shouldRecordProviderCircuitFailure && + shouldAccountCircuitBreaker && + !isProbeRequest && + !willRetry; ... - circuitFailureCount: - health.failureCount + (shouldRecordProviderCircuitFailure ? 1 : 0), + circuitFailureCount: + health.failureCount + (willRecordProviderCircuitFailure ? 1 : 0), ... - if (session.isProbeRequest()) { + if (isProbeRequest) { logger.debug("ProxyForwarder: Probe request error, skipping circuit breaker", { providerId: currentProvider.id, providerName: currentProvider.name, messagesCount: session.getMessagesLength(), }); } else { if (shouldRecordProviderCircuitFailure && shouldAccountCircuitBreaker) { await recordFailure(currentProvider.id, lastError); } }

Also applies to: 2017-2025

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/app/v1/_lib/proxy/forwarder.ts` around lines 1974 - 1975, The interim increment of circuitFailureCount (using health.failureCount + (shouldRecordProviderCircuitFailure ? 1 : 0)) is applied even when recordFailure() may not actually run, causing mismatch with the real circuit state; change the logic so circuitFailureCount is only incremented under the exact same conditions used when calling recordFailure() — i.e., evaluate probe status, shouldAccountCircuitBreaker, retry exhaustion, and shouldRecordProviderCircuitFailure together (the same predicates used around recordFailure()) or simply defer computing/setting circuitFailureCount until after recordFailure() actually executes; update references in the forwarder code where circuitFailureCount, shouldRecordProviderCircuitFailure, shouldAccountCircuitBreaker, probe, and recordFailure() are used so the displayed count always matches real writes.

coderabbitai · 2026-04-28T12:24:37Z

+      if (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524) {
        await recordFailure(attempt.provider.id, error);
      }


⚠️ Potential issue | 🟠 Major

Hedge 分支缺少熔断记账开关保护。

Line 3782-3784 在 524 时直接 recordFailure()；但普通分支会受 allowCircuitBreakerAccounting 与 probe 请求保护。这里会导致同一策略在 hedge 路径下被绕过。

建议修复

// in sendStreamingWithHedge(...) near function start +const shouldAccountCircuitBreaker = + ProxyForwarder.getEndpointPolicy(session).allowCircuitBreakerAccounting; +const isProbeRequest = session.isProbeRequest(); // ... -if (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524) { +if ( + errorCategory === ErrorCategory.PROVIDER_ERROR && + statusCode === 524 && + shouldAccountCircuitBreaker && + !isProbeRequest +) { await recordFailure(attempt.provider.id, error); }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/app/v1/_lib/proxy/forwarder.ts` around lines 3782 - 3784, The hedge branch calls recordFailure(attempt.provider.id, error) on a 524 provider error without honoring the same circuit-breaker accounting guards used elsewhere; update the hedge-path check to only call recordFailure when allowCircuitBreakerAccounting is enabled and the request is not a probe (mirror the ordinary branch's conditions), i.e. wrap the existing if (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524) { recordFailure(...) } with the same allowCircuitBreakerAccounting and probe-request guards (using the existing probe flag/check and allowCircuitBreakerAccounting variable) so hedge failures are accounted consistently.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 668daf6d2f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-28T12:24:38Z

 # - 默认关闭：适用于网络不稳定环境（如使用代理），避免因临时网络抖动触发熔断器
 # - 启用：适用于网络稳定环境，连续网络错误也应触发熔断保护，避免持续请求不可达的供应商
-ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=false
+ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true


Correct circuit-breaker env comments after default flip

This change sets ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true, but the surrounding .env.example guidance still describes false as the default and says provider HTTP errors are counted when disabled/enabled. After this commit, provider 4xx/5xx are generally not counted (except timeout-specific handling), so operators who rely on this example can deploy with an incorrect mental model and mis-tune production circuit-breaker behavior.

Useful? React with 👍 / 👎.

ding113 · 2026-05-13T06:10:11Z

您好，这一决策方式暂不打算在当前项目中做改动。后续重构版本中计划提供相应开关或配置接口，本 PR 将被关闭。

github-project-automation Bot added this to Claude Code Hub Roadmap Apr 28, 2026

coderabbitai Bot requested a review from ding113 April 28, 2026 11:27

github-actions Bot added bug Something isn't working area:provider labels Apr 28, 2026

gemini-code-assist Bot reviewed Apr 28, 2026

View reviewed changes

greptile-apps Bot reviewed Apr 28, 2026

View reviewed changes

coderabbitai Bot requested changes Apr 28, 2026

View reviewed changes

Comment thread README.en.md

Comment thread README.md

github-actions Bot added the size/M Medium PR (< 500 lines) label Apr 28, 2026

github-actions Bot reviewed Apr 28, 2026

View reviewed changes

fix(proxy): limit provider circuit breaker failures

668daf6

dofastted force-pushed the fix/provider-circuit-breaker-health branch from 01b8ace to 668daf6 Compare April 28, 2026 12:19

coderabbitai Bot requested changes Apr 28, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 28, 2026

View reviewed changes

ding113 closed this May 13, 2026

github-project-automation Bot moved this from Backlog to Done in Claude Code Hub Roadmap May 13, 2026

		@@ -1892,6 +1887,7 @@ export class ProxyForwarder {
		const proxyError = lastError as ProxyError;

		circuitFailureCount:
		health.failureCount + (shouldRecordProviderCircuitFailure ? 1 : 0),

Uh oh!

Conversation

dofastted commented Apr 28, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Core Behavior Changes

Error Category Refinements

Changes

Core Proxy Changes

Configuration Changes

Test Updates

Breaking Changes

Testing

Automated Tests

Manual Verification

Checklist

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

coderabbitai Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Code Review Summary

PR Size: M

Issues Found

High Priority Issues (Should Fix)

Medium Priority Issues

Review Coverage

Uh oh!

github-actions Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 28, 2026

dofastted commented Apr 28, 2026 •

edited by greptile-apps Bot

Loading

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading