Skip to content

fix(proxy): limit provider circuit breaker failures#1134

Closed
dofastted wants to merge 1 commit into
ding113:devfrom
dofastted:fix/provider-circuit-breaker-health
Closed

fix(proxy): limit provider circuit breaker failures#1134
dofastted wants to merge 1 commit into
ding113:devfrom
dofastted:fix/provider-circuit-breaker-health

Conversation

@dofastted
Copy link
Copy Markdown
Contributor

@dofastted dofastted commented Apr 28, 2026

Summary

Refines provider circuit breaker logic to distinguish between infrastructure failures (connectivity issues, timeouts) and request-level errors (HTTP 4xx/5xx, fake-200, empty responses). Only infrastructure failures now affect provider health scores, preventing healthy providers from being incorrectly penalized for transient or client-side issues.

Problem

Previously, the circuit breaker counted the following as provider failures:

  • All HTTP 4xx/5xx responses from upstream providers
  • Fake-200 errors (HTML error pages returned with HTTP 200)
  • Empty response errors
  • Stream abort errors

This caused several operational issues:

  1. Providers returning legitimate HTTP errors (like 429 rate limits, 401 auth failures) would be marked as unhealthy and excluded from rotation
  2. Client disconnections (499 errors) were polluting the circuit breaker metrics
  3. Transient model-specific errors could take down an entire provider endpoint
  4. The ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS default was false, meaning network errors weren't counted even when operators wanted strict health tracking

Related Issues:

Solution

Core Behavior Changes

  1. HTTP 4xx/5xx responses no longer count as circuit breaker failures

    • These indicate the upstream was reachable but returned an error (auth failure, rate limit, model error)
    • The proxy still retries/fails over, but doesn't penalize provider health
    • Exception: 524 timeout errors (Cloudflare timeout) still count as infrastructure failures
  2. Fake-200 responses (HTML error pages) excluded from circuit breaker

    • Previously recorded as recordFailure() - now only logged for debugging
    • Session binding is still cleared to avoid sticky sessions to problematic endpoints
  3. Empty response errors excluded

    • Empty responses may be request-specific; provider health shouldn't suffer
  4. Stream abort errors excluded

    • Client-side disconnections after response started don't indicate provider unhealth
  5. Network error config now defaults to true

    • ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS now defaults to true
    • When enabled, only connectivity failures and 524 timeouts affect provider circuit state

Error Category Refinements

Updated ErrorCategory classification in errors.ts:

Category Circuit Breaker Impact Examples
CLIENT_ABORT Never counted User disconnect, request cancelled
NON_RETRYABLE_CLIENT_ERROR Never counted Prompt too large, content filter
RESOURCE_NOT_FOUND (404) Never counted Model doesn't exist on this provider
PROVIDER_ERROR (4xx/5xx) No longer counted Rate limit, auth failure, bad request
SYSTEM_ERROR Conditionally counted DNS failure, connection refused, 524 timeout

Changes

Core Proxy Changes

  • src/app/v1/_lib/proxy/forwarder.ts - Only 524 status codes and network errors trigger recordFailure()
  • src/app/v1/_lib/proxy/response-handler.ts - Removed all recordFailure() calls for non-200/fake-200/stream errors
  • src/app/v1/_lib/proxy/errors.ts - Updated error category documentation
  • src/app/v1/_lib/proxy/session.ts - Updated chain reason comments

Configuration Changes

  • src/lib/config/env.schema.ts - ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS default: falsetrue
  • .env.example - Updated documentation for the new behavior
  • README.md / README.en.md - Clarified that only connectivity failures affect circuit state

Test Updates

  • tests/unit/proxy/client-abort-vs-upstream-499.test.ts - Updated expectations
  • tests/unit/proxy/proxy-forwarder-fake-200-html.test.ts - Removed recordFailure() assertions
  • tests/unit/proxy/response-handler-endpoint-circuit-isolation.test.ts - Verified no circuit recording
  • tests/unit/proxy/response-handler-non200.test.ts - Non-200 responses don't trigger failures

Breaking Changes

Behavior change for operators:

Scenario Before After
Provider returns HTTP 500 Counts as failure, may trip circuit Doesn't count; still retries/fails over
Provider returns HTTP 429 Counts as failure, may trip circuit Doesn't count; still retries/fails over
Fake-200 HTML error Counts as failure Doesn't count
Empty response Counts as failure Doesn't count
524 Cloudflare timeout Counts as failure Still counts
Network/DNS error Depends on ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS (default: false) Counts if config true (default: true)

Migration:

  • No code changes required
  • Monitor provider health metrics after deployment - providers that were incorrectly penalized may now show better health scores
  • If you prefer the old behavior (all 4xx/5xx count as failures), this PR does not provide a toggle - the old behavior was deemed too aggressive for production use

Testing

Automated Tests

  • bun run typecheck - pass
  • bun run test - All 4 updated test suites pass
    • client-abort-vs-upstream-499.test.ts (16 tests)
    • proxy-forwarder-fake-200-html.test.ts (7 tests)
    • response-handler-endpoint-circuit-isolation.test.ts (12 tests)
    • response-handler-non200.test.ts (14 tests)

Manual Verification

  1. Configure provider that returns HTTP 429
  2. Send multiple requests until rate limit triggered
  3. Verify provider is NOT marked unhealthy in circuit breaker
  4. Verify requests still fail over to alternative provider
  5. Verify provider_chain shows retry_failed but not circuit breaker penalties

Checklist

  • Code follows project conventions (no emoji, named exports, i18n strings)
  • Self-review completed
  • Tests updated to reflect new behavior
  • Documentation updated (env var descriptions)
  • No breaking API changes
  • Target branch is dev

Description enhanced by Claude AI

Greptile Summary

This PR refines the provider circuit breaker to count only infrastructure failures (524 Cloudflare timeouts and network-layer errors when ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true) rather than all 4xx/5xx HTTP responses, and flips that config default from false to true. The core forwarder.ts and response-handler.ts changes are clean and internally consistent; the circuitFailureCount log values are corrected to match the new non-recording behavior.

Confidence Score: 5/5

Safe to merge; only P2 findings remain and they are pre-existing test quality issues.

All findings are P2 style/test concerns. The production logic change is straightforward and well-scoped. The meaningful test suites (forwarder fake-200 and endpoint circuit isolation) do exercise real code paths and correctly reflect the new behavior. The default-flip for ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS is a documented intentional breaking change.

tests/unit/proxy/response-handler-non200.test.ts — the handleNonStream describe block tests are trivially passing and do not exercise the real handler.

Important Files Changed

Filename Overview
src/app/v1/_lib/proxy/forwarder.ts Core change: sequential path now gates recordFailure on statusCode === 524 (previously all 4xx/5xx); empty-response path also stops calling recordFailure; circuitFailureCount log corrected accordingly. Concurrent path mirrors the 524-only guard but still lacks the shouldAccountCircuitBreaker check (pre-existing gap, already flagged).
src/app/v1/_lib/proxy/response-handler.ts Removes all recordFailure calls for fake-200, non-200 passthrough, and stream-abort paths; replaces them with comment-only explanations. Logic and cleanup are consistent with the stated design intent.
src/lib/config/env.schema.ts Default for ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS flipped from false to true; one-liner change, no schema-logic issues.
tests/unit/proxy/response-handler-non200.test.ts handleNonStream describe block never invokes the real handler; tests trivially pass and provide no coverage of the actual response-handler.ts change. Other test files (endpoint-circuit-isolation) do exercise the real dispatch path and are meaningful.
tests/unit/proxy/response-handler-endpoint-circuit-isolation.test.ts Test titles and expectations updated; all scenarios call ProxyResponseHandler.dispatch and meaningfully verify the new no-recordFailure behavior for fake-200 and non-200 paths.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Upstream error received] --> B{Error category?}
    B -->|CLIENT_ABORT| C[No recordFailure\nNo retry]
    B -->|NON_RETRYABLE_CLIENT_ERROR| D[No recordFailure\nNo retry]
    B -->|RESOURCE_NOT_FOUND 404| E[No recordFailure\nSwitch provider]
    B -->|PROVIDER_ERROR\n4xx/5xx + EmptyResponse| F{statusCode == 524?}
    F -->|Yes\nCloudflare timeout| G[recordFailure ✓\nSwitch provider]
    F -->|No\n429/500/502/etc.| H[No recordFailure\nRetry/switch provider]
    B -->|SYSTEM_ERROR\nNetwork/DNS| I{ENABLE_CIRCUIT_BREAKER\n_ON_NETWORK_ERRORS?}
    I -->|true\nnew default| J[recordFailure ✓\nRetry once]
    I -->|false| K[No recordFailure\nRetry once]
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: tests/unit/proxy/response-handler-non200.test.ts
Line: 270-368

Comment:
**Tests don't invoke the real handler — trivially pass**

The `handleNonStream with non-200 status code` describe block never calls `ProxyResponseHandler.dispatch` or any other production method. Each test manually runs `detectUpstreamErrorFromSseOrJsonText` and `session.addProviderToChain` inline, then asserts `expect(mockRecordFailure).not.toHaveBeenCalled()`. Because no production code is exercised, these assertions are trivially true regardless of whether `response-handler.ts` was changed or not. The companion tests in `response-handler-endpoint-circuit-isolation.test.ts` do call `ProxyResponseHandler.dispatch` and are meaningful; these four tests provide no additional coverage of the actual behavior change.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (2): Last reviewed commit: "fix(proxy): limit provider circuit break..." | Re-trigger Greptile

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 28, 2026

📝 Walkthrough

Walkthrough

ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS 默认由 false 改为 true,并将提供商断路器计数限定为超时/连接类故障(如 HTTP 524);移除针对 EmptyResponse/伪 200/多数非 2xx 路径的提供商断路器记录,仅在超时样式失败时计数。

Changes

Cohort / File(s) Summary
配置与示例
\.env.example, src/lib/config/env.schema.ts
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS 默认值从 false 改为 true
文档
README.en.md, README.md
更新配置表述:默认改为 true,并将描述收窄为仅由连接失败和超时影响断路器。
错误分类注释
src/app/v1/_lib/proxy/errors.ts
调整 JSDoc/comments,澄清 EmptyResponse 属于 PROVIDER_ERROR 不应累计提供商断路器;SYSTEM_ERROR(传输/网络)是否计入由配置决定。
转发逻辑
src/app/v1/_lib/proxy/forwarder.ts
移除 EmptyResponse 与多数 PROVIDER_ERROR 的断路器增量;仅当 statusCode === 524(超时样式)且配置允许时才增加 circuitFailureCount 与调用 recordFailure()
响应处理与流控
src/app/v1/_lib/proxy/response-handler.ts
从 SSE/伪 200、非 2xx 流和若干非流路径中移除对 recordFailure 的动态调用,仅将失败记录入提供商决策链并避免会话绑定更新。
会话注释
src/app/v1/_lib/proxy/session.ts
更新 ProviderSession.addProviderToChainmetadata.reason 的文档,说明 retry_failedsystem_error 在断路器计数上的不同语义(后者受配置影响)。
测试(注释/断言调整)
tests/unit/proxy/client-abort-vs-upstream-499.test.ts
更新文件注释,描述 499 被归类为 PROVIDER_ERROR、触发重试/回退而非断路器。
测试(行为断言修改)
tests/unit/proxy/proxy-forwarder-fake-200-html.test.ts, tests/unit/proxy/response-handler-endpoint-circuit-isolation.test.ts, tests/unit/proxy/response-handler-non200.test.ts
调整单元测试期望:移除或替换对 recordFailure/mockRecordFailure 被调用的断言为“未被调用”,保留提供商选择与 provider-chain 相关断言。

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title clearly and accurately summarizes the main change: limiting what counts as provider circuit breaker failures, which is the core purpose of this PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description clearly explains the rationale, core behavior changes, affected files, test updates, and breaking changes related to provider circuit breaker logic refinement.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot requested a review from ding113 April 28, 2026 11:27
@github-actions github-actions Bot added bug Something isn't working area:provider labels Apr 28, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refines the circuit breaker logic so that only connectivity failures and timeouts contribute to a provider's circuit breaker state, while application-level errors trigger fallbacks without penalizing overall health. It also updates the default configuration to enable circuit breakers on network errors. Feedback points out an inconsistency in the streaming hedge path regarding network error handling and endpoint policies, and recommends making failure recording non-blocking to improve performance.

Comment on lines +3782 to 3784
if (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524) {
await recordFailure(attempt.provider.id, error);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The circuit breaker logic in the streaming hedge path is inconsistent with the non-hedge path and the PR description. Connectivity failures (SYSTEM_ERROR) should also affect the provider circuit state when ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS is enabled. Additionally, the allowCircuitBreakerAccounting endpoint policy should be respected. Note: In performance-sensitive paths like this, the recordFailure I/O operation should be fire-and-forget to avoid blocking the main logic.

      const shouldAccountCircuitBreaker = session.getEndpointPolicy().allowCircuitBreakerAccounting;
      const shouldRecordProviderCircuitFailure =
        (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524) ||
        (errorCategory === ErrorCategory.SYSTEM_ERROR && getEnvConfig().ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS);

      if (shouldRecordProviderCircuitFailure && shouldAccountCircuitBreaker) {
        recordFailure(attempt.provider.id, error);
      }
References
  1. In performance-sensitive code paths like provider failover, non-critical I/O operations (e.g., releasing a session in Redis) should be executed as fire-and-forget tasks to avoid blocking the main logic.

Comment thread .env.example
Comment on lines 94 to +97
# 使用场景:
# - 默认关闭:适用于网络不稳定环境(如使用代理),避免因临时网络抖动触发熔断器
# - 启用:适用于网络稳定环境,连续网络错误也应触发熔断保护,避免持续请求不可达的供应商
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=false
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Stale "默认关闭" comment after default change

The comment block still reads - 默认关闭:适用于网络不稳定环境… (disabled by default: for unstable networks), but the default was just flipped to true. Operators who set up a fresh deployment will read "disabled by default" and assume false is the initial state, leading to misconfigured circuit breakers.

Suggested change
# 使用场景:
# - 默认关闭:适用于网络不稳定环境(如使用代理),避免因临时网络抖动触发熔断器
# - 启用:适用于网络稳定环境,连续网络错误也应触发熔断保护,避免持续请求不可达的供应商
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=false
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true
# 使用场景:
# - 关闭:适用于网络不稳定环境(如使用代理),避免因临时网络抖动触发熔断器
# - 启用(默认):适用于网络稳定环境,连续网络错误也应触发熔断保护,避免持续请求不可达的供应商
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true
Prompt To Fix With AI
This is a comment left during a code review.
Path: .env.example
Line: 94-97

Comment:
**Stale "默认关闭" comment after default change**

The comment block still reads `- 默认关闭:适用于网络不稳定环境…` (disabled by default: for unstable networks), but the default was just flipped to `true`. Operators who set up a fresh deployment will read "disabled by default" and assume `false` is the initial state, leading to misconfigured circuit breakers.

```suggestion
 # 使用场景:
 # - 关闭:适用于网络不稳定环境(如使用代理),避免因临时网络抖动触发熔断器
 # - 启用(默认):适用于网络稳定环境,连续网络错误也应触发熔断保护,避免持续请求不可达的供应商
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +3782 to 3784
if (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524) {
await recordFailure(attempt.provider.id, error);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Concurrent mode skips shouldAccountCircuitBreaker guard

The sequential path gates its 524-failure recording on shouldRecordProviderCircuitFailure && shouldAccountCircuitBreaker (line ~2024), but the concurrent/hedging path at this line does not check shouldAccountCircuitBreaker (endpointPolicy.allowCircuitBreakerAccounting). This means a 524 error in a hedge attempt will always record a circuit-breaker failure, even for endpoints whose policy explicitly opts out of circuit-breaker accounting.

This asymmetry pre-dates this PR, but since this PR now narrows the trigger to only 524, the missing guard becomes more visible. Consider adding the same guard for consistency:

if (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524 && shouldAccountCircuitBreaker) {
  await recordFailure(attempt.provider.id, error);
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/app/v1/_lib/proxy/forwarder.ts
Line: 3782-3784

Comment:
**Concurrent mode skips `shouldAccountCircuitBreaker` guard**

The sequential path gates its 524-failure recording on `shouldRecordProviderCircuitFailure && shouldAccountCircuitBreaker` (line ~2024), but the concurrent/hedging path at this line does not check `shouldAccountCircuitBreaker` (`endpointPolicy.allowCircuitBreakerAccounting`). This means a 524 error in a hedge attempt will always record a circuit-breaker failure, even for endpoints whose policy explicitly opts out of circuit-breaker accounting.

This asymmetry pre-dates this PR, but since this PR now narrows the trigger to only `524`, the missing guard becomes more visible. Consider adding the same guard for consistency:

```typescript
if (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524 && shouldAccountCircuitBreaker) {
  await recordFailure(attempt.provider.id, error);
}
```

How can I resolve this? If you propose a fix, please make it concise.

Comment thread README.en.md
| `SESSION_TTL` | `300` | Session cache window (seconds) that drives vendor reuse. |
| `ENABLE_SECURE_COOKIES` | `true` | Browsers require HTTPS for Secure cookies; set to `false` when serving plain HTTP outside localhost. |
| `ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS` | `false` | When `true`, network errors also trip the circuit breaker for quicker isolation. |
| `ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS` | `true` | When `true`, only connectivity failures and timeouts affect provider circuit breaker state. |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Description ambiguous when variable is false

The new description "When true, only connectivity failures and timeouts affect provider circuit breaker state" implies that when false, other things also affect provider circuit state. In practice, HTTP 524 still triggers recordFailure regardless of this variable — it purely controls whether network-layer errors are additionally included.

Consider wording like: "When true, network-layer errors (DNS, connection refused, etc.) also trip the provider circuit breaker in addition to 524 timeouts."

Prompt To Fix With AI
This is a comment left during a code review.
Path: README.en.md
Line: 347

Comment:
**Description ambiguous when variable is `false`**

The new description "When `true`, only connectivity failures and timeouts affect provider circuit breaker state" implies that when `false`, other things also affect provider circuit state. In practice, HTTP 524 still triggers `recordFailure` regardless of this variable — it purely controls whether *network-layer* errors are additionally included.

Consider wording like: "When `true`, network-layer errors (DNS, connection refused, etc.) also trip the provider circuit breaker in addition to 524 timeouts."

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
.env.example (1)

90-97: ⚠️ Potential issue | 🟡 Minor

熔断器注释与当前行为不一致,易误导运维配置。

Line 97 已改为默认 true,但 Line 92-96 仍写“false 为默认、true 计入所有错误”。这与本 PR“仅连通性失败/超时类问题影响 provider 熔断状态”的语义冲突。

建议同步更新注释文案
 # 熔断器配置
-# 功能说明:控制网络错误是否计入熔断器失败计数
-# - false (默认):网络错误(DNS 解析失败、连接超时、代理连接失败等)不计入熔断器,仅供应商错误(4xx/5xx HTTP 响应)计入
-# - true:所有错误(包括网络错误)都计入熔断器失败计数
+# 功能说明:控制“连通性失败/超时类错误”是否影响 provider 熔断状态
+# - true (默认):连接失败、连接超时、超时风格错误(如 524)会计入 provider 熔断
+# - false:上述网络连通性错误不计入 provider 熔断(仍保留重试/切换逻辑)
 ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.env.example around lines 90 - 97, The comment for
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS is inconsistent with the new default
and PR behavior; update the explanatory text so the default value shown (true)
matches the described behavior and clearly define what true vs false do given
the PR's change (specifically that, under the current implementation, only
connectivity failures/timeouts affect the provider circuit state or explain that
true/false now invert behavior as implemented). Edit the block mentioning "false
(默认)" / "true:所有错误计入熔断器失败计数" and rewrite it to explicitly state the current
default (true), what each setting does, and that only connectivity/time‑out
style network errors are considered for provider circuit status per this PR;
ensure ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS is referenced so maintainers can
locate the setting.
tests/unit/proxy/response-handler-non200.test.ts (1)

269-367: ⚠️ Potential issue | 🟠 Major

这些 non-200 用例未走真实处理链路,当前断言无法证明行为变更。

在 Line 270-367,测试只做本地分支和 session.addProviderToChain(...),并没有调用 ProxyResponseHandler(或 forwarder)实际逻辑,所以 Line 291/319/343/367 的 not.toHaveBeenCalled() 基本是平凡成立,回归保护不足。

建议改为真实调用路径(示例)
+import { ProxyResponseHandler } from "@/app/v1/_lib/proxy/response-handler";
...
   describe("handleNonStream with non-200 status code", () => {
     it("should not record provider circuit breaker failure for 500 status", async () => {
       const session = createSession({
         provider: mockProvider,
         messageContext: mockMessageContext,
       });
-
-      const statusCode = 500;
-      const responseText = '{"error":"internal error"}';
-
-      if (statusCode >= 400) {
-        const detected = detectUpstreamErrorFromSseOrJsonText(responseText);
-        const errorMessageForDb = detected.isError ? detected.code : `HTTP ${statusCode}`;
-
-        session.addProviderToChain(mockProvider, {
-          reason: "retry_failed",
-          attemptNumber: 1,
-          statusCode: statusCode,
-          errorMessage: errorMessageForDb,
-        });
-      }
+      const response = new Response('{"error":"internal error"}', {
+        status: 500,
+        headers: { "content-type": "application/json" },
+      });
+
+      await ProxyResponseHandler.dispatch(session, response);
+      await Promise.all(asyncTasks.splice(0, asyncTasks.length));
 
       expect(mockRecordFailure).not.toHaveBeenCalled();
+      expect(
+        session
+          .getProviderChain()
+          .some((item) => item.reason === "retry_failed" && item.statusCode === 500),
+      ).toBe(true);
     });
   });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/proxy/response-handler-non200.test.ts` around lines 269 - 367,
Tests are only exercising local branching and calling session.addProviderToChain
directly, so mockRecordFailure assertions are vacuous; instead invoke the real
handling path (e.g., ProxyResponseHandler.handleNonStream or the forwarder
method that processes an upstream Response) with a constructed Response-like
object containing status and body so the handler calls
detectUpstreamErrorFromSseOrJsonText and then
session.addProviderToChain/recordProviderFailure; update each test to call that
handler with the mocked provider and response (using the same statusCode and
responseText), assert mockRecordFailure called or not accordingly, and keep
references to detectUpstreamErrorFromSseOrJsonText, session.addProviderToChain,
ProxyResponseHandler.handleNonStream (or the forwarder method) and
mockRecordFailure to find the right code to change.
src/app/v1/_lib/proxy/response-handler.ts (2)

926-938: ⚠️ Potential issue | 🟡 Minor

404 在这两个 non-stream 分支里也需要保留 resource_not_found

这个 PR 去掉了这里的 provider breaker 记账后,决策链已经变成这条路径里唯一的失败归因。但当前把所有 >= 400 都写成 retry_failed,会把 404 和普通上游失败混在一起,和本文件的流式分支、categorizeErrorAsync() 的 404 特判不一致。

建议修改
             if (statusCode >= 400) {
               const detected = detectUpstreamErrorFromSseOrJsonText(responseText);
               errorMessageForFinalize = detected.isError ? detected.code : `HTTP ${statusCode}`;
+              const chainReason =
+                statusCode === 404 ? "resource_not_found" : "retry_failed";

               // 记录到决策链
               session.addProviderToChain(provider, {
-                reason: "retry_failed",
+                reason: chainReason,
                 attemptNumber: 1,
                 statusCode: statusCode,
                 errorMessage: errorMessageForFinalize,
               });
             }
         if (statusCode >= 400) {
           const detected = detectUpstreamErrorFromSseOrJsonText(responseText);
           const errorMessageForDb = detected.isError ? detected.code : `HTTP ${statusCode}`;
+          const chainReason =
+            statusCode === 404 ? "resource_not_found" : "retry_failed";

           // 记录到决策链
           session.addProviderToChain(provider, {
-            reason: "retry_failed",
+            reason: chainReason,
             attemptNumber: 1,
             statusCode: statusCode,
             errorMessage: errorMessageForDb,
           });
         }

Also applies to: 1305-1316

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/v1/_lib/proxy/response-handler.ts` around lines 926 - 938, The
non-stream error handling currently logs every >=400 as reason "retry_failed",
which incorrectly merges 404s with other failures; update the
session.addProviderToChain call in response-handler.ts (the block using
detectUpstreamErrorFromSseOrJsonText, variables
provider/statusCode/detected/errorMessageForFinalize) so that if the upstream
represents a resource-not-found (statusCode === 404 or detected.isError &&
detected.code === 'resource_not_found') you set reason to "resource_not_found",
otherwise keep "retry_failed"; apply the same conditional fix to the other
non-stream branch that mirrors this logic (the block around lines 1305-1316).

594-608: ⚠️ Potential issue | 🟡 Minor

客户端中断这里不应记成 system_error

clientAborted === true 时,这里仍然固定写入 reason: "system_error"。后续 updateMessageRequestDetails() 会把 provider chain 原样持久化,所以用户主动取消会被审计/监控误记成上游故障。

建议修改
     session.addProviderToChain(providerForChain, {
       endpointId: meta.endpointId,
       endpointUrl: meta.endpointUrl,
-      reason: "system_error",
+      reason: clientAborted ? "client_abort" : "system_error",
       attemptNumber: meta.attemptNumber,
       statusCode: effectiveStatusCode,
       errorMessage: errorMessage ?? undefined,
     });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/v1/_lib/proxy/response-handler.ts` around lines 594 - 608, The code
currently records provider chain entries with reason: "system_error" even when
the client aborted; update the block that runs when !streamEndedNormally (around
streamEndedNormally, clearSessionBinding, and session.addProviderToChain) to
detect clientAborted (or equivalent flag) and set reason to "client_aborted" (or
a distinct non-system reason) instead of "system_error"; ensure
session.addProviderToChain receives the adjusted reason, leaving other fields
(endpointId, endpointUrl, attemptNumber, statusCode, errorMessage) unchanged so
updateMessageRequestDetails() persists the correct cause.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@README.en.md`:
- Line 347: The README's descriptions for
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS are inconsistent: one place says "only
connectivity failures and timeouts affect provider circuit breaker state" while
another still states "4xx/5xx or network errors will trigger"—pick the intended
behavior and make both descriptions match; update the other occurrence so the
text referencing ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS and any FAQ entry use
the same wording (either "only connectivity failures and timeouts" or "4xx/5xx
and network errors") and ensure the FAQ and the table entry convey the identical
policy.

In `@README.md`:
- Line 360: The README has inconsistent descriptions: the config table for
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS says only connection failures/timeouts
count toward provider circuit-breaking, but the FAQ still states "4xx/5xx or
network errors"—please reconcile by picking one authoritative rule and making
both places consistent; specifically update the FAQ text that mentions "4xx/5xx"
to match the table (or vice versa) so both state clearly whether 4xx/5xx
responses are included, and ensure the ENV name
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS and its explanatory Chinese wording
consistently describe the exact trigger set (e.g., "仅将连接失败、超时等网络不可达错误计入熔断" or "将
4xx/5xx 与网络错误一并计入熔断") across README.

---

Outside diff comments:
In @.env.example:
- Around line 90-97: The comment for ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS is
inconsistent with the new default and PR behavior; update the explanatory text
so the default value shown (true) matches the described behavior and clearly
define what true vs false do given the PR's change (specifically that, under the
current implementation, only connectivity failures/timeouts affect the provider
circuit state or explain that true/false now invert behavior as implemented).
Edit the block mentioning "false (默认)" / "true:所有错误计入熔断器失败计数" and rewrite it to
explicitly state the current default (true), what each setting does, and that
only connectivity/time‑out style network errors are considered for provider
circuit status per this PR; ensure ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS is
referenced so maintainers can locate the setting.

In `@src/app/v1/_lib/proxy/response-handler.ts`:
- Around line 926-938: The non-stream error handling currently logs every >=400
as reason "retry_failed", which incorrectly merges 404s with other failures;
update the session.addProviderToChain call in response-handler.ts (the block
using detectUpstreamErrorFromSseOrJsonText, variables
provider/statusCode/detected/errorMessageForFinalize) so that if the upstream
represents a resource-not-found (statusCode === 404 or detected.isError &&
detected.code === 'resource_not_found') you set reason to "resource_not_found",
otherwise keep "retry_failed"; apply the same conditional fix to the other
non-stream branch that mirrors this logic (the block around lines 1305-1316).
- Around line 594-608: The code currently records provider chain entries with
reason: "system_error" even when the client aborted; update the block that runs
when !streamEndedNormally (around streamEndedNormally, clearSessionBinding, and
session.addProviderToChain) to detect clientAborted (or equivalent flag) and set
reason to "client_aborted" (or a distinct non-system reason) instead of
"system_error"; ensure session.addProviderToChain receives the adjusted reason,
leaving other fields (endpointId, endpointUrl, attemptNumber, statusCode,
errorMessage) unchanged so updateMessageRequestDetails() persists the correct
cause.

In `@tests/unit/proxy/response-handler-non200.test.ts`:
- Around line 269-367: Tests are only exercising local branching and calling
session.addProviderToChain directly, so mockRecordFailure assertions are
vacuous; instead invoke the real handling path (e.g.,
ProxyResponseHandler.handleNonStream or the forwarder method that processes an
upstream Response) with a constructed Response-like object containing status and
body so the handler calls detectUpstreamErrorFromSseOrJsonText and then
session.addProviderToChain/recordProviderFailure; update each test to call that
handler with the mocked provider and response (using the same statusCode and
responseText), assert mockRecordFailure called or not accordingly, and keep
references to detectUpstreamErrorFromSseOrJsonText, session.addProviderToChain,
ProxyResponseHandler.handleNonStream (or the forwarder method) and
mockRecordFailure to find the right code to change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5ef302ec-9adf-483e-8862-d56cdbfef7b4

📥 Commits

Reviewing files that changed from the base of the PR and between 79cbbc5 and 01b8ace.

📒 Files selected for processing (12)
  • .env.example
  • README.en.md
  • README.md
  • src/app/v1/_lib/proxy/errors.ts
  • src/app/v1/_lib/proxy/forwarder.ts
  • src/app/v1/_lib/proxy/response-handler.ts
  • src/app/v1/_lib/proxy/session.ts
  • src/lib/config/env.schema.ts
  • tests/unit/proxy/client-abort-vs-upstream-499.test.ts
  • tests/unit/proxy/proxy-forwarder-fake-200-html.test.ts
  • tests/unit/proxy/response-handler-endpoint-circuit-isolation.test.ts
  • tests/unit/proxy/response-handler-non200.test.ts

Comment thread README.en.md
Comment thread README.md
@github-actions github-actions Bot added the size/M Medium PR (< 500 lines) label Apr 28, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

This PR changes the circuit breaker failure recording policy so that only HTTP 524 (timeout) errors affect provider circuit state, removing circuit breaker recording for upstream 4xx/5xx responses, fake-200 errors, empty responses, and stream aborts. The behavioral change is consistent across all code paths (forwarder main loop, hedge path, response handler). The test suite is updated to match, but has a coverage gap for the new 524-specific logic.

PR Size: M

  • Lines changed: 258
  • Files changed: 12

Issues Found

Category Critical High Medium Low
Logic/Bugs 0 0 0 0
Security 0 0 0 0
Error Handling 0 0 0 0
Types 0 0 0 0
Comments/Docs 0 0 1 0
Tests 0 1 0 0
Simplification 0 0 0 0

High Priority Issues (Should Fix)

  1. [TEST-MISSING-CRITICAL] No test verifies that 524 status codes trigger recordFailure (src/app/v1/_lib/proxy/forwarder.ts:1890). The shouldRecordProviderCircuitFailure = statusCode === 524 conditional is the sole remaining path for provider errors to affect circuit breaker state. Existing tests only verify the negative case (non-524 does not record). A test for the positive case is needed to prevent silent regression. The hedge path (forwarder.ts ~line 3782) has the same gap.

Medium Priority Issues

  1. [COMMENT-OUTDATED] .env.example lines 91-96 contain stale comments after this PR's changes. The comments state "默认关闭" (default off) and describe the old semantics where false means "only 4xx/5xx count" and true means "all errors count." After this PR, 4xx/5xx no longer count towards the circuit breaker regardless of this flag, making both descriptions inaccurate. The .env.example comments should be updated to match the new behavior.

Review Coverage

  • Logic and correctness
  • Security (OWASP Top 10)
  • Error handling
  • Type safety
  • Documentation accuracy
  • Test coverage
  • Code clarity

Automated review by Claude AI

@@ -1892,6 +1887,7 @@ export class ProxyForwarder {
const proxyError = lastError as ProxyError;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[High] [TEST-MISSING-CRITICAL] No test verifies that 524 status codes still trigger recordFailure

Why this is a problem: This PR introduces a critical behavioral gate -- only HTTP 524 (timeout) errors now trigger provider circuit breaker recording in the PROVIDER_ERROR path. The existing tests were updated to assert recordFailure.not.toHaveBeenCalled() for non-524 errors, but no test verifies the positive case: that a 524 error does call recordFailure. This is the sole remaining path for provider errors to affect circuit breaker state. If this conditional breaks, the circuit breaker becomes silently non-functional for all provider errors.

The hedge path (~line 3782, changed from statusCode \!== 404 to statusCode === 524) has the same gap.

Per CLAUDE.md: "All new features must have unit test coverage of at least 80%"

Suggested fix -- add a test in tests/unit/proxy/ that:

// Verify 524 timeout still triggers circuit breaker recording
it("should record circuit breaker failure for 524 timeout", async () => {
  // Setup: mock provider that throws ProxyError with statusCode 524
  doForward.mockRejectedValueOnce(
    new ProxyError("provider timeout", 524, { /* ... */ })
  );

  await ProxyForwarder.send(session);

  expect(mocks.recordFailure).toHaveBeenCalledWith(
    expect.any(Number),
    expect.objectContaining({ message: expect.stringContaining("timeout") })
  );
});

@dofastted dofastted force-pushed the fix/provider-circuit-breaker-health branch from 01b8ace to 668daf6 Compare April 28, 2026 12:19
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/app/v1/_lib/proxy/response-handler.ts (1)

594-610: ⚠️ Potential issue | 🟡 Minor

把客户端主动中断单独标成 client_abort

当前 !streamEndedNormally 分支无论是客户端断开还是上游/网络异常,都会写成 system_error。这样会把用户主动取消和真正的供应商故障混在一起,降低后续排障与统计精度。addProviderToChain 已经支持 client_abort,建议按 clientAborted 分开记录。

建议修正
   if (!streamEndedNormally) {
     await clearSessionBinding();

+    const chainReason = clientAborted ? "client_abort" : "system_error";
     session.addProviderToChain(providerForChain, {
       endpointId: meta.endpointId,
       endpointUrl: meta.endpointUrl,
-      reason: "system_error",
+      reason: chainReason,
       attemptNumber: meta.attemptNumber,
       statusCode: effectiveStatusCode,
       errorMessage: errorMessage ?? undefined,
     });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/v1/_lib/proxy/response-handler.ts` around lines 594 - 610, The branch
handling !streamEndedNormally currently records every abort as "system_error";
change it to distinguish client-initiated disconnects by checking the
clientAborted flag (or equivalent boolean that indicates the client closed the
connection) before calling session.addProviderToChain; if clientAborted is true,
pass reason: "client_abort", otherwise keep reason: "system_error". Keep the
surrounding logic (await clearSessionBinding(), use providerForChain,
meta.attemptNumber, effectiveStatusCode, errorMessage, and return values)
unchanged.
tests/unit/proxy/response-handler-non200.test.ts (1)

269-391: ⚠️ Potential issue | 🟠 Major

这组测试没有真正覆盖 response-handler.ts

当前用例只是手动调用 session.addProviderToChain(...),并没有驱动 ProxyResponseHandler 的非 200 分支;因此即使处理逻辑回退,这些断言也会继续通过。建议改成走真实 handler,再断言 provider chain / errorMessage / recordFailure 不被触发。

如果你愿意,我可以帮你把这组测试改回真实的 handler 路径。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/proxy/response-handler-non200.test.ts` around lines 269 - 391, The
tests are directly calling session.addProviderToChain instead of exercising the
real handler; replace the manual adds with invoking the actual
ProxyResponseHandler method that processes non-200 responses (e.g.,
ProxyResponseHandler.handleNonStream or the method used in tests to handle
responses) using a mocked response with the target statusCode and responseText
so the handler populates the session provider chain and potentially calls
recordFailure; assert on session.getProviderChain(), the added entry's
errorMessage/reason, and that mockRecordFailure was or was not called
accordingly, removing the manual session.addProviderToChain calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/app/v1/_lib/proxy/forwarder.ts`:
- Around line 1974-1975: The interim increment of circuitFailureCount (using
health.failureCount + (shouldRecordProviderCircuitFailure ? 1 : 0)) is applied
even when recordFailure() may not actually run, causing mismatch with the real
circuit state; change the logic so circuitFailureCount is only incremented under
the exact same conditions used when calling recordFailure() — i.e., evaluate
probe status, shouldAccountCircuitBreaker, retry exhaustion, and
shouldRecordProviderCircuitFailure together (the same predicates used around
recordFailure()) or simply defer computing/setting circuitFailureCount until
after recordFailure() actually executes; update references in the forwarder code
where circuitFailureCount, shouldRecordProviderCircuitFailure,
shouldAccountCircuitBreaker, probe, and recordFailure() are used so the
displayed count always matches real writes.
- Line 1819: Remove emoji characters from the inline comments in
src/app/v1/_lib/proxy/forwarder.ts — specifically replace the comment containing
"⭐ 6. 供应商错误处理(所有 4xx/5xx HTTP 错误 + 空响应错误,重试耗尽后切换)" and the comments at the
nearby block around lines 2015-2016 with equivalent plain-text comments (e.g.,
remove "⭐" and any other emoji), and scan the forwarder.ts file for any other
emoji in comments or string literals and convert them to plain text to comply
with the repository rule.
- Around line 3782-3784: The hedge branch calls
recordFailure(attempt.provider.id, error) on a 524 provider error without
honoring the same circuit-breaker accounting guards used elsewhere; update the
hedge-path check to only call recordFailure when allowCircuitBreakerAccounting
is enabled and the request is not a probe (mirror the ordinary branch's
conditions), i.e. wrap the existing if (errorCategory ===
ErrorCategory.PROVIDER_ERROR && statusCode === 524) { recordFailure(...) } with
the same allowCircuitBreakerAccounting and probe-request guards (using the
existing probe flag/check and allowCircuitBreakerAccounting variable) so hedge
failures are accounted consistently.

---

Outside diff comments:
In `@src/app/v1/_lib/proxy/response-handler.ts`:
- Around line 594-610: The branch handling !streamEndedNormally currently
records every abort as "system_error"; change it to distinguish client-initiated
disconnects by checking the clientAborted flag (or equivalent boolean that
indicates the client closed the connection) before calling
session.addProviderToChain; if clientAborted is true, pass reason:
"client_abort", otherwise keep reason: "system_error". Keep the surrounding
logic (await clearSessionBinding(), use providerForChain, meta.attemptNumber,
effectiveStatusCode, errorMessage, and return values) unchanged.

In `@tests/unit/proxy/response-handler-non200.test.ts`:
- Around line 269-391: The tests are directly calling session.addProviderToChain
instead of exercising the real handler; replace the manual adds with invoking
the actual ProxyResponseHandler method that processes non-200 responses (e.g.,
ProxyResponseHandler.handleNonStream or the method used in tests to handle
responses) using a mocked response with the target statusCode and responseText
so the handler populates the session provider chain and potentially calls
recordFailure; assert on session.getProviderChain(), the added entry's
errorMessage/reason, and that mockRecordFailure was or was not called
accordingly, removing the manual session.addProviderToChain calls.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 16478241-0290-465a-971e-cc551af90d8c

📥 Commits

Reviewing files that changed from the base of the PR and between 01b8ace and 668daf6.

📒 Files selected for processing (12)
  • .env.example
  • README.en.md
  • README.md
  • src/app/v1/_lib/proxy/errors.ts
  • src/app/v1/_lib/proxy/forwarder.ts
  • src/app/v1/_lib/proxy/response-handler.ts
  • src/app/v1/_lib/proxy/session.ts
  • src/lib/config/env.schema.ts
  • tests/unit/proxy/client-abort-vs-upstream-499.test.ts
  • tests/unit/proxy/proxy-forwarder-fake-200-html.test.ts
  • tests/unit/proxy/response-handler-endpoint-circuit-isolation.test.ts
  • tests/unit/proxy/response-handler-non200.test.ts
✅ Files skipped from review due to trivial changes (4)
  • tests/unit/proxy/client-abort-vs-upstream-499.test.ts
  • README.md
  • src/app/v1/_lib/proxy/session.ts
  • src/app/v1/_lib/proxy/errors.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/lib/config/env.schema.ts
  • .env.example
  • README.en.md

}

// ⭐ 6. 供应商错误处理(所有 4xx/5xx HTTP 错误 + 空响应错误,计入熔断器,重试耗尽后切换)
// ⭐ 6. 供应商错误处理(所有 4xx/5xx HTTP 错误 + 空响应错误,重试耗尽后切换)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

请移除新增注释中的 emoji 字符。

Line 1819、Line 2015 的注释含有 emoji,不符合仓库规则,建议改成纯文本注释。

As per coding guidelines "**/*.{ts,tsx,js,jsx,json,css,scss,sql,py,go,sh,bash}: Never use emoji characters in any code, comments, or string literals".

Also applies to: 2015-2016

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/v1/_lib/proxy/forwarder.ts` at line 1819, Remove emoji characters
from the inline comments in src/app/v1/_lib/proxy/forwarder.ts — specifically
replace the comment containing "⭐ 6. 供应商错误处理(所有 4xx/5xx HTTP 错误 +
空响应错误,重试耗尽后切换)" and the comments at the nearby block around lines 2015-2016 with
equivalent plain-text comments (e.g., remove "⭐" and any other emoji), and scan
the forwarder.ts file for any other emoji in comments or string literals and
convert them to plain text to comply with the repository rule.

Comment on lines +1974 to +1975
circuitFailureCount:
health.failureCount + (shouldRecordProviderCircuitFailure ? 1 : 0),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

circuitFailureCount 会在未实际 recordFailure() 时被提前加 1。

Line 1974-1975 只按 statusCode===524 预加计数,但真正写入还受 probe、shouldAccountCircuitBreaker、以及是否重试耗尽控制(Line 2017-2025)。这会让决策链计数与真实熔断状态不一致。

建议修复
 const statusCode = proxyError.statusCode;
 const willRetry = attemptCount < maxAttemptsPerProvider;
 const shouldRecordProviderCircuitFailure = statusCode === 524;
+const isProbeRequest = session.isProbeRequest();
+const willRecordProviderCircuitFailure =
+  shouldRecordProviderCircuitFailure &&
+  shouldAccountCircuitBreaker &&
+  !isProbeRequest &&
+  !willRetry;
 ...
-  circuitFailureCount:
-    health.failureCount + (shouldRecordProviderCircuitFailure ? 1 : 0),
+  circuitFailureCount:
+    health.failureCount + (willRecordProviderCircuitFailure ? 1 : 0),
 ...
- if (session.isProbeRequest()) {
+ if (isProbeRequest) {
   logger.debug("ProxyForwarder: Probe request error, skipping circuit breaker", {
     providerId: currentProvider.id,
     providerName: currentProvider.name,
     messagesCount: session.getMessagesLength(),
   });
 } else {
   if (shouldRecordProviderCircuitFailure && shouldAccountCircuitBreaker) {
     await recordFailure(currentProvider.id, lastError);
   }
 }

Also applies to: 2017-2025

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/v1/_lib/proxy/forwarder.ts` around lines 1974 - 1975, The interim
increment of circuitFailureCount (using health.failureCount +
(shouldRecordProviderCircuitFailure ? 1 : 0)) is applied even when
recordFailure() may not actually run, causing mismatch with the real circuit
state; change the logic so circuitFailureCount is only incremented under the
exact same conditions used when calling recordFailure() — i.e., evaluate probe
status, shouldAccountCircuitBreaker, retry exhaustion, and
shouldRecordProviderCircuitFailure together (the same predicates used around
recordFailure()) or simply defer computing/setting circuitFailureCount until
after recordFailure() actually executes; update references in the forwarder code
where circuitFailureCount, shouldRecordProviderCircuitFailure,
shouldAccountCircuitBreaker, probe, and recordFailure() are used so the
displayed count always matches real writes.

Comment on lines +3782 to 3784
if (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524) {
await recordFailure(attempt.provider.id, error);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Hedge 分支缺少熔断记账开关保护。

Line 3782-3784 在 524 时直接 recordFailure();但普通分支会受 allowCircuitBreakerAccounting 与 probe 请求保护。这里会导致同一策略在 hedge 路径下被绕过。

建议修复
 // in sendStreamingWithHedge(...) near function start
+const shouldAccountCircuitBreaker =
+  ProxyForwarder.getEndpointPolicy(session).allowCircuitBreakerAccounting;
+const isProbeRequest = session.isProbeRequest();

 // ...
-if (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524) {
+if (
+  errorCategory === ErrorCategory.PROVIDER_ERROR &&
+  statusCode === 524 &&
+  shouldAccountCircuitBreaker &&
+  !isProbeRequest
+) {
   await recordFailure(attempt.provider.id, error);
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/v1/_lib/proxy/forwarder.ts` around lines 3782 - 3784, The hedge
branch calls recordFailure(attempt.provider.id, error) on a 524 provider error
without honoring the same circuit-breaker accounting guards used elsewhere;
update the hedge-path check to only call recordFailure when
allowCircuitBreakerAccounting is enabled and the request is not a probe (mirror
the ordinary branch's conditions), i.e. wrap the existing if (errorCategory ===
ErrorCategory.PROVIDER_ERROR && statusCode === 524) { recordFailure(...) } with
the same allowCircuitBreakerAccounting and probe-request guards (using the
existing probe flag/check and allowCircuitBreakerAccounting variable) so hedge
failures are accounted consistently.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 668daf6d2f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread .env.example
# - 默认关闭:适用于网络不稳定环境(如使用代理),避免因临时网络抖动触发熔断器
# - 启用:适用于网络稳定环境,连续网络错误也应触发熔断保护,避免持续请求不可达的供应商
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=false
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Correct circuit-breaker env comments after default flip

This change sets ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true, but the surrounding .env.example guidance still describes false as the default and says provider HTTP errors are counted when disabled/enabled. After this commit, provider 4xx/5xx are generally not counted (except timeout-specific handling), so operators who rely on this example can deploy with an incorrect mental model and mis-tune production circuit-breaker behavior.

Useful? React with 👍 / 👎.

@ding113
Copy link
Copy Markdown
Owner

ding113 commented May 13, 2026

您好,这一决策方式暂不打算在当前项目中做改动。后续重构版本中计划提供相应开关或配置接口,本 PR 将被关闭。

@ding113 ding113 closed this May 13, 2026
@github-project-automation github-project-automation Bot moved this from Backlog to Done in Claude Code Hub Roadmap May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:provider bug Something isn't working size/M Medium PR (< 500 lines)

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants