fix(proxy): limit provider circuit breaker failures#1134
Conversation
📝 WalkthroughWalkthrough将 Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request refines the circuit breaker logic so that only connectivity failures and timeouts contribute to a provider's circuit breaker state, while application-level errors trigger fallbacks without penalizing overall health. It also updates the default configuration to enable circuit breakers on network errors. Feedback points out an inconsistency in the streaming hedge path regarding network error handling and endpoint policies, and recommends making failure recording non-blocking to improve performance.
| if (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524) { | ||
| await recordFailure(attempt.provider.id, error); | ||
| } |
There was a problem hiding this comment.
The circuit breaker logic in the streaming hedge path is inconsistent with the non-hedge path and the PR description. Connectivity failures (SYSTEM_ERROR) should also affect the provider circuit state when ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS is enabled. Additionally, the allowCircuitBreakerAccounting endpoint policy should be respected. Note: In performance-sensitive paths like this, the recordFailure I/O operation should be fire-and-forget to avoid blocking the main logic.
const shouldAccountCircuitBreaker = session.getEndpointPolicy().allowCircuitBreakerAccounting;
const shouldRecordProviderCircuitFailure =
(errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524) ||
(errorCategory === ErrorCategory.SYSTEM_ERROR && getEnvConfig().ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS);
if (shouldRecordProviderCircuitFailure && shouldAccountCircuitBreaker) {
recordFailure(attempt.provider.id, error);
}References
- In performance-sensitive code paths like provider failover, non-critical I/O operations (e.g., releasing a session in Redis) should be executed as fire-and-forget tasks to avoid blocking the main logic.
| # 使用场景: | ||
| # - 默认关闭:适用于网络不稳定环境(如使用代理),避免因临时网络抖动触发熔断器 | ||
| # - 启用:适用于网络稳定环境,连续网络错误也应触发熔断保护,避免持续请求不可达的供应商 | ||
| ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=false | ||
| ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true |
There was a problem hiding this comment.
Stale "默认关闭" comment after default change
The comment block still reads - 默认关闭:适用于网络不稳定环境… (disabled by default: for unstable networks), but the default was just flipped to true. Operators who set up a fresh deployment will read "disabled by default" and assume false is the initial state, leading to misconfigured circuit breakers.
| # 使用场景: | |
| # - 默认关闭:适用于网络不稳定环境(如使用代理),避免因临时网络抖动触发熔断器 | |
| # - 启用:适用于网络稳定环境,连续网络错误也应触发熔断保护,避免持续请求不可达的供应商 | |
| ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=false | |
| ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true | |
| # 使用场景: | |
| # - 关闭:适用于网络不稳定环境(如使用代理),避免因临时网络抖动触发熔断器 | |
| # - 启用(默认):适用于网络稳定环境,连续网络错误也应触发熔断保护,避免持续请求不可达的供应商 | |
| ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true |
Prompt To Fix With AI
This is a comment left during a code review.
Path: .env.example
Line: 94-97
Comment:
**Stale "默认关闭" comment after default change**
The comment block still reads `- 默认关闭:适用于网络不稳定环境…` (disabled by default: for unstable networks), but the default was just flipped to `true`. Operators who set up a fresh deployment will read "disabled by default" and assume `false` is the initial state, leading to misconfigured circuit breakers.
```suggestion
# 使用场景:
# - 关闭:适用于网络不稳定环境(如使用代理),避免因临时网络抖动触发熔断器
# - 启用(默认):适用于网络稳定环境,连续网络错误也应触发熔断保护,避免持续请求不可达的供应商
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true
```
How can I resolve this? If you propose a fix, please make it concise.| if (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524) { | ||
| await recordFailure(attempt.provider.id, error); | ||
| } |
There was a problem hiding this comment.
Concurrent mode skips
shouldAccountCircuitBreaker guard
The sequential path gates its 524-failure recording on shouldRecordProviderCircuitFailure && shouldAccountCircuitBreaker (line ~2024), but the concurrent/hedging path at this line does not check shouldAccountCircuitBreaker (endpointPolicy.allowCircuitBreakerAccounting). This means a 524 error in a hedge attempt will always record a circuit-breaker failure, even for endpoints whose policy explicitly opts out of circuit-breaker accounting.
This asymmetry pre-dates this PR, but since this PR now narrows the trigger to only 524, the missing guard becomes more visible. Consider adding the same guard for consistency:
if (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524 && shouldAccountCircuitBreaker) {
await recordFailure(attempt.provider.id, error);
}Prompt To Fix With AI
This is a comment left during a code review.
Path: src/app/v1/_lib/proxy/forwarder.ts
Line: 3782-3784
Comment:
**Concurrent mode skips `shouldAccountCircuitBreaker` guard**
The sequential path gates its 524-failure recording on `shouldRecordProviderCircuitFailure && shouldAccountCircuitBreaker` (line ~2024), but the concurrent/hedging path at this line does not check `shouldAccountCircuitBreaker` (`endpointPolicy.allowCircuitBreakerAccounting`). This means a 524 error in a hedge attempt will always record a circuit-breaker failure, even for endpoints whose policy explicitly opts out of circuit-breaker accounting.
This asymmetry pre-dates this PR, but since this PR now narrows the trigger to only `524`, the missing guard becomes more visible. Consider adding the same guard for consistency:
```typescript
if (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524 && shouldAccountCircuitBreaker) {
await recordFailure(attempt.provider.id, error);
}
```
How can I resolve this? If you propose a fix, please make it concise.| | `SESSION_TTL` | `300` | Session cache window (seconds) that drives vendor reuse. | | ||
| | `ENABLE_SECURE_COOKIES` | `true` | Browsers require HTTPS for Secure cookies; set to `false` when serving plain HTTP outside localhost. | | ||
| | `ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS` | `false` | When `true`, network errors also trip the circuit breaker for quicker isolation. | | ||
| | `ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS` | `true` | When `true`, only connectivity failures and timeouts affect provider circuit breaker state. | |
There was a problem hiding this comment.
Description ambiguous when variable is
false
The new description "When true, only connectivity failures and timeouts affect provider circuit breaker state" implies that when false, other things also affect provider circuit state. In practice, HTTP 524 still triggers recordFailure regardless of this variable — it purely controls whether network-layer errors are additionally included.
Consider wording like: "When true, network-layer errors (DNS, connection refused, etc.) also trip the provider circuit breaker in addition to 524 timeouts."
Prompt To Fix With AI
This is a comment left during a code review.
Path: README.en.md
Line: 347
Comment:
**Description ambiguous when variable is `false`**
The new description "When `true`, only connectivity failures and timeouts affect provider circuit breaker state" implies that when `false`, other things also affect provider circuit state. In practice, HTTP 524 still triggers `recordFailure` regardless of this variable — it purely controls whether *network-layer* errors are additionally included.
Consider wording like: "When `true`, network-layer errors (DNS, connection refused, etc.) also trip the provider circuit breaker in addition to 524 timeouts."
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
.env.example (1)
90-97:⚠️ Potential issue | 🟡 Minor熔断器注释与当前行为不一致,易误导运维配置。
Line 97 已改为默认
true,但 Line 92-96 仍写“false 为默认、true 计入所有错误”。这与本 PR“仅连通性失败/超时类问题影响 provider 熔断状态”的语义冲突。建议同步更新注释文案
# 熔断器配置 -# 功能说明:控制网络错误是否计入熔断器失败计数 -# - false (默认):网络错误(DNS 解析失败、连接超时、代理连接失败等)不计入熔断器,仅供应商错误(4xx/5xx HTTP 响应)计入 -# - true:所有错误(包括网络错误)都计入熔断器失败计数 +# 功能说明:控制“连通性失败/超时类错误”是否影响 provider 熔断状态 +# - true (默认):连接失败、连接超时、超时风格错误(如 524)会计入 provider 熔断 +# - false:上述网络连通性错误不计入 provider 熔断(仍保留重试/切换逻辑) ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.env.example around lines 90 - 97, The comment for ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS is inconsistent with the new default and PR behavior; update the explanatory text so the default value shown (true) matches the described behavior and clearly define what true vs false do given the PR's change (specifically that, under the current implementation, only connectivity failures/timeouts affect the provider circuit state or explain that true/false now invert behavior as implemented). Edit the block mentioning "false (默认)" / "true:所有错误计入熔断器失败计数" and rewrite it to explicitly state the current default (true), what each setting does, and that only connectivity/time‑out style network errors are considered for provider circuit status per this PR; ensure ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS is referenced so maintainers can locate the setting.tests/unit/proxy/response-handler-non200.test.ts (1)
269-367:⚠️ Potential issue | 🟠 Major这些 non-200 用例未走真实处理链路,当前断言无法证明行为变更。
在 Line 270-367,测试只做本地分支和
session.addProviderToChain(...),并没有调用ProxyResponseHandler(或 forwarder)实际逻辑,所以 Line 291/319/343/367 的not.toHaveBeenCalled()基本是平凡成立,回归保护不足。建议改为真实调用路径(示例)
+import { ProxyResponseHandler } from "@/app/v1/_lib/proxy/response-handler"; ... describe("handleNonStream with non-200 status code", () => { it("should not record provider circuit breaker failure for 500 status", async () => { const session = createSession({ provider: mockProvider, messageContext: mockMessageContext, }); - - const statusCode = 500; - const responseText = '{"error":"internal error"}'; - - if (statusCode >= 400) { - const detected = detectUpstreamErrorFromSseOrJsonText(responseText); - const errorMessageForDb = detected.isError ? detected.code : `HTTP ${statusCode}`; - - session.addProviderToChain(mockProvider, { - reason: "retry_failed", - attemptNumber: 1, - statusCode: statusCode, - errorMessage: errorMessageForDb, - }); - } + const response = new Response('{"error":"internal error"}', { + status: 500, + headers: { "content-type": "application/json" }, + }); + + await ProxyResponseHandler.dispatch(session, response); + await Promise.all(asyncTasks.splice(0, asyncTasks.length)); expect(mockRecordFailure).not.toHaveBeenCalled(); + expect( + session + .getProviderChain() + .some((item) => item.reason === "retry_failed" && item.statusCode === 500), + ).toBe(true); }); });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/proxy/response-handler-non200.test.ts` around lines 269 - 367, Tests are only exercising local branching and calling session.addProviderToChain directly, so mockRecordFailure assertions are vacuous; instead invoke the real handling path (e.g., ProxyResponseHandler.handleNonStream or the forwarder method that processes an upstream Response) with a constructed Response-like object containing status and body so the handler calls detectUpstreamErrorFromSseOrJsonText and then session.addProviderToChain/recordProviderFailure; update each test to call that handler with the mocked provider and response (using the same statusCode and responseText), assert mockRecordFailure called or not accordingly, and keep references to detectUpstreamErrorFromSseOrJsonText, session.addProviderToChain, ProxyResponseHandler.handleNonStream (or the forwarder method) and mockRecordFailure to find the right code to change.src/app/v1/_lib/proxy/response-handler.ts (2)
926-938:⚠️ Potential issue | 🟡 Minor404 在这两个 non-stream 分支里也需要保留
resource_not_found。这个 PR 去掉了这里的 provider breaker 记账后,决策链已经变成这条路径里唯一的失败归因。但当前把所有
>= 400都写成retry_failed,会把 404 和普通上游失败混在一起,和本文件的流式分支、categorizeErrorAsync()的 404 特判不一致。建议修改
if (statusCode >= 400) { const detected = detectUpstreamErrorFromSseOrJsonText(responseText); errorMessageForFinalize = detected.isError ? detected.code : `HTTP ${statusCode}`; + const chainReason = + statusCode === 404 ? "resource_not_found" : "retry_failed"; // 记录到决策链 session.addProviderToChain(provider, { - reason: "retry_failed", + reason: chainReason, attemptNumber: 1, statusCode: statusCode, errorMessage: errorMessageForFinalize, }); }if (statusCode >= 400) { const detected = detectUpstreamErrorFromSseOrJsonText(responseText); const errorMessageForDb = detected.isError ? detected.code : `HTTP ${statusCode}`; + const chainReason = + statusCode === 404 ? "resource_not_found" : "retry_failed"; // 记录到决策链 session.addProviderToChain(provider, { - reason: "retry_failed", + reason: chainReason, attemptNumber: 1, statusCode: statusCode, errorMessage: errorMessageForDb, }); }Also applies to: 1305-1316
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/app/v1/_lib/proxy/response-handler.ts` around lines 926 - 938, The non-stream error handling currently logs every >=400 as reason "retry_failed", which incorrectly merges 404s with other failures; update the session.addProviderToChain call in response-handler.ts (the block using detectUpstreamErrorFromSseOrJsonText, variables provider/statusCode/detected/errorMessageForFinalize) so that if the upstream represents a resource-not-found (statusCode === 404 or detected.isError && detected.code === 'resource_not_found') you set reason to "resource_not_found", otherwise keep "retry_failed"; apply the same conditional fix to the other non-stream branch that mirrors this logic (the block around lines 1305-1316).
594-608:⚠️ Potential issue | 🟡 Minor客户端中断这里不应记成
system_error。当
clientAborted === true时,这里仍然固定写入reason: "system_error"。后续updateMessageRequestDetails()会把 provider chain 原样持久化,所以用户主动取消会被审计/监控误记成上游故障。建议修改
session.addProviderToChain(providerForChain, { endpointId: meta.endpointId, endpointUrl: meta.endpointUrl, - reason: "system_error", + reason: clientAborted ? "client_abort" : "system_error", attemptNumber: meta.attemptNumber, statusCode: effectiveStatusCode, errorMessage: errorMessage ?? undefined, });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/app/v1/_lib/proxy/response-handler.ts` around lines 594 - 608, The code currently records provider chain entries with reason: "system_error" even when the client aborted; update the block that runs when !streamEndedNormally (around streamEndedNormally, clearSessionBinding, and session.addProviderToChain) to detect clientAborted (or equivalent flag) and set reason to "client_aborted" (or a distinct non-system reason) instead of "system_error"; ensure session.addProviderToChain receives the adjusted reason, leaving other fields (endpointId, endpointUrl, attemptNumber, statusCode, errorMessage) unchanged so updateMessageRequestDetails() persists the correct cause.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@README.en.md`:
- Line 347: The README's descriptions for
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS are inconsistent: one place says "only
connectivity failures and timeouts affect provider circuit breaker state" while
another still states "4xx/5xx or network errors will trigger"—pick the intended
behavior and make both descriptions match; update the other occurrence so the
text referencing ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS and any FAQ entry use
the same wording (either "only connectivity failures and timeouts" or "4xx/5xx
and network errors") and ensure the FAQ and the table entry convey the identical
policy.
In `@README.md`:
- Line 360: The README has inconsistent descriptions: the config table for
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS says only connection failures/timeouts
count toward provider circuit-breaking, but the FAQ still states "4xx/5xx or
network errors"—please reconcile by picking one authoritative rule and making
both places consistent; specifically update the FAQ text that mentions "4xx/5xx"
to match the table (or vice versa) so both state clearly whether 4xx/5xx
responses are included, and ensure the ENV name
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS and its explanatory Chinese wording
consistently describe the exact trigger set (e.g., "仅将连接失败、超时等网络不可达错误计入熔断" or "将
4xx/5xx 与网络错误一并计入熔断") across README.
---
Outside diff comments:
In @.env.example:
- Around line 90-97: The comment for ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS is
inconsistent with the new default and PR behavior; update the explanatory text
so the default value shown (true) matches the described behavior and clearly
define what true vs false do given the PR's change (specifically that, under the
current implementation, only connectivity failures/timeouts affect the provider
circuit state or explain that true/false now invert behavior as implemented).
Edit the block mentioning "false (默认)" / "true:所有错误计入熔断器失败计数" and rewrite it to
explicitly state the current default (true), what each setting does, and that
only connectivity/time‑out style network errors are considered for provider
circuit status per this PR; ensure ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS is
referenced so maintainers can locate the setting.
In `@src/app/v1/_lib/proxy/response-handler.ts`:
- Around line 926-938: The non-stream error handling currently logs every >=400
as reason "retry_failed", which incorrectly merges 404s with other failures;
update the session.addProviderToChain call in response-handler.ts (the block
using detectUpstreamErrorFromSseOrJsonText, variables
provider/statusCode/detected/errorMessageForFinalize) so that if the upstream
represents a resource-not-found (statusCode === 404 or detected.isError &&
detected.code === 'resource_not_found') you set reason to "resource_not_found",
otherwise keep "retry_failed"; apply the same conditional fix to the other
non-stream branch that mirrors this logic (the block around lines 1305-1316).
- Around line 594-608: The code currently records provider chain entries with
reason: "system_error" even when the client aborted; update the block that runs
when !streamEndedNormally (around streamEndedNormally, clearSessionBinding, and
session.addProviderToChain) to detect clientAborted (or equivalent flag) and set
reason to "client_aborted" (or a distinct non-system reason) instead of
"system_error"; ensure session.addProviderToChain receives the adjusted reason,
leaving other fields (endpointId, endpointUrl, attemptNumber, statusCode,
errorMessage) unchanged so updateMessageRequestDetails() persists the correct
cause.
In `@tests/unit/proxy/response-handler-non200.test.ts`:
- Around line 269-367: Tests are only exercising local branching and calling
session.addProviderToChain directly, so mockRecordFailure assertions are
vacuous; instead invoke the real handling path (e.g.,
ProxyResponseHandler.handleNonStream or the forwarder method that processes an
upstream Response) with a constructed Response-like object containing status and
body so the handler calls detectUpstreamErrorFromSseOrJsonText and then
session.addProviderToChain/recordProviderFailure; update each test to call that
handler with the mocked provider and response (using the same statusCode and
responseText), assert mockRecordFailure called or not accordingly, and keep
references to detectUpstreamErrorFromSseOrJsonText, session.addProviderToChain,
ProxyResponseHandler.handleNonStream (or the forwarder method) and
mockRecordFailure to find the right code to change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5ef302ec-9adf-483e-8862-d56cdbfef7b4
📒 Files selected for processing (12)
.env.exampleREADME.en.mdREADME.mdsrc/app/v1/_lib/proxy/errors.tssrc/app/v1/_lib/proxy/forwarder.tssrc/app/v1/_lib/proxy/response-handler.tssrc/app/v1/_lib/proxy/session.tssrc/lib/config/env.schema.tstests/unit/proxy/client-abort-vs-upstream-499.test.tstests/unit/proxy/proxy-forwarder-fake-200-html.test.tstests/unit/proxy/response-handler-endpoint-circuit-isolation.test.tstests/unit/proxy/response-handler-non200.test.ts
There was a problem hiding this comment.
Code Review Summary
This PR changes the circuit breaker failure recording policy so that only HTTP 524 (timeout) errors affect provider circuit state, removing circuit breaker recording for upstream 4xx/5xx responses, fake-200 errors, empty responses, and stream aborts. The behavioral change is consistent across all code paths (forwarder main loop, hedge path, response handler). The test suite is updated to match, but has a coverage gap for the new 524-specific logic.
PR Size: M
- Lines changed: 258
- Files changed: 12
Issues Found
| Category | Critical | High | Medium | Low |
|---|---|---|---|---|
| Logic/Bugs | 0 | 0 | 0 | 0 |
| Security | 0 | 0 | 0 | 0 |
| Error Handling | 0 | 0 | 0 | 0 |
| Types | 0 | 0 | 0 | 0 |
| Comments/Docs | 0 | 0 | 1 | 0 |
| Tests | 0 | 1 | 0 | 0 |
| Simplification | 0 | 0 | 0 | 0 |
High Priority Issues (Should Fix)
- [TEST-MISSING-CRITICAL] No test verifies that 524 status codes trigger
recordFailure(src/app/v1/_lib/proxy/forwarder.ts:1890). TheshouldRecordProviderCircuitFailure = statusCode === 524conditional is the sole remaining path for provider errors to affect circuit breaker state. Existing tests only verify the negative case (non-524 does not record). A test for the positive case is needed to prevent silent regression. The hedge path (forwarder.ts~line 3782) has the same gap.
Medium Priority Issues
- [COMMENT-OUTDATED]
.env.examplelines 91-96 contain stale comments after this PR's changes. The comments state "默认关闭" (default off) and describe the old semantics wherefalsemeans "only 4xx/5xx count" andtruemeans "all errors count." After this PR, 4xx/5xx no longer count towards the circuit breaker regardless of this flag, making both descriptions inaccurate. The.env.examplecomments should be updated to match the new behavior.
Review Coverage
- Logic and correctness
- Security (OWASP Top 10)
- Error handling
- Type safety
- Documentation accuracy
- Test coverage
- Code clarity
Automated review by Claude AI
| @@ -1892,6 +1887,7 @@ export class ProxyForwarder { | |||
| const proxyError = lastError as ProxyError; | |||
There was a problem hiding this comment.
[High] [TEST-MISSING-CRITICAL] No test verifies that 524 status codes still trigger recordFailure
Why this is a problem: This PR introduces a critical behavioral gate -- only HTTP 524 (timeout) errors now trigger provider circuit breaker recording in the PROVIDER_ERROR path. The existing tests were updated to assert recordFailure.not.toHaveBeenCalled() for non-524 errors, but no test verifies the positive case: that a 524 error does call recordFailure. This is the sole remaining path for provider errors to affect circuit breaker state. If this conditional breaks, the circuit breaker becomes silently non-functional for all provider errors.
The hedge path (~line 3782, changed from statusCode \!== 404 to statusCode === 524) has the same gap.
Per CLAUDE.md: "All new features must have unit test coverage of at least 80%"
Suggested fix -- add a test in tests/unit/proxy/ that:
// Verify 524 timeout still triggers circuit breaker recording
it("should record circuit breaker failure for 524 timeout", async () => {
// Setup: mock provider that throws ProxyError with statusCode 524
doForward.mockRejectedValueOnce(
new ProxyError("provider timeout", 524, { /* ... */ })
);
await ProxyForwarder.send(session);
expect(mocks.recordFailure).toHaveBeenCalledWith(
expect.any(Number),
expect.objectContaining({ message: expect.stringContaining("timeout") })
);
});01b8ace to
668daf6
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/app/v1/_lib/proxy/response-handler.ts (1)
594-610:⚠️ Potential issue | 🟡 Minor把客户端主动中断单独标成
client_abort。当前
!streamEndedNormally分支无论是客户端断开还是上游/网络异常,都会写成system_error。这样会把用户主动取消和真正的供应商故障混在一起,降低后续排障与统计精度。addProviderToChain已经支持client_abort,建议按clientAborted分开记录。建议修正
if (!streamEndedNormally) { await clearSessionBinding(); + const chainReason = clientAborted ? "client_abort" : "system_error"; session.addProviderToChain(providerForChain, { endpointId: meta.endpointId, endpointUrl: meta.endpointUrl, - reason: "system_error", + reason: chainReason, attemptNumber: meta.attemptNumber, statusCode: effectiveStatusCode, errorMessage: errorMessage ?? undefined, });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/app/v1/_lib/proxy/response-handler.ts` around lines 594 - 610, The branch handling !streamEndedNormally currently records every abort as "system_error"; change it to distinguish client-initiated disconnects by checking the clientAborted flag (or equivalent boolean that indicates the client closed the connection) before calling session.addProviderToChain; if clientAborted is true, pass reason: "client_abort", otherwise keep reason: "system_error". Keep the surrounding logic (await clearSessionBinding(), use providerForChain, meta.attemptNumber, effectiveStatusCode, errorMessage, and return values) unchanged.tests/unit/proxy/response-handler-non200.test.ts (1)
269-391:⚠️ Potential issue | 🟠 Major这组测试没有真正覆盖
response-handler.ts。当前用例只是手动调用
session.addProviderToChain(...),并没有驱动ProxyResponseHandler的非 200 分支;因此即使处理逻辑回退,这些断言也会继续通过。建议改成走真实 handler,再断言 provider chain / errorMessage /recordFailure不被触发。如果你愿意,我可以帮你把这组测试改回真实的 handler 路径。
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/proxy/response-handler-non200.test.ts` around lines 269 - 391, The tests are directly calling session.addProviderToChain instead of exercising the real handler; replace the manual adds with invoking the actual ProxyResponseHandler method that processes non-200 responses (e.g., ProxyResponseHandler.handleNonStream or the method used in tests to handle responses) using a mocked response with the target statusCode and responseText so the handler populates the session provider chain and potentially calls recordFailure; assert on session.getProviderChain(), the added entry's errorMessage/reason, and that mockRecordFailure was or was not called accordingly, removing the manual session.addProviderToChain calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/app/v1/_lib/proxy/forwarder.ts`:
- Around line 1974-1975: The interim increment of circuitFailureCount (using
health.failureCount + (shouldRecordProviderCircuitFailure ? 1 : 0)) is applied
even when recordFailure() may not actually run, causing mismatch with the real
circuit state; change the logic so circuitFailureCount is only incremented under
the exact same conditions used when calling recordFailure() — i.e., evaluate
probe status, shouldAccountCircuitBreaker, retry exhaustion, and
shouldRecordProviderCircuitFailure together (the same predicates used around
recordFailure()) or simply defer computing/setting circuitFailureCount until
after recordFailure() actually executes; update references in the forwarder code
where circuitFailureCount, shouldRecordProviderCircuitFailure,
shouldAccountCircuitBreaker, probe, and recordFailure() are used so the
displayed count always matches real writes.
- Line 1819: Remove emoji characters from the inline comments in
src/app/v1/_lib/proxy/forwarder.ts — specifically replace the comment containing
"⭐ 6. 供应商错误处理(所有 4xx/5xx HTTP 错误 + 空响应错误,重试耗尽后切换)" and the comments at the
nearby block around lines 2015-2016 with equivalent plain-text comments (e.g.,
remove "⭐" and any other emoji), and scan the forwarder.ts file for any other
emoji in comments or string literals and convert them to plain text to comply
with the repository rule.
- Around line 3782-3784: The hedge branch calls
recordFailure(attempt.provider.id, error) on a 524 provider error without
honoring the same circuit-breaker accounting guards used elsewhere; update the
hedge-path check to only call recordFailure when allowCircuitBreakerAccounting
is enabled and the request is not a probe (mirror the ordinary branch's
conditions), i.e. wrap the existing if (errorCategory ===
ErrorCategory.PROVIDER_ERROR && statusCode === 524) { recordFailure(...) } with
the same allowCircuitBreakerAccounting and probe-request guards (using the
existing probe flag/check and allowCircuitBreakerAccounting variable) so hedge
failures are accounted consistently.
---
Outside diff comments:
In `@src/app/v1/_lib/proxy/response-handler.ts`:
- Around line 594-610: The branch handling !streamEndedNormally currently
records every abort as "system_error"; change it to distinguish client-initiated
disconnects by checking the clientAborted flag (or equivalent boolean that
indicates the client closed the connection) before calling
session.addProviderToChain; if clientAborted is true, pass reason:
"client_abort", otherwise keep reason: "system_error". Keep the surrounding
logic (await clearSessionBinding(), use providerForChain, meta.attemptNumber,
effectiveStatusCode, errorMessage, and return values) unchanged.
In `@tests/unit/proxy/response-handler-non200.test.ts`:
- Around line 269-391: The tests are directly calling session.addProviderToChain
instead of exercising the real handler; replace the manual adds with invoking
the actual ProxyResponseHandler method that processes non-200 responses (e.g.,
ProxyResponseHandler.handleNonStream or the method used in tests to handle
responses) using a mocked response with the target statusCode and responseText
so the handler populates the session provider chain and potentially calls
recordFailure; assert on session.getProviderChain(), the added entry's
errorMessage/reason, and that mockRecordFailure was or was not called
accordingly, removing the manual session.addProviderToChain calls.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 16478241-0290-465a-971e-cc551af90d8c
📒 Files selected for processing (12)
.env.exampleREADME.en.mdREADME.mdsrc/app/v1/_lib/proxy/errors.tssrc/app/v1/_lib/proxy/forwarder.tssrc/app/v1/_lib/proxy/response-handler.tssrc/app/v1/_lib/proxy/session.tssrc/lib/config/env.schema.tstests/unit/proxy/client-abort-vs-upstream-499.test.tstests/unit/proxy/proxy-forwarder-fake-200-html.test.tstests/unit/proxy/response-handler-endpoint-circuit-isolation.test.tstests/unit/proxy/response-handler-non200.test.ts
✅ Files skipped from review due to trivial changes (4)
- tests/unit/proxy/client-abort-vs-upstream-499.test.ts
- README.md
- src/app/v1/_lib/proxy/session.ts
- src/app/v1/_lib/proxy/errors.ts
🚧 Files skipped from review as they are similar to previous changes (3)
- src/lib/config/env.schema.ts
- .env.example
- README.en.md
| } | ||
|
|
||
| // ⭐ 6. 供应商错误处理(所有 4xx/5xx HTTP 错误 + 空响应错误,计入熔断器,重试耗尽后切换) | ||
| // ⭐ 6. 供应商错误处理(所有 4xx/5xx HTTP 错误 + 空响应错误,重试耗尽后切换) |
There was a problem hiding this comment.
请移除新增注释中的 emoji 字符。
Line 1819、Line 2015 的注释含有 emoji,不符合仓库规则,建议改成纯文本注释。
As per coding guidelines "**/*.{ts,tsx,js,jsx,json,css,scss,sql,py,go,sh,bash}: Never use emoji characters in any code, comments, or string literals".
Also applies to: 2015-2016
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/app/v1/_lib/proxy/forwarder.ts` at line 1819, Remove emoji characters
from the inline comments in src/app/v1/_lib/proxy/forwarder.ts — specifically
replace the comment containing "⭐ 6. 供应商错误处理(所有 4xx/5xx HTTP 错误 +
空响应错误,重试耗尽后切换)" and the comments at the nearby block around lines 2015-2016 with
equivalent plain-text comments (e.g., remove "⭐" and any other emoji), and scan
the forwarder.ts file for any other emoji in comments or string literals and
convert them to plain text to comply with the repository rule.
| circuitFailureCount: | ||
| health.failureCount + (shouldRecordProviderCircuitFailure ? 1 : 0), |
There was a problem hiding this comment.
circuitFailureCount 会在未实际 recordFailure() 时被提前加 1。
Line 1974-1975 只按 statusCode===524 预加计数,但真正写入还受 probe、shouldAccountCircuitBreaker、以及是否重试耗尽控制(Line 2017-2025)。这会让决策链计数与真实熔断状态不一致。
建议修复
const statusCode = proxyError.statusCode;
const willRetry = attemptCount < maxAttemptsPerProvider;
const shouldRecordProviderCircuitFailure = statusCode === 524;
+const isProbeRequest = session.isProbeRequest();
+const willRecordProviderCircuitFailure =
+ shouldRecordProviderCircuitFailure &&
+ shouldAccountCircuitBreaker &&
+ !isProbeRequest &&
+ !willRetry;
...
- circuitFailureCount:
- health.failureCount + (shouldRecordProviderCircuitFailure ? 1 : 0),
+ circuitFailureCount:
+ health.failureCount + (willRecordProviderCircuitFailure ? 1 : 0),
...
- if (session.isProbeRequest()) {
+ if (isProbeRequest) {
logger.debug("ProxyForwarder: Probe request error, skipping circuit breaker", {
providerId: currentProvider.id,
providerName: currentProvider.name,
messagesCount: session.getMessagesLength(),
});
} else {
if (shouldRecordProviderCircuitFailure && shouldAccountCircuitBreaker) {
await recordFailure(currentProvider.id, lastError);
}
}Also applies to: 2017-2025
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/app/v1/_lib/proxy/forwarder.ts` around lines 1974 - 1975, The interim
increment of circuitFailureCount (using health.failureCount +
(shouldRecordProviderCircuitFailure ? 1 : 0)) is applied even when
recordFailure() may not actually run, causing mismatch with the real circuit
state; change the logic so circuitFailureCount is only incremented under the
exact same conditions used when calling recordFailure() — i.e., evaluate probe
status, shouldAccountCircuitBreaker, retry exhaustion, and
shouldRecordProviderCircuitFailure together (the same predicates used around
recordFailure()) or simply defer computing/setting circuitFailureCount until
after recordFailure() actually executes; update references in the forwarder code
where circuitFailureCount, shouldRecordProviderCircuitFailure,
shouldAccountCircuitBreaker, probe, and recordFailure() are used so the
displayed count always matches real writes.
| if (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524) { | ||
| await recordFailure(attempt.provider.id, error); | ||
| } |
There was a problem hiding this comment.
Hedge 分支缺少熔断记账开关保护。
Line 3782-3784 在 524 时直接 recordFailure();但普通分支会受 allowCircuitBreakerAccounting 与 probe 请求保护。这里会导致同一策略在 hedge 路径下被绕过。
建议修复
// in sendStreamingWithHedge(...) near function start
+const shouldAccountCircuitBreaker =
+ ProxyForwarder.getEndpointPolicy(session).allowCircuitBreakerAccounting;
+const isProbeRequest = session.isProbeRequest();
// ...
-if (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524) {
+if (
+ errorCategory === ErrorCategory.PROVIDER_ERROR &&
+ statusCode === 524 &&
+ shouldAccountCircuitBreaker &&
+ !isProbeRequest
+) {
await recordFailure(attempt.provider.id, error);
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/app/v1/_lib/proxy/forwarder.ts` around lines 3782 - 3784, The hedge
branch calls recordFailure(attempt.provider.id, error) on a 524 provider error
without honoring the same circuit-breaker accounting guards used elsewhere;
update the hedge-path check to only call recordFailure when
allowCircuitBreakerAccounting is enabled and the request is not a probe (mirror
the ordinary branch's conditions), i.e. wrap the existing if (errorCategory ===
ErrorCategory.PROVIDER_ERROR && statusCode === 524) { recordFailure(...) } with
the same allowCircuitBreakerAccounting and probe-request guards (using the
existing probe flag/check and allowCircuitBreakerAccounting variable) so hedge
failures are accounted consistently.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 668daf6d2f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # - 默认关闭:适用于网络不稳定环境(如使用代理),避免因临时网络抖动触发熔断器 | ||
| # - 启用:适用于网络稳定环境,连续网络错误也应触发熔断保护,避免持续请求不可达的供应商 | ||
| ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=false | ||
| ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true |
There was a problem hiding this comment.
Correct circuit-breaker env comments after default flip
This change sets ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true, but the surrounding .env.example guidance still describes false as the default and says provider HTTP errors are counted when disabled/enabled. After this commit, provider 4xx/5xx are generally not counted (except timeout-specific handling), so operators who rely on this example can deploy with an incorrect mental model and mis-tune production circuit-breaker behavior.
Useful? React with 👍 / 👎.
|
您好,这一决策方式暂不打算在当前项目中做改动。后续重构版本中计划提供相应开关或配置接口,本 PR 将被关闭。 |
Summary
Refines provider circuit breaker logic to distinguish between infrastructure failures (connectivity issues, timeouts) and request-level errors (HTTP 4xx/5xx, fake-200, empty responses). Only infrastructure failures now affect provider health scores, preventing healthy providers from being incorrectly penalized for transient or client-side issues.
Problem
Previously, the circuit breaker counted the following as provider failures:
This caused several operational issues:
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORSdefault wasfalse, meaning network errors weren't counted even when operators wanted strict health trackingRelated Issues:
Solution
Core Behavior Changes
HTTP 4xx/5xx responses no longer count as circuit breaker failures
Fake-200 responses (HTML error pages) excluded from circuit breaker
recordFailure()- now only logged for debuggingEmpty response errors excluded
Stream abort errors excluded
Network error config now defaults to
trueENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORSnow defaults totrueError Category Refinements
Updated
ErrorCategoryclassification inerrors.ts:CLIENT_ABORTNON_RETRYABLE_CLIENT_ERRORRESOURCE_NOT_FOUND(404)PROVIDER_ERROR(4xx/5xx)SYSTEM_ERRORChanges
Core Proxy Changes
src/app/v1/_lib/proxy/forwarder.ts- Only 524 status codes and network errors triggerrecordFailure()src/app/v1/_lib/proxy/response-handler.ts- Removed allrecordFailure()calls for non-200/fake-200/stream errorssrc/app/v1/_lib/proxy/errors.ts- Updated error category documentationsrc/app/v1/_lib/proxy/session.ts- Updated chain reason commentsConfiguration Changes
src/lib/config/env.schema.ts-ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORSdefault:false→true.env.example- Updated documentation for the new behaviorREADME.md/README.en.md- Clarified that only connectivity failures affect circuit stateTest Updates
tests/unit/proxy/client-abort-vs-upstream-499.test.ts- Updated expectationstests/unit/proxy/proxy-forwarder-fake-200-html.test.ts- RemovedrecordFailure()assertionstests/unit/proxy/response-handler-endpoint-circuit-isolation.test.ts- Verified no circuit recordingtests/unit/proxy/response-handler-non200.test.ts- Non-200 responses don't trigger failuresBreaking Changes
Behavior change for operators:
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS(default: false)Migration:
Testing
Automated Tests
bun run typecheck- passbun run test- All 4 updated test suites passclient-abort-vs-upstream-499.test.ts(16 tests)proxy-forwarder-fake-200-html.test.ts(7 tests)response-handler-endpoint-circuit-isolation.test.ts(12 tests)response-handler-non200.test.ts(14 tests)Manual Verification
provider_chainshowsretry_failedbut not circuit breaker penaltiesChecklist
devDescription enhanced by Claude AI
Greptile Summary
This PR refines the provider circuit breaker to count only infrastructure failures (524 Cloudflare timeouts and network-layer errors when
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true) rather than all 4xx/5xx HTTP responses, and flips that config default fromfalsetotrue. The coreforwarder.tsandresponse-handler.tschanges are clean and internally consistent; thecircuitFailureCountlog values are corrected to match the new non-recording behavior.Confidence Score: 5/5
Safe to merge; only P2 findings remain and they are pre-existing test quality issues.
All findings are P2 style/test concerns. The production logic change is straightforward and well-scoped. The meaningful test suites (forwarder fake-200 and endpoint circuit isolation) do exercise real code paths and correctly reflect the new behavior. The default-flip for ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS is a documented intentional breaking change.
tests/unit/proxy/response-handler-non200.test.ts — the handleNonStream describe block tests are trivially passing and do not exercise the real handler.
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[Upstream error received] --> B{Error category?} B -->|CLIENT_ABORT| C[No recordFailure\nNo retry] B -->|NON_RETRYABLE_CLIENT_ERROR| D[No recordFailure\nNo retry] B -->|RESOURCE_NOT_FOUND 404| E[No recordFailure\nSwitch provider] B -->|PROVIDER_ERROR\n4xx/5xx + EmptyResponse| F{statusCode == 524?} F -->|Yes\nCloudflare timeout| G[recordFailure ✓\nSwitch provider] F -->|No\n429/500/502/etc.| H[No recordFailure\nRetry/switch provider] B -->|SYSTEM_ERROR\nNetwork/DNS| I{ENABLE_CIRCUIT_BREAKER\n_ON_NETWORK_ERRORS?} I -->|true\nnew default| J[recordFailure ✓\nRetry once] I -->|false| K[No recordFailure\nRetry once]Prompt To Fix All With AI
Reviews (2): Last reviewed commit: "fix(proxy): limit provider circuit break..." | Re-trigger Greptile