fix(proxy): classify errors by rule category to preserve failover for service errors#1103
Conversation
… service errors Previously, isNonRetryableClientErrorAsync (and the sync companion) treated any matched error rule as NON_RETRYABLE_CLIENT_ERROR, regardless of the rule's category. This caused user-added rules with category like service_error or network_error to short-circuit failover/circuit-breaker logic, even though their underlying errors should be retried or routed to a different provider. Worse, regex patterns without word boundaries (e.g. "503|service.unavailable", "529|service.overloaded") matched 3-digit substrings inside upstream response bodies (request IDs, price amounts), causing INSUFFICIENT_BALANCE responses to be misclassified as service-overloaded and silenced. This change introduces an explicit category white-list for client-input errors that should genuinely bypass retry/failover. Only matches whose category falls within this set are classified as NON_RETRYABLE_CLIENT_ERROR; others fall back to the standard ProxyError-based classification (typically PROVIDER_ERROR or SYSTEM_ERROR), preserving failover and circuit-breaker behaviour. The white-list reflects the categories actually used by DEFAULT_ERROR_RULES: prompt_limit, input_limit, context_limit, token_limit, content_filter, validation_error, thinking_error, parameter_error, pdf_limit, media_limit, cache_limit, invalid_request, model_error, store_error. Reproduction context (CCH 0.7.2 production, observed 2026-04-25): - User added rule pattern "529|service.overloaded" with category=service_error - Upstream returned 403 INSUFFICIENT_BALANCE; response body contained substring "529" inside the price "0.352942" - Old logic: matched -> NON_RETRYABLE -> stop_immediately -> client received generic permission_error with no upstream detail - New logic: matched but category not white-listed -> falls back to PROVIDER_ERROR -> failover to next provider -> standard behaviour Tests: 10 new unit tests covering the white-list and regression cases for service_error / network_error / rate_limit categories.
📝 WalkthroughWalkthrough该变更修改了错误分类逻辑,使其仅当检测到的匹配规则的 Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| /** | ||
| * 暴露内部白名单常量供测试使用 | ||
| * | ||
| * 注意:这是只读副本,外部不应修改运行时白名单。 | ||
| */ | ||
| export function getNonRetryableClientErrorCategoriesForTesting(): ReadonlySet<string> { | ||
| return NON_RETRYABLE_CLIENT_ERROR_CATEGORIES; | ||
| } |
There was a problem hiding this comment.
Misleading "read-only copy" comment; actual object reference is returned
The JSDoc says "这是只读副本" (this is a read-only copy), but the function returns the live NON_RETRYABLE_CLIENT_ERROR_CATEGORIES object, not a copy. ReadonlySet<string> is a TypeScript compile-time constraint only — a test that casts the return value to Set<string> can call .add() or .delete() and mutate the runtime whitelist, causing subtle cross-test pollution.
Either return a true copy, or update the comment to clarify that ReadonlySet is the only guard in place:
// Option A: return a real copy so the comment is accurate
export function getNonRetryableClientErrorCategoriesForTesting(): ReadonlySet<string> {
return new Set(NON_RETRYABLE_CLIENT_ERROR_CATEGORIES);
}Prompt To Fix With AI
This is a comment left during a code review.
Path: src/app/v1/_lib/proxy/errors.ts
Line: 715-722
Comment:
**Misleading "read-only copy" comment; actual object reference is returned**
The JSDoc says "这是只读副本" (this is a read-only copy), but the function returns the live `NON_RETRYABLE_CLIENT_ERROR_CATEGORIES` object, not a copy. `ReadonlySet<string>` is a TypeScript compile-time constraint only — a test that casts the return value to `Set<string>` can call `.add()` or `.delete()` and mutate the runtime whitelist, causing subtle cross-test pollution.
Either return a true copy, or update the comment to clarify that `ReadonlySet` is the only guard in place:
```ts
// Option A: return a real copy so the comment is accurate
export function getNonRetryableClientErrorCategoriesForTesting(): ReadonlySet<string> {
return new Set(NON_RETRYABLE_CLIENT_ERROR_CATEGORIES);
}
```
How can I resolve this? If you propose a fix, please make it concise.| const NON_RETRYABLE_CLIENT_ERROR_CATEGORIES = new Set<string>([ | ||
| "prompt_limit", | ||
| "input_limit", | ||
| "context_limit", | ||
| "token_limit", | ||
| "content_filter", | ||
| "validation_error", | ||
| "thinking_error", | ||
| "parameter_error", | ||
| "pdf_limit", | ||
| "media_limit", | ||
| "cache_limit", | ||
| "invalid_request", | ||
| "model_error", | ||
| "store_error", | ||
| ]); |
There was a problem hiding this comment.
Whitelist is a silent maintenance hazard when new default categories are added
The set is hard-coded and has no compile-time link to DEFAULT_ERROR_RULES. If a future default rule is added with a new category (e.g. "citation_limit") and the whitelist is not updated, that rule's errors will silently fall back to PROVIDER_ERROR/SYSTEM_ERROR instead of short-circuiting retries — the opposite of the current bug, but still incorrect.
A lightweight guard could be a unit test that asserts every category used by DEFAULT_ERROR_RULES is present in NON_RETRYABLE_CLIENT_ERROR_CATEGORIES, catching drift at CI time rather than in production.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/app/v1/_lib/proxy/errors.ts
Line: 594-609
Comment:
**Whitelist is a silent maintenance hazard when new default categories are added**
The set is hard-coded and has no compile-time link to `DEFAULT_ERROR_RULES`. If a future default rule is added with a new category (e.g. `"citation_limit"`) and the whitelist is not updated, that rule's errors will silently fall back to `PROVIDER_ERROR`/`SYSTEM_ERROR` instead of short-circuiting retries — the opposite of the current bug, but still incorrect.
A lightweight guard could be a unit test that asserts every category used by `DEFAULT_ERROR_RULES` is present in `NON_RETRYABLE_CLIENT_ERROR_CATEGORIES`, catching drift at CI time rather than in production.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
src/app/v1/_lib/proxy/errors.ts (1)
594-609: 白名单与DEFAULT_ERROR_RULES存在静默漂移风险当前白名单是硬编码集合,PR 描述声明其与
DEFAULT_ERROR_RULES对齐,但二者并未在代码层面建立强约束。若后续向DEFAULT_ERROR_RULES新增/调整 category(例如新增一个真正的客户端输入错误 category),此处容易被遗漏,导致默认规则匹配后被错误地降级为PROVIDER_ERROR,复现本 PR 试图修复的反向问题。建议二选一:
- 从
DEFAULT_ERROR_RULES中派生白名单(取所有isDefault=true且语义为客户端输入错误的 category),并辅以注释说明排除项;- 增加一个轻量单测,断言
DEFAULT_ERROR_RULES中所有“客户端输入类” category 都包含于本白名单(或反向断言),以保证未来变更时 CI 主动告警。🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/app/v1/_lib/proxy/errors.ts` around lines 594 - 609, The hardcoded NON_RETRYABLE_CLIENT_ERROR_CATEGORIES set can drift from DEFAULT_ERROR_RULES; update the implementation to derive the whitelist programmatically from DEFAULT_ERROR_RULES (e.g., filter entries where isDefault===true and semantically denote client/input errors) and document any explicit exclusions, or alternatively add a unit test that asserts all client-input categories in DEFAULT_ERROR_RULES are included in NON_RETRYABLE_CLIENT_ERROR_CATEGORIES (or the inverse) so CI fails on future mismatches; reference DEFAULT_ERROR_RULES and NON_RETRYABLE_CLIENT_ERROR_CATEGORIES to locate where to implement the derivation or add the test.tests/unit/proxy/error-category-aware-classification.test.ts (2)
344-362: 同步路径的覆盖建议增加“白名单命中 -> true”用例当前同步用例仅验证了
service_error(非白名单)在缓存命中时返回false的方向。建议追加一个对称用例:先用白名单 category(如validation_error)的规则预热缓存,断言isNonRetryableClientError返回true,避免未来若有人误改shouldTreatMatchedRuleAsNonRetryable在同步分支的语义而无回归保护。建议追加用例
test("sync version respects white-list when cache hit", async () => { // ...existing assertion (false direction)... }); + + test("sync version returns true when cached match has white-listed category", async () => { + await loadDetectorWithRules([ + buildRule({ + id: 41, + pattern: "validation-token", + matchType: "contains", + category: "validation_error", + }), + ]); + + const { isNonRetryableClientError, isNonRetryableClientErrorAsync } = await import( + "@/app/v1/_lib/proxy/errors" + ); + const error = new Error("upstream said: validation-token"); + await isNonRetryableClientErrorAsync(error); + expect(isNonRetryableClientError(error)).toBe(true); + });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/proxy/error-category-aware-classification.test.ts` around lines 344 - 362, Add a symmetric test that primes the cache via the async path with a white-list category rule and asserts the sync path returns true: use loadDetectorWithRules with buildRule({ id: X, pattern: "validation-token", matchType: "contains", category: "validation_error" }), call isNonRetryableClientErrorAsync(new Error("...validation-token")) to warm the cache, then assert isNonRetryableClientError(new Error("...validation-token")) === true; place this alongside the existing "sync version respects white-list when cache hit" test and reference the same functions isNonRetryableClientError and isNonRetryableClientErrorAsync to protect the sync branch semantics.
104-112:setTimeout(0)的意图建议在注释中说明或移除
await new Promise((resolve) => setTimeout(resolve, 0));紧跟在设置 mock 之后、reload()之前,意图不明:是等待某个微任务/订阅注册,还是历史遗留?若用于等待subscribeCacheInvalidation的异步副作用,请补一行注释说明;若实际并无作用,可直接移除以减少测试中的玄学等待。🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/proxy/error-category-aware-classification.test.ts` around lines 104 - 112, The zero-delay timer in loadDetectorWithRules (the await new Promise(resolve => setTimeout(resolve, 0)) placed after mocking mocks.getActiveErrorRules and before calling errorRuleDetector.reload()) is unclear: either add a one-line comment explaining it waits for a microtask/async subscription (e.g. to allow subscribeCacheInvalidation or other async side-effects to register) or remove the timer if it has no effect; alternatively, replace it with an explicit, intention-revealing helper (e.g. flushPromises or await nextTick) and keep errorRuleDetector.reload() and mocks.getActiveErrorRules references intact so the test deterministically waits for the needed async registration.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/app/v1/_lib/proxy/errors.ts`:
- Around line 715-722: The JSDoc incorrectly claims
getNonRetryableClientErrorCategoriesForTesting returns a read-only copy while it
currently returns the original Set reference
NON_RETRYABLE_CLIENT_ERROR_CATEGORIES; either return an actual copy (e.g.,
construct and return a new Set from NON_RETRYABLE_CLIENT_ERROR_CATEGORIES) to
make it immutable at runtime, or update the JSDoc to state that the function
returns the same runtime Set and only TypeScript's ReadonlySet typing prevents
mutation; modify getNonRetryableClientErrorCategoriesForTesting and/or its JSDoc
accordingly and keep the symbol NON_RETRYABLE_CLIENT_ERROR_CATEGORIES as the
source of truth.
---
Nitpick comments:
In `@src/app/v1/_lib/proxy/errors.ts`:
- Around line 594-609: The hardcoded NON_RETRYABLE_CLIENT_ERROR_CATEGORIES set
can drift from DEFAULT_ERROR_RULES; update the implementation to derive the
whitelist programmatically from DEFAULT_ERROR_RULES (e.g., filter entries where
isDefault===true and semantically denote client/input errors) and document any
explicit exclusions, or alternatively add a unit test that asserts all
client-input categories in DEFAULT_ERROR_RULES are included in
NON_RETRYABLE_CLIENT_ERROR_CATEGORIES (or the inverse) so CI fails on future
mismatches; reference DEFAULT_ERROR_RULES and
NON_RETRYABLE_CLIENT_ERROR_CATEGORIES to locate where to implement the
derivation or add the test.
In `@tests/unit/proxy/error-category-aware-classification.test.ts`:
- Around line 344-362: Add a symmetric test that primes the cache via the async
path with a white-list category rule and asserts the sync path returns true: use
loadDetectorWithRules with buildRule({ id: X, pattern: "validation-token",
matchType: "contains", category: "validation_error" }), call
isNonRetryableClientErrorAsync(new Error("...validation-token")) to warm the
cache, then assert isNonRetryableClientError(new Error("...validation-token"))
=== true; place this alongside the existing "sync version respects white-list
when cache hit" test and reference the same functions isNonRetryableClientError
and isNonRetryableClientErrorAsync to protect the sync branch semantics.
- Around line 104-112: The zero-delay timer in loadDetectorWithRules (the await
new Promise(resolve => setTimeout(resolve, 0)) placed after mocking
mocks.getActiveErrorRules and before calling errorRuleDetector.reload()) is
unclear: either add a one-line comment explaining it waits for a microtask/async
subscription (e.g. to allow subscribeCacheInvalidation or other async
side-effects to register) or remove the timer if it has no effect;
alternatively, replace it with an explicit, intention-revealing helper (e.g.
flushPromises or await nextTick) and keep errorRuleDetector.reload() and
mocks.getActiveErrorRules references intact so the test deterministically waits
for the needed async registration.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e6f0f976-c382-430b-a66c-3e084bad6afb
📒 Files selected for processing (2)
src/app/v1/_lib/proxy/errors.tstests/unit/proxy/error-category-aware-classification.test.ts
| /** | ||
| * 暴露内部白名单常量供测试使用 | ||
| * | ||
| * 注意:这是只读副本,外部不应修改运行时白名单。 | ||
| */ | ||
| export function getNonRetryableClientErrorCategoriesForTesting(): ReadonlySet<string> { | ||
| return NON_RETRYABLE_CLIENT_ERROR_CATEGORIES; | ||
| } |
There was a problem hiding this comment.
JSDoc 与实现不一致:返回的不是“只读副本”
注释写的是“只读副本,外部不应修改运行时白名单”,但函数实际返回的是底层 Set 的同一引用,仅依赖 ReadonlySet<string> 的类型断言来阻止修改。调用方一旦使用 as Set<string> 类型断言或 JS 调用,就能直接修改运行时白名单,造成全局副作用。
建议二选一(任选其一即可消除歧义):
方案 A:真正返回副本
export function getNonRetryableClientErrorCategoriesForTesting(): ReadonlySet<string> {
- return NON_RETRYABLE_CLIENT_ERROR_CATEGORIES;
+ return new Set(NON_RETRYABLE_CLIENT_ERROR_CATEGORIES);
}方案 B:保留同一引用,但修正注释,明确仅靠类型层面保护
-/**
- * 暴露内部白名单常量供测试使用
- *
- * 注意:这是只读副本,外部不应修改运行时白名单。
- */
+/**
+ * 暴露内部白名单常量供测试使用
+ *
+ * 注意:返回值通过 ReadonlySet 类型仅在编译期阻止修改,
+ * 运行时仍是同一引用;测试中切勿通过类型断言绕过只读约束。
+ */📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| /** | |
| * 暴露内部白名单常量供测试使用 | |
| * | |
| * 注意:这是只读副本,外部不应修改运行时白名单。 | |
| */ | |
| export function getNonRetryableClientErrorCategoriesForTesting(): ReadonlySet<string> { | |
| return NON_RETRYABLE_CLIENT_ERROR_CATEGORIES; | |
| } | |
| /** | |
| * 暴露内部白名单常量供测试使用 | |
| * | |
| * 注意:这是只读副本,外部不应修改运行时白名单。 | |
| */ | |
| export function getNonRetryableClientErrorCategoriesForTesting(): ReadonlySet<string> { | |
| return new Set(NON_RETRYABLE_CLIENT_ERROR_CATEGORIES); | |
| } |
| /** | |
| * 暴露内部白名单常量供测试使用 | |
| * | |
| * 注意:这是只读副本,外部不应修改运行时白名单。 | |
| */ | |
| export function getNonRetryableClientErrorCategoriesForTesting(): ReadonlySet<string> { | |
| return NON_RETRYABLE_CLIENT_ERROR_CATEGORIES; | |
| } | |
| /** | |
| * 暴露内部白名单常量供测试使用 | |
| * | |
| * 注意:返回值通过 ReadonlySet 类型仅在编译期阻止修改, | |
| * 运行时仍是同一引用;测试中切勿通过类型断言绕过只读约束。 | |
| */ | |
| export function getNonRetryableClientErrorCategoriesForTesting(): ReadonlySet<string> { | |
| return NON_RETRYABLE_CLIENT_ERROR_CATEGORIES; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/app/v1/_lib/proxy/errors.ts` around lines 715 - 722, The JSDoc
incorrectly claims getNonRetryableClientErrorCategoriesForTesting returns a
read-only copy while it currently returns the original Set reference
NON_RETRYABLE_CLIENT_ERROR_CATEGORIES; either return an actual copy (e.g.,
construct and return a new Set from NON_RETRYABLE_CLIENT_ERROR_CATEGORIES) to
make it immutable at runtime, or update the JSDoc to state that the function
returns the same runtime Set and only TypeScript's ReadonlySet typing prevents
mutation; modify getNonRetryableClientErrorCategoriesForTesting and/or its JSDoc
accordingly and keep the symbol NON_RETRYABLE_CLIENT_ERROR_CATEGORIES as the
source of truth.
There was a problem hiding this comment.
Code Review
This pull request introduces a category-aware whitelist for non-retryable client errors in the proxy error handling logic. By restricting non-retryable status to specific categories like validation_error and prompt_limit, the system ensures that service or network-related errors correctly trigger failover and circuit-breaker mechanisms rather than stopping immediately. The changes include updates to both synchronous and asynchronous error detection functions and a new suite of unit tests to verify the classification logic and prevent regressions. I have no feedback to provide.
There was a problem hiding this comment.
Code Review Summary
This PR introduces a category-aware white-list for NON_RETRYABLE_CLIENT_ERROR classification. The fix is precisely scoped: a new NON_RETRYABLE_CLIENT_ERROR_CATEGORIES set and a shouldTreatMatchedRuleAsNonRetryable helper replace the previous result.matched short-circuit in both sync and async code paths. The white-list's 14 categories match DEFAULT_ERROR_RULES exactly. Test coverage is solid (10 tests including data-driven enumeration of all white-listed categories, regression tests for the reported production scenario, and sync-path cache-hit verification). No issues with confidence >= 80 were identified.
PR Size: M
- Lines changed: 438 (435 additions, 3 deletions)
- Files changed: 2
Review Coverage
- Logic and correctness - Clean
- Security (OWASP Top 10) - Clean
- Error handling - Clean
- Type safety - Clean
- Documentation accuracy - Clean
- Test coverage - Adequate
- Code clarity - Good
Automated review by Claude AI
Summary
isNonRetryableClientErrorAsync(and its sync companion) previously treated any matched error rule asNON_RETRYABLE_CLIENT_ERROR, ignoring the rule'scategoryfield. This caused user-added rules with categories likeservice_errorornetwork_errorto short-circuit failover/circuit-breaker logic for upstream/transport problems that should actually be retried or routed to a different provider.This PR adds an explicit category white-list of client-input categories. Only matches whose
categoryfalls within this set are classified asNON_RETRYABLE_CLIENT_ERROR; others fall back to the standardProxyError-based classification (typicallyPROVIDER_ERRORorSYSTEM_ERROR), preserving failover and circuit-breaker behaviour.Related Issues:
service_errorcategory were not triggering retry/failover, causing Codex to stopNON_RETRYABLE_CLIENT_ERRORReproduction (CCH 0.7.2 production, observed 2026-04-25)
A user added a regex rule with the pattern
529|service.overloadedandcategory=service_error(and another503|service.unavailableforservice_error). When an upstream provider returned403 INSUFFICIENT_BALANCE:{ "error": { "type": "new_api_error", "message": "预扣费额度失败, 用户剩余额度: ¥0.214220, 需要预扣费额度: ¥0.352942 (request id: 20260425055420369306260Kfzayl7j)" } }The substring
529matched inside the price0.352942(positions 3–5 of352942). The old logic:matched=true→isNonRetryableClientErrorAsyncreturnedtruecategorizeErrorAsyncreturnedNON_RETRYABLE_CLIENT_ERRORforwarder.tslogged"White-listed client error … stopping immediately"and aborted failoverpermission_error: 上游拒绝了本次请求 (cch_session_id: …)with no upstream detail. The user could not tell that the real cause was an exhausted upstream balance.Symmetric symptom for
503|service.unavailablematching503inside a request ID like2026042505**503**99959….After this PR:
category=service_erroris not in the white-list, so the error falls back toPROVIDER_ERROR. Failover and circuit-breaker logic resume normally.prompt_limit,validation_error,content_filter, etc.) continue to be white-listed and short-circuit retries as before.White-list contents
Aligned with the categories actually used by
DEFAULT_ERROR_RULES:User-added rules with categories outside this list (e.g.
service_error,network_error,rate_limit) will no longer be wrongly white-listed. Their underlying errors will be classified by the existingProxyError-based rules, which is the correct behaviour for upstream/transport failures.Changes
src/app/v1/_lib/proxy/errors.tsNON_RETRYABLE_CLIENT_ERROR_CATEGORIESsetshouldTreatMatchedRuleAsNonRetryable(result)helperisNonRetryableClientError(sync) andisNonRetryableClientErrorAsync(async) now use the helper instead of returningresult.matcheddirectlygetNonRetryableClientErrorCategoriesForTesting()accessor exposing a read-only view of the set for unit teststests/unit/proxy/error-category-aware-classification.test.ts(new) — 10 unit tests covering:validation_error) →NON_RETRYABLEservice_error,network_error,rate_limit) → notNON_RETRYABLENON_RETRYABLENON_RETRYABLE(data-driven)categorizeErrorAsyncregression: substring match onservice_errorrule still producesPROVIDER_ERRORcategorizeErrorAsyncregression:network_errorrule on plainErrorstill producesSYSTEM_ERRORcategorizeErrorAsync:validation_errorrule still producesNON_RETRYABLE_CLIENT_ERRORisNonRetryableClientErrorrespects white-list when cache-hitTest results
bun run typecheckpasses cleanly.bunx biome checkis clean on the two changed files.Backward compatibility / migration
service_error/network_error/etc. categories intending to short-circuit retries), the behaviour change is the entire point of this PR — those errors should retry/failover, which is what the existingProxyError-based classification already does correctly.non_retryable: booleancolumn onerror_rules), not a hidden side-effect ofcategory.Out of scope (tracked separately)
503|service.unavailable,529|service.overloaded) are user data, not part of CCH's defaults — the upstreamis_default=truerules are unaffected. Users should still tighten those patterns with\bboundaries; that is operator hygiene, not an upstream bug.上游拒绝了本次请求inerror-handler.tsand the aggressive sanitisation inclient-error-message.tsstrip operationally useful upstream details (e.g. balance amounts). This deserves a separate discussion / issue about how to safely expose business-error context to clients.PR Checklist
devbun run typecheckpassesbunx biome checkclean on changed filesDescription enhanced by Claude AI