fix(proxy): release failed provider sessions#1108
Conversation
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
📝 WalkthroughWalkthrough在转发器中集中化失败提供商记录为 Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to immediately release provider-level active sessions when a provider fails, preventing 'outage storms' from inflating concurrency counts. It adds a releaseProviderSession method to the RateLimitService and integrates it into the ProxyForwarder via a new markProviderFailed helper. Feedback suggests converting this operation to a fire-and-forget pattern to ensure that Redis latency or failures do not block the critical provider failover logic.
| private static async markProviderFailed( | ||
| session: ProxySession, | ||
| failedProviderIds: number[], | ||
| providerId: number | ||
| ): Promise<void> { | ||
| failedProviderIds.push(providerId); | ||
|
|
||
| if (!session.sessionId) { | ||
| return; | ||
| } | ||
|
|
||
| await RateLimitService.releaseProviderSession(providerId, session.sessionId); | ||
| } |
There was a problem hiding this comment.
For better performance during provider failover, consider making this a fire-and-forget operation. The releaseProviderSession call to Redis doesn't need to block the provider switching logic, especially during high load or Redis latency spikes. The releaseProviderSession method is designed to be safe for this as it handles its own errors.
By making this method synchronous and using void for the async call, you can improve the resilience of the failover process.
Note: Applying this suggestion will require removing await from all call sites of markProviderFailed.
| private static async markProviderFailed( | |
| session: ProxySession, | |
| failedProviderIds: number[], | |
| providerId: number | |
| ): Promise<void> { | |
| failedProviderIds.push(providerId); | |
| if (!session.sessionId) { | |
| return; | |
| } | |
| await RateLimitService.releaseProviderSession(providerId, session.sessionId); | |
| } | |
| private static markProviderFailed( | |
| session: ProxySession, | |
| failedProviderIds: number[], | |
| providerId: number | |
| ): void { | |
| failedProviderIds.push(providerId); | |
| if (!session.sessionId) { | |
| return; | |
| } | |
| // Fire-and-forget to avoid blocking failover on this Redis command. | |
| void RateLimitService.releaseProviderSession(providerId, session.sessionId); | |
| } |
There was a problem hiding this comment.
Fixed in a9fcbd6: markProviderFailed is now synchronous and uses fire-and-forget releaseProviderSession, so Redis cleanup no longer blocks provider failover.
🧪 测试结果
总体结果: ✅ 所有测试通过 |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/app/v1/_lib/proxy/forwarder.ts`:
- Around line 4262-4274: The helper markProviderFailed currently adds providerId
to failedProviderIds and releases the session via
RateLimitService.releaseProviderSession but is not invoked from the streaming
hedge path; update sendStreamingWithHedge so that whenever the first attempt's
provider fails or an alternative provider wins, it calls
markProviderFailed(session, failedProviderIds, providerId) (using the same
ProxySession and failedProviderIds passed into sendStreamingWithHedge) before
switching providers; ensure both the failure branch and the "alternative
provider won" completion path in sendStreamingWithHedge use this helper so the
original provider's active session is released and no stale
provider:{id}:active_sessions entries remain.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b5759987-5710-464d-972b-2096ee18b76a
📒 Files selected for processing (4)
src/app/v1/_lib/proxy/forwarder.tssrc/lib/rate-limit/service.tstests/unit/lib/rate-limit/provider-session-release.test.tstests/unit/proxy/proxy-forwarder-provider-session-release.test.ts
There was a problem hiding this comment.
Code Review Summary
No significant issues identified in this PR.
PR Size: M
- Lines changed: 203
- Files changed: 4
Review Coverage
- Logic and correctness - Clean
- Security (OWASP Top 10) - Clean
- Error handling - Clean
- Type safety - Clean
- Documentation accuracy - Clean
- Test coverage - Adequate
- Code clarity - Good
Automated review by Codex AI
There was a problem hiding this comment.
- Reviewed PR
#1108(fix(proxy): release failed provider sessions) diff across 4 files (src/app/v1/_lib/proxy/forwarder.ts,src/lib/rate-limit/service.ts, and 2 new unit tests). - Applied PR size label
size/M(203 lines changed, 4 files). - No diff-line issues met the reporting threshold, so no inline review comments were posted.
- Submitted the required summary review comment via
gh pr review(visible on the PR).
There was a problem hiding this comment.
Code Review Summary
No significant issues identified in this PR. The implementation correctly addresses the Redis ZSET churn problem by immediately releasing provider sessions when providers fail, preventing outage storms from leaving failed sessions resident until TTL cleanup.
PR Size: M
- Lines changed: 203 (196 additions, 7 deletions)
- Files changed: 4
Review Coverage
- Logic and correctness - Clean
- Security (OWASP Top 10) - Clean
- Error handling - Clean
- Type safety - Clean
- Documentation accuracy - Clean
- Test coverage - Adequate
- Code clarity - Good
Implementation Highlights
RateLimitService.releaseProviderSession() (src/lib/rate-limit/service.ts:859-886)
- Properly validates inputs (providerId > 0, non-empty sessionId)
- Defensively checks Redis readiness before operation
- Correctly uses
zremto remove session from provider ZSET - Errors are logged but not thrown (appropriate for cleanup operations)
- JSDoc comment accurately explains the purpose
ProxyForwarder.markProviderFailed() (src/app/v1/_lib/proxy/forwarder.ts)
- Replaces 7 direct
failedProviderIds.push()calls consistently - Checks for null sessionId before attempting Redis release
- Proper async/await handling without blocking failover flow
Test Coverage
- Unit tests cover: successful release, Redis unavailable, invalid inputs, error handling
- Integration tests cover: session release on failure, null sessionId handling
Notes
- The early-throw path (shouldSkipRawRetryAndProviderSwitch) intentionally does not call markProviderFailed as the request terminates immediately rather than falling back; session cleanup will occur via TTL.
- Async overhead in markProviderFailed is acceptable as it only occurs during failover scenarios.
Automated review by Claude AI
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a9fcbd6225
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| return; | ||
| } | ||
|
|
||
| void RateLimitService.releaseProviderSession(providerId, session.sessionId); |
There was a problem hiding this comment.
Preserve provider session membership for concurrent requests
markProviderFailed now always calls releaseProviderSession, which does a raw ZREM by sessionId; because provider concurrency is tracked as one ZSET member per session (via checkAndTrackProviderSession), this removes the provider session for all in-flight requests sharing that session. If one concurrent request fails over while another request from the same session is still actively using that provider, the active session count is under-reported and new sessions can be admitted past the configured provider concurrency limit until the next refresh/retrack.
Useful? React with 👍 / 👎.
🧪 测试结果
总体结果: ✅ 所有测试通过 |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/app/v1/_lib/proxy/forwarder.ts (1)
3442-3450:⚠️ Potential issue | 🟠 Major
abortAttempt只在 hedge loser 分支释放 provider session这里的清理条件太窄了。
abortAllAttempts()还会在 Line 3546 用"hedge_launch_failed"、在 Line 4013 用"client_abort"终止所有 in-flight attempt,但这两条路径都不会走markProviderFailed(),因此对应的provider:{id}:active_sessions仍会残留到 TTL。这样在 selector 故障或客户端提前断开时,新的并发计数膨胀问题还会继续出现。建议修改
attempts.delete(attempt); - if (reason === "hedge_loser") { + const shouldReleaseProviderSession = + reason === "hedge_loser" || + reason === "hedge_launch_failed" || + reason === "client_abort"; + + if (reason === "hedge_loser") { session.addProviderToChain(attempt.provider, { ...attempt.endpointAudit, reason: "hedge_loser_cancelled", attemptNumber: attempt.sequence, modelRedirect: getAttemptModelRedirect(attempt), }); - ProxyForwarder.markProviderFailed(session, failedProviderIds, attempt.provider.id); + } + + if (shouldReleaseProviderSession) { + ProxyForwarder.markProviderFailed(session, failedProviderIds, attempt.provider.id); }Also applies to: 3543-3547, 3994-4014
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/app/v1/_lib/proxy/forwarder.ts` around lines 3442 - 3450, The abort logic only calls ProxyForwarder.markProviderFailed(...) when reason === "hedge_loser", leaving provider:{id}:active_sessions stuck for other abort reasons ("hedge_launch_failed", "client_abort"); update abortAttempt/abortAllAttempts to always perform the same provider-session cleanup: after session.addProviderToChain(...) (or wherever attempts are terminated) call ProxyForwarder.markProviderFailed(session, failedProviderIds, attempt.provider.id) (or collect attempt.provider.id into failedProviderIds and call markProviderFailed once for each aborted attempt) so every aborted attempt path (including hedge_launch_failed and client_abort) decrements/cleans provider active session state consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@src/app/v1/_lib/proxy/forwarder.ts`:
- Around line 3442-3450: The abort logic only calls
ProxyForwarder.markProviderFailed(...) when reason === "hedge_loser", leaving
provider:{id}:active_sessions stuck for other abort reasons
("hedge_launch_failed", "client_abort"); update abortAttempt/abortAllAttempts to
always perform the same provider-session cleanup: after
session.addProviderToChain(...) (or wherever attempts are terminated) call
ProxyForwarder.markProviderFailed(session, failedProviderIds,
attempt.provider.id) (or collect attempt.provider.id into failedProviderIds and
call markProviderFailed once for each aborted attempt) so every aborted attempt
path (including hedge_launch_failed and client_abort) decrements/cleans provider
active session state consistently.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 78251726-0608-47e1-aac4-96d51fcb3ce0
📒 Files selected for processing (3)
src/app/v1/_lib/proxy/forwarder.tstests/unit/proxy/proxy-forwarder-hedge-first-byte.test.tstests/unit/proxy/proxy-forwarder-provider-session-release.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/unit/proxy/proxy-forwarder-provider-session-release.test.ts
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
🧪 测试结果
总体结果: ✅ 所有测试通过 |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5375ce4eb5
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const key = `provider:${providerId}:active_sessions`; | ||
| const refKey = `provider:${providerId}:active_session_refs`; | ||
| const now = Date.now(); |
There was a problem hiding this comment.
Hash-tag provider keys before multi-key Lua eval
checkAndTrackProviderSession now executes Lua with two keys (provider:${providerId}:active_sessions and provider:${providerId}:active_session_refs) that do not share a hash tag, so Redis Cluster deployments will return CROSSSLOT for this eval. Because this method catches the error and fail-opens, provider concurrent-session limits are silently bypassed (and the paired release script fails similarly), which can overload providers in cluster-backed environments.
Useful? React with 👍 / 👎.
| if not is_tracked or existing_refs > 0 then | ||
| redis.call('HINCRBY', ref_key, session_id, 1) | ||
| referenced = 1 |
There was a problem hiding this comment.
Stop ref-count growth across successful provider requests
The Lua tracking script increments active_session_refs for every request once a ref exists, but refs are only decremented on markProviderFailed/releaseProviderSession; successful requests never decrement. In a normal session that has successes before a failover, this causes ref counts to accumulate indefinitely, so a later failure decrements by only one and leaves the failed provider session resident in active_sessions, keeping concurrency counts inflated and defeating the intended immediate cleanup.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (5)
src/lib/rate-limit/service.ts (1)
877-910:releaseProviderSession在 Redis 暂时不可用时是静默 no-op,建议确认是否可接受。当
redis.status !== "ready"时直接return,不抛错也不记录任何日志(连 debug/warn 都没有)。这意味着如果 provider 被标记为失败时 Redis 恰好处于断连/重连阶段,对应的 active session 既不会被立即释放,也不会留下任何观测线索,只能等待 ZSET TTL 自然过期。考虑到本 PR 的目标恰恰是“避免 lingering sessions 等到 TTL”,建议在该分支至少打一条
logger.warn,便于后续排查为何释放未执行:📝 建议改动
const redis = RateLimitService.redis; if (!redis || redis.status !== "ready") { + logger.warn("[RateLimit] Redis not ready, skip releaseProviderSession", { + providerId, + sessionId, + }); return; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/rate-limit/service.ts` around lines 877 - 910, The function releaseProviderSession silently returns when Redis is not ready; change it to log a warning so failures to release sessions are observable: inside RateLimitService.releaseProviderSession, detect the early-return branch where RateLimitService.redis is missing or redis.status !== "ready" and add a logger.warn (including providerId and sessionId) before returning (do not convert this into throwing an error unless intended). Ensure the warning message clearly states Redis is unavailable and session release was skipped so operators can trace lingering sessions.tests/unit/lib/session-manager-terminate-session.test.ts (1)
30-30: 新增了hdelmock,但没有断言terminateSession对 active_session_refs 的清理行为。本 PR 在
SessionManager.terminateSession中新增了pipeline.hdel(\provider:${providerId}:active_session_refs`, sessionId)(src/lib/session-manager.ts L2436)。当前测试只在 mock 上加了hdel占位,但没有任何expect(pipelineRef.hdel).toHaveBeenCalledWith(...)`,这会让该新清理路径处于无测试覆盖状态,未来回归不易被发现。♻️ 建议在已有用例中追加断言
expect(pipelineRef.zrem).toHaveBeenCalledWith("provider:42:active_sessions", sessionId); + expect(pipelineRef.hdel).toHaveBeenCalledWith( + "provider:42:active_session_refs", + sessionId + ); expect(pipelineRef.zrem).toHaveBeenCalledWith(getKeyActiveSessionsKey(7), sessionId);Also applies to: 42-67
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/lib/session-manager-terminate-session.test.ts` at line 30, 测试新增了 pipeline.hdel 的 mock,但没有断言 SessionManager.terminateSession 对 active_session_refs 的清理行为;请在对应单元测试(tests/unit/lib/session-manager-terminate-session.test.ts)中为 pipelineRef.hdel 增加断言,验证 terminateSession 调用时 pipelineRef.hdel 被以正确参数调用(即键 "provider:${providerId}:active_session_refs" 和 sessionId),使用 expect(pipelineRef.hdel).toHaveBeenCalledWith(...) 或等效匹配以覆盖该清理路径。src/lib/redis/lua-scripts.ts (1)
23-27: 注释中拒绝路径未提及referenced=0,建议补全契约说明。第 27 行只描述返回值含义为
tracked=0,但实际返回元组是{0, count, 0, 0},最后一项也是referenced=0。为保持与其他三种返回情况一致,建议在注释中明确这一点,避免后续维护者误解第四个返回值的含义。📝 建议改动
- * - {0, count, 0, 0} - rejected (limit reached), returns current count and tracked=0 + * - {0, count, 0, 0} - rejected (limit reached), returns current count, tracked=0, referenced=0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/redis/lua-scripts.ts` around lines 23 - 27, Update the return-value comment block that documents the Lua script's tuple responses so the rejection case explicitly notes referenced=0; specifically, change the fourth bullet from "{0, count, 0, 0} - rejected (limit reached), returns current count and tracked=0" to explicitly state referenced=0 as well (e.g. "{0, count, 0, 0} - rejected (limit reached), returns current count, tracked=0, referenced=0") so the documented contract for the tuple {flag, count, tracked, referenced} is complete and consistent with the other bullets.src/lib/session-tracker.ts (1)
439-492: 两阶段引用清理的索引/分支判断正确,但建议留意双 pipeline 顺序。将每个 provider 的清理命令从 2 条扩为 3 条(
zrangebyscore+zremrangebyscore+zrange),且使用i * 3/i * 3 + 2取偏移,与 pipeline 顺序一致;hasRefCleanup用于跳过空 pipeline 的exec()也合理。同样地,refCleanupPipeline.exec()与上一个 cleanupPipeline 不在同一 round-trip 内,存在与上一处相同的小窗口(被 zremrangebyscore 标记为过期但被并发刷新的 sessionId 仍会在此被 HDEL)。本路径属于批量计数热路径,若评估到对统计准确性敏感,建议结合 Lua 合并这两个 pipeline。🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/session-tracker.ts` around lines 439 - 492, The current two-pipeline approach (cleanupPipeline and refCleanupPipeline) leaves a small race where a session removed by zremrangebyscore can be concurrently refreshed and still get HDEL'd; fix by combining the per-provider cleanup and ref deletion into a single atomic Redis script/command so zrangebyscore, zremrangebyscore, zrange (and hdel of expired IDs) run in one round-trip. Replace the loop that builds cleanupPipeline + refCleanupPipeline with a single EVAL (or multi-key Lua) invocation that, for each providerId from providerIds, performs zrangebyscore(key, "-inf", cutoffMs) to collect expired IDs, zremrangebyscore(key, "-inf", cutoffMs) to remove them, zrange(key, 0, -1) to get remaining sessions, and hdel("provider:{providerId}:active_session_refs", expiredIds...) to remove their refs, and return both expired and remaining lists so you can populate expiredProviderSessions, providerSessionMap, and allSessionIds from the script result atomically.tests/unit/proxy/proxy-forwarder-provider-session-release.test.ts (1)
25-111: 测试场景对markProviderFailed的四个分支覆盖完整。
consumeProviderSessionRef → true/false、重复标记去重、sessionId === null都覆盖到了,且每个场景都同时断言failedProviderIds、consumeProviderSessionRef调用次数和releaseProviderSession是否被调用,能很好地兜住该 helper 的契约。一个可选的补充用例(不强求):在 sessionId 存在但consumeProviderSessionRef方法本身缺失(比如运行时 session 未升级到新接口)时,验证不抛异常——这正好对应 forwarder 中那段as { consumeProviderSessionRef?: ... }防御式断言所要保护的场景。🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/proxy/proxy-forwarder-provider-session-release.test.ts` around lines 25 - 111, Tests already cover four branches of markProviderFailed; add one more test case to ensure it doesn't throw when session lacks consumeProviderSessionRef: import ProxyForwarder, call forwarderInternals.markProviderFailed with session = { sessionId: "sess_no_consume" } cast as ProxySession (no consumeProviderSessionRef), failedProviderIds = [], and providerId like 42, then assert failedProviderIds contains 42 and mocks.releaseProviderSession behavior matches existing logic (should be not called because no consume ref or only called when sessionId present and consume returns true). This validates the defensive optional signature for consumeProviderSessionRef in markProviderFailed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/app/v1/_lib/proxy/forwarder.ts`:
- Around line 4291-4315: The code in markProviderFailed uses a defensive type
assertion to access consumeProviderSessionRef but ProxySession already exposes
that method; remove the cast and call the method directly (mirror the style used
for session.recordProviderSessionRef), e.g. replace the casted
providerSessionRefConsumer usage with a direct invocation of
session.consumeProviderSessionRef(providerId), keep the same null/boolean check
and the subsequent call to
RateLimitService.releaseProviderSession(session.sessionId) when it returns true.
In `@src/lib/redis/lua-scripts.ts`:
- Around line 103-122: The RELEASE_PROVIDER_SESSION Lua script currently returns
early when HGET(ref_key, session_id) is missing or zero, leaving ZSET members
orphaned; modify RELEASE_PROVIDER_SESSION so that when current_refs <= 0 it
still performs redis.call('ZREM', provider_key, session_id) (remove the member
from the ZSET) and then return {removed, 0}; keep the existing logic for
remaining_refs > 0 and for the HDEL path, but ensure the early-return case does
ZREM as a fallback cleanup so legacy ZSET-only sessions are removed atomically
by the script.
In `@src/lib/session-tracker.ts`:
- Around line 257-267: The post-pipeline redis.hdel call (after pipeline.exec())
can race with concurrent refreshSession/checkAndTrackProviderSession updates and
produce ZSET refs/Hashmap inconsistencies; move the hdel into the same Redis
pipeline or combine the read/remove/hdel logic into a single atomic operation
(e.g., include the hdel in the same pipeline used around
cleanupExpiredSessionsResultIndex and providerRefKey or replace the sequence
with a Lua script) so that expiredResult processing
(cleanupExpiredSessionsResultIndex -> expiredSessionIds -> redis.hdel) becomes
atomic; apply the same fix pattern to countFromZSet where zrangebyscore,
zremrangebyscore and hdel are separate calls.
---
Nitpick comments:
In `@src/lib/rate-limit/service.ts`:
- Around line 877-910: The function releaseProviderSession silently returns when
Redis is not ready; change it to log a warning so failures to release sessions
are observable: inside RateLimitService.releaseProviderSession, detect the
early-return branch where RateLimitService.redis is missing or redis.status !==
"ready" and add a logger.warn (including providerId and sessionId) before
returning (do not convert this into throwing an error unless intended). Ensure
the warning message clearly states Redis is unavailable and session release was
skipped so operators can trace lingering sessions.
In `@src/lib/redis/lua-scripts.ts`:
- Around line 23-27: Update the return-value comment block that documents the
Lua script's tuple responses so the rejection case explicitly notes
referenced=0; specifically, change the fourth bullet from "{0, count, 0, 0} -
rejected (limit reached), returns current count and tracked=0" to explicitly
state referenced=0 as well (e.g. "{0, count, 0, 0} - rejected (limit reached),
returns current count, tracked=0, referenced=0") so the documented contract for
the tuple {flag, count, tracked, referenced} is complete and consistent with the
other bullets.
In `@src/lib/session-tracker.ts`:
- Around line 439-492: The current two-pipeline approach (cleanupPipeline and
refCleanupPipeline) leaves a small race where a session removed by
zremrangebyscore can be concurrently refreshed and still get HDEL'd; fix by
combining the per-provider cleanup and ref deletion into a single atomic Redis
script/command so zrangebyscore, zremrangebyscore, zrange (and hdel of expired
IDs) run in one round-trip. Replace the loop that builds cleanupPipeline +
refCleanupPipeline with a single EVAL (or multi-key Lua) invocation that, for
each providerId from providerIds, performs zrangebyscore(key, "-inf", cutoffMs)
to collect expired IDs, zremrangebyscore(key, "-inf", cutoffMs) to remove them,
zrange(key, 0, -1) to get remaining sessions, and
hdel("provider:{providerId}:active_session_refs", expiredIds...) to remove their
refs, and return both expired and remaining lists so you can populate
expiredProviderSessions, providerSessionMap, and allSessionIds from the script
result atomically.
In `@tests/unit/lib/session-manager-terminate-session.test.ts`:
- Line 30: 测试新增了 pipeline.hdel 的 mock,但没有断言 SessionManager.terminateSession 对
active_session_refs
的清理行为;请在对应单元测试(tests/unit/lib/session-manager-terminate-session.test.ts)中为
pipelineRef.hdel 增加断言,验证 terminateSession 调用时 pipelineRef.hdel 被以正确参数调用(即键
"provider:${providerId}:active_session_refs" 和 sessionId),使用
expect(pipelineRef.hdel).toHaveBeenCalledWith(...) 或等效匹配以覆盖该清理路径。
In `@tests/unit/proxy/proxy-forwarder-provider-session-release.test.ts`:
- Around line 25-111: Tests already cover four branches of markProviderFailed;
add one more test case to ensure it doesn't throw when session lacks
consumeProviderSessionRef: import ProxyForwarder, call
forwarderInternals.markProviderFailed with session = { sessionId:
"sess_no_consume" } cast as ProxySession (no consumeProviderSessionRef),
failedProviderIds = [], and providerId like 42, then assert failedProviderIds
contains 42 and mocks.releaseProviderSession behavior matches existing logic
(should be not called because no consume ref or only called when sessionId
present and consume returns true). This validates the defensive optional
signature for consumeProviderSessionRef in markProviderFailed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3e068c22-192f-470f-b18f-cbd885a3f8d9
📒 Files selected for processing (13)
src/app/v1/_lib/proxy/forwarder.tssrc/app/v1/_lib/proxy/provider-selector.tssrc/app/v1/_lib/proxy/session.tssrc/lib/rate-limit/service.tssrc/lib/redis/lua-scripts.tssrc/lib/session-manager.tssrc/lib/session-tracker.tstests/unit/lib/rate-limit/provider-session-release.test.tstests/unit/lib/rate-limit/service-extra.test.tstests/unit/lib/session-manager-terminate-session.test.tstests/unit/lib/session-tracker-cleanup.test.tstests/unit/proxy/proxy-forwarder-hedge-first-byte.test.tstests/unit/proxy/proxy-forwarder-provider-session-release.test.ts
| private static markProviderFailed( | ||
| session: ProxySession, | ||
| failedProviderIds: number[], | ||
| providerId: number | ||
| ): void { | ||
| if (failedProviderIds.includes(providerId)) { | ||
| return; | ||
| } | ||
|
|
||
| failedProviderIds.push(providerId); | ||
|
|
||
| if (!session.sessionId) { | ||
| return; | ||
| } | ||
|
|
||
| const providerSessionRefConsumer = ( | ||
| session as { consumeProviderSessionRef?: (providerId: number) => boolean } | ||
| ).consumeProviderSessionRef; | ||
|
|
||
| if (!providerSessionRefConsumer?.call(session, providerId)) { | ||
| return; | ||
| } | ||
|
|
||
| void RateLimitService.releaseProviderSession(providerId, session.sessionId); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify ProxySession exposes recordProviderSessionRef and consumeProviderSessionRef on its public type.
fd -t f 'session.ts' src/app/v1/_lib/proxy/ --exec rg -nP -C2 'recordProviderSessionRef|consumeProviderSessionRef|providerSessionRefs' {}Repository: ding113/claude-code-hub
Length of output: 842
统一使用直接方法调用而非防御式类型断言
验证表明 recordProviderSessionRef 和 consumeProviderSessionRef 两个方法都已在 ProxySession 中公开定义(分别在 session.ts 的第 320 和 330 行),因此 forwarder.ts 第 4306-4310 行的防御式类型断言:
const providerSessionRefConsumer = (
session as { consumeProviderSessionRef?: (providerId: number) => boolean }
).consumeProviderSessionRef;是冗余的。建议统一使用直接调用方式,保持与第 3954 行 session.recordProviderSessionRef(provider.id) 的风格一致。
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/app/v1/_lib/proxy/forwarder.ts` around lines 4291 - 4315, The code in
markProviderFailed uses a defensive type assertion to access
consumeProviderSessionRef but ProxySession already exposes that method; remove
the cast and call the method directly (mirror the style used for
session.recordProviderSessionRef), e.g. replace the casted
providerSessionRefConsumer usage with a direct invocation of
session.consumeProviderSessionRef(providerId), keep the same null/boolean check
and the subsequent call to
RateLimitService.releaseProviderSession(session.sessionId) when it returns true.
| export const RELEASE_PROVIDER_SESSION = ` | ||
| local provider_key = KEYS[1] | ||
| local ref_key = KEYS[2] | ||
| local session_id = ARGV[1] | ||
|
|
||
| local current_refs = tonumber(redis.call('HGET', ref_key, session_id) or '0') | ||
| if current_refs <= 0 then | ||
| return {0, 0} | ||
| end | ||
|
|
||
| local remaining_refs = current_refs - 1 | ||
| if remaining_refs > 0 then | ||
| redis.call('HSET', ref_key, session_id, remaining_refs) | ||
| return {0, remaining_refs} | ||
| end | ||
|
|
||
| redis.call('HDEL', ref_key, session_id) | ||
| local removed = redis.call('ZREM', provider_key, session_id) | ||
| return {removed, remaining_refs} | ||
| `; |
There was a problem hiding this comment.
遗留会员(在 ZSET 中但无引用计数)无法被 RELEASE_PROVIDER_SESSION 主动释放。
当 HGET ref_key session_id 返回 0 或缺失时,脚本直接 return {0, 0},不会执行 ZREM provider_key。这意味着部署本 PR 之前由旧脚本写入的活跃 session(即 ZSET 里有 member 但 refs hash 中没有计数)在 provider 被标记为失败时无法被立即清理,只能等待 TTL 过期。
部署过渡期内,这些遗留会员会让供应商的 active_sessions 计数虚高,与本 PR 试图修复的“lingering sessions 直到 TTL”问题部分相同。建议:
- 方案 A(推荐):当
current_refs <= 0时仍执行ZREM provider_key,作为兜底清理;将语义改为“无引用 → 直接移除 ZSET 会员”。 - 方案 B:保留当前行为,但在
RateLimitService.releaseProviderSession中检测到removed=0 && remainingRefs=0时,额外调用一次ZREM兜底(注意非原子)。
♻️ 建议改动(方案 A)
local current_refs = tonumber(redis.call('HGET', ref_key, session_id) or '0')
if current_refs <= 0 then
- return {0, 0}
+ -- Legacy member without ref tracking (created before ref-count rollout):
+ -- best-effort remove from ZSET so eager release path still works during migration.
+ local removed_legacy = redis.call('ZREM', provider_key, session_id)
+ return {removed_legacy, 0}
end📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| export const RELEASE_PROVIDER_SESSION = ` | |
| local provider_key = KEYS[1] | |
| local ref_key = KEYS[2] | |
| local session_id = ARGV[1] | |
| local current_refs = tonumber(redis.call('HGET', ref_key, session_id) or '0') | |
| if current_refs <= 0 then | |
| return {0, 0} | |
| end | |
| local remaining_refs = current_refs - 1 | |
| if remaining_refs > 0 then | |
| redis.call('HSET', ref_key, session_id, remaining_refs) | |
| return {0, remaining_refs} | |
| end | |
| redis.call('HDEL', ref_key, session_id) | |
| local removed = redis.call('ZREM', provider_key, session_id) | |
| return {removed, remaining_refs} | |
| `; | |
| export const RELEASE_PROVIDER_SESSION = ` | |
| local provider_key = KEYS[1] | |
| local ref_key = KEYS[2] | |
| local session_id = ARGV[1] | |
| local current_refs = tonumber(redis.call('HGET', ref_key, session_id) or '0') | |
| if current_refs <= 0 then | |
| -- Legacy member without ref tracking (created before ref-count rollout): | |
| -- best-effort remove from ZSET so eager release path still works during migration. | |
| local removed_legacy = redis.call('ZREM', provider_key, session_id) | |
| return {removed_legacy, 0} | |
| end | |
| local remaining_refs = current_refs - 1 | |
| if remaining_refs > 0 then | |
| redis.call('HSET', ref_key, session_id, remaining_refs) | |
| return {0, remaining_refs} | |
| end | |
| redis.call('HDEL', ref_key, session_id) | |
| local removed = redis.call('ZREM', provider_key, session_id) | |
| return {removed, remaining_refs} | |
| `; |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lib/redis/lua-scripts.ts` around lines 103 - 122, The
RELEASE_PROVIDER_SESSION Lua script currently returns early when HGET(ref_key,
session_id) is missing or zero, leaving ZSET members orphaned; modify
RELEASE_PROVIDER_SESSION so that when current_refs <= 0 it still performs
redis.call('ZREM', provider_key, session_id) (remove the member from the ZSET)
and then return {removed, 0}; keep the existing logic for remaining_refs > 0 and
for the HDEL path, but ensure the early-return case does ZREM as a fallback
cleanup so legacy ZSET-only sessions are removed atomically by the script.
| if (cleanupExpiredSessionsResultIndex !== null && results) { | ||
| const expiredResult = results[cleanupExpiredSessionsResultIndex]; | ||
| if (!expiredResult?.[0] && Array.isArray(expiredResult?.[1])) { | ||
| const expiredSessionIds = expiredResult[1].filter( | ||
| (value): value is string => typeof value === "string" && value.length > 0 | ||
| ); | ||
| if (expiredSessionIds.length > 0) { | ||
| await redis.hdel(providerRefKey, ...expiredSessionIds); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
post-pipeline hdel 与并发 zadd 之间存在小窗口的引用一致性偏差。
pipeline.exec() 完成后,再发起 redis.hdel(providerRefKey, ...expiredSessionIds) 是非原子的:在两次 round-trip 之间,并发的 refreshSession/checkAndTrackProviderSession 可能用更新的分数将其中某个被判过期的 sessionId 重新写回 ZSET(zremrangebyscore 不会删除它),但本次 hdel 仍会按缓存的 ID 列表删除其在 refs 中的引用计数,从而出现 “ZSET 里有该 sessionId、refs 里没有” 的孤儿记录。后续 releaseProviderSession 命中该会话时,引用计数会下溢。
考虑到清理只有 1% 的触发概率、时间窗很小,且属于尽力而为的清理,影响范围有限;但若希望彻底消除该窗口,可把 hdel 也合并进同一 pipeline(或封装为单条 Lua),以保证读取/删除/HDEL 的语义一致。countFromZSet 在行 602-609 的写法存在同样的窗口(zrangebyscore 与 zremrangebyscore、hdel 是三次独立调用)。
♻️ 可选:将 `hdel` 合并至同一 pipeline
if (cleanupExpiredSessionsResultIndex !== null && results) {
const expiredResult = results[cleanupExpiredSessionsResultIndex];
if (!expiredResult?.[0] && Array.isArray(expiredResult?.[1])) {
const expiredSessionIds = expiredResult[1].filter(
(value): value is string => typeof value === "string" && value.length > 0
);
if (expiredSessionIds.length > 0) {
- await redis.hdel(providerRefKey, ...expiredSessionIds);
+ // 合并到下一次 refresh 的 pipeline,或使用 Lua 脚本完成 zrange/zrem/hdel 的原子清理。
+ await redis.hdel(providerRefKey, ...expiredSessionIds);
}
}
}更彻底的做法是将 zrange/zrem/hdel 合并为一段 Lua(与 getProviderSessionCountBatch 共用)。
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lib/session-tracker.ts` around lines 257 - 267, The post-pipeline
redis.hdel call (after pipeline.exec()) can race with concurrent
refreshSession/checkAndTrackProviderSession updates and produce ZSET
refs/Hashmap inconsistencies; move the hdel into the same Redis pipeline or
combine the read/remove/hdel logic into a single atomic operation (e.g., include
the hdel in the same pipeline used around cleanupExpiredSessionsResultIndex and
providerRefKey or replace the sequence with a Lua script) so that expiredResult
processing (cleanupExpiredSessionsResultIndex -> expiredSessionIds ->
redis.hdel) becomes atomic; apply the same fix pattern to countFromZSet where
zrangebyscore, zremrangebyscore and hdel are separate calls.
Summary
Release provider-level active session ZSET entries immediately when a provider fails and the proxy falls back. Prevents outage storms from leaving failed provider sessions resident until TTL cleanup.
Problem
Provider concurrency is tracked before forwarding in
provider:{id}:active_sessions. When providers were marked as failed (exhausted retries, network errors, or invalid responses), their active session entries were not released immediately. This caused:Solution
Introduce a new
RateLimitService.releaseProviderSession()method that removes a session from the provider'sactive_sessionsZSET when the provider is marked as failed. TheProxyForwardernow calls a newmarkProviderFailed()helper (replacing directfailedProviderIds.push()) which:Changes
Core Changes
src/lib/rate-limit/service.ts(+30/-0) - AddreleaseProviderSession()static method with defensive checks and error loggingsrc/app/v1/_lib/proxy/forwarder.ts(+30/-7) - Replace 7 directfailedProviderIds.push()calls withmarkProviderFailed(); add new private static helperTest Coverage
tests/unit/lib/rate-limit/provider-session-release.test.ts(new, +82 lines) - Unit tests forreleaseProviderSessioncovering:tests/unit/proxy/proxy-forwarder-provider-session-release.test.ts(new, +54 lines) - Integration tests formarkProviderFailedcovering:Related Issues/PRs
Verification
Testing
Automated Tests
RateLimitService.releaseProviderSession()ProxyForwarder.markProviderFailed()Manual Testing
Not applicable - this is a background reliability fix with no UI changes.
Breaking Changes
None. This fix only affects internal Redis state cleanup timing; no APIs or behaviors change.
Checklist
devDescription enhanced by Claude AI
Greptile Summary
This PR fixes a Redis ZSET leak where failed provider sessions remained in
provider:{id}:active_sessionsuntil TTL expiry, inflating concurrency counts during outage storms. The fix introduces reference-counted release: a newactive_session_refshash tracks how many in-flight acquisitions each session holds for a given provider, andRELEASE_PROVIDER_SESSIONonly executesZREMwhen the last reference is dropped. All existing cleanup paths are updated to also clean the ref hash, maintaining consistency.Confidence Score: 5/5
Safe to merge; changes are additive, all failure cases fail-open, and the fix is well-tested.
No P0 or P1 issues found. The ref-counting design is carefully guarded: consumeProviderSessionRef prevents double-release within a request, failedProviderIds.includes() deduplicates calls to markProviderFailed, and releaseProviderSession is fire-and-forget with error logging so it cannot break the request path.
Minor attention to src/app/v1/_lib/proxy/session.ts (dead null guard) and src/app/v1/_lib/proxy/forwarder.ts (duck-typed consumeProviderSessionRef access in markProviderFailed).
Important Files Changed
Sequence Diagram
sequenceDiagram participant PS as provider-selector participant PF as ProxyForwarder participant RL as RateLimitService participant RD as Redis PS->>RL: checkAndTrackProviderSession RL->>RD: CHECK_AND_TRACK_SESSION Lua RD-->>RL: allowed=1 referenced=1 RL-->>PS: allowed true referenced true PS->>PS: session.recordProviderSessionRef Note over PF: Provider fails PF->>PF: markProviderFailed PF->>RL: releaseProviderSession RL->>RD: RELEASE_PROVIDER_SESSION Lua RD-->>RL: removed=1 remainingRefs=0 Note over PF: Session terminates PF->>RD: ZREM active_sessions + HDEL active_session_refsPrompt To Fix All With AI
Reviews (3): Last reviewed commit: "fix(rate-limit): preserve provider sessi..." | Re-trigger Greptile