Skip to content

fix(proxy): release failed provider sessions#1108

Merged
ding113 merged 4 commits into
devfrom
fix/redis-provider-failure-20260425
Apr 25, 2026
Merged

fix(proxy): release failed provider sessions#1108
ding113 merged 4 commits into
devfrom
fix/redis-provider-failure-20260425

Conversation

@ding113
Copy link
Copy Markdown
Owner

@ding113 ding113 commented Apr 25, 2026

Summary

Release provider-level active session ZSET entries immediately when a provider fails and the proxy falls back. Prevents outage storms from leaving failed provider sessions resident until TTL cleanup.

Problem

Provider concurrency is tracked before forwarding in provider:{id}:active_sessions. When providers were marked as failed (exhausted retries, network errors, or invalid responses), their active session entries were not released immediately. This caused:

  1. Redis ZSET churn - Failed sessions remained in the sorted set until TTL cleanup
  2. Inaccurate concurrency tracking - The active session count was inflated during provider outages
  3. Outage storm amplification - When all providers were unavailable, the accumulated failed sessions exacerbated the Redis load

Solution

Introduce a new RateLimitService.releaseProviderSession() method that removes a session from the provider's active_sessions ZSET when the provider is marked as failed. The ProxyForwarder now calls a new markProviderFailed() helper (replacing direct failedProviderIds.push()) which:

  1. Adds the provider to the failed list (existing behavior)
  2. Immediately releases the provider session from Redis (new behavior)

Changes

Core Changes

  • src/lib/rate-limit/service.ts (+30/-0) - Add releaseProviderSession() static method with defensive checks and error logging
  • src/app/v1/_lib/proxy/forwarder.ts (+30/-7) - Replace 7 direct failedProviderIds.push() calls with markProviderFailed(); add new private static helper

Test Coverage

  • tests/unit/lib/rate-limit/provider-session-release.test.ts (new, +82 lines) - Unit tests for releaseProviderSession covering:
    • Successful session release from ZSET
    • Redis unavailable / not ready handling
    • Invalid providerId / empty sessionId rejection
    • Error logging without throwing
  • tests/unit/proxy/proxy-forwarder-provider-session-release.test.ts (new, +54 lines) - Integration tests for markProviderFailed covering:
    • Session release on provider failure
    • Null sessionId handling (no Redis call)

Related Issues/PRs

Verification

# LSP diagnostics: no diagnostics for modified files
bunx vitest run \
  tests/unit/lib/rate-limit/provider-session-release.test.ts \
  tests/unit/proxy/proxy-forwarder-provider-session-release.test.ts \
  tests/unit/proxy/proxy-forwarder-retry-limit.test.ts \
  tests/unit/proxy/proxy-forwarder.test.ts

bun run lint      # Biome check
bun run typecheck # TypeScript validation
bun run test      # Full test suite
bun run build     # Production build

Testing

Automated Tests

  • Unit tests added for RateLimitService.releaseProviderSession()
  • Integration tests added for ProxyForwarder.markProviderFailed()
  • Regression tests pass (existing forwarder tests)

Manual Testing

Not applicable - this is a background reliability fix with no UI changes.

Breaking Changes

None. This fix only affects internal Redis state cleanup timing; no APIs or behaviors change.

Checklist

  • Code follows project conventions (no emoji, i18n where applicable)
  • Self-review completed
  • Tests added and passing
  • No breaking changes
  • No UI changes (screenshots N/A)
  • Target branch is dev

Description enhanced by Claude AI

Greptile Summary

This PR fixes a Redis ZSET leak where failed provider sessions remained in provider:{id}:active_sessions until TTL expiry, inflating concurrency counts during outage storms. The fix introduces reference-counted release: a new active_session_refs hash tracks how many in-flight acquisitions each session holds for a given provider, and RELEASE_PROVIDER_SESSION only executes ZREM when the last reference is dropped. All existing cleanup paths are updated to also clean the ref hash, maintaining consistency.

Confidence Score: 5/5

Safe to merge; changes are additive, all failure cases fail-open, and the fix is well-tested.

No P0 or P1 issues found. The ref-counting design is carefully guarded: consumeProviderSessionRef prevents double-release within a request, failedProviderIds.includes() deduplicates calls to markProviderFailed, and releaseProviderSession is fire-and-forget with error logging so it cannot break the request path.

Minor attention to src/app/v1/_lib/proxy/session.ts (dead null guard) and src/app/v1/_lib/proxy/forwarder.ts (duck-typed consumeProviderSessionRef access in markProviderFailed).

Important Files Changed

Filename Overview
src/app/v1/_lib/proxy/forwarder.ts Replaces 7 direct failedProviderIds.push() calls with markProviderFailed(); adds concurrent-session acquisition in startAttempt for hedge providers; converts launchingAlternative from one-shot to retry while-loop. Logic is sound with proper guards against double-release.
src/lib/redis/lua-scripts.ts Extends CHECK_AND_TRACK_SESSION to 2 keys (ZSET + ref hash), adds ref-count increment with referenced return value and expired-session cleanup of ref hash. Adds new RELEASE_PROVIDER_SESSION script using reference counting to safely ZREM when last ref is released.
src/lib/rate-limit/service.ts Extends checkAndTrackProviderSession return type with referenced field; adds releaseProviderSession static method with defensive input validation, Redis readiness check, and error logging without throw.
src/app/v1/_lib/proxy/session.ts Adds providerSessionRefs Set field initialized as class property; adds recordProviderSessionRef and consumeProviderSessionRef methods with a redundant null guard on the already-initialized field.
src/lib/session-tracker.ts Adds ref-hash cleanup alongside every ZSET cleanup path; refreshes ref-hash TTL in updateProvider and refreshSession. New getProviderActiveSessionRefsKey helper pattern-matches keys correctly.

Sequence Diagram

sequenceDiagram
    participant PS as provider-selector
    participant PF as ProxyForwarder
    participant RL as RateLimitService
    participant RD as Redis
    PS->>RL: checkAndTrackProviderSession
    RL->>RD: CHECK_AND_TRACK_SESSION Lua
    RD-->>RL: allowed=1 referenced=1
    RL-->>PS: allowed true referenced true
    PS->>PS: session.recordProviderSessionRef
    Note over PF: Provider fails
    PF->>PF: markProviderFailed
    PF->>RL: releaseProviderSession
    RL->>RD: RELEASE_PROVIDER_SESSION Lua
    RD-->>RL: removed=1 remainingRefs=0
    Note over PF: Session terminates
    PF->>RD: ZREM active_sessions + HDEL active_session_refs
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/app/v1/_lib/proxy/session.ts
Line: 320-322

Comment:
**Unreachable null guard on class-initialized field**

`providerSessionRefs` is declared as a class field initializer (`private providerSessionRefs = new Set<number>()`), so it can never be `null` or `undefined` here. The `if (!this.providerSessionRefs)` branch is dead code and could mislead future readers into thinking the field might be absent.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/app/v1/_lib/proxy/forwarder.ts
Line: 4300-4318

Comment:
**Duck-typed access to `consumeProviderSessionRef` bypasses TypeScript safety**

`markProviderFailed` casts `session` to an anonymous structural type to reach `consumeProviderSessionRef`, rather than referencing it through `ProxySession`'s declared interface. If the method is ever renamed or removed from `ProxySession`, TypeScript will not catch the silent no-op — the optional-chaining guard `?.call(...)` will evaluate to `undefined` (falsy) and skip the Redis release without any compile-time or runtime error.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/lib/redis/lua-scripts.ts
Line: 72-76

Comment:
**Ref increment for already-tracked sessions can accumulate across concurrent requests**

The condition `if not is_tracked or existing_refs > 0` increments refs when the session is already in the ZSET and has an existing ref count. Within a single request this is safe because `launchedProviderIds` prevents re-selecting the same provider. However, two independent HTTP requests sharing the same `sessionId` racing against the same provider will each increment refs, requiring two `RELEASE_PROVIDER_SESSION` calls before the ZSET member is removed. A clarifying inline comment on the multi-caller invariant would help future readers.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (3): Last reviewed commit: "fix(rate-limit): preserve provider sessi..." | Re-trigger Greptile

ding113 and others added 2 commits April 25, 2026 14:29
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 25, 2026

📝 Walkthrough

Walkthrough

在转发器中集中化失败提供商记录为 markProviderFailed(...),新增按会话引用释放提供商会话的 RateLimit 方法与 Redis Lua/引用计数支持,并在会话追踪、会话终止和多处 hedge 流程中使用该机制;相应单元测试已扩展覆盖新行为。

Changes

Cohort / File(s) Summary
Proxy forwarder
src/app/v1/_lib/proxy/forwarder.ts
引入私有 markProviderFailed(...),统一失败提供商去重记录并在有 session.sessionId 时调用释放;流式 hedge 路径重构以循环选择替代、把 startAttempt 改为 Promise<boolean> 并在并发受限时记录链路条目与标记失败。
Proxy session / selector
src/app/v1/_lib/proxy/session.ts, src/app/v1/_lib/proxy/provider-selector.ts
ProxySession 增加 providerSessionRefs 集合与 recordProviderSessionRef / consumeProviderSessionRef 方法;选择器在 ensure 成功且 checkResult.referenced 为真时记录会话引用。
Rate limit & Redis Lua
src/lib/rate-limit/service.ts, src/lib/redis/lua-scripts.ts
checkAndTrackProviderSession 返回包含 referenced 字段,改为使用两个 Redis key(active_sessions 与 active_session_refs);新增 releaseProviderSession(providerId, sessionId),并新增 RELEASE_PROVIDER_SESSION Lua 脚本以按会话引用递减并在需要时从 ZSET/HASH 中移除。
Session tracking/termination
src/lib/session-tracker.ts, src/lib/session-manager.ts
会话更新/批量统计与终止清理扩展为同时维护与删除 provider:<id>:active_session_refs 的引用条目并在清理流程中删除过期 refs。
Unit tests — rate limit & session release
tests/unit/lib/rate-limit/provider-session-release.test.ts, tests/unit/lib/rate-limit/service-extra.test.ts
新增 releaseProviderSession 测试覆盖成功、引用残留、Redis 不可用、无效输入与 Redis 错误日志;更新 checkAndTrackProviderSession 测试以断言新增的 referenced 语义与新的 Lua 调用签名。
Unit tests — proxy / hedge
tests/unit/proxy/proxy-forwarder-provider-session-release.test.ts, tests/unit/proxy/proxy-forwarder-hedge-first-byte.test.ts
markProviderFailed 添加去重与 sessionId 有/无场景测试;hedge 测试新增断言确保释放失败/中止/被跳过提供商的会话(例如 "sess-hedge"),并验证并发受限时跳过提供商行为。
Unit tests — session tracker/manager mocks
tests/unit/lib/session-tracker-cleanup.test.ts, tests/unit/lib/session-manager-terminate-session.test.ts
扩展测试 Redis mock(新增 hdel, zrangebyscore 等)以支持 refs 清理与终止路径的 pipeline 调用。

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed 标题清晰准确地总结了主要变更:即时释放失败提供者的会话,这正是整个PR的核心目标。
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed PR描述清晰完整,详细说明了问题、解决方案、修改内容、测试覆盖和验证步骤,与代码变更高度相关。

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/redis-provider-failure-20260425

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to immediately release provider-level active sessions when a provider fails, preventing 'outage storms' from inflating concurrency counts. It adds a releaseProviderSession method to the RateLimitService and integrates it into the ProxyForwarder via a new markProviderFailed helper. Feedback suggests converting this operation to a fire-and-forget pattern to ensure that Redis latency or failures do not block the critical provider failover logic.

Comment thread src/app/v1/_lib/proxy/forwarder.ts Outdated
Comment on lines +4262 to +4274
private static async markProviderFailed(
session: ProxySession,
failedProviderIds: number[],
providerId: number
): Promise<void> {
failedProviderIds.push(providerId);

if (!session.sessionId) {
return;
}

await RateLimitService.releaseProviderSession(providerId, session.sessionId);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better performance during provider failover, consider making this a fire-and-forget operation. The releaseProviderSession call to Redis doesn't need to block the provider switching logic, especially during high load or Redis latency spikes. The releaseProviderSession method is designed to be safe for this as it handles its own errors.

By making this method synchronous and using void for the async call, you can improve the resilience of the failover process.

Note: Applying this suggestion will require removing await from all call sites of markProviderFailed.

Suggested change
private static async markProviderFailed(
session: ProxySession,
failedProviderIds: number[],
providerId: number
): Promise<void> {
failedProviderIds.push(providerId);
if (!session.sessionId) {
return;
}
await RateLimitService.releaseProviderSession(providerId, session.sessionId);
}
private static markProviderFailed(
session: ProxySession,
failedProviderIds: number[],
providerId: number
): void {
failedProviderIds.push(providerId);
if (!session.sessionId) {
return;
}
// Fire-and-forget to avoid blocking failover on this Redis command.
void RateLimitService.releaseProviderSession(providerId, session.sessionId);
}

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a9fcbd6: markProviderFailed is now synchronous and uses fire-and-forget releaseProviderSession, so Redis cleanup no longer blocks provider failover.

@github-actions github-actions Bot added the size/M Medium PR (< 500 lines) label Apr 25, 2026
@github-actions
Copy link
Copy Markdown
Contributor

🧪 测试结果

测试类型 状态
代码质量
单元测试
集成测试
API 测试

总体结果: ✅ 所有测试通过

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/app/v1/_lib/proxy/forwarder.ts`:
- Around line 4262-4274: The helper markProviderFailed currently adds providerId
to failedProviderIds and releases the session via
RateLimitService.releaseProviderSession but is not invoked from the streaming
hedge path; update sendStreamingWithHedge so that whenever the first attempt's
provider fails or an alternative provider wins, it calls
markProviderFailed(session, failedProviderIds, providerId) (using the same
ProxySession and failedProviderIds passed into sendStreamingWithHedge) before
switching providers; ensure both the failure branch and the "alternative
provider won" completion path in sendStreamingWithHedge use this helper so the
original provider's active session is released and no stale
provider:{id}:active_sessions entries remain.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b5759987-5710-464d-972b-2096ee18b76a

📥 Commits

Reviewing files that changed from the base of the PR and between 1381474 and f454dcd.

📒 Files selected for processing (4)
  • src/app/v1/_lib/proxy/forwarder.ts
  • src/lib/rate-limit/service.ts
  • tests/unit/lib/rate-limit/provider-session-release.test.ts
  • tests/unit/proxy/proxy-forwarder-provider-session-release.test.ts

Comment thread src/app/v1/_lib/proxy/forwarder.ts Outdated
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

No significant issues identified in this PR.

PR Size: M

  • Lines changed: 203
  • Files changed: 4

Review Coverage

  • Logic and correctness - Clean
  • Security (OWASP Top 10) - Clean
  • Error handling - Clean
  • Type safety - Clean
  • Documentation accuracy - Clean
  • Test coverage - Adequate
  • Code clarity - Good

Automated review by Codex AI

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Reviewed PR #1108 (fix(proxy): release failed provider sessions) diff across 4 files (src/app/v1/_lib/proxy/forwarder.ts, src/lib/rate-limit/service.ts, and 2 new unit tests).
  • Applied PR size label size/M (203 lines changed, 4 files).
  • No diff-line issues met the reporting threshold, so no inline review comments were posted.
  • Submitted the required summary review comment via gh pr review (visible on the PR).

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

No significant issues identified in this PR. The implementation correctly addresses the Redis ZSET churn problem by immediately releasing provider sessions when providers fail, preventing outage storms from leaving failed sessions resident until TTL cleanup.

PR Size: M

  • Lines changed: 203 (196 additions, 7 deletions)
  • Files changed: 4

Review Coverage

  • Logic and correctness - Clean
  • Security (OWASP Top 10) - Clean
  • Error handling - Clean
  • Type safety - Clean
  • Documentation accuracy - Clean
  • Test coverage - Adequate
  • Code clarity - Good

Implementation Highlights

RateLimitService.releaseProviderSession() (src/lib/rate-limit/service.ts:859-886)

  • Properly validates inputs (providerId > 0, non-empty sessionId)
  • Defensively checks Redis readiness before operation
  • Correctly uses zrem to remove session from provider ZSET
  • Errors are logged but not thrown (appropriate for cleanup operations)
  • JSDoc comment accurately explains the purpose

ProxyForwarder.markProviderFailed() (src/app/v1/_lib/proxy/forwarder.ts)

  • Replaces 7 direct failedProviderIds.push() calls consistently
  • Checks for null sessionId before attempting Redis release
  • Proper async/await handling without blocking failover flow

Test Coverage

  • Unit tests cover: successful release, Redis unavailable, invalid inputs, error handling
  • Integration tests cover: session release on failure, null sessionId handling

Notes

  • The early-throw path (shouldSkipRawRetryAndProviderSwitch) intentionally does not call markProviderFailed as the request terminates immediately rather than falling back; session cleanup will occur via TTL.
  • Async overhead in markProviderFailed is acceptable as it only occurs during failover scenarios.

Automated review by Claude AI

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a9fcbd6225

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

return;
}

void RateLimitService.releaseProviderSession(providerId, session.sessionId);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve provider session membership for concurrent requests

markProviderFailed now always calls releaseProviderSession, which does a raw ZREM by sessionId; because provider concurrency is tracked as one ZSET member per session (via checkAndTrackProviderSession), this removes the provider session for all in-flight requests sharing that session. If one concurrent request fails over while another request from the same session is still actively using that provider, the active session count is under-reported and new sessions can be admitted past the configured provider concurrency limit until the next refresh/retrack.

Useful? React with 👍 / 👎.

@github-actions
Copy link
Copy Markdown
Contributor

🧪 测试结果

测试类型 状态
代码质量
单元测试
集成测试
API 测试

总体结果: ✅ 所有测试通过

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/app/v1/_lib/proxy/forwarder.ts (1)

3442-3450: ⚠️ Potential issue | 🟠 Major

abortAttempt 只在 hedge loser 分支释放 provider session

这里的清理条件太窄了。abortAllAttempts() 还会在 Line 3546 用 "hedge_launch_failed"、在 Line 4013 用 "client_abort" 终止所有 in-flight attempt,但这两条路径都不会走 markProviderFailed(),因此对应的 provider:{id}:active_sessions 仍会残留到 TTL。这样在 selector 故障或客户端提前断开时,新的并发计数膨胀问题还会继续出现。

建议修改
       attempts.delete(attempt);
-      if (reason === "hedge_loser") {
+      const shouldReleaseProviderSession =
+        reason === "hedge_loser" ||
+        reason === "hedge_launch_failed" ||
+        reason === "client_abort";
+
+      if (reason === "hedge_loser") {
         session.addProviderToChain(attempt.provider, {
           ...attempt.endpointAudit,
           reason: "hedge_loser_cancelled",
           attemptNumber: attempt.sequence,
           modelRedirect: getAttemptModelRedirect(attempt),
         });
-        ProxyForwarder.markProviderFailed(session, failedProviderIds, attempt.provider.id);
+      }
+
+      if (shouldReleaseProviderSession) {
+        ProxyForwarder.markProviderFailed(session, failedProviderIds, attempt.provider.id);
       }

Also applies to: 3543-3547, 3994-4014

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/v1/_lib/proxy/forwarder.ts` around lines 3442 - 3450, The abort logic
only calls ProxyForwarder.markProviderFailed(...) when reason === "hedge_loser",
leaving provider:{id}:active_sessions stuck for other abort reasons
("hedge_launch_failed", "client_abort"); update abortAttempt/abortAllAttempts to
always perform the same provider-session cleanup: after
session.addProviderToChain(...) (or wherever attempts are terminated) call
ProxyForwarder.markProviderFailed(session, failedProviderIds,
attempt.provider.id) (or collect attempt.provider.id into failedProviderIds and
call markProviderFailed once for each aborted attempt) so every aborted attempt
path (including hedge_launch_failed and client_abort) decrements/cleans provider
active session state consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/app/v1/_lib/proxy/forwarder.ts`:
- Around line 3442-3450: The abort logic only calls
ProxyForwarder.markProviderFailed(...) when reason === "hedge_loser", leaving
provider:{id}:active_sessions stuck for other abort reasons
("hedge_launch_failed", "client_abort"); update abortAttempt/abortAllAttempts to
always perform the same provider-session cleanup: after
session.addProviderToChain(...) (or wherever attempts are terminated) call
ProxyForwarder.markProviderFailed(session, failedProviderIds,
attempt.provider.id) (or collect attempt.provider.id into failedProviderIds and
call markProviderFailed once for each aborted attempt) so every aborted attempt
path (including hedge_launch_failed and client_abort) decrements/cleans provider
active session state consistently.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 78251726-0608-47e1-aac4-96d51fcb3ce0

📥 Commits

Reviewing files that changed from the base of the PR and between f454dcd and a9fcbd6.

📒 Files selected for processing (3)
  • src/app/v1/_lib/proxy/forwarder.ts
  • tests/unit/proxy/proxy-forwarder-hedge-first-byte.test.ts
  • tests/unit/proxy/proxy-forwarder-provider-session-release.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/unit/proxy/proxy-forwarder-provider-session-release.test.ts

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
@github-actions
Copy link
Copy Markdown
Contributor

🧪 测试结果

测试类型 状态
代码质量
单元测试
集成测试
API 测试

总体结果: ✅ 所有测试通过

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5375ce4eb5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 831 to 833
const key = `provider:${providerId}:active_sessions`;
const refKey = `provider:${providerId}:active_session_refs`;
const now = Date.now();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Hash-tag provider keys before multi-key Lua eval

checkAndTrackProviderSession now executes Lua with two keys (provider:${providerId}:active_sessions and provider:${providerId}:active_session_refs) that do not share a hash tag, so Redis Cluster deployments will return CROSSSLOT for this eval. Because this method catches the error and fail-opens, provider concurrent-session limits are silently bypassed (and the paired release script fails similarly), which can overload providers in cluster-backed environments.

Useful? React with 👍 / 👎.

Comment on lines +73 to +75
if not is_tracked or existing_refs > 0 then
redis.call('HINCRBY', ref_key, session_id, 1)
referenced = 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop ref-count growth across successful provider requests

The Lua tracking script increments active_session_refs for every request once a ref exists, but refs are only decremented on markProviderFailed/releaseProviderSession; successful requests never decrement. In a normal session that has successes before a failover, this causes ref counts to accumulate indefinitely, so a later failure decrements by only one and leaves the failed provider session resident in active_sessions, keeping concurrency counts inflated and defeating the intended immediate cleanup.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (5)
src/lib/rate-limit/service.ts (1)

877-910: releaseProviderSession 在 Redis 暂时不可用时是静默 no-op,建议确认是否可接受。

redis.status !== "ready" 时直接 return,不抛错也不记录任何日志(连 debug/warn 都没有)。这意味着如果 provider 被标记为失败时 Redis 恰好处于断连/重连阶段,对应的 active session 既不会被立即释放,也不会留下任何观测线索,只能等待 ZSET TTL 自然过期。

考虑到本 PR 的目标恰恰是“避免 lingering sessions 等到 TTL”,建议在该分支至少打一条 logger.warn,便于后续排查为何释放未执行:

📝 建议改动
   const redis = RateLimitService.redis;
   if (!redis || redis.status !== "ready") {
+    logger.warn("[RateLimit] Redis not ready, skip releaseProviderSession", {
+      providerId,
+      sessionId,
+    });
     return;
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/rate-limit/service.ts` around lines 877 - 910, The function
releaseProviderSession silently returns when Redis is not ready; change it to
log a warning so failures to release sessions are observable: inside
RateLimitService.releaseProviderSession, detect the early-return branch where
RateLimitService.redis is missing or redis.status !== "ready" and add a
logger.warn (including providerId and sessionId) before returning (do not
convert this into throwing an error unless intended). Ensure the warning message
clearly states Redis is unavailable and session release was skipped so operators
can trace lingering sessions.
tests/unit/lib/session-manager-terminate-session.test.ts (1)

30-30: 新增了 hdel mock,但没有断言 terminateSession 对 active_session_refs 的清理行为。

本 PR 在 SessionManager.terminateSession 中新增了 pipeline.hdel(\provider:${providerId}:active_session_refs`, sessionId)(src/lib/session-manager.ts L2436)。当前测试只在 mock 上加了 hdel占位,但没有任何expect(pipelineRef.hdel).toHaveBeenCalledWith(...)`,这会让该新清理路径处于无测试覆盖状态,未来回归不易被发现。

♻️ 建议在已有用例中追加断言
     expect(pipelineRef.zrem).toHaveBeenCalledWith("provider:42:active_sessions", sessionId);
+    expect(pipelineRef.hdel).toHaveBeenCalledWith(
+      "provider:42:active_session_refs",
+      sessionId
+    );
     expect(pipelineRef.zrem).toHaveBeenCalledWith(getKeyActiveSessionsKey(7), sessionId);

Also applies to: 42-67

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/lib/session-manager-terminate-session.test.ts` at line 30, 测试新增了
pipeline.hdel 的 mock,但没有断言 SessionManager.terminateSession 对 active_session_refs
的清理行为;请在对应单元测试(tests/unit/lib/session-manager-terminate-session.test.ts)中为
pipelineRef.hdel 增加断言,验证 terminateSession 调用时 pipelineRef.hdel 被以正确参数调用(即键
"provider:${providerId}:active_session_refs" 和 sessionId),使用
expect(pipelineRef.hdel).toHaveBeenCalledWith(...) 或等效匹配以覆盖该清理路径。
src/lib/redis/lua-scripts.ts (1)

23-27: 注释中拒绝路径未提及 referenced=0,建议补全契约说明。

第 27 行只描述返回值含义为 tracked=0,但实际返回元组是 {0, count, 0, 0},最后一项也是 referenced=0。为保持与其他三种返回情况一致,建议在注释中明确这一点,避免后续维护者误解第四个返回值的含义。

📝 建议改动
- * - {0, count, 0, 0} - rejected (limit reached), returns current count and tracked=0
+ * - {0, count, 0, 0} - rejected (limit reached), returns current count, tracked=0, referenced=0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/redis/lua-scripts.ts` around lines 23 - 27, Update the return-value
comment block that documents the Lua script's tuple responses so the rejection
case explicitly notes referenced=0; specifically, change the fourth bullet from
"{0, count, 0, 0} - rejected (limit reached), returns current count and
tracked=0" to explicitly state referenced=0 as well (e.g. "{0, count, 0, 0} -
rejected (limit reached), returns current count, tracked=0, referenced=0") so
the documented contract for the tuple {flag, count, tracked, referenced} is
complete and consistent with the other bullets.
src/lib/session-tracker.ts (1)

439-492: 两阶段引用清理的索引/分支判断正确,但建议留意双 pipeline 顺序。

将每个 provider 的清理命令从 2 条扩为 3 条(zrangebyscore + zremrangebyscore + zrange),且使用 i * 3 / i * 3 + 2 取偏移,与 pipeline 顺序一致;hasRefCleanup 用于跳过空 pipeline 的 exec() 也合理。同样地,refCleanupPipeline.exec() 与上一个 cleanupPipeline 不在同一 round-trip 内,存在与上一处相同的小窗口(被 zremrangebyscore 标记为过期但被并发刷新的 sessionId 仍会在此被 HDEL)。本路径属于批量计数热路径,若评估到对统计准确性敏感,建议结合 Lua 合并这两个 pipeline。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/session-tracker.ts` around lines 439 - 492, The current two-pipeline
approach (cleanupPipeline and refCleanupPipeline) leaves a small race where a
session removed by zremrangebyscore can be concurrently refreshed and still get
HDEL'd; fix by combining the per-provider cleanup and ref deletion into a single
atomic Redis script/command so zrangebyscore, zremrangebyscore, zrange (and hdel
of expired IDs) run in one round-trip. Replace the loop that builds
cleanupPipeline + refCleanupPipeline with a single EVAL (or multi-key Lua)
invocation that, for each providerId from providerIds, performs
zrangebyscore(key, "-inf", cutoffMs) to collect expired IDs,
zremrangebyscore(key, "-inf", cutoffMs) to remove them, zrange(key, 0, -1) to
get remaining sessions, and hdel("provider:{providerId}:active_session_refs",
expiredIds...) to remove their refs, and return both expired and remaining lists
so you can populate expiredProviderSessions, providerSessionMap, and
allSessionIds from the script result atomically.
tests/unit/proxy/proxy-forwarder-provider-session-release.test.ts (1)

25-111: 测试场景对 markProviderFailed 的四个分支覆盖完整。

consumeProviderSessionRef → true/false、重复标记去重、sessionId === null 都覆盖到了,且每个场景都同时断言 failedProviderIdsconsumeProviderSessionRef 调用次数和 releaseProviderSession 是否被调用,能很好地兜住该 helper 的契约。一个可选的补充用例(不强求):在 sessionId 存在但 consumeProviderSessionRef 方法本身缺失(比如运行时 session 未升级到新接口)时,验证不抛异常——这正好对应 forwarder 中那段 as { consumeProviderSessionRef?: ... } 防御式断言所要保护的场景。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/proxy/proxy-forwarder-provider-session-release.test.ts` around
lines 25 - 111, Tests already cover four branches of markProviderFailed; add one
more test case to ensure it doesn't throw when session lacks
consumeProviderSessionRef: import ProxyForwarder, call
forwarderInternals.markProviderFailed with session = { sessionId:
"sess_no_consume" } cast as ProxySession (no consumeProviderSessionRef),
failedProviderIds = [], and providerId like 42, then assert failedProviderIds
contains 42 and mocks.releaseProviderSession behavior matches existing logic
(should be not called because no consume ref or only called when sessionId
present and consume returns true). This validates the defensive optional
signature for consumeProviderSessionRef in markProviderFailed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/app/v1/_lib/proxy/forwarder.ts`:
- Around line 4291-4315: The code in markProviderFailed uses a defensive type
assertion to access consumeProviderSessionRef but ProxySession already exposes
that method; remove the cast and call the method directly (mirror the style used
for session.recordProviderSessionRef), e.g. replace the casted
providerSessionRefConsumer usage with a direct invocation of
session.consumeProviderSessionRef(providerId), keep the same null/boolean check
and the subsequent call to
RateLimitService.releaseProviderSession(session.sessionId) when it returns true.

In `@src/lib/redis/lua-scripts.ts`:
- Around line 103-122: The RELEASE_PROVIDER_SESSION Lua script currently returns
early when HGET(ref_key, session_id) is missing or zero, leaving ZSET members
orphaned; modify RELEASE_PROVIDER_SESSION so that when current_refs <= 0 it
still performs redis.call('ZREM', provider_key, session_id) (remove the member
from the ZSET) and then return {removed, 0}; keep the existing logic for
remaining_refs > 0 and for the HDEL path, but ensure the early-return case does
ZREM as a fallback cleanup so legacy ZSET-only sessions are removed atomically
by the script.

In `@src/lib/session-tracker.ts`:
- Around line 257-267: The post-pipeline redis.hdel call (after pipeline.exec())
can race with concurrent refreshSession/checkAndTrackProviderSession updates and
produce ZSET refs/Hashmap inconsistencies; move the hdel into the same Redis
pipeline or combine the read/remove/hdel logic into a single atomic operation
(e.g., include the hdel in the same pipeline used around
cleanupExpiredSessionsResultIndex and providerRefKey or replace the sequence
with a Lua script) so that expiredResult processing
(cleanupExpiredSessionsResultIndex -> expiredSessionIds -> redis.hdel) becomes
atomic; apply the same fix pattern to countFromZSet where zrangebyscore,
zremrangebyscore and hdel are separate calls.

---

Nitpick comments:
In `@src/lib/rate-limit/service.ts`:
- Around line 877-910: The function releaseProviderSession silently returns when
Redis is not ready; change it to log a warning so failures to release sessions
are observable: inside RateLimitService.releaseProviderSession, detect the
early-return branch where RateLimitService.redis is missing or redis.status !==
"ready" and add a logger.warn (including providerId and sessionId) before
returning (do not convert this into throwing an error unless intended). Ensure
the warning message clearly states Redis is unavailable and session release was
skipped so operators can trace lingering sessions.

In `@src/lib/redis/lua-scripts.ts`:
- Around line 23-27: Update the return-value comment block that documents the
Lua script's tuple responses so the rejection case explicitly notes
referenced=0; specifically, change the fourth bullet from "{0, count, 0, 0} -
rejected (limit reached), returns current count and tracked=0" to explicitly
state referenced=0 as well (e.g. "{0, count, 0, 0} - rejected (limit reached),
returns current count, tracked=0, referenced=0") so the documented contract for
the tuple {flag, count, tracked, referenced} is complete and consistent with the
other bullets.

In `@src/lib/session-tracker.ts`:
- Around line 439-492: The current two-pipeline approach (cleanupPipeline and
refCleanupPipeline) leaves a small race where a session removed by
zremrangebyscore can be concurrently refreshed and still get HDEL'd; fix by
combining the per-provider cleanup and ref deletion into a single atomic Redis
script/command so zrangebyscore, zremrangebyscore, zrange (and hdel of expired
IDs) run in one round-trip. Replace the loop that builds cleanupPipeline +
refCleanupPipeline with a single EVAL (or multi-key Lua) invocation that, for
each providerId from providerIds, performs zrangebyscore(key, "-inf", cutoffMs)
to collect expired IDs, zremrangebyscore(key, "-inf", cutoffMs) to remove them,
zrange(key, 0, -1) to get remaining sessions, and
hdel("provider:{providerId}:active_session_refs", expiredIds...) to remove their
refs, and return both expired and remaining lists so you can populate
expiredProviderSessions, providerSessionMap, and allSessionIds from the script
result atomically.

In `@tests/unit/lib/session-manager-terminate-session.test.ts`:
- Line 30: 测试新增了 pipeline.hdel 的 mock,但没有断言 SessionManager.terminateSession 对
active_session_refs
的清理行为;请在对应单元测试(tests/unit/lib/session-manager-terminate-session.test.ts)中为
pipelineRef.hdel 增加断言,验证 terminateSession 调用时 pipelineRef.hdel 被以正确参数调用(即键
"provider:${providerId}:active_session_refs" 和 sessionId),使用
expect(pipelineRef.hdel).toHaveBeenCalledWith(...) 或等效匹配以覆盖该清理路径。

In `@tests/unit/proxy/proxy-forwarder-provider-session-release.test.ts`:
- Around line 25-111: Tests already cover four branches of markProviderFailed;
add one more test case to ensure it doesn't throw when session lacks
consumeProviderSessionRef: import ProxyForwarder, call
forwarderInternals.markProviderFailed with session = { sessionId:
"sess_no_consume" } cast as ProxySession (no consumeProviderSessionRef),
failedProviderIds = [], and providerId like 42, then assert failedProviderIds
contains 42 and mocks.releaseProviderSession behavior matches existing logic
(should be not called because no consume ref or only called when sessionId
present and consume returns true). This validates the defensive optional
signature for consumeProviderSessionRef in markProviderFailed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3e068c22-192f-470f-b18f-cbd885a3f8d9

📥 Commits

Reviewing files that changed from the base of the PR and between a9fcbd6 and 5375ce4.

📒 Files selected for processing (13)
  • src/app/v1/_lib/proxy/forwarder.ts
  • src/app/v1/_lib/proxy/provider-selector.ts
  • src/app/v1/_lib/proxy/session.ts
  • src/lib/rate-limit/service.ts
  • src/lib/redis/lua-scripts.ts
  • src/lib/session-manager.ts
  • src/lib/session-tracker.ts
  • tests/unit/lib/rate-limit/provider-session-release.test.ts
  • tests/unit/lib/rate-limit/service-extra.test.ts
  • tests/unit/lib/session-manager-terminate-session.test.ts
  • tests/unit/lib/session-tracker-cleanup.test.ts
  • tests/unit/proxy/proxy-forwarder-hedge-first-byte.test.ts
  • tests/unit/proxy/proxy-forwarder-provider-session-release.test.ts

Comment on lines +4291 to +4315
private static markProviderFailed(
session: ProxySession,
failedProviderIds: number[],
providerId: number
): void {
if (failedProviderIds.includes(providerId)) {
return;
}

failedProviderIds.push(providerId);

if (!session.sessionId) {
return;
}

const providerSessionRefConsumer = (
session as { consumeProviderSessionRef?: (providerId: number) => boolean }
).consumeProviderSessionRef;

if (!providerSessionRefConsumer?.call(session, providerId)) {
return;
}

void RateLimitService.releaseProviderSession(providerId, session.sessionId);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify ProxySession exposes recordProviderSessionRef and consumeProviderSessionRef on its public type.
fd -t f 'session.ts' src/app/v1/_lib/proxy/ --exec rg -nP -C2 'recordProviderSessionRef|consumeProviderSessionRef|providerSessionRefs' {}

Repository: ding113/claude-code-hub

Length of output: 842


统一使用直接方法调用而非防御式类型断言

验证表明 recordProviderSessionRefconsumeProviderSessionRef 两个方法都已在 ProxySession 中公开定义(分别在 session.ts 的第 320 和 330 行),因此 forwarder.ts 第 4306-4310 行的防御式类型断言:

const providerSessionRefConsumer = (
  session as { consumeProviderSessionRef?: (providerId: number) => boolean }
).consumeProviderSessionRef;

是冗余的。建议统一使用直接调用方式,保持与第 3954 行 session.recordProviderSessionRef(provider.id) 的风格一致。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/v1/_lib/proxy/forwarder.ts` around lines 4291 - 4315, The code in
markProviderFailed uses a defensive type assertion to access
consumeProviderSessionRef but ProxySession already exposes that method; remove
the cast and call the method directly (mirror the style used for
session.recordProviderSessionRef), e.g. replace the casted
providerSessionRefConsumer usage with a direct invocation of
session.consumeProviderSessionRef(providerId), keep the same null/boolean check
and the subsequent call to
RateLimitService.releaseProviderSession(session.sessionId) when it returns true.

Comment on lines +103 to +122
export const RELEASE_PROVIDER_SESSION = `
local provider_key = KEYS[1]
local ref_key = KEYS[2]
local session_id = ARGV[1]

local current_refs = tonumber(redis.call('HGET', ref_key, session_id) or '0')
if current_refs <= 0 then
return {0, 0}
end

local remaining_refs = current_refs - 1
if remaining_refs > 0 then
redis.call('HSET', ref_key, session_id, remaining_refs)
return {0, remaining_refs}
end

redis.call('HDEL', ref_key, session_id)
local removed = redis.call('ZREM', provider_key, session_id)
return {removed, remaining_refs}
`;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

遗留会员(在 ZSET 中但无引用计数)无法被 RELEASE_PROVIDER_SESSION 主动释放。

HGET ref_key session_id 返回 0 或缺失时,脚本直接 return {0, 0},不会执行 ZREM provider_key。这意味着部署本 PR 之前由旧脚本写入的活跃 session(即 ZSET 里有 member 但 refs hash 中没有计数)在 provider 被标记为失败时无法被立即清理,只能等待 TTL 过期。

部署过渡期内,这些遗留会员会让供应商的 active_sessions 计数虚高,与本 PR 试图修复的“lingering sessions 直到 TTL”问题部分相同。建议:

  • 方案 A(推荐):当 current_refs <= 0 时仍执行 ZREM provider_key,作为兜底清理;将语义改为“无引用 → 直接移除 ZSET 会员”。
  • 方案 B:保留当前行为,但在 RateLimitService.releaseProviderSession 中检测到 removed=0 && remainingRefs=0 时,额外调用一次 ZREM 兜底(注意非原子)。
♻️ 建议改动(方案 A)
 local current_refs = tonumber(redis.call('HGET', ref_key, session_id) or '0')
 if current_refs <= 0 then
-  return {0, 0}
+  -- Legacy member without ref tracking (created before ref-count rollout):
+  -- best-effort remove from ZSET so eager release path still works during migration.
+  local removed_legacy = redis.call('ZREM', provider_key, session_id)
+  return {removed_legacy, 0}
 end
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
export const RELEASE_PROVIDER_SESSION = `
local provider_key = KEYS[1]
local ref_key = KEYS[2]
local session_id = ARGV[1]
local current_refs = tonumber(redis.call('HGET', ref_key, session_id) or '0')
if current_refs <= 0 then
return {0, 0}
end
local remaining_refs = current_refs - 1
if remaining_refs > 0 then
redis.call('HSET', ref_key, session_id, remaining_refs)
return {0, remaining_refs}
end
redis.call('HDEL', ref_key, session_id)
local removed = redis.call('ZREM', provider_key, session_id)
return {removed, remaining_refs}
`;
export const RELEASE_PROVIDER_SESSION = `
local provider_key = KEYS[1]
local ref_key = KEYS[2]
local session_id = ARGV[1]
local current_refs = tonumber(redis.call('HGET', ref_key, session_id) or '0')
if current_refs <= 0 then
-- Legacy member without ref tracking (created before ref-count rollout):
-- best-effort remove from ZSET so eager release path still works during migration.
local removed_legacy = redis.call('ZREM', provider_key, session_id)
return {removed_legacy, 0}
end
local remaining_refs = current_refs - 1
if remaining_refs > 0 then
redis.call('HSET', ref_key, session_id, remaining_refs)
return {0, remaining_refs}
end
redis.call('HDEL', ref_key, session_id)
local removed = redis.call('ZREM', provider_key, session_id)
return {removed, remaining_refs}
`;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/redis/lua-scripts.ts` around lines 103 - 122, The
RELEASE_PROVIDER_SESSION Lua script currently returns early when HGET(ref_key,
session_id) is missing or zero, leaving ZSET members orphaned; modify
RELEASE_PROVIDER_SESSION so that when current_refs <= 0 it still performs
redis.call('ZREM', provider_key, session_id) (remove the member from the ZSET)
and then return {removed, 0}; keep the existing logic for remaining_refs > 0 and
for the HDEL path, but ensure the early-return case does ZREM as a fallback
cleanup so legacy ZSET-only sessions are removed atomically by the script.

Comment on lines +257 to +267
if (cleanupExpiredSessionsResultIndex !== null && results) {
const expiredResult = results[cleanupExpiredSessionsResultIndex];
if (!expiredResult?.[0] && Array.isArray(expiredResult?.[1])) {
const expiredSessionIds = expiredResult[1].filter(
(value): value is string => typeof value === "string" && value.length > 0
);
if (expiredSessionIds.length > 0) {
await redis.hdel(providerRefKey, ...expiredSessionIds);
}
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

post-pipeline hdel 与并发 zadd 之间存在小窗口的引用一致性偏差。

pipeline.exec() 完成后,再发起 redis.hdel(providerRefKey, ...expiredSessionIds) 是非原子的:在两次 round-trip 之间,并发的 refreshSession/checkAndTrackProviderSession 可能用更新的分数将其中某个被判过期的 sessionId 重新写回 ZSET(zremrangebyscore 不会删除它),但本次 hdel 仍会按缓存的 ID 列表删除其在 refs 中的引用计数,从而出现 “ZSET 里有该 sessionId、refs 里没有” 的孤儿记录。后续 releaseProviderSession 命中该会话时,引用计数会下溢。

考虑到清理只有 1% 的触发概率、时间窗很小,且属于尽力而为的清理,影响范围有限;但若希望彻底消除该窗口,可把 hdel 也合并进同一 pipeline(或封装为单条 Lua),以保证读取/删除/HDEL 的语义一致。countFromZSet 在行 602-609 的写法存在同样的窗口(zrangebyscorezremrangebyscorehdel 是三次独立调用)。

♻️ 可选:将 `hdel` 合并至同一 pipeline
       if (cleanupExpiredSessionsResultIndex !== null && results) {
         const expiredResult = results[cleanupExpiredSessionsResultIndex];
         if (!expiredResult?.[0] && Array.isArray(expiredResult?.[1])) {
           const expiredSessionIds = expiredResult[1].filter(
             (value): value is string => typeof value === "string" && value.length > 0
           );
           if (expiredSessionIds.length > 0) {
-            await redis.hdel(providerRefKey, ...expiredSessionIds);
+            // 合并到下一次 refresh 的 pipeline,或使用 Lua 脚本完成 zrange/zrem/hdel 的原子清理。
+            await redis.hdel(providerRefKey, ...expiredSessionIds);
           }
         }
       }

更彻底的做法是将 zrange/zrem/hdel 合并为一段 Lua(与 getProviderSessionCountBatch 共用)。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/session-tracker.ts` around lines 257 - 267, The post-pipeline
redis.hdel call (after pipeline.exec()) can race with concurrent
refreshSession/checkAndTrackProviderSession updates and produce ZSET
refs/Hashmap inconsistencies; move the hdel into the same Redis pipeline or
combine the read/remove/hdel logic into a single atomic operation (e.g., include
the hdel in the same pipeline used around cleanupExpiredSessionsResultIndex and
providerRefKey or replace the sequence with a Lua script) so that expiredResult
processing (cleanupExpiredSessionsResultIndex -> expiredSessionIds ->
redis.hdel) becomes atomic; apply the same fix pattern to countFromZSet where
zrangebyscore, zremrangebyscore and hdel are separate calls.

@ding113 ding113 merged commit 331e6b7 into dev Apr 25, 2026
17 checks passed
@github-project-automation github-project-automation Bot moved this from Backlog to Done in Claude Code Hub Roadmap Apr 25, 2026
@github-actions github-actions Bot mentioned this pull request Apr 26, 2026
@ding113 ding113 deleted the fix/redis-provider-failure-20260425 branch May 13, 2026 06:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:provider area:Rate Limit bug Something isn't working size/M Medium PR (< 500 lines)

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

1 participant