优化公开状态页 Redis 聚合与轮询性能#1211
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthrough迁移公共状态聚合到按时间桶 rollup 存储并引入 v2 Redis schema 与 legacy 回退;提取并复用模型厂商规则;更新前端时间线交互与轮询复用逻辑;在消息仓库添加 rollup 事件入队与去重缓存;大量单测覆盖。 ChangesModel vendor system extraction
Public status rollup aggregation system
Message repository rollup event integration
Frontend component updates for timeline state management
Comprehensive test coverage for rollup system and updates
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review Summary
This XL PR redesigns the public status page data pipeline from DB-scanned aggregation to Redis-based incremental rollup (v2 namespace), with a backward-compatible v1 fallback during upgrade. The frontend reduces polling payload size by excluding timeline data and reusing local state. The implementation is thorough: error handling is consistent (fire-and-forget writes use catch+log, reads use safe fallbacks), the v1/v2 migration path is well-structured with stale markers and rebuild hints, and new functionality has dedicated test coverage.
PR Size: XL
- Lines changed: 2985 (2640 additions + 345 deletions)
- Files changed: 26
- Split suggestion: This PR could be split into: (1)
model-vendor-rules.tsextraction +aggregation-core.tsrefactor, (2)rollup-store.ts+redis-contract.tsv2 keys +rebuild-worker.tsrollup integration, (3)message.tsrollup hooks +read-store.tsv1/v2 fallback, (4) frontend polling + timeline changes. However, the v2 rollout depends on all backend pieces working together, so the monolithic approach is understandable.
Issues Found
| Category | Critical | High | Medium | Low |
|---|---|---|---|---|
| Logic/Bugs | 0 | 0 | 0 | 0 |
| Security | 0 | 0 | 0 | 0 |
| Error Handling | 0 | 0 | 0 | 0 |
| Types | 0 | 0 | 0 | 0 |
| Comments/Docs | 0 | 0 | 0 | 0 |
| Tests | 0 | 0 | 0 | 0 |
| Simplification | 0 | 0 | 0 | 0 |
Observations (below reporting threshold)
public-status-view.tsxusesas ViewModelSnapshotcasts to access the injectedtimelineReusedFromPreviousfield. This bypasses type checking but is a pragmatic tradeoff to avoid leaking a view-layer concern into the sharedPublicStatusPayloadtype. (Confidence: 72)- Module-level mutable caches (
publicStatusRequestSeedCacheinmessage.ts,cachedConfiguredGroupsinrollup-store.ts) are bounded by size/TTL respectively and are acceptable in the server-side Next.js context. (Confidence: 65) - Non-pipeline Redis fallback in
writePublicStatusRollupEventcould produce partial writes on individualhincrbyfloatfailures, but errors propagate toqueuePublicStatusRollupWritewhich catches and logs them, and ioredis always supports pipelines in production. (Confidence: 62)
Review Coverage
- Logic and correctness - Clean
- Security (OWASP Top 10) - Clean (no SQL injection vectors introduced, Redis keys use proper encoding, no user input reaches eval/lua scripts)
- Error handling - Clean (all new async paths have catch+log, fire-and-forget writes use void IIFE with try/catch)
- Type safety - Clean (no
anyusage in new code, discriminated unions used for read outcomes) - Documentation accuracy - Clean (v1 comment references updated to version-agnostic)
- Test coverage - Adequate (new rollup-store 324-line test file, new message-rollup 199-line test file, updated tests for all modified modules)
- Code clarity - Good (helper extraction in read-store reduces duplication, single active timeline summary reduces DOM overhead)
Automated review by Claude AI
| const publicStatusRequestSeedCache = new Map<number, PublicStatusRequestSeed>(); | ||
| const PUBLIC_STATUS_REQUEST_SEED_CACHE_MAX_SIZE = 10_000; | ||
|
|
||
| function rememberPublicStatusRequestSeed(id: number, seed: PublicStatusRequestSeed): void { | ||
| publicStatusRequestSeedCache.set(id, seed); | ||
| if (publicStatusRequestSeedCache.size <= PUBLIC_STATUS_REQUEST_SEED_CACHE_MAX_SIZE) { | ||
| return; | ||
| } | ||
|
|
||
| const firstKey = publicStatusRequestSeedCache.keys().next().value as number | undefined; | ||
| if (firstKey !== undefined) { | ||
| publicStatusRequestSeedCache.delete(firstKey); | ||
| } |
There was a problem hiding this comment.
Process-local seed cache silently drops rollup in multi-pod deployments
publicStatusRequestSeedCache is a module-level Map that lives only in the running process. In a horizontally-scaled setup (multiple pods/workers), createMessageRequest and updateMessageRequestDetails can hit different pods; when they do, consumePublicStatusRequestSeed on the second pod finds nothing and the rollup write is silently skipped — no error, no counter, just lost data. The codebase already accepts this tradeoff for enqueueMessageRequestUpdate, but it's worth a comment making the single-process assumption explicit so future scaling work knows to revisit this path.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/repository/message.ts
Line: 38-50
Comment:
**Process-local seed cache silently drops rollup in multi-pod deployments**
`publicStatusRequestSeedCache` is a module-level `Map` that lives only in the running process. In a horizontally-scaled setup (multiple pods/workers), `createMessageRequest` and `updateMessageRequestDetails` can hit different pods; when they do, `consumePublicStatusRequestSeed` on the second pod finds nothing and the rollup write is silently skipped — no error, no counter, just lost data. The codebase already accepts this tradeoff for `enqueueMessageRequestUpdate`, but it's worth a comment making the single-process assumption explicit so future scaling work knows to revisit this path.
How can I resolve this? If you propose a fix, please make it concise.| const coverageStartKey = buildPublicStatusRollupCoverageStartKey(); | ||
| const bucketStartIso = alignBucketStartUtc(createdAtIso, PUBLIC_STATUS_ROLLUP_BUCKET_MINUTES); | ||
| const redis = input.redis ?? getRedisClient({ allowWhenRateLimitDisabled: true }); | ||
| if ( | ||
| !redis || | ||
| ("status" in redis && redis.status && redis.status !== "ready") || | ||
| typeof redis.hincrbyfloat !== "function" | ||
| ) { | ||
| return { written: false, incrementCount: increments.length, key }; | ||
| } | ||
|
|
||
| if (typeof redis.pipeline === "function") { | ||
| const pipeline = redis.pipeline(); | ||
| for (const increment of increments) { | ||
| pipeline.hincrbyfloat(key, buildPublicStatusRollupField(increment), increment.value); | ||
| } | ||
| if (typeof pipeline.set === "function") { | ||
| pipeline.set(coverageStartKey, bucketStartIso, "NX"); | ||
| } | ||
| pipeline.expire(key, ROLLUP_TTL_SECONDS); | ||
| await pipeline.exec(); | ||
| } else { | ||
| for (const increment of increments) { | ||
| const field = buildPublicStatusRollupField(increment); | ||
| await redis.hincrbyfloat(key, field, increment.value); | ||
| } | ||
| if (typeof redis.set === "function") { | ||
| await redis.set(coverageStartKey, bucketStartIso, "NX"); | ||
| } | ||
| await redis.expire?.(key, ROLLUP_TTL_SECONDS); | ||
| } |
There was a problem hiding this comment.
Coverage-start key has no TTL — stale coverage claim after bucket expiry
coverageStartKey is written with NX and no expiry, so it persists indefinitely. Rollup bucket hashes use ROLLUP_TTL_SECONDS = 60 * 60 * 24 * 32 (32 days). If the service is offline for longer than 32 days (or Redis is rebuilt from a backup that lacks recent data), all rollup buckets expire while the coverage-start key remains. The next rebuild calls isRollupCoverageComplete with a coverageStartedAt from 32+ days ago and a coveredFrom only ~24 h ago — the check passes, rollupCoverageComplete is stored as true, and the read side serves an empty v2 snapshot instead of falling back to v1. Setting the same ROLLUP_TTL_SECONDS on this key would keep it in sync with the data it covers.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/lib/public-status/rollup-store.ts
Line: 358-388
Comment:
**Coverage-start key has no TTL — stale coverage claim after bucket expiry**
`coverageStartKey` is written with `NX` and no expiry, so it persists indefinitely. Rollup bucket hashes use `ROLLUP_TTL_SECONDS = 60 * 60 * 24 * 32` (32 days). If the service is offline for longer than 32 days (or Redis is rebuilt from a backup that lacks recent data), all rollup buckets expire while the coverage-start key remains. The next rebuild calls `isRollupCoverageComplete` with a `coverageStartedAt` from 32+ days ago and a `coveredFrom` only ~24 h ago — the check passes, `rollupCoverageComplete` is stored as `true`, and the read side serves an empty v2 snapshot instead of falling back to v1. Setting the same `ROLLUP_TTL_SECONDS` on this key would keep it in sync with the data it covers.
How can I resolve this? If you propose a fix, please make it concise.| return { | ||
| ...rule, | ||
| icon: MODEL_VENDOR_ICON_BY_KEY[rule.i18nKey] ?? OpenAI, | ||
| }; |
There was a problem hiding this comment.
Silent OpenAI fallback for unrecognised
i18nKey could show wrong branding
The original array kept icon and rule together, making divergence impossible. Now that they live in separate files, any vendor added to model-vendor-rules.ts without a matching entry in MODEL_VENDOR_ICON_BY_KEY silently renders the OpenAI logo. A dev-mode console.warn would surface the mismatch early.
| return { | |
| ...rule, | |
| icon: MODEL_VENDOR_ICON_BY_KEY[rule.i18nKey] ?? OpenAI, | |
| }; | |
| const icon = MODEL_VENDOR_ICON_BY_KEY[rule.i18nKey]; | |
| if (!icon && process.env.NODE_ENV !== "production") { | |
| console.warn(`[model-vendor-icons] No icon registered for i18nKey "${rule.i18nKey}"`); | |
| } | |
| return { | |
| ...rule, | |
| icon: icon ?? OpenAI, | |
| }; |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/lib/model-vendor-icons.tsx
Line: 76-79
Comment:
**Silent OpenAI fallback for unrecognised `i18nKey` could show wrong branding**
The original array kept icon and rule together, making divergence impossible. Now that they live in separate files, any vendor added to `model-vendor-rules.ts` without a matching entry in `MODEL_VENDOR_ICON_BY_KEY` silently renders the OpenAI logo. A dev-mode `console.warn` would surface the mismatch early.
```suggestion
const icon = MODEL_VENDOR_ICON_BY_KEY[rule.i18nKey];
if (!icon && process.env.NODE_ENV !== "production") {
console.warn(`[model-vendor-icons] No icon registered for i18nKey "${rule.i18nKey}"`);
}
return {
...rule,
icon: icon ?? OpenAI,
};
```
How can I resolve this? If you propose a fix, please make it concise.| export function buildPublicStatusRollupKey(input: { | ||
| bucketStartIso: string; | ||
| bucketMinutes?: number; | ||
| prefix?: string; | ||
| }): string { | ||
| const bucketMinutes = input.bucketMinutes ?? PUBLIC_STATUS_ROLLUP_BUCKET_MINUTES; | ||
| assertPositiveInteger(bucketMinutes, "bucketMinutes"); | ||
| return [ | ||
| resolvePrefix(input.prefix), | ||
| "rollup", | ||
| `${bucketMinutes}m`, | ||
| encodeKeyPart(alignBucketStartUtc(input.bucketStartIso, bucketMinutes)), | ||
| ].join(":"); |
There was a problem hiding this comment.
alignBucketStartUtc call site lacks documentation of alignment semantics
alignBucketStartUtc is called inside buildPublicStatusRollupKey without a visible definition nearby. An inline comment clarifying that it floors to the nearest bucketMinutes UTC boundary would help readers understand the key structure.
| export function buildPublicStatusRollupKey(input: { | |
| bucketStartIso: string; | |
| bucketMinutes?: number; | |
| prefix?: string; | |
| }): string { | |
| const bucketMinutes = input.bucketMinutes ?? PUBLIC_STATUS_ROLLUP_BUCKET_MINUTES; | |
| assertPositiveInteger(bucketMinutes, "bucketMinutes"); | |
| return [ | |
| resolvePrefix(input.prefix), | |
| "rollup", | |
| `${bucketMinutes}m`, | |
| encodeKeyPart(alignBucketStartUtc(input.bucketStartIso, bucketMinutes)), | |
| ].join(":"); | |
| export function buildPublicStatusRollupKey(input: { | |
| bucketStartIso: string; | |
| bucketMinutes?: number; | |
| prefix?: string; | |
| }): string { | |
| const bucketMinutes = input.bucketMinutes ?? PUBLIC_STATUS_ROLLUP_BUCKET_MINUTES; | |
| assertPositiveInteger(bucketMinutes, "bucketMinutes"); | |
| // alignBucketStartUtc floors the ISO timestamp to the nearest bucketMinutes boundary in UTC. | |
| return [ | |
| resolvePrefix(input.prefix), | |
| "rollup", | |
| `${bucketMinutes}m`, | |
| encodeKeyPart(alignBucketStartUtc(input.bucketStartIso, bucketMinutes)), | |
| ].join(":"); |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/lib/public-status/redis-contract.ts
Line: 147-159
Comment:
**`alignBucketStartUtc` call site lacks documentation of alignment semantics**
`alignBucketStartUtc` is called inside `buildPublicStatusRollupKey` without a visible definition nearby. An inline comment clarifying that it floors to the nearest `bucketMinutes` UTC boundary would help readers understand the key structure.
```suggestion
export function buildPublicStatusRollupKey(input: {
bucketStartIso: string;
bucketMinutes?: number;
prefix?: string;
}): string {
const bucketMinutes = input.bucketMinutes ?? PUBLIC_STATUS_ROLLUP_BUCKET_MINUTES;
assertPositiveInteger(bucketMinutes, "bucketMinutes");
// alignBucketStartUtc floors the ISO timestamp to the nearest bucketMinutes boundary in UTC.
return [
resolvePrefix(input.prefix),
"rollup",
`${bucketMinutes}m`,
encodeKeyPart(alignBucketStartUtc(input.bucketStartIso, bucketMinutes)),
].join(":");
```
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (1)
src/lib/public-status/rollup-store.ts (1)
9-20: ⚡ Quick win把
src/内部导入统一改成@/别名。这个文件里同时混用了
@/和./...。既然这些模块都在src/下,继续保留相对路径会让后续移动文件时更脆弱,也不符合仓库约定。 As per coding guidelines,**/*.{ts,tsx,js,jsx}: Use path alias@/to map to ./src/ for imports.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/public-status/rollup-store.ts` around lines 9 - 20, The imports in rollup-store.ts mix relative paths with the project alias; replace all local-src imports that use "./..." with the "`@/`..." alias so they consistently map to ./src/ — e.g., change imports that bring in computeTokensPerSecond, PublicStatusConfiguredGroup, InternalPublicStatusConfigSnapshot, readCurrentInternalPublicStatusConfigSnapshot, PublicStatusPayload, PublicStatusTimelineBucket, alignBucketStartUtc, buildPublicStatusRollupCoverageStartKey, buildPublicStatusRollupKey, and PUBLIC_STATUS_ROLLUP_BUCKET_MINUTES to use the "`@/`..." path prefix; ensure export names and types are unchanged and tests/build still resolve the updated module specifiers.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/app/`[locale]/status/_components/public-status-view.tsx:
- Around line 158-163: deriveCurrentModelState currently falls back to
deriveLatestModelState even when model.timeline is an empty array, which
downgrades newly-polled models to no_data and drops availabilityPct; change
deriveCurrentModelState (and other spots that call deriveLatestModelState for
initial/partial polls) to return model.latestState whenever model.timeline is
empty or model.timelineReusedFromPrevious is true (i.e., if
model.timeline?.length === 0 || model.timelineReusedFromPrevious) so the
server-provided latestState and availabilityPct are preserved instead of
recalculating from a missing timeline.
In `@src/lib/model-vendor-icons.tsx`:
- Line 36: Replace the relative import in model-vendor-icons.tsx with the
project alias: update the import that currently brings in getModelVendor as
getModelVendorRule and type ModelVendorRule from "./model-vendor-rules" to use
"`@/lib/model-vendor-rules`" instead, preserving the same named imports and alias
(getModelVendor as getModelVendorRule) and the type import (ModelVendorRule).
In `@src/lib/model-vendor-rules.ts`:
- Around line 10-113: The rules array should be a readonly literal so `prefix`
and `i18nKey` keep their literal types and the array cannot be mutated; remove
the explicit `ModelVendorRule[]` annotation on MODEL_VENDOR_RULES, declare it as
a const readonly literal (using `const ... = [...] as const`) so prefixes remain
string literal types and the array is immutable, and if you need the existing
ModelVendorRule type derive/align it from the literal (e.g. `type
ModelVendorRule = typeof MODEL_VENDOR_RULES[number]`) so downstream exhaustive
checks over `prefix`/`i18nKey` work and callers cannot reorder or mutate the
list.
In `@src/lib/public-status/rollup-store.ts`:
- Around line 520-538: buildBucketStarts silently approximates unsupported
intervalMinutes (e.g., 16) by rounding to bucket factors, causing misaligned
coveredTo/bucket boundaries; add a fail-fast validation at the top of
buildBucketStarts that rejects non-supported intervalMinutes (accept only
multiples of PUBLIC_STATUS_ROLLUP_BUCKET_MINUTES or the explicit whitelist
5,15,30,60) and throw a clear error (include interval value) instead of
rounding; also apply the same strict validation to the related rollup boundary
logic that calls alignBucketStartUtc/uses intervalMinutes so all callers enforce
the same allowed intervals.
- Around line 298-337: The code is incorrectly attributing request-level
TTFB/TPS to every non-excluded group; only the group that actually succeeded
should receive ttfb_sum/ttfb_count and tps_sum/tps_count. Change the logic in
the loop over groupOutcome (where ttfbMs, tps, increments, outcome, groupId,
modelKey are used) so that the pushes of ttfb_* and tps_* happen only when
outcome === "success" (i.e., guard the ttfbMs !== null and tps !== null branches
with outcome === "success"); also add a unit/integration test covering a
failure-then-fallback chain ("A" fails, "B" succeeds) to ensure TTFB/TPS are
only recorded for the successful group.
- Around line 369-378: The code currently calls pipeline.exec() in
rollup-store.ts and ignores its per-command results, so any individual
HINCRBYFLOAT/EXPIRE failures are silently treated as success; change the logic
after calling pipeline.exec() (the result of redis.pipeline().exec()) to iterate
the returned array of [error, result] tuples, and if any tuple has a non-null
error convert that into an explicit failure (reject/throw or return written:
false and log via the same logger used in queuePublicStatusRollupWrite),
including context like key and which buildPublicStatusRollupField(increment)
failed; ensure you still preserve setting coverageStartKey and TTL behavior
(ROLLUP_TTL_SECONDS) but fail the outer operation when any pipeline entry
errored.
In `@src/repository/message.ts`:
- Around line 71-105: Currently queuePublicStatusRollupForFinalDetails consumes
the seed immediately via consumePublicStatusRequestSeed(id) which can
permanently lose the seed if later steps fail; change the flow to first
read/peek the seed (use a non-destructive read or a new
getPublicStatusRequestSeed/peekPublicStatusRequestSeed helper), then mark the
seed as in-flight (introduce or call an "markPublicStatusRequestInFlight"
operation tied to the id), perform getConfiguredPublicStatusGroupsForRollup()
and await queuePublicStatusRollupWrite(...), and only after
queuePublicStatusRollupWrite succeeds call consumePublicStatusRequestSeed(id) to
remove the seed; ensure on error you clear the in-flight mark (or leave it for
retry policies) and keep logging in the catch block (reference functions:
queuePublicStatusRollupForFinalDetails, consumePublicStatusRequestSeed,
queuePublicStatusRollupWrite).
---
Nitpick comments:
In `@src/lib/public-status/rollup-store.ts`:
- Around line 9-20: The imports in rollup-store.ts mix relative paths with the
project alias; replace all local-src imports that use "./..." with the "`@/`..."
alias so they consistently map to ./src/ — e.g., change imports that bring in
computeTokensPerSecond, PublicStatusConfiguredGroup,
InternalPublicStatusConfigSnapshot,
readCurrentInternalPublicStatusConfigSnapshot, PublicStatusPayload,
PublicStatusTimelineBucket, alignBucketStartUtc,
buildPublicStatusRollupCoverageStartKey, buildPublicStatusRollupKey, and
PUBLIC_STATUS_ROLLUP_BUCKET_MINUTES to use the "`@/`..." path prefix; ensure
export names and types are unchanged and tests/build still resolve the updated
module specifiers.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0738d741-65b2-44fe-89cd-04c4d0515b75
📒 Files selected for processing (26)
src/app/[locale]/status/[slug]/page.tsxsrc/app/[locale]/status/_components/public-status-timeline.tsxsrc/app/[locale]/status/_components/public-status-view.tsxsrc/lib/model-vendor-icons.tsxsrc/lib/model-vendor-rules.tssrc/lib/public-status/aggregation-core.tssrc/lib/public-status/aggregation.tssrc/lib/public-status/config-publisher.tssrc/lib/public-status/config-snapshot.tssrc/lib/public-status/read-store.tssrc/lib/public-status/rebuild-hints.tssrc/lib/public-status/rebuild-worker.tssrc/lib/public-status/redis-contract.tssrc/lib/public-status/rollup-store.tssrc/lib/public-status/scheduler.tssrc/lib/public-status/vendor-icon-key.tssrc/repository/message.tstests/unit/public-status/config-publisher.test.tstests/unit/public-status/config-snapshot.test.tstests/unit/public-status/no-db-import-guard.test.tstests/unit/public-status/public-status-view.test.tsxtests/unit/public-status/read-store.test.tstests/unit/public-status/rebuild-worker.test.tstests/unit/public-status/redis-contract.test.tstests/unit/public-status/rollup-store.test.tstests/unit/repository/message-public-status-rollup.test.ts
There was a problem hiding this comment.
Code Review
This pull request introduces a major architectural shift for the public status system, moving from on-demand database queries to a Redis-based rollup mechanism (v2). It includes a robust migration path that serves legacy v1 projections until the new rollup window is fully covered. Key changes include the introduction of a rollup-store for high-frequency metric increments, UI refactoring of the status timeline to improve accessibility and performance, and logic to reuse timeline data during polling. Feedback focuses on the reliability of the in-memory metadata cache in distributed environments, performance optimizations for rollup aggregation loops, and ensuring the reliability of migration markers in Redis.
| model?: string; | ||
| }; | ||
|
|
||
| const publicStatusRequestSeedCache = new Map<number, PublicStatusRequestSeed>(); |
There was a problem hiding this comment.
Using an in-memory Map (publicStatusRequestSeedCache) to store request metadata between createMessageRequest and updateMessageRequestDetails is unreliable in distributed or serverless environments. If the request lifecycle spans multiple instances or if the process restarts, the 'seed' data will be lost, resulting in missing rollup metrics. Consider storing this temporary state in Redis with a short TTL or fetching the necessary fields (createdAt, originalModel) from the database during the update call if they are not already available in the context.
| const rollupSampleCount = rollupBuckets.reduce((sum, bucket) => { | ||
| for (const [field, value] of bucket.values) { | ||
| const parsedField = parsePublicStatusRollupField(field); | ||
| if (parsedField?.metric === "success" || parsedField?.metric === "failure") { | ||
| sum += value; | ||
| } | ||
| } | ||
| return sum; | ||
| }, 0); |
There was a problem hiding this comment.
The current logic for calculating rollupSampleCount parses every field in every bucket using parsePublicStatusRollupField, which involves string splitting and URI decoding. Given that this loop can run hundreds of thousands of times (buckets × groups × models × metrics), this is quite expensive. Since we only need to identify 'success' and 'failure' metrics, a simple string suffix check would be significantly more efficient.
const rollupSampleCount = rollupBuckets.reduce((sum, bucket) => {
for (const [field, value] of bucket.values) {
if (field.endsWith("|success") || field.endsWith("|failure")) {
sum += value;
}
}
return sum;
}, 0);| pipeline.hincrbyfloat(key, buildPublicStatusRollupField(increment), increment.value); | ||
| } | ||
| if (typeof pipeline.set === "function") { | ||
| pipeline.set(coverageStartKey, bucketStartIso, "NX"); |
There was a problem hiding this comment.
The coverageStartKey is critical for the migration logic that determines when to switch from legacy v1 projections to v2 rollups. In the pipeline block, the set command is guarded by a type check (typeof pipeline.set === "function"). If this check fails, the coverage start timestamp will never be recorded, potentially causing the status page to stay on legacy data indefinitely. While most Redis clients support set, it's safer to ensure this operation is performed or logged if it fails.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d75f47279b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (!isPublicStatusFinalDetails(details) || !claimPublicStatusFinalization(id)) { | ||
| return; |
There was a problem hiding this comment.
Delay finalization claim until rollup write is actually possible
The request is marked finalized before the code verifies that rollup prerequisites exist, so transient config unavailability can permanently drop data. In queuePublicStatusRollupForFinalDetails, claimPublicStatusFinalization(id) runs first, but if getConfiguredPublicStatusGroupsForRollup() returns an empty list (for example during startup/bootstrap or a temporary Redis/config miss), the function returns early and later calls for the same request are blocked by the finalized cache. That means success/failure metrics for those requests are never written to rollups, creating silent gaps in public status accuracy.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/unit/public-status/public-status-view.test.tsx (1)
610-677: ⚡ Quick win确保 fake timers 清理总能执行,避免后续用例被污染。
当前
vi.useRealTimers()/unmount()放在断言后;若中途断言失败,清理不会执行,可能导致后续测试串扰。建议用try/finally包裹。建议修改
- await act(async () => { - vi.advanceTimersByTime(30_000); - await Promise.resolve(); - }); - - const text = container.textContent || ""; - expect(text).toContain("GPT-4.2"); - expect(text).toContain("Failed"); - expect(text).toContain("0.00%"); - - vi.useRealTimers(); - unmount(); + try { + await act(async () => { + vi.advanceTimersByTime(30_000); + await Promise.resolve(); + }); + + const text = container.textContent || ""; + expect(text).toContain("GPT-4.2"); + expect(text).toContain("Failed"); + expect(text).toContain("0.00%"); + } finally { + vi.useRealTimers(); + unmount(); + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/public-status/public-status-view.test.tsx` around lines 610 - 677, The test "uses server summary metrics when polling returns a new model without timeline" currently calls vi.useRealTimers() and unmount() after assertions, which can leak fake timers or mounted components if an assertion throws; wrap the polling/advance and assertions in a try/finally so that vi.useRealTimers() and unmount() are invoked in the finally block (ensure unmount() and vi.useRealTimers() are still called even on failure), locating the cleanup around the render(...) result (container, unmount) and the vi.useFakeTimers()/vi.advanceTimersByTime(...) section.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/unit/public-status/public-status-view.test.tsx`:
- Around line 610-677: The test "uses server summary metrics when polling
returns a new model without timeline" currently calls vi.useRealTimers() and
unmount() after assertions, which can leak fake timers or mounted components if
an assertion throws; wrap the polling/advance and assertions in a try/finally so
that vi.useRealTimers() and unmount() are invoked in the finally block (ensure
unmount() and vi.useRealTimers() are still called even on failure), locating the
cleanup around the render(...) result (container, unmount) and the
vi.useFakeTimers()/vi.advanceTimersByTime(...) section.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a8c9e5ec-75fc-40f2-b2fa-7f390f2cf099
📒 Files selected for processing (12)
src/app/[locale]/status/_components/public-status-view.tsxsrc/lib/model-vendor-icons.test.tssrc/lib/model-vendor-icons.tsxsrc/lib/model-vendor-rules.tssrc/lib/public-status/aggregation.tssrc/lib/public-status/rebuild-worker.tssrc/lib/public-status/rollup-store.tssrc/repository/message.tstests/unit/public-status/aggregation.test.tstests/unit/public-status/public-status-view.test.tsxtests/unit/public-status/rollup-store.test.tstests/unit/repository/message-public-status-rollup.test.ts
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2b9e35c6fc
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const results = await pipeline.exec(); | ||
| for (let batchIndex = 0; batchIndex < batchStarts.length; batchIndex++) { | ||
| const [error, raw] = results?.[batchIndex] ?? [null, null]; | ||
| buckets.push(parseBucket(batchStarts[batchIndex]!, error ? null : raw)); |
There was a problem hiding this comment.
Fail rebuild when rollup pipeline returns null results
Treating pipeline.exec() returning null as an all-empty batch silently publishes a fresh snapshot with zeroed rollup data, which can temporarily wipe status visibility to no_data even when real rollup data exists. In this code path, each bucket is parsed from results?.[batchIndex] ?? [null, null], so a null exec is downgraded into empty buckets instead of a rebuild failure/retry signal.
Useful? React with 👍 / 👎.
| await input.triggerRebuildHint("rollup-coverage-incomplete"); | ||
| await input.triggerRebuildHint("legacy-generation"); | ||
| if ( |
There was a problem hiding this comment.
Coalesce rebuild hints on rollup-coverage fallback path
This branch awaits multiple triggerRebuildHint calls sequentially for a single read, which turns one status-page poll into multiple blocking hint round-trips during the rollout window (rollupCoverageComplete === false). Because hint scheduling now short-circuits via TTL checks, these extra awaited calls mostly add latency on the hot read path without improving correctness; they should be coalesced into one hint or made non-blocking.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
tests/unit/public-status/public-status-view.test.tsx (2)
407-513: ⚡ Quick win建议添加 try/finally 以确保清理代码执行。
此测试使用了 fake timers,但清理逻辑(
vi.useRealTimers()和unmount())未包裹在finally块中。为保证测试可靠性并与本 PR 中其他测试的模式保持一致,建议采用 try/finally 包裹。♻️ 建议的重构方案
- await act(async () => { - await Promise.resolve(); - }); - - expect(fetchMock).toHaveBeenCalledWith( - "/api/public-status?groupSlug=platform&include=meta%2Cdefaults%2Cgroups", - { cache: "no-store" } - ); - - await act(async () => { - vi.advanceTimersByTime(30_000); - await Promise.resolve(); - }); - - const text = container.textContent || ""; - expect(text).toContain("Platform Model"); - expect(text).not.toContain("OpenAI Model"); - expect(container.querySelectorAll('[data-testid="sortable-group-panel"]')).toHaveLength(1); - - vi.useRealTimers(); - unmount(); + try { + await act(async () => { + await Promise.resolve(); + }); + + expect(fetchMock).toHaveBeenCalledWith( + "/api/public-status?groupSlug=platform&include=meta%2Cdefaults%2Cgroups", + { cache: "no-store" } + ); + + await act(async () => { + vi.advanceTimersByTime(30_000); + await Promise.resolve(); + }); + + const text = container.textContent || ""; + expect(text).toContain("Platform Model"); + expect(text).not.toContain("OpenAI Model"); + expect(container.querySelectorAll('[data-testid="sortable-group-panel"]')).toHaveLength(1); + } finally { + vi.useRealTimers(); + unmount(); + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/public-status/public-status-view.test.tsx` around lines 407 - 513, The test "keeps filterSlug-scoped default group after polling refresh" uses vi.useFakeTimers() but leaves cleanup (vi.useRealTimers() and unmount()) outside a finally block; wrap the main test body (everything after vi.useFakeTimers()) in a try block and move vi.useRealTimers() and unmount() into a finally so the fake timers are always restored and the component always unmounted (references: vi.useFakeTimers, vi.useRealTimers, unmount, and the fetchMock/container usage inside the test).
277-325: ⚡ Quick win建议添加 try/finally 以确保清理代码执行。
此测试使用了 fake timers,但未将清理逻辑(
vi.useRealTimers()和unmount())包裹在finally块中。如果断言失败,清理代码将不会执行,可能导致测试污染。建议采用与 lines 595-609 和 667-680 相同的模式。♻️ 建议的重构方案
- await act(async () => { - await Promise.resolve(); - }); - - expect(fetchMock).toHaveBeenCalledWith( - "/api/public-status?interval=5&rangeHours=24&include=meta%2Cdefaults%2Cgroups", - { cache: "no-store" } - ); - - await act(async () => { - vi.advanceTimersByTime(30_000); - await Promise.resolve(); - }); - - expect(container.textContent).toContain("Refresh delayed"); - expect(container.textContent).toContain("Preparing first snapshot"); - - vi.useRealTimers(); - unmount(); + try { + await act(async () => { + await Promise.resolve(); + }); + + expect(fetchMock).toHaveBeenCalledWith( + "/api/public-status?interval=5&rangeHours=24&include=meta%2Cdefaults%2Cgroups", + { cache: "no-store" } + ); + + await act(async () => { + vi.advanceTimersByTime(30_000); + await Promise.resolve(); + }); + + expect(container.textContent).toContain("Refresh delayed"); + expect(container.textContent).toContain("Preparing first snapshot"); + } finally { + vi.useRealTimers(); + unmount(); + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/public-status/public-status-view.test.tsx` around lines 277 - 325, Wrap the fake-timer test body in a try/finally to guarantee cleanup: start by calling vi.useFakeTimers(), render PublicStatusView and run assertions inside the try block, and in the finally block always call vi.useRealTimers() and unmount() (referencing the existing vi.useFakeTimers(), vi.useRealTimers(), unmount() and fetchMock usage) so timers and DOM are restored even if an assertion fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/unit/public-status/public-status-view.test.tsx`:
- Around line 407-513: The test "keeps filterSlug-scoped default group after
polling refresh" uses vi.useFakeTimers() but leaves cleanup (vi.useRealTimers()
and unmount()) outside a finally block; wrap the main test body (everything
after vi.useFakeTimers()) in a try block and move vi.useRealTimers() and
unmount() into a finally so the fake timers are always restored and the
component always unmounted (references: vi.useFakeTimers, vi.useRealTimers,
unmount, and the fetchMock/container usage inside the test).
- Around line 277-325: Wrap the fake-timer test body in a try/finally to
guarantee cleanup: start by calling vi.useFakeTimers(), render PublicStatusView
and run assertions inside the try block, and in the finally block always call
vi.useRealTimers() and unmount() (referencing the existing vi.useFakeTimers(),
vi.useRealTimers(), unmount() and fetchMock usage) so timers and DOM are
restored even if an assertion fails.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 6c2fc430-8d77-49e5-92dc-e166084bafbd
📒 Files selected for processing (1)
tests/unit/public-status/public-status-view.test.tsx
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4da03aa142
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const [error, raw] = results?.[batchIndex] ?? [null, null]; | ||
| buckets.push(parseBucket(batchStarts[batchIndex]!, error ? null : raw)); |
There was a problem hiding this comment.
Fail rollup rebuild on Redis pipeline command errors
When batched HGETALL calls return per-command errors, this code silently treats those buckets as empty (error ? null : raw) and continues rebuilding, which can publish a new snapshot with partial/zeroed data instead of preserving the last good projection. In production, transient Redis command failures during a rebuild will therefore degrade public status accuracy without surfacing a rebuild failure signal.
Useful? React with 👍 / 👎.
| if (!previousGroup) { | ||
| return group; | ||
| } |
There was a problem hiding this comment.
Load timeline when transitioning from no snapshot state
Polling always requests include=meta,defaults,groups (no timeline), and when a group appears for the first time this branch returns it as-is, leaving timeline empty. If a user opens the page while the system is rebuilding (initial payload has no groups), then later receives the first ready payload via polling, charts never get real history data and remain empty until a full page reload.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: add2a1cb5c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (groupsByRollupId.get(groupId) !== group) { | ||
| continue; |
There was a problem hiding this comment.
Handle rollup ID collisions instead of silently dropping increments
When two configured groups resolve to the same rollup key string (for example one group has sourceGroupId = 5 while another falls back to sourceGroupName = "5"), groupsByRollupId becomes last-write-wins and this guard drops one group’s samples entirely. In that collision case, increments for the overwritten group are skipped with no error/log, so success/failure and latency rollups are silently undercounted for affected public groups.
Useful? React with 👍 / 👎.
相关 PR
背景
公开
/status页面之前依赖按窗口从message_request回扫并重建投影。数据量上来后,这条路径会把公开状态页的刷新成本放大到"时间窗口 × 端点/请求历史 × 模型/分组",并且读侧轮询会反复拉完整 timeline,导致 Redis/DB/浏览器都承压。本 PR 重新设计公开状态页的数据面:写入时增量写 Redis 5 分钟 rollup,后台 worker 只从 rollup 物化 public-safe snapshot,公开 API 只读 snapshot,不再扫 DB。
设计要点
public-status:v2Redis namespace,保留public-status:v1投影作为升级期 fallback。public-status:v2:rollup:5m:{bucketStart}。{providerGroupId}|{publicModelKey}|metric,不包含 endpoint/internal provider URL,避免"端点 × 模型 × 时间"爆炸。success/failure/ttfb_sum/ttfb_count/tps_sum/tps_count,worker 按公开配置的5/15/30/60m展示间隔合并。public-status:v2:rollup:coverage-start:5m,窗口覆盖不完整时公开读侧优先服务 v1 snapshot,并标记 stale;没有 v1 时服务 v2 stale。provider_groups.id,旧 internal snapshot 没有 id 时才 fallback 到 group name。前端性能
include=meta,defaults,groups,不再每 30 秒拉 timeline。兼容与升级
验证
bun run typecheckbun run lint:fixbun run lintvitest.cmd run tests\\unit\\public-status\\public-status-view.test.tsx tests\\unit\\public-status\\rollup-store.test.ts tests\\unit\\public-status\\read-store.test.ts tests\\unit\\public-status\\rebuild-worker.test.ts tests\\unit\\repository\\message-public-status-rollup.test.tsbun run buildbun run test全量测试结果:
684个测试文件通过,6099个测试通过,13个跳过。构建通过;Edge Runtime / Redis 未配置 / DSN fallback 等日志为现有测试和构建环境噪音。Description enhanced by Claude AI
Greptile Summary
This PR replaces the
public-statusread path's per-request DB back-scan with an incremental v2 Redis rollup architecture: 5-minuteHINCRBYFLOATrollup buckets written at request finalization, a background worker that materializes snapshots from those buckets, and a v1-fallback during the coverage warm-up window. The front-end polling is also optimized to omit per-bucket timeline data and reuse locally cached timelines between polls.rollup-store.ts(write/read/aggregate rollup buckets), extendsrebuild-worker.tsto build projections from rollups instead of DB queries, and adds dual-namespace (v2primary /v1legacy fallback) support toread-store.ts,config-snapshot.ts, andredis-contract.ts.include=meta,defaults,groups(no timeline), with client-side timeline reuse logic and a single shared hover summary replacing per-cell Radix Tooltip instances.Confidence Score: 5/5
The PR is safe to merge. The rollup write path is fire-and-forget with dedup guards, the v1 fallback activates whenever v2 coverage is incomplete, and no database migrations are needed.
The core rollup pipeline — write, aggregate, serve — is internally consistent: bucket alignment uses the same alignBucketStartUtc constant throughout, the rollupCoverageComplete flag correctly gates the legacy fallback, and the frontend merge logic handles every edge case. The two noted issues are purely about code clarity, not runtime correctness.
No files require special attention. The read-store.ts fallback branch worth a comment is the only area that could confuse a future reader.
Important Files Changed
Sequence Diagram
sequenceDiagram participant Req as Request Handler participant Msg as message.ts participant RS as rollup-store.ts participant Redis as Redis (v2) participant Worker as rebuild-worker.ts participant ReadStore as read-store.ts participant Client as Browser Req->>Msg: createMessageRequest() Msg->>Msg: rememberPublicStatusRequestSeed(id, seed) Req->>Msg: updateMessageRequestDetails(id, details) Msg->>RS: queuePublicStatusRollupWrite(event, groups) RS->>Redis: "HINCRBYFLOAT rollup:5m:{bucket} field value" RS->>Redis: SET coverage-start:5m (NX) RS->>Redis: EXPIRE keys (32 days) Worker->>Redis: GET coverage-start key Worker->>Redis: HGETALL rollup buckets (batched pipeline) Worker->>Worker: buildPublicStatusPayloadFromRollups() Worker->>Worker: isRollupCoverageComplete() Worker->>Redis: "SET snapshot:{gen}" Worker->>Redis: SET manifest:current Client->>ReadStore: readPublicStatusPayload() ReadStore->>Redis: GET manifest (v2) alt "rollupCoverageComplete === false" ReadStore->>Redis: GET manifest (v1 legacy) ReadStore-->>Client: "legacy projection (rebuildState=stale)" else coverage complete ReadStore->>Redis: GET snapshot (v2) ReadStore-->>Client: v2 projection endPrompt To Fix All With AI
Reviews (4): Last reviewed commit: "完善公开状态页轮询测试清理" | Re-trigger Greptile