Conversation
- Integrated Rolldown for hot module replacement (HMR) in `arkor dev`, allowing real-time updates to the training interface without page refresh. - Implemented `requestEarlyStop` and `replaceCallbacks` methods in the Trainer API to facilitate graceful stopping of training jobs and dynamic callback updates during execution. - Updated documentation to reflect new features and usage patterns for improved developer guidance. - Adjusted cleanup logic for better resource management during development sessions.
…l shutdown - Introduced `replaceCallbacks` method in the Trainer API to allow dynamic updates of lifecycle callbacks during training runs. - Enhanced signal handling for graceful early stopping, ensuring in-flight checkpoints are preserved during HMR rebuilds. - Added `registerCleanupHook` for streamlined resource management on process exit, improving cleanup logic across development commands. - Updated documentation to reflect new features and usage patterns for better developer guidance.
…internal callback swapping - Removed the public `replaceCallbacks` method from the Trainer interface to prevent exposure of the hot-swapping functionality. - Introduced an internal mechanism for callback swapping using a `Symbol.for`-keyed brand, allowing for dynamic updates during training runs without affecting the public API. - Updated signal handling to ensure seamless integration with the new callback swapping logic, enhancing the hot module replacement (HMR) experience. - Revised documentation to reflect changes in the Trainer API and clarify the internal callback management process.
…lement internal early-stop handling - Removed the public `requestEarlyStop` method from the Trainer interface to prevent exposure of the early-stop functionality. - Introduced an internal mechanism for early stopping using a `Symbol.for`-keyed brand, allowing for graceful stopping after the next checkpoint without affecting the public API. - Updated signal handling to ensure seamless integration with the new early-stop logic, enhancing the hot module replacement (HMR) experience. - Revised documentation to reflect changes in the Trainer API and clarify the internal early-stop management process.
…llback replacement - Updated the Trainer API to remove public exposure of `requestEarlyStop` and `replaceCallbacks` methods, enhancing encapsulation. - Implemented internal mechanisms for early stopping and callback swapping using `Symbol.for`-keyed brands, ensuring seamless integration during training runs. - Revised signal handling to improve the hot module replacement (HMR) experience and maintain clean resource management. - Updated documentation to reflect these changes and clarify the internal management processes for developers.
- Replaced esbuild with Rolldown for building the project, ensuring external dependencies are resolved correctly. - Implemented a cleanup hook system to manage resource disposal on process termination signals. - Enhanced the development experience with hot module replacement (HMR) capabilities, allowing for seamless updates during training runs. - Updated documentation to reflect changes in the development loop and HMR behavior. - Added tests for new signal handling and configuration hashing functionality.
|
Preview deployment for your docs. Learn more about Mintlify Previews.
💡 Tip: Enable Workflows to automatically generate PRs for you. |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8abc594369
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const hotSwapTargets = activeTrains.notifyCallbackReload(nextHash); | ||
| const restartTargets = | ||
| activeTrains.requestEarlyStopOnMismatch(nextHash); |
There was a problem hiding this comment.
Dispatch train signals only once per rebuild
notifyCallbackReload/requestEarlyStopOnMismatch run inside each /api/dev/events subscriber callback, so one rebuild sends signals once per connected SSE client (and again on reconnect when createHmrCoordinator.subscribe replays lastEvent). In the common case of two open Studio tabs, the same child can receive two SIGTERMs for one config-changing rebuild; the second signal triggers the runner’s forced-exit path (exit(143)) instead of checkpoint-preserving early stop. Move signal dispatch out of per-subscriber delivery (or dedupe per rebuild hash) so each child is signaled at most once per rebuild event.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
This PR upgrades the local Studio dev experience by migrating the trainer bundling pipeline from esbuild to Rolldown, adding an HMR/SSE channel to propagate rebuild events to the Studio SPA, and introducing signal-based “hot-swap callbacks vs early-stop + restart” behavior for in-flight training subprocesses.
Changes:
- Replace esbuild-based trainer bundling with Rolldown (one-shot build + watch/HMR).
- Add
/api/dev/eventsSSE stream + SPA client wiring to refresh manifest and coordinate restarts/hot-swaps. - Introduce internal trainer inspection/callback replacement/early-stop hooks (via
Symbol.forbrands), plus cleanup utilities and updated docs/tests.
Reviewed changes
Copilot reviewed 27 out of 28 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| pnpm-lock.yaml | Removes esbuild and adds rolldown to the lockfile graph. |
| packages/arkor/package.json | Swaps dependency from esbuild to rolldown. |
| packages/arkor/src/core/rolldownConfig.ts | Centralizes Rolldown input options + entry/outDir resolution and Node target derivation. |
| packages/arkor/src/cli/commands/build.ts | Implements arkor build via Rolldown bundling instead of esbuild. |
| packages/arkor/src/studio/hmr.ts | Adds lazy Rolldown watcher coordinator that emits ready/rebuild/error events. |
| packages/arkor/src/studio/hmr.test.ts | Tests HMR coordinator behavior (ready/rebuild/error/replay/dispose). |
| packages/arkor/src/studio/manifest.ts | Adds configHash to manifest summary and splits “summarise built manifest” vs “build + summarise”. |
| packages/arkor/src/studio/trainRegistry.ts | Adds per-child registry + policy for SIGUSR2 hot-swap vs SIGTERM early-stop. |
| packages/arkor/src/studio/trainRegistry.test.ts | Tests TrainRegistry signaling decisions and error tolerance. |
| packages/arkor/src/studio/server.ts | Adds /api/dev/events SSE route and integrates TrainRegistry into rebuild handling. |
| packages/arkor/src/studio/server.test.ts | Adds tests for /api/dev/events token rules, loopback guard, and SSE framing. |
| packages/arkor/src/core/configHash.ts | Adds stable hashing of JobConfig for “callbacks-only vs restart” decisions. |
| packages/arkor/src/core/configHash.test.ts | Tests determinism and sensitivity of hashJobConfig. |
| packages/arkor/src/core/trainerInspection.ts | Introduces internal inspection + callback replacement + early-stop brands via Symbol.for. |
| packages/arkor/src/core/trainer.ts | Implements callback hot-swap support + early-stop latch; attaches internal brands. |
| packages/arkor/src/core/trainer.test.ts | Adds coverage for early-stop and callback hot-swap behavior via internal brands. |
| packages/arkor/src/core/runnerSignals.ts | Adds signal handlers: SIGTERM graceful early-stop + SIGUSR2 callback reload. |
| packages/arkor/src/core/runnerSignals.test.ts | Tests SIGTERM/SIGUSR2 handler behaviors with branded trainers. |
| packages/arkor/src/core/runner.ts | Installs/removes new signal handlers around trainer.start()/trainer.wait(). |
| packages/arkor/src/core/runner.test.ts | Adds a test for SIGTERM early-stop behavior in the runner. |
| packages/arkor/src/cli/cleanupHooks.ts | Adds reusable cleanup-hook registration for exit/signals. |
| packages/arkor/src/cli/commands/dev.ts | Wires HMR coordinator into Studio server and refactors shutdown cleanup via cleanup hooks. |
| packages/arkor/src/cli/commands/start.test.ts | Updates test comment to refer to Rolldown instead of esbuild. |
| packages/studio-app/src/lib/api.ts | Adds typed DevEvent + openDevEvents() for SSE notifications. |
| packages/studio-app/src/components/RunTraining.tsx | SPA listens to dev SSE events, refreshes manifest, and coordinates restart/hot-swap UI state. |
| docs/concepts/studio.mdx | Documents HMR workflow, including hash-based hot-swap vs restart behavior (EN). |
| docs/ja/concepts/studio.mdx | Documents HMR workflow (JA). |
| AGENTS.md | Updates repository policy/docs for new SSE allow-list, HMR internals, and Rolldown build pipeline. |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const [log, setLog] = useState(""); | ||
| const [manifest, setManifest] = useState<ManifestResult | null>(null); | ||
| const [hmrStatus, setHmrStatus] = useState< | ||
| "idle" | "rebuilding" | "early-stopping" | "restarting" | "hot-swapped" |
| import { ensureProjectState } from "../core/projectState"; | ||
| import { readState } from "../core/state"; | ||
| import { readManifestSummary } from "./manifest"; | ||
| import { readManifestSummary, summariseBuiltManifest } from "./manifest"; |
| function scheduleHmrCleanup(hmr: { dispose: () => Promise<void> }): void { | ||
| // Registered before the studio-token cleanup so it runs first on | ||
| // shutdown — Node fires signal handlers in registration order, and we | ||
| // want the watcher to release file handles before the outermost | ||
| // process.exit. | ||
| registerCleanupHook({ cleanup: () => hmr.dispose() }); | ||
| } |
| requestEarlyStopOnMismatch( | ||
| nextConfigHash: string | null, | ||
| ): RestartTarget[] { | ||
| const targets: RestartTarget[] = []; | ||
| for (const [pid, entry] of this.entries) { | ||
| if ( | ||
| nextConfigHash === null || | ||
| entry.configHash === null || | ||
| entry.configHash !== nextConfigHash | ||
| ) { | ||
| try { | ||
| entry.child.kill("SIGTERM"); | ||
| } catch { | ||
| // child already exited; close handler will clean up. | ||
| } | ||
| targets.push({ pid, trainFile: entry.trainFile }); | ||
| } |
| await callbacks.onCheckpoint?.(ctx); | ||
| // Early-stop latch: a checkpoint just landed, so the in-flight work | ||
| // is durable. Cancel the cloud job and end `wait()` cleanly. | ||
| if (earlyStopRequested && earlyStopDeferred) { | ||
| await trainer.cancel(); | ||
| if (earlyStopDeferred.timer) clearTimeout(earlyStopDeferred.timer); | ||
| earlyStopDeferred.resolve(); | ||
| earlyStopDeferred = null; | ||
| return { terminal: true, artifacts: terminalResult?.artifacts ?? [] }; | ||
| } |
| function startWatcher(): void { | ||
| if (watcher || disposed) return; | ||
| if (!existsSync(resolved.entry)) { | ||
| broadcast({ | ||
| type: "error", | ||
| message: `Build entry not found: ${resolved.entry}. Create ${BUILD_DEFAULTS.entry} or pass an explicit entry argument.`, | ||
| }); | ||
| return; | ||
| } |
…ous BUNDLE_END on macOS - Modified the test to assert that the late subscriber receives the same event as the prior subscriber, addressing issues with spurious BUNDLE_END events in rolldown@1.0.0-rc.17 on macOS. - Added comments to clarify the reasoning behind the change and noted a future task to revisit the assertion after rolldown stabilizes.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 31f510869c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| restartPendingRef.current = true; | ||
| setHmrStatus(runningRef.current ? "early-stopping" : "idle"); |
There was a problem hiding this comment.
Ignore restart hints when no local run is active
restartPendingRef is set for every payload.restart event even when this tab is not running a training stream. Since /api/dev/events broadcasts restart events to all connected tabs, a passive tab can cache restartPendingRef=true from another tab’s run; the next time the user runs training in this tab, finally will auto-trigger an extra run, creating an unintended duplicate job and extra cloud/GPU spend. Only mark restart pending when runningRef.current is true (or otherwise correlate the restart target to the local run).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 27 out of 28 changed files in this pull request and generated 4 comments.
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } catch { | ||
| // child already exited; close handler will clean up. | ||
| } | ||
| targets.push({ pid, trainFile: entry.trainFile }); |
| import { ensureProjectState } from "../core/projectState"; | ||
| import { readState } from "../core/state"; | ||
| import { readManifestSummary } from "./manifest"; | ||
| import { readManifestSummary, summariseBuiltManifest } from "./manifest"; |
| const [log, setLog] = useState(""); | ||
| const [manifest, setManifest] = useState<ManifestResult | null>(null); | ||
| const [hmrStatus, setHmrStatus] = useState< | ||
| "idle" | "rebuilding" | "early-stopping" | "restarting" | "hot-swapped" |
| // Early-stop latch: a checkpoint just landed, so the in-flight work | ||
| // is durable. Cancel the cloud job and end `wait()` cleanly. | ||
| if (earlyStopRequested && earlyStopDeferred) { | ||
| await trainer.cancel(); | ||
| if (earlyStopDeferred.timer) clearTimeout(earlyStopDeferred.timer); | ||
| earlyStopDeferred.resolve(); | ||
| earlyStopDeferred = null; | ||
| return { terminal: true, artifacts: terminalResult?.artifacts ?? [] }; | ||
| } | ||
| return { terminal: false, artifacts: terminalResult?.artifacts ?? [] }; |
…ce failures - Added a test to verify that the SIGINT exit handler is armed even if the `persistStudioToken` function fails, preventing the dev server from idling indefinitely. - Updated the `runDev` function to register the studio-token cleanup hook unconditionally, ensuring it executes on process termination signals regardless of token persistence success. - Enhanced cleanup logic to maintain proper signal handling during development sessions, improving overall stability and user experience.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 55b4a2e8f6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (payload.restart) { | ||
| // `/api/dev/events` is a broadcast — every open Studio tab gets | ||
| // this event. Only flip the auto-restart latch when *this* tab | ||
| // is actually running a stream right now; otherwise a passive | ||
| // tab would silently auto-spawn an extra job the next time the | ||
| // user clicks Run training, doubling cloud spend. | ||
| if (runningRef.current) { | ||
| // Training run is early-stopping; the active stream will |
There was a problem hiding this comment.
Scope restart latch to runs actually signaled for restart
The handler sets restartPendingRef for any payload.restart while this tab is running, but rebuild events are broadcast across all tabs and can contain mixed outcomes (restartTargets and hotSwapTargets) for different subprocesses. In a multi-run setup, a tab whose run was only hot-swapped can still latch restartPendingRef=true, then auto-spawn an extra training job when its current run eventually finishes, increasing unintended cloud/GPU usage.
Useful? React with 👍 / 👎.
| entry.child.kill("SIGTERM"); | ||
| entry.earlyStopRequested = true; |
There was a problem hiding this comment.
Check signal delivery before marking restart targets
ChildProcess.kill() returns a boolean and does not throw when a process is already gone, but this path treats any non-throw as success. If kill("SIGTERM") returns false during an exit race, the entry is still marked earlyStopRequested and added to targets, so the UI may wait for/restart a run that was never actually signaled.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 28 out of 29 changed files in this pull request and generated 4 comments.
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| useEffect(() => { | ||
| const es = openDevEvents(); | ||
| const onMessage = (raw: MessageEvent) => { |
| setHmrStatus("hot-swapped"); | ||
| window.setTimeout(() => { | ||
| setHmrStatus((s) => (s === "hot-swapped" ? "idle" : s)); | ||
| }, 1500); | ||
| } else { |
| }, | ||
| cancel() { | ||
| activeTrains.unregister(child.pid); | ||
| child.kill(); |
| // Synchronous wrapper so signal handlers preserve "cleanup landed | ||
| // before this function returns" — important for sync cleanups (e.g. | ||
| // `unlinkSync`) and for tests that assert the side effect right after | ||
| // invoking the handler. Async cleanups are fire-and-forget with a | ||
| // catch so a hung dispose doesn't block exit. | ||
| const run = (): void => { | ||
| if (done) return; | ||
| done = true; | ||
| try { | ||
| const result = options.cleanup(); | ||
| if (result && typeof (result as Promise<void>).catch === "function") { | ||
| (result as Promise<void>).catch(() => { | ||
| // best-effort: shutdown is racing other cleanup paths | ||
| }); | ||
| } | ||
| } catch { | ||
| // best-effort | ||
| } | ||
| }; | ||
|
|
||
| process.on("exit", run); | ||
|
|
||
| for (const sig of TERMINATING_SIGNALS) { | ||
| process.on(sig, () => { | ||
| run(); | ||
| if (options.exitOnSignal) process.exit(0); |
Code Coverage OverviewLanguages: TypeScript TypeScript / code-coverage/arkorThe overall coverage in the branch is 96%. The coverage in the branch is 99%. Show a code coverage summary of the most impacted files.
TypeScript / code-coverage/create-arkorThe overall coverage remains at 56%, unchanged from the branch. TypeScript / code-coverage/cli-internalThe overall coverage remains at 97%, unchanged from the branch. TypeScript / code-coverage/studio-appThe overall coverage in the branch is 57%. The coverage in the branch is 53%. Show a code coverage summary of the most impacted files.
Updated |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aa6a826f27
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const configHash: string | null = options.hmr | ||
| ? options.hmr.getCurrentConfigHash() | ||
| : null; |
There was a problem hiding this comment.
Route explicit train files with their own HMR hash
When /api/train is called with body.file, the spawned arkor start <file> rebuilds that explicit entry into .arkor/build/index.mjs, but this snapshot still records the HMR coordinator's current hash for the default src/arkor/index.ts. In HMR-enabled Studio, a run launched from an alternate file can therefore be classified for SIGUSR2/restart based on a different artifact, and the default watcher may overwrite the same index.mjs that the child reloads from; e.g. editing only the default entry can hot-swap callbacks into an alternate-entry run from the wrong bundle. Treat explicit files as unhashable/restart-only or build/watch a separate artifact for that entry.
Useful? React with 👍 / 👎.
| try { | ||
| await trainer.cancel(); | ||
| } catch { | ||
| // intentionally ignored; see comment above. | ||
| } |
There was a problem hiding this comment.
Propagate fallback cancel failures
For hand-rolled trainers that the runner accepts but that lack the internal early-stop brand, a SIGTERM/HMR stop falls back to trainer.cancel() here. If that cancel rejects (for example, a transient cloud/network failure), this catch resolves requestTrainerEarlyStop() successfully, so installShutdownHandlers exits with code 0 even though the remote job may still be running; this reintroduces the orphaned-job false-success path for the unbranded trainer shape.
Useful? React with 👍 / 👎.
| const manifest = await readManifestSummary(trainCwd, { | ||
| prebuiltOutFile: hmrOutFile, | ||
| }); |
There was a problem hiding this comment.
Avoid serving stale artifacts before HMR is ready
On HMR-enabled startup, this always passes the prebuilt path to readManifestSummary, which imports it whenever .arkor/build/index.mjs already exists. If that file is left over from a previous build and the watcher has not emitted its first ready yet, the first /api/manifest poll can show an old trainer/config; because arkor start also skips rebuilding when the artifact exists, a quick Run click can launch the stale artifact until the watcher catches up. Gate this fast path on the coordinator having a successful current build, not just file existence.
Useful? React with 👍 / 👎.
| if (process.platform !== "win32") { | ||
| const r = safeKill(entry.child, "SIGUSR2"); | ||
| if (r === "ok") { | ||
| hotSwapTargets.push(target); |
There was a problem hiding this comment.
Delay hot-swap signals until the child is ready
A same-hash rebuild can hit this branch immediately after /api/train registers a newly spawned child, before that child has imported runner.ts and installed its SIGUSR2 callback-reload handler. In that startup window on Unix, child.kill("SIGUSR2") can terminate the Node process with the default signal action while the server reports it as a hot-swap target, so the SPA does not schedule a restart and the user sees the run die instead of continue. Add a child-ready marker before allowing SIGUSR2, or route pre-ready children through restart.
Useful? React with 👍 / 👎.
| "@types/node": "^24", | ||
| "@vitest/coverage-v8": "^4.1.5", | ||
| "tsdown": "^0.22.0", | ||
| "typescript": "^5", | ||
| "vitest": "^4.1.5" |
There was a problem hiding this comment.
Keep published devDependencies catalog-bound
The repository's AGENTS.md dependency-catalog rule says the published packages' runtime dependencies are the only literal-semver carve-out and that their devDependencies stay catalog:-bound. Replacing these with literals here (and the same pattern in packages/create-arkor/package.json) breaks the single-source-of-truth catalog policy, so future bumps of TypeScript/Vitest/tsdown can drift between the published packages and the rest of the workspace instead of being one-line catalog updates.
Useful? React with 👍 / 👎.
… for process termination
- Introduced a NO_OP_DISPOSE function for consistent signal registration failure handling. - Enhanced shutdown signal handling to ensure accurate exit codes based on received signals. - Updated requestTrainerEarlyStop to handle errors more gracefully and ensure proper exit codes. - Improved callback reload handler to avoid crashing on unsupported platforms. - Refactored trainer test cases to utilize Response.json for consistency and clarity. - Updated various promises to use void for better readability and to avoid unhandled promise rejections. - Enhanced type safety and clarity in several areas, including the use of readonly arrays and improved error handling. - Cleaned up unnecessary comments and improved code readability across multiple files.
WalkthroughAdds Rolldown-based Studio HMR: build/watch plumbing, deterministic config hashing, trainer inspection and Symbol.for-branded callbacks/early-stop, signal-driven hot-swap vs graceful restart, server SSE/dev routes and train registry, CLI cleanup integration, frontend auto-restart, and comprehensive tests/docs. ChangesStudio HMR End-to-End Flow
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
✨ Simplify code
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d4030b9c43
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // Clearing here makes the user's broken-state edit the | ||
| // source of truth: no auto-restart fires until the next | ||
| // successful rebuild re-arms the latch. | ||
| restartPendingRef.current = false; |
There was a problem hiding this comment.
Preserve restart after a fixed HMR error
When an HMR error arrives after a prior rebuild already put this child into the registry's earlyStopRequested state, clearing restartPendingRef here loses the only restart signal for that child. The server's rebuild dispatch skips entries with earlyStopRequested, so if the user fixes the compile error before the early-stop child exits, the successful rebuild will not include this pid in restartTargets and the SPA cannot re-arm the latch; the child then exits and stays idle instead of restarting with the fixed bundle.
Useful? React with 👍 / 👎.
| void fetchManifest() | ||
| .then(setManifest) |
There was a problem hiding this comment.
Ignore stale manifest fetches after HMR errors
Fresh evidence beyond the server-side /api/manifest guard: a ready/rebuild event can start this async fetchManifest(), then a newer HMR error event can set the manifest to { error } before this request resolves. Because the promise is not versioned or cancelled, its later setManifest can overwrite the error with the last-good manifest, re-enabling the Run button for stale code until a later poll restores the error state.
Useful? React with 👍 / 👎.
| async function inspectBundle(outFile: string): Promise<InspectionResult> { | ||
| try { | ||
| const mod = (await import(moduleCacheBustUrl(outFile))) as Record< | ||
| string, | ||
| unknown | ||
| >; | ||
| const inspection = findInspectableTrainer(mod); | ||
| if (!inspection) return null; | ||
| return { | ||
| configHash: hashJobConfig(inspection.config), | ||
| trainerName: inspection.name, | ||
| }; | ||
| } catch { | ||
| return null; |
There was a problem hiding this comment.
Propagate inspection failures to HMR error state
If a rebuilt module throws during evaluation (for example top-level user code throws), Rolldown still emits BUNDLE_END, but this catch converts the failed import into configHash: null and a successful ready/rebuild event. That means /api/manifest will not stay in the HMR error state and the registry may SIGTERM-restart active runs against an artifact that arkor start cannot import, leaving the SPA to churn into a failed restart instead of surfacing the build/import failure as an error.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
packages/arkor/src/core/trainerInspection.test.ts (1)
91-121: ⚡ Quick winAdd the missing
default: trainerinspection case.This block calls out shape
#4in the regression note, but it never assertsfindInspectableTrainer({ default: trainer }). A directexport default createTrainer(...)path can regress without failing this suite.As per coding guidelines, "Add vitest test cases under `packages/*/src/**/*.test.ts` for SDK/CLI/scaffold logic changes."Proposed test addition
const trainerC = brandedTrainer("default-arkor"); const inspectionC = findInspectableTrainer({ default: createArkor({ trainer: trainerC }), }); expect(inspectionC?.name).toBe("default-arkor"); - const trainerD = brandedTrainer("default-nested"); + const trainerD = brandedTrainer("default-direct"); + const inspectionD = findInspectableTrainer({ default: trainerD }); + expect(inspectionD?.name).toBe("default-direct"); + + const trainerE = brandedTrainer("default-nested"); + const inspectionE = findInspectableTrainer({ + default: { trainer: trainerE }, + }); + expect(inspectionE?.name).toBe("default-nested"); - const inspectionD = findInspectableTrainer({ - default: { trainer: trainerD }, - }); - expect(inspectionD?.name).toBe("default-nested");🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/arkor/src/core/trainerInspection.test.ts` around lines 91 - 121, Add a new assertion to cover the missing "default: trainer" shape by calling findInspectableTrainer with an object whose default property is the bare trainer; specifically create a trainer via brandedTrainer (e.g., brandedTrainer("default-bare")) and assert that findInspectableTrainer({ default: trainer }) returns the expected inspection (check .name), similar to the other cases in the test that use findInspectableTrainer, brandedTrainer, and createArkor.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/arkor/src/core/configHash.ts`:
- Around line 64-85: stableStringifyRec currently calls a value's toJSON before
adding the object to the seen set and uses Array.prototype.map which skips
sparse array holes; change it so that when maybeToJSON is a function you add the
current value to seen before invoking (so circular returns from toJSON are
detected like JSON.stringify), then call (maybeToJSON as (key: string) =>
unknown).call(value, key) and recurse on that result; for arrays, stop using
value.map and instead iterate numeric indices 0..value.length-1, treat missing
slots (i in value === false) and slots that serialize to undefined as "null"
(i.e. use stableStringifyRec per index and fall back to "null") to match
JSON.stringify behavior and ensure circulars are detected via the seen set.
In `@packages/arkor/src/core/trainer.test.ts`:
- Around line 1817-1849: The test's fetcher mock expects events for job id
"j-falsy" but the create-job stub returns minimalJobRow.id (currently "j-stop"),
causing wait() to hit the wrong stream URL; update the create-job response to
return a job with id "j-falsy" (i.e. ensure minimalJobRow.id or the object
returned in the POST branch matches "j-falsy") so the GET to
/v1/jobs/j-falsy/events/stream in the fetcher mock is exercised by the
Trainer.wait() flow.
---
Nitpick comments:
In `@packages/arkor/src/core/trainerInspection.test.ts`:
- Around line 91-121: Add a new assertion to cover the missing "default:
trainer" shape by calling findInspectableTrainer with an object whose default
property is the bare trainer; specifically create a trainer via brandedTrainer
(e.g., brandedTrainer("default-bare")) and assert that findInspectableTrainer({
default: trainer }) returns the expected inspection (check .name), similar to
the other cases in the test that use findInspectableTrainer, brandedTrainer, and
createArkor.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 429d904f-75a0-497d-bfbb-9ea6225bd3e4
⛔ Files ignored due to path filters (1)
pnpm-lock.yamlis excluded by!**/pnpm-lock.yaml
📒 Files selected for processing (43)
AGENTS.mddocs/concepts/studio.mdxdocs/ja/concepts/studio.mdxdocs/ja/studio/jobs.mdxe2e/cli/vitest.config.tse2e/studio/src/specs/hmr.spec.tseslint.config.tspackages/arkor/package.jsonpackages/arkor/src/cli/cleanupHooks.test.tspackages/arkor/src/cli/cleanupHooks.tspackages/arkor/src/cli/commands/build.tspackages/arkor/src/cli/commands/dev.test.tspackages/arkor/src/cli/commands/dev.tspackages/arkor/src/cli/commands/start.test.tspackages/arkor/src/core/configHash.test.tspackages/arkor/src/core/configHash.tspackages/arkor/src/core/moduleCacheBust.test.tspackages/arkor/src/core/moduleCacheBust.tspackages/arkor/src/core/projectState.test.tspackages/arkor/src/core/rolldownConfig.tspackages/arkor/src/core/runner.test.tspackages/arkor/src/core/runner.tspackages/arkor/src/core/runnerSignals.test.tspackages/arkor/src/core/runnerSignals.tspackages/arkor/src/core/schemas.test.tspackages/arkor/src/core/signalExit.tspackages/arkor/src/core/trainer.test.tspackages/arkor/src/core/trainer.tspackages/arkor/src/core/trainerInspection.test.tspackages/arkor/src/core/trainerInspection.tspackages/arkor/src/studio/hmr.test.tspackages/arkor/src/studio/hmr.tspackages/arkor/src/studio/manifest.tspackages/arkor/src/studio/server.test.tspackages/arkor/src/studio/server.tspackages/arkor/src/studio/trainRegistry.test.tspackages/arkor/src/studio/trainRegistry.tspackages/cli-internal/src/templates.tspackages/studio-app/src/components/RunTraining.tsxpackages/studio-app/src/lib/api.test.tspackages/studio-app/src/lib/api.tspackages/studio-app/vite.config.tspnpm-workspace.yaml
…andling fix: improve inspectBundle error handling and manifest fetching logic in HMR test: add assertions for default-exported trainer shapes in trainerInspection tests feat: implement monotonic counter for HMR SSE handler to prevent stale state
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e726548bca
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // restored the error. The polling effect's own fetches don't need | ||
| // this because they observe `cancelled` synchronously inside the | ||
| // `tick()` closure; this counter only protects the cross-handler |
There was a problem hiding this comment.
Version manifest polling against HMR errors
When HMR is enabled, a regular 5s manifest poll that started before a later error SSE can still resolve after the error handler has set { error }, because this code explicitly leaves polling fetches outside the manifestFetchSeqRef guard. In projects whose manifest import/build is slow (for example top-level await or filesystem contention), that stale poll can overwrite the current build-error state and re-enable the Run button for the last-good artifact until the next poll restores the error.
Useful? React with 👍 / 👎.
| if (restartPendingRef.current) { | ||
| // Fast path: SSE event already landed before exit. Fire the | ||
| // restart synchronously without waiting for the grace | ||
| // window so the common case has no perceptible delay. | ||
| restartPendingRef.current = false; |
There was a problem hiding this comment.
Suppress auto-restart after failed early-stop exits
When an HMR-triggered early-stop fails to cancel the cloud job, the runner now exits nonzero (exit=143), but streamTraining() still resolves normally for any /api/train stream that closed after a 200 response. If restartPendingRef is set, this branch auto-spawns a new run without checking that exit marker, so the fixed nonzero exit path is still treated as restartable and can overlap a new cloud job with the one whose cancel failed.
Useful? React with 👍 / 👎.
| const { baseUrl: rpcBaseUrl, token: rpcToken } = | ||
| await resolveCredentialsAndBaseUrl(); |
There was a problem hiding this comment.
Capture cancel credentials at spawn time
This server-side Stop path re-reads credentials/base URL at cancel time, while the child created the cloud job using whatever credentials it loaded at trainer.start(). If the user logs in/out or otherwise changes ~/.arkor credentials while a run is in flight, Stop can send the cancel POST to the wrong account/control plane, swallow the failure, and then SIGKILL the child that still held the original credentials, leaving the original cloud job running.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e41acc7f28
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
|
|
||
| // Hash mismatch (or one side is null): graceful restart. | ||
| const r = safeKill(entry.child, "SIGTERM"); |
There was a problem hiding this comment.
Avoid SIGTERM restarts on Windows
When Studio HMR runs on Windows and a rebuild requires a restart, this SIGTERM does not reach the runner's installShutdownHandlers; Node's subprocess.kill() docs state that on Windows these signals cause the process to be killed "forcefully and abruptly" (https://nodejs.org/api/child_process.html#subprocesskillsignal). In that environment the child cannot early-stop or send cancel(), so a config-changing rebuild can orphan the cloud job and lose the checkpoint-preservation behavior this path is meant to provide.
Useful? React with 👍 / 👎.
|
|
||
| async function emitBuildSucceeded(): Promise<void> { | ||
| if (disposed) return; | ||
| const seq = ++buildSeq; |
There was a problem hiding this comment.
Invalidate stale inspections when rebuild starts
This sequence guard only advances after a later build reaches BUNDLE_END (or ERROR), so if the previous bundle's inspectBundle() is slow (for example user top-level await during dynamic import) and a new edit has already started compiling but has not ended yet, the older inspection still passes seq === buildSeq and broadcasts/signals active children for stale source. Rolldown's watch cycle includes BUNDLE_START before BUNDLE_END, so the in-flight inspection should be invalidated when the newer rebuild starts, not only when it finishes.
Useful? React with 👍 / 👎.
| } | ||
| watcher = watch({ | ||
| ...rolldownInputOptions(resolved), | ||
| output: { file: resolved.outFile, format: "esm" }, |
There was a problem hiding this comment.
Avoid importing while the watcher writes
With HMR enabled, the watcher writes .arkor/build/index.mjs at this same path while /api/train can concurrently spawn arkor start, whose runStart skips rebuilding whenever that artifact already exists. If the user clicks Run during the watcher's non-atomic write window, the child can import a partially-written bundle and fail the run instead of starting from a coherent artifact. Gate training while a rebuild is in progress or publish the watched bundle via temp-file-and-rename before making it visible to arkor start.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
packages/arkor/src/studio/server.ts (1)
1620-1624:⚠️ Potential issue | 🟠 Major | ⚡ Quick winConstrain static asset reads to
assetsDir.
readAsset()joins the request path directly ontoassetsDir, andapp.get("*")feeds itc.req.pathverbatim. Any traversal sequence that survives routing normalization can therefore escape the bundled assets tree and read arbitrary local files. Because these GETs are only host-guarded, not Studio-token gated, this is a local file disclosure on the loopback server.🔒 Suggested hardening
async function readAsset(relPath: string): Promise<Response | null> { const cleaned = relPath.replace(/^\/+/, ""); try { - const file = await readFile(join(assetsDir, cleaned)); + const filePath = resolve(assetsDir, cleaned); + if ( + filePath !== assetsDir && + !filePath.startsWith(`${assetsDir}${sep}`) + ) { + return null; + } + const file = await readFile(filePath); const ext = cleaned.slice(cleaned.lastIndexOf(".") + 1); if (ext === "html") {Also applies to: 1647-1655
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/arkor/src/studio/server.ts` around lines 1620 - 1624, readAsset currently concatenates the request path onto assetsDir and can be abused for path traversal; to fix, decode and normalize the incoming relPath, then compute the absolute target with path.resolve(assetsDir, cleaned) (use the same resolved base: const base = path.resolve(assetsDir)), and verify the resulting target is inside base (e.g. use path.relative(base, target) and ensure it does not start with ".." and does not contain path.sep at the front); if the check fails return null (or an appropriate error) and do not call readFile. Also reject suspicious inputs (null bytes, absolute paths) before resolving. Apply this same guard to the other occurrence around lines 1647-1655.packages/arkor/src/cli/commands/dev.test.ts (1)
663-682:⚠️ Potential issue | 🟠 Major | ⚡ Quick winThese persist-failure tests do not actually fail on Windows.
Line 671 and Line 738 rely on
chmodSync(..., 0o555)to block writes to~/.arkor, but NTFS does not enforce directory writes that way. On Windows both tests can go green without ever hitting thepersistStudioTokencatch path, so this regression coverage disappears on part of the CI matrix. Use a platform-neutral failure trigger here, such as makingstudioTokenPath()a directory or mocking the write boundary.Also applies to: 726-747
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/arkor/src/cli/commands/dev.test.ts` around lines 663 - 682, The tests relying on chmodSync to make ~/.arkor read-only are flaky on Windows; replace the platform-specific permission trick with a platform-neutral write-failure trigger: locate the token path via studioTokenPath(...) (or the exported persistStudioToken/writeCredentials function) and either (a) create a directory at that exact path before calling runDev({ port: 4203 }) so the token write will fail with ENOTDIR, or (b) mock/spyon the module method that writes the token (persistStudioToken or writeCredentials) to throw an EACCES-like error for this test; ensure you restore the original behavior in the finally block and keep assertions (expect(runDev...).resolves.toBeUndefined and expect(serve).toHaveBeenCalledTimes(1)) unchanged.packages/studio-app/src/components/RunTraining.tsx (2)
141-167:⚠️ Potential issue | 🟠 Major | ⚡ Quick winThe poller can still overwrite an HMR error with stale manifest data.
Line 151 is only gated by
cancelled, which flips on unmount, not on newer HMR events. If a/api/manifestpoll started before an SSEerrorframe resolves afterward, this effect will restore the last-good manifest and re-enable Run against broken code until the next refresh. Reuse the same seq/generation gate for poll fetches, or invalidate in-flight polls when HMR events land.Possible fix
async function tick() { + const seq = ++manifestFetchSeqRef.current; try { const m = await fetchManifest(); - if (!cancelled) setManifest(m); + if (!cancelled && seq === manifestFetchSeqRef.current) { + setManifest(m); + } } catch (err: unknown) { - if (!cancelled) { + if (!cancelled && seq === manifestFetchSeqRef.current) { setManifest({ error: err instanceof Error ? err.message : String(err), }); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/studio-app/src/components/RunTraining.tsx` around lines 141 - 167, The poller can overwrite a newer HMR error because it only checks cancelled; modify the effect to use a generation/sequence gate: create a ref (e.g., manifestSeqRef) that you increment whenever an SSE/HMR error frame or any HMR event invalidates the manifest, capture the current seq at the start of tick (before calling fetchManifest), then before calling setManifest or scheduling the next timer ensure the captured seq still equals manifestSeqRef.current; if it differs, discard the poll result (and do not schedule another timeout). Keep using the existing symbols (useEffect, cancelled, timer, tick, fetchManifest, setManifest) but add and check the manifestSeqRef around the async fetch and before state updates so in-flight polls are invalidated when HMR events arrive.
334-510: 🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy liftAdd a jsdom test suite for this restart state machine.
This component now owns the risky HMR logic - buffered pre-spawn events, late-SSE grace windows, and "Stop beats auto-restart" arbitration - but the provided coverage only hits API helpers and a caption-refresh E2E. A small Testing Library suite here would catch regressions before they become flaky cross-process Playwright failures. As per coding guidelines: Include jsdom-based Testing Library cases for Studio components. Based on learnings: Do not split implementation across multiple PRs; land docs, tests, and code in the same PR unless the user explicitly says otherwise.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/studio-app/src/components/RunTraining.tsx` around lines 334 - 510, Add a jsdom Testing Library suite for the RunTraining component exercising the run() state machine: write tests to simulate streamTraining callbacks and SSE-like pre-spawn events to validate that pendingPreSpawnEventsRef is drained, restartPendingRef wins over hot-swap, user Abort (trainingAbortRef)/ac.signal.aborted prevents logging and cancels restarts, and the 250ms restart grace window behavior (currentPidRef, restartGraceTimerRef) toggles hmrStatus correctly; stub/mock streamTraining and isHmrEnabled, drive onSpawn via the pid callback, and assert UI/state transitions (setHmrStatus outcomes "idle"/"early-stopping"/"restarting"/"hot-swapped") and that run(file) re-invokes when appropriate. Ensure tests run under jsdom (Testing Library + jest) and are added alongside the component.Sources: Coding guidelines, Learnings
packages/arkor/src/cli/commands/dev.ts (1)
199-230:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftRead-then-unlink still lets a concurrent
arkor devinstance lose its token.Line 224 and Line 230 are a TOCTOU pair: another process can rewrite
~/.arkor/studio-tokenafter the read succeeds but before this process unlinks it. In that interleaving, we compare our bytes and then delete the other process's fresh token anyway, so the shared-path 403 regression is still reachable. This needs an atomic ownership/lease scheme, not a best-effort content check.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/arkor/src/cli/commands/dev.ts` around lines 199 - 230, The current read-then-unlink in registerCleanupHook (readFileSync → tokensEqual → unlinkSync) is TOCTOU and can delete a freshly written token from another process; replace the best-effort content check with an explicit ownership/lease: when creating the token (persistStudioToken or wherever the token is written) also atomically acquire a per-process lease by creating a companion lock/ownership file using fs.openSync(lockPath, 'wx') (or writeFileSync with flag 'wx') containing a unique owner id (pid/uuid); on cleanup, instead of just reading the token file, read the lock file and compare the owner id to our stored id and only then unlink the token and remove the lock; update registerCleanupHook, persistStudioToken (or the token-writing site), and use tokensEqual only for debug — use lockPath, persistStudioToken, registerCleanupHook, and the stored owner id to locate the changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@packages/arkor/src/cli/commands/dev.test.ts`:
- Around line 663-682: The tests relying on chmodSync to make ~/.arkor read-only
are flaky on Windows; replace the platform-specific permission trick with a
platform-neutral write-failure trigger: locate the token path via
studioTokenPath(...) (or the exported persistStudioToken/writeCredentials
function) and either (a) create a directory at that exact path before calling
runDev({ port: 4203 }) so the token write will fail with ENOTDIR, or (b)
mock/spyon the module method that writes the token (persistStudioToken or
writeCredentials) to throw an EACCES-like error for this test; ensure you
restore the original behavior in the finally block and keep assertions
(expect(runDev...).resolves.toBeUndefined and
expect(serve).toHaveBeenCalledTimes(1)) unchanged.
In `@packages/arkor/src/cli/commands/dev.ts`:
- Around line 199-230: The current read-then-unlink in registerCleanupHook
(readFileSync → tokensEqual → unlinkSync) is TOCTOU and can delete a freshly
written token from another process; replace the best-effort content check with
an explicit ownership/lease: when creating the token (persistStudioToken or
wherever the token is written) also atomically acquire a per-process lease by
creating a companion lock/ownership file using fs.openSync(lockPath, 'wx') (or
writeFileSync with flag 'wx') containing a unique owner id (pid/uuid); on
cleanup, instead of just reading the token file, read the lock file and compare
the owner id to our stored id and only then unlink the token and remove the
lock; update registerCleanupHook, persistStudioToken (or the token-writing
site), and use tokensEqual only for debug — use lockPath, persistStudioToken,
registerCleanupHook, and the stored owner id to locate the changes.
In `@packages/arkor/src/studio/server.ts`:
- Around line 1620-1624: readAsset currently concatenates the request path onto
assetsDir and can be abused for path traversal; to fix, decode and normalize the
incoming relPath, then compute the absolute target with path.resolve(assetsDir,
cleaned) (use the same resolved base: const base = path.resolve(assetsDir)), and
verify the resulting target is inside base (e.g. use path.relative(base, target)
and ensure it does not start with ".." and does not contain path.sep at the
front); if the check fails return null (or an appropriate error) and do not call
readFile. Also reject suspicious inputs (null bytes, absolute paths) before
resolving. Apply this same guard to the other occurrence around lines 1647-1655.
In `@packages/studio-app/src/components/RunTraining.tsx`:
- Around line 141-167: The poller can overwrite a newer HMR error because it
only checks cancelled; modify the effect to use a generation/sequence gate:
create a ref (e.g., manifestSeqRef) that you increment whenever an SSE/HMR error
frame or any HMR event invalidates the manifest, capture the current seq at the
start of tick (before calling fetchManifest), then before calling setManifest or
scheduling the next timer ensure the captured seq still equals
manifestSeqRef.current; if it differs, discard the poll result (and do not
schedule another timeout). Keep using the existing symbols (useEffect,
cancelled, timer, tick, fetchManifest, setManifest) but add and check the
manifestSeqRef around the async fetch and before state updates so in-flight
polls are invalidated when HMR events arrive.
- Around line 334-510: Add a jsdom Testing Library suite for the RunTraining
component exercising the run() state machine: write tests to simulate
streamTraining callbacks and SSE-like pre-spawn events to validate that
pendingPreSpawnEventsRef is drained, restartPendingRef wins over hot-swap, user
Abort (trainingAbortRef)/ac.signal.aborted prevents logging and cancels
restarts, and the 250ms restart grace window behavior (currentPidRef,
restartGraceTimerRef) toggles hmrStatus correctly; stub/mock streamTraining and
isHmrEnabled, drive onSpawn via the pid callback, and assert UI/state
transitions (setHmrStatus outcomes
"idle"/"early-stopping"/"restarting"/"hot-swapped") and that run(file)
re-invokes when appropriate. Ensure tests run under jsdom (Testing Library +
jest) and are added alongside the component.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: be0f3b24-0b07-46b0-87c3-4a8c45682ee7
📒 Files selected for processing (18)
AGENTS.mddocs/concepts/studio.mdxe2e/cli/vitest.config.tse2e/studio/src/specs/hmr.spec.tseslint.config.tspackages/arkor/src/cli/commands/dev.test.tspackages/arkor/src/cli/commands/dev.tspackages/arkor/src/cli/commands/start.test.tspackages/arkor/src/core/runner.test.tspackages/arkor/src/core/trainer.test.tspackages/arkor/src/core/trainer.tspackages/arkor/src/studio/hmr.tspackages/arkor/src/studio/server.test.tspackages/arkor/src/studio/server.tspackages/studio-app/src/components/RunTraining.tsxpackages/studio-app/src/lib/api.test.tspackages/studio-app/src/lib/api.tspackages/studio-app/vite.config.ts
📜 Review details
🧰 Additional context used
📓 Path-based instructions (19)
docs/**/*.mdx
📄 CodeRabbit inference engine (AGENTS.md)
Verify Mintlify heading slugs with curl before adding cross-page anchor links. Mintlify preserves
/,=, and full-width JP parens()in ids, strips ASCII parens()and backticks`, and converts spaces to-. Confirm actual rendered ids before committing links.
Files:
docs/concepts/studio.mdx
**
⚙️ CodeRabbit configuration file
**: # Arkor Development GuideNote: Claude Code automatically loads this file.
Repository shape
pnpm + Turbo monorepo. Workspaces are declared in
pnpm-workspace.yaml(packages/*,e2e/*,docs).
Path Role packages/arkor Published arkorSDK + CLI + bundled local Studio server (Hono).bin/arkor→dist/bin.mjs. Library entry →dist/index.mjs.packages/create-arkor Published create-arkorscaffolder (pnpm create arkor).packages/cli-internal Private workspace package. Source is bundled into arkorandcreate-arkorvia tsdown'sdeps.alwaysBundle. Never appears as a runtime dependency on npm.packages/studio-app Private Vite + React 19 SPA. pnpm --filter@arkor/studio-appbundlebuilds it;packages/arkor/scripts/copy-studio-assets.mjscopiesdist/intopackages/arkor/dist/assets/.e2e/cli Private vitest suite that spawns the built dist/bin.mjsof both CLIs in temp dirs.e2e/studio Private Playwright suite that spawns arkor devagainst an in-process fake cloud-api and drives the Studio SPA in Chromium.docs Mintlify sources for docs.arkor.ai. Common commands
Root scripts fan out via Turbo (which respects
^builddeps in turbo.json):pnpm install pnpm build # turbo run build across all packages pnpm typecheck # tsc --noEmit across all packages pnpm lint # oxlint --deny-warnings, then strict ESLint 10 (single root config each) pnpm format # oxfmt --write across the repo (config in oxfmt.config.ts) pnpm format:check # oxfmt --check; CI gates on this (no writes) pnpm test # vitest run across all packages (incl. e2e) pnpm test:coverage # writes lcov + cobertura + junit per package; CI uploads lcov to Codecov, cobertura to GitHub nativ...
Files:
docs/concepts/studio.mdxpackages/arkor/src/cli/commands/start.test.tspackages/studio-app/vite.config.tse2e/cli/vitest.config.tspackages/studio-app/src/lib/api.tspackages/arkor/src/studio/server.tsAGENTS.mdeslint.config.tspackages/arkor/src/studio/server.test.tspackages/arkor/src/core/runner.test.tse2e/studio/src/specs/hmr.spec.tspackages/arkor/src/core/trainer.test.tspackages/arkor/src/core/trainer.tspackages/arkor/src/cli/commands/dev.tspackages/studio-app/src/components/RunTraining.tsxpackages/studio-app/src/lib/api.test.tspackages/arkor/src/studio/hmr.tspackages/arkor/src/cli/commands/dev.test.ts
**/*.{ts,tsx,js,jsx,mts,mjs,cts,cjs}
📄 CodeRabbit inference engine (CONTRIBUTING.md)
**/*.{ts,tsx,js,jsx,mts,mjs,cts,cjs}: Avoid the em dash (U+2014) in code comments, string literals, and template literals. Use colons, periods, commas, parentheses, spaced hyphens (" - "), or reworded sentences instead
Runpnpm lint(oxlint --deny-warnings followed by strict ESLint 10 with type-aware rules) before submitting code
Files:
packages/arkor/src/cli/commands/start.test.tspackages/studio-app/vite.config.tse2e/cli/vitest.config.tspackages/studio-app/src/lib/api.tspackages/arkor/src/studio/server.tseslint.config.tspackages/arkor/src/studio/server.test.tspackages/arkor/src/core/runner.test.tse2e/studio/src/specs/hmr.spec.tspackages/arkor/src/core/trainer.test.tspackages/arkor/src/core/trainer.tspackages/arkor/src/cli/commands/dev.tspackages/studio-app/src/components/RunTraining.tsxpackages/studio-app/src/lib/api.test.tspackages/arkor/src/studio/hmr.tspackages/arkor/src/cli/commands/dev.test.ts
**/*.{ts,tsx,js,jsx,mts,mjs,cts,cjs,json,css,html}
📄 CodeRabbit inference engine (CONTRIBUTING.md)
Run
pnpm formatusing oxfmt to format all JS/TS/JSON/CSS/HTML files before submitting code
Files:
packages/arkor/src/cli/commands/start.test.tspackages/studio-app/vite.config.tse2e/cli/vitest.config.tspackages/studio-app/src/lib/api.tspackages/arkor/src/studio/server.tseslint.config.tspackages/arkor/src/studio/server.test.tspackages/arkor/src/core/runner.test.tse2e/studio/src/specs/hmr.spec.tspackages/arkor/src/core/trainer.test.tspackages/arkor/src/core/trainer.tspackages/arkor/src/cli/commands/dev.tspackages/studio-app/src/components/RunTraining.tsxpackages/studio-app/src/lib/api.test.tspackages/arkor/src/studio/hmr.tspackages/arkor/src/cli/commands/dev.test.ts
**/*.test.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (CONTRIBUTING.md)
Include vitest cases for SDK/CLI/scaffolder logic, jsdom-based Testing Library cases for Studio components, or visual UI documentation when submitting PRs
Files:
packages/arkor/src/cli/commands/start.test.tspackages/arkor/src/studio/server.test.tspackages/arkor/src/core/runner.test.tspackages/arkor/src/core/trainer.test.tspackages/studio-app/src/lib/api.test.tspackages/arkor/src/cli/commands/dev.test.ts
**/*.{ts,tsx,js,jsx,mts,mjs}
📄 CodeRabbit inference engine (CONTRIBUTING.ja.md)
Do not use em-dashes (U+2014) anywhere in code comments, strings, and template literals. Instead use colons, periods, commas, parentheses, space-separated hyphens (
-), or rephrase sentences.
Files:
packages/arkor/src/cli/commands/start.test.tspackages/studio-app/vite.config.tse2e/cli/vitest.config.tspackages/studio-app/src/lib/api.tspackages/arkor/src/studio/server.tseslint.config.tspackages/arkor/src/studio/server.test.tspackages/arkor/src/core/runner.test.tse2e/studio/src/specs/hmr.spec.tspackages/arkor/src/core/trainer.test.tspackages/arkor/src/core/trainer.tspackages/arkor/src/cli/commands/dev.tspackages/studio-app/src/components/RunTraining.tsxpackages/studio-app/src/lib/api.test.tspackages/arkor/src/studio/hmr.tspackages/arkor/src/cli/commands/dev.test.ts
packages/{arkor,create-arkor,cli-internal}/**/*.{ts,tsx,js,mts}
📄 CodeRabbit inference engine (CONTRIBUTING.ja.md)
CLI runtime messages, generated file templates, and test names must also follow the no em-dash rule
Files:
packages/arkor/src/cli/commands/start.test.tspackages/arkor/src/studio/server.tspackages/arkor/src/studio/server.test.tspackages/arkor/src/core/runner.test.tspackages/arkor/src/core/trainer.test.tspackages/arkor/src/core/trainer.tspackages/arkor/src/cli/commands/dev.tspackages/arkor/src/studio/hmr.tspackages/arkor/src/cli/commands/dev.test.ts
**/*.{ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
**/*.{ts,tsx}: Preferinterfacefor defining object shapes in TypeScript (TypeScript-specific type definition pattern)
Use camelCase for variable names in TypeScript/JavaScript code
Always use async/await for promises in TypeScript/JavaScript code
UsetimingSafeEqualfor security-critical token comparisons (e.g., Studio CSRF token validation)
Files:
packages/arkor/src/cli/commands/start.test.tspackages/studio-app/vite.config.tse2e/cli/vitest.config.tspackages/studio-app/src/lib/api.tspackages/arkor/src/studio/server.tseslint.config.tspackages/arkor/src/studio/server.test.tspackages/arkor/src/core/runner.test.tse2e/studio/src/specs/hmr.spec.tspackages/arkor/src/core/trainer.test.tspackages/arkor/src/core/trainer.tspackages/arkor/src/cli/commands/dev.tspackages/studio-app/src/components/RunTraining.tsxpackages/studio-app/src/lib/api.test.tspackages/arkor/src/studio/hmr.tspackages/arkor/src/cli/commands/dev.test.ts
packages/arkor/src/**/*.test.ts
📄 CodeRabbit inference engine (AGENTS.md)
Add vitest test cases for SDK/CLI/scaffold logic changes
Files:
packages/arkor/src/cli/commands/start.test.tspackages/arkor/src/studio/server.test.tspackages/arkor/src/core/runner.test.tspackages/arkor/src/core/trainer.test.tspackages/arkor/src/cli/commands/dev.test.ts
**/*.ts
📄 CodeRabbit inference engine (AGENTS.md)
When bumping
oxlintoroxfmtin the catalog, runpnpm install --config.minimum-release-age=0once so platform bindings land inpnpm-lock.yaml. Regularpnpm installthereafter will resolve them normally.
Files:
packages/arkor/src/cli/commands/start.test.tspackages/studio-app/vite.config.tse2e/cli/vitest.config.tspackages/studio-app/src/lib/api.tspackages/arkor/src/studio/server.tseslint.config.tspackages/arkor/src/studio/server.test.tspackages/arkor/src/core/runner.test.tse2e/studio/src/specs/hmr.spec.tspackages/arkor/src/core/trainer.test.tspackages/arkor/src/core/trainer.tspackages/arkor/src/cli/commands/dev.tspackages/studio-app/src/lib/api.test.tspackages/arkor/src/studio/hmr.tspackages/arkor/src/cli/commands/dev.test.ts
packages/studio-app/vite.config.ts
📄 CodeRabbit inference engine (AGENTS.md)
Configure the
arkor-studio-tokenVite plugin withapply: "serve"(dev-only) to inject<meta name="arkor-studio-token">into index.html on each request. Never apply this plugin duringvite buildto prevent baking a stale token into production.
Files:
packages/studio-app/vite.config.ts
e2e/cli/**/*.{ts,tsx,js,mts}
📄 CodeRabbit inference engine (CONTRIBUTING.ja.md)
E2E CLI tests must consume the built dist/bin.mjs from arkor and create-arkor packages, ensuring pnpm build is run first when executing tests standalone
Files:
e2e/cli/vitest.config.ts
packages/arkor/src/studio/server.ts
📄 CodeRabbit inference engine (AGENTS.md)
Validate every
/api/*request viaX-Arkor-Studio-Tokenheader (or?studioToken=query forEventSource) usingtimingSafeEqualcomparison. Only GET stream-only routes can be added to theeventStreamPathPatternallow-list regex; never add mutation endpoints.
Files:
packages/arkor/src/studio/server.ts
AGENTS.md
📄 CodeRabbit inference engine (CLAUDE.md)
Maintain AGENTS.md as a living document with current agent status and architectural decisions
Files:
AGENTS.md
**/*.md
📄 CodeRabbit inference engine (AGENTS.md)
Maintain English/Japanese documentation pairs:
README.md↔README.ja.md,CONTRIBUTING.md↔CONTRIBUTING.ja.md, anddocs/↔docs/ja/. Update both sides in the same PR.
Files:
AGENTS.md
eslint.config.ts
📄 CodeRabbit inference engine (AGENTS.md)
ESLint 10 reads
eslint.config.tsflat config at repo root (type-aware viaprojectService). Configure strict rules: typescript-eslintstrictTypeChecked+stylisticTypeChecked, unicorn, import-x, promise, n, react-hooks, jsx-a11y, vitest. Add per-file overrides for SPA (React), tests, Playwright fixtures, and scripts. Keepauto-fixunsafe rules (unicorn/no-typeof-undefined,no-unnecessary-type-assertion) set tooffwith comments explaining the failure modes.
Files:
eslint.config.ts
e2e/studio/**/*.{ts,tsx,js,mts}
📄 CodeRabbit inference engine (CONTRIBUTING.ja.md)
E2E Studio tests must consume the built dist/bin.mjs from arkor and the bundled Studio assets, requiring playwright chromium installation and pnpm build as prerequisites
Files:
e2e/studio/src/specs/hmr.spec.ts
packages/arkor/src/cli/commands/dev.ts
📄 CodeRabbit inference engine (AGENTS.md)
Generate a 32-byte base64url CSRF token per dev server launch and pass it to
buildStudioApp({ studioToken, hmr })
Files:
packages/arkor/src/cli/commands/dev.ts
packages/arkor/src/studio/hmr.ts
📄 CodeRabbit inference engine (AGENTS.md)
Keep a Rolldown watcher over
src/arkor/to track rebuilds and push events over/api/dev/events(SSE). Dynamic-import artifacts on success, extractTrainerInspectionsnapshots viaSymbol.for("arkor.trainer.inspect")brand, and compute stableconfigHashfrom cloud-sideJobConfig.
Files:
packages/arkor/src/studio/hmr.ts
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-11T14:01:54.065Z
Learning: Use the Repository pattern for data access
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-11T14:01:54.065Z
Learning: Treat `createArkor` as a function that returns a frozen, opaque manifest tagged with `_kind: "arkor"`; pass it to tooling, not as a programmable client for user code
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-11T14:01:54.065Z
Learning: Both e2e/cli and e2e/studio declare `arkor` (and `create-arkor`) as `workspace:*` devDependencies so Turbo's `^build` produces `dist/bin.mjs` exactly once before test runs (no pretest hooks, no concurrent races). Standalone runs need a prior `pnpm build`.
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-11T14:01:54.065Z
Learning: E2E coverage uses `c8` wrapping vitest (not vitest's native coverage) so child CLI V8 coverage is captured and remapped through tsdown sourcemaps back to `src/`. Set `CREATE_ARKOR_BUILD_SOURCEMAP=1` for E2E coverage builds; CI declares it in turbo.json so it propagates and busts the no-sourcemap cache variant. Release builds do NOT set this and ship without `.map` files.
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-11T14:01:54.065Z
Learning: Release SBOMs are generated with `pnpm sbom --prod --sbom-spec-version 1.7` (CycloneDX 1.7 and SPDX 2.3 formats), attested under the trusted builder's OIDC identity, and uploaded to both Codecov and GitHub native code coverage. Bundled runtime inputs of `arkor/studio-app` (e.g. `react`, `fontsource-variable/geist*`) must be `dependencies`, not `devDependencies`, so they appear in the production SBOM.
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-11T14:01:54.065Z
Learning: Do NOT split implementation (features, schema, behavior changes) across multiple PRs. Land docs (English + Japanese), tests, and code in the same PR unless the user explicitly says otherwise.
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-11T14:01:54.065Z
Learning: Node version: published packages declare `engines.node >=22.22.0` (raised from 22.6 to dodge the Jan 2026 async-hooks CVE). Use Node 24 (latest preferred) for development.
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-11T14:01:54.065Z
Learning: Preserve the SIGTERM-and-let-the-child-handle-it pattern in HMR subprocess lifecycle. Do not replace it with SIGKILL escalation (orphans Cloud jobs) or always-hot-swap (risks stale JobConfig). Do not surface `requestEarlyStop()` or `replaceCallbacks()` as public Trainer interface methods; keep them behind `Symbol.for` brands (dev-only HMR primitives).
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-11T14:01:54.065Z
Learning: HMR Symbols (`Symbol.for("arkor.trainer.inspect")`, `Symbol.for("arkor.trainer.replaceCallbacks")`, `Symbol.for("arkor.trainer.requestEarlyStop")`) are dev-only HMR primitives and must never be exposed in the published Trainer public interface. User code that wants similar semantics should compose `abortSignal` + `cancel()` per the cookbook.
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-11T14:01:54.065Z
Learning: When touching the Studio server or SPA fetch layer, preserve: token via header for `fetch`, query param for `EventSource`, host-header guard, no CORS, timing-safe compare. The Vite plugin is dev-only (`apply: "serve"`); running it during `vite build` would bake a stale per-launch token into production.
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-11T14:01:54.065Z
Learning: CORS is intentionally NOT configured on `/api/*`. The SPA is same-origin so reflecting `*` would let "simple" cross-origin POSTs reach handlers. The token check rejects those; cross-origin tabs cannot read the SPA's `<meta>`. This prevents another browser tab on the same machine from POSTing `/api/train` (RCE-grade exposure).
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-11T14:01:54.065Z
Learning: Generated files copied into package directories are gitignored: `packages/*/CONTRIBUTING.md` (from root), `packages/arkor/docs/` (from root `docs/`). Always edit the source under repo root, not the copies.
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-11T14:01:54.065Z
Learning: Do not call a HuggingFace model name "non-existent" based on training data alone. Verify (e.g. WebFetch) before flagging in issues or PR comments. If unverifiable, hedge ("could not confirm") rather than asserting absence.
🪛 LanguageTool
AGENTS.md
[style] ~100-~100: Consider using the typographical ellipsis character here instead.
Context: ...emoved. The build / coverage lanes set ARKOR_INTERNAL_SCAFFOLD_ARKOR_SPEC=file:.../packages/arkor so scaffolded fixtures ...
(ELLIPSIS)
[uncategorized] ~132-~132: The official name of this software platform is spelled with a capital “H”.
Context: ...TOCOL`. - Release SBOMs (build.yaml): each release sh...
(GITHUB)
[uncategorized] ~132-~132: The official name of this software platform is spelled with a capital “H”.
Context: ...ifact, and postflight in release.yaml attaches them t...
(GITHUB)
[uncategorized] ~132-~132: The official name of this software platform is spelled with a capital “H”.
Context: ...npm emits 2.3), so release-dry-run.yaml regener...
(GITHUB)
🔇 Additional comments (11)
packages/arkor/src/core/trainer.ts (1)
32-32: LGTM!Also applies to: 157-157, 248-248
eslint.config.ts (1)
4-4: LGTM!Also applies to: 31-38, 40-98, 154-167, 176-176, 334-334, 357-357
packages/arkor/src/core/runner.test.ts (1)
171-171: LGTM!packages/arkor/src/core/trainer.test.ts (1)
279-279: LGTM!Also applies to: 674-675, 856-858, 1841-1841
packages/arkor/src/studio/hmr.ts (1)
108-108: LGTM!Also applies to: 111-111, 347-347
packages/studio-app/src/components/RunTraining.tsx (1)
241-251: Em-dash check: no U+2014 characters in RunTraining.tsx
packages/studio-app/src/components/RunTraining.tsxcontains no U+2014 em-dash characters, so the em-dash guideline issue is not present here.> Likely an incorrect or invalid review comment.packages/studio-app/src/lib/api.ts (1)
174-183: LGTM!Also applies to: 262-268, 414-424, 447-453, 476-485, 510-515
packages/studio-app/vite.config.ts (1)
61-63: LGTM!AGENTS.md (1)
100-100: LGTM!Also applies to: 132-132
docs/concepts/studio.mdx (1)
15-15: LGTM!e2e/cli/vitest.config.ts (1)
22-25: LGTM!Also applies to: 45-46
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c8c110d181
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const snap = win32CancelSnapshots.find( | ||
| (s) => s.pid === target.pid, | ||
| ); | ||
| if (!snap?.jobId || !snap.scope) continue; |
There was a problem hiding this comment.
Fall back to late scope before Windows HMR cancel
On Windows, child.kill("SIGTERM") forcefully terminates the runner (Node documents this as a forceful/abrupt kill), so this server-side cancel is the only chance to release the cloud job for HMR restarts. In first-run anonymous projects the registry scope is commonly null at spawn because .arkor/state.json is created inside the child during trainer.start(), but the job id may be known by the time a rebuild forces restart; this !snap.scope guard then skips the cancel POST and the abrupt kill can leave the remote job running. Fresh evidence versus the earlier Windows SIGTERM finding: the new mitigation still only uses the spawn-time snapshot and does not do the late state-file fallback that the manual Stop path added.
Useful? React with 👍 / 👎.
| const { baseUrl: rpcBaseUrl, token: rpcToken } = | ||
| await resolveCredentialsAndBaseUrl(); |
There was a problem hiding this comment.
Use the spawn credentials for Windows HMR cancels
When a Windows HMR restart hits this fallback, the child has just been killed abruptly, so this POST is the only process that can cancel the cloud job. Resolving credentials here uses whatever account/control plane is on disk at rebuild time; if the user logged in/out or switched hosts after the run was spawned, the cancel goes to the wrong place, the error is swallowed, and the original job keeps running. Fresh evidence versus the earlier credential-capture report: the manual Stop path now uses its spawn-time spawnRpc, but this new Windows HMR fallback still bypasses that snapshot.
Useful? React with 👍 / 👎.
| void trainer | ||
| .cancel() |
There was a problem hiding this comment.
Disarm the timeout latch before fallback cancel
When the early-stop timeout fires, the latch stays armed until this fallback cancel() settles. If a checkpoint.saved event arrives during that in-flight cancel POST, the checkpoint branch still sees earlyStopRequested && earlyStopDeferred and issues a second trainer.cancel(); on APIs that reject duplicate/already-cancelled requests, that second POST can mark the run failed and make the SIGTERM handler exit nonzero even though the timeout fallback cancel may have succeeded. Clear or mark the latch before starting the fallback cancel so late checkpoints cannot re-enter the cancel path.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/arkor/src/studio/server.ts (1)
354-362:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftThe manifest fast path drifts from custom HMR entries.
options.hmrcan be watching an explicit entry/output, but this code always re-deriveshmrOutFilefromcwdalone. In anarkor devsession started against a non-default entry,/api/manifestwill keep reading the default.arkor/build/index.mjswhile the watcher is publishing a different artifact, so the SPA polls stale or missing manifest data even though HMR is healthy. Please plumb the coordinator's resolvedoutFileintobuildStudioAppinstead of recomputing it here.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/arkor/src/studio/server.ts` around lines 354 - 362, The current hmrOutFile computation recomputes the build entry from trainCwd (using resolveBuildEntry) and ignores a coordinator-resolved output; update the logic so buildStudioApp receives and uses the coordinator's resolved outFile instead of recomputing it here: stop deriving hmrOutFile from resolveBuildEntry({ cwd: trainCwd }) when options.hmr is true and instead accept a resolvedOutFile/outFile parameter (the coordinator's value) and assign hmrOutFile = coordinatorOutFile (or the passed-in resolvedOutFile) so readManifestSummary and the /api/manifest fast path use the coordinator-provided artifact; keep fallback behavior for undefined/out-of-date files as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/arkor/src/cli/commands/dev.ts`:
- Around line 261-278: Replace the non-atomic existsSync+renameSync path restore
with an atomic "create only if absent" operation: in the try block where you
currently call renameSync(reapPath, path), use fs.linkSync(reapPath, path) (or
fs.copyFileSync(reapPath, path, fs.constants.COPYFILE_EXCL)) so the creation
will fail with EEXIST if path was recreated, and in the catch for that EEXIST
simply unlinkSync(reapPath); keep the unlinkSync(reapPath) branch when path
already existed before attempting the atomic create; reference existsSync,
unlinkSync, renameSync (replace), linkSync or copyFileSync with COPYFILE_EXCL,
reapPath and path to locate the change.
In `@packages/arkor/src/studio/hmr.ts`:
- Around line 431-445: The staging rename currently changes mtime/ctime and
breaks inspectBundle() caching because moduleCacheBustUrl(outFile) is used; fix
by comparing content hashes and skipping the atomic rename when the staging
content equals the existing artifact: compute a checksum (e.g., SHA256) of
stagingFile and resolved.outFile at the BUNDLE_END/rename path (the code around
stagingFile and watcher), and only call renameSync when hashes differ;
alternatively, update inspectBundle() and moduleCacheBustUrl() to use a
content-hash-based cache key (read file bytes and derive the cache key) so
identical-byte rebuilds map to the same inspection record.
In `@packages/arkor/src/studio/server.ts`:
- Around line 1074-1121: The win32 restart path snapshots only pid/jobId/scope
and re-resolves credentials later, reintroducing races (null jobId, null scope,
auth/host switches) — instead attach a spawn-time RPC cancel context (baseUrl,
token and any resolved RPC client info) to each registry entry when the child is
spawned (augment whatever structure activeTrains.list()/activeTrains registry
returns), then in the restart loop (the restartTargets handling after
activeTrains.dispatchRebuild) look up that stored RPC snapshot (instead of
calling resolveCredentialsAndBaseUrl/createRpc again) and call
rpc.v1.jobs[":id"].cancel.$post with the stored snapshot; still keep the
existing late fallback to jobId/state when the spawn-time snapshot is missing,
but prefer the stored spawn-time RPC context to avoid races.
In `@packages/studio-app/src/components/RunTraining.tsx`:
- Around line 463-489: The current guard only suppresses auto-restart for
explicit nonzero exit codes; change it so a missing trailer (exitCode === null)
also suppresses restart when a restart was pending. Update the condition around
the block that manipulates restartPendingRef.current, currentPidRef.current,
restartGraceTimerRef.current, setHmrStatus and setLog (the existing if that
checks exitCode) to: if ((exitCode !== null && exitCode !== 0) || (exitCode ===
null && restartPendingRef.current)) { ... } so the same suppression/cleanup and
log (via appendCapped) runs for null when a restart was pending.
---
Outside diff comments:
In `@packages/arkor/src/studio/server.ts`:
- Around line 354-362: The current hmrOutFile computation recomputes the build
entry from trainCwd (using resolveBuildEntry) and ignores a coordinator-resolved
output; update the logic so buildStudioApp receives and uses the coordinator's
resolved outFile instead of recomputing it here: stop deriving hmrOutFile from
resolveBuildEntry({ cwd: trainCwd }) when options.hmr is true and instead accept
a resolvedOutFile/outFile parameter (the coordinator's value) and assign
hmrOutFile = coordinatorOutFile (or the passed-in resolvedOutFile) so
readManifestSummary and the /api/manifest fast path use the coordinator-provided
artifact; keep fallback behavior for undefined/out-of-date files as before.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 3eac6c93-4022-4549-b988-df0e2aa0795a
📒 Files selected for processing (8)
packages/arkor/src/cli/commands/dev.test.tspackages/arkor/src/cli/commands/dev.tspackages/arkor/src/studio/hmr.tspackages/arkor/src/studio/server.tspackages/studio-app/src/components/RunTraining.hmr.test.tsxpackages/studio-app/src/components/RunTraining.tsxpackages/studio-app/src/lib/api.test.tspackages/studio-app/src/lib/api.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 900000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (52)
- GitHub Check: install · yarn-berry · macos-latest · node 26.0.0
- GitHub Check: install · npm · macos-latest · node >=26.0.0 <27
- GitHub Check: install · bun · macos-latest · node >=26.0.0 <27
- GitHub Check: install · yarn-berry · macos-latest · node >=26.0.0 <27
- GitHub Check: install · pnpm-9 · macos-latest · node 26.0.0
- GitHub Check: install · yarn · macos-latest · node >=26.0.0 <27
- GitHub Check: install · pnpm-11 · macos-latest · node >=26.0.0 <27
- GitHub Check: install · pnpm-10 · macos-latest · node >=26.0.0 <27
- GitHub Check: install · pnpm-10 · macos-latest · node 26.0.0
- GitHub Check: install · pnpm-9 · macos-latest · node >=26.0.0 <27
- GitHub Check: install · npm · macos-latest · node >=24.12.0 <25
- GitHub Check: install · yarn-berry · macos-latest · node >=24.12.0 <25
- GitHub Check: install · bun · macos-latest · node >=24.12.0 <25
- GitHub Check: install · pnpm-10 · windows-latest · node 24.0.0
- GitHub Check: install · pnpm-11 · macos-latest · node 24.12.0
- GitHub Check: install · pnpm-11 · windows-latest · node >=26.0.0 <27
- GitHub Check: install · yarn-berry · windows-latest · node 24.12.0
- GitHub Check: install · bun · windows-latest · node 24.12.0
- GitHub Check: install · pnpm-11 · macos-latest · node >=22.22.0 <23
- GitHub Check: install · yarn · windows-latest · node >=26.0.0 <27
- GitHub Check: install · pnpm-10 · macos-latest · node 24.0.0
- GitHub Check: install · bun · windows-latest · node >=24.12.0 <25
- GitHub Check: install · pnpm-11 · macos-latest · node 22.22.0
- GitHub Check: install · bun · windows-latest · node 26.0.0
- GitHub Check: install · bun · windows-latest · node >=26.0.0 <27
- GitHub Check: install · pnpm-10 · windows-latest · node 26.0.0
- GitHub Check: install · npm · windows-latest · node >=26.0.0 <27
- GitHub Check: install · pnpm-11 · windows-latest · node >=24.12.0 <25
- GitHub Check: install · yarn · windows-latest · node 26.0.0
- GitHub Check: install · pnpm-10 · windows-latest · node >=26.0.0 <27
- GitHub Check: install · yarn-berry · windows-latest · node >=24.12.0 <25
- GitHub Check: install · yarn · windows-latest · node 24.12.0
- GitHub Check: install · pnpm-9 · windows-latest · node >=26.0.0 <27
- GitHub Check: install · npm · windows-latest · node 24.12.0
- GitHub Check: install · yarn-berry · windows-latest · node 26.0.0
- GitHub Check: install · yarn · windows-latest · node >=24.12.0 <25
- GitHub Check: install · pnpm-10 · windows-latest · node >=24.12.0 <25
- GitHub Check: install · pnpm-11 · windows-latest · node 24.12.0
- GitHub Check: install · yarn · windows-latest · node 24.0.0
- GitHub Check: install · pnpm-9 · windows-latest · node >=24.12.0 <25
- GitHub Check: install · pnpm-9 · windows-latest · node 24.12.0
- GitHub Check: install · npm · windows-latest · node 26.0.0
- GitHub Check: install · pnpm-11 · windows-latest · node 24.0.0
- GitHub Check: install · npm · windows-latest · node >=24.12.0 <25
- GitHub Check: install · pnpm-10 · windows-latest · node 24.12.0
- GitHub Check: install · npm · windows-latest · node 24.0.0
- GitHub Check: install · npm · windows-latest · node >=22.22.0 <23
- GitHub Check: install · pnpm-9 · windows-latest · node 22.22.0
- GitHub Check: install · npm · windows-latest · node 22.22.0
- GitHub Check: install · yarn · windows-latest · node >=22.22.0 <23
- GitHub Check: install · npm · windows-latest · node 26.0.0
- GitHub Check: typecheck · lint · test · build · windows-latest · node 22.22.0
🧰 Additional context used
📓 Path-based instructions (11)
**/*.{ts,tsx,js,jsx,mts,mjs,cts,cjs}
📄 CodeRabbit inference engine (CONTRIBUTING.md)
**/*.{ts,tsx,js,jsx,mts,mjs,cts,cjs}: Avoid the em dash (U+2014) in code comments, string literals, and template literals. Use colons, periods, commas, parentheses, spaced hyphens (" - "), or reworded sentences instead
Runpnpm lint(oxlint --deny-warnings followed by strict ESLint 10 with type-aware rules) before submitting code
Files:
packages/studio-app/src/components/RunTraining.tsxpackages/studio-app/src/lib/api.test.tspackages/studio-app/src/lib/api.tspackages/studio-app/src/components/RunTraining.hmr.test.tsxpackages/arkor/src/studio/server.tspackages/arkor/src/cli/commands/dev.tspackages/arkor/src/cli/commands/dev.test.tspackages/arkor/src/studio/hmr.ts
**/*.{ts,tsx,js,jsx,mts,mjs,cts,cjs,json,css,html}
📄 CodeRabbit inference engine (CONTRIBUTING.md)
Run
pnpm formatusing oxfmt to format all JS/TS/JSON/CSS/HTML files before submitting code
Files:
packages/studio-app/src/components/RunTraining.tsxpackages/studio-app/src/lib/api.test.tspackages/studio-app/src/lib/api.tspackages/studio-app/src/components/RunTraining.hmr.test.tsxpackages/arkor/src/studio/server.tspackages/arkor/src/cli/commands/dev.tspackages/arkor/src/cli/commands/dev.test.tspackages/arkor/src/studio/hmr.ts
**/*.{ts,tsx,js,jsx,mts,mjs}
📄 CodeRabbit inference engine (CONTRIBUTING.ja.md)
Do not use em-dashes (U+2014) anywhere in code comments, strings, and template literals. Instead use colons, periods, commas, parentheses, space-separated hyphens (
-), or rephrase sentences.
Files:
packages/studio-app/src/components/RunTraining.tsxpackages/studio-app/src/lib/api.test.tspackages/studio-app/src/lib/api.tspackages/studio-app/src/components/RunTraining.hmr.test.tsxpackages/arkor/src/studio/server.tspackages/arkor/src/cli/commands/dev.tspackages/arkor/src/cli/commands/dev.test.tspackages/arkor/src/studio/hmr.ts
**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (AGENTS.md)
TypeScript: prefer interface for defining object shapes; use camelCase for variable names; always use async/await for promises
Files:
packages/studio-app/src/components/RunTraining.tsxpackages/studio-app/src/lib/api.test.tspackages/studio-app/src/lib/api.tspackages/studio-app/src/components/RunTraining.hmr.test.tsxpackages/arkor/src/studio/server.tspackages/arkor/src/cli/commands/dev.tspackages/arkor/src/cli/commands/dev.test.tspackages/arkor/src/studio/hmr.ts
**
⚙️ CodeRabbit configuration file
**: # Arkor Development GuideNote: Claude Code automatically loads this file.
Repository shape
pnpm + Turbo monorepo. Workspaces are declared in
pnpm-workspace.yaml(packages/*,e2e/*,docs).
Path Role packages/arkor Published arkorSDK + CLI + bundled local Studio server (Hono).bin/arkor→dist/bin.mjs. Library entry →dist/index.mjs.packages/create-arkor Published create-arkorscaffolder (pnpm create arkor).packages/cli-internal Private workspace package. Source is bundled into arkorandcreate-arkorvia tsdown'sdeps.alwaysBundle. Never appears as a runtime dependency on npm.packages/studio-app Private Vite + React 19 SPA. pnpm --filter@arkor/studio-appbundlebuilds it;packages/arkor/scripts/copy-studio-assets.mjscopiesdist/intopackages/arkor/dist/assets/.e2e/cli Private vitest suite that spawns the built dist/bin.mjsof both CLIs in temp dirs.e2e/studio Private Playwright suite that spawns arkor devagainst an in-process fake cloud-api and drives the Studio SPA in Chromium.docs Mintlify sources for docs.arkor.ai. Common commands
Root scripts fan out via Turbo (which respects
^builddeps in turbo.json):pnpm install pnpm build # turbo run build across all packages pnpm typecheck # tsc --noEmit across all packages pnpm lint # oxlint --deny-warnings, then strict ESLint 10 (single root config each) pnpm format # oxfmt --write across the repo (config in oxfmt.config.ts) pnpm format:check # oxfmt --check; CI gates on this (no writes) pnpm test # vitest run across all packages (incl. e2e) pnpm test:coverage # writes lcov + cobertura + junit per package; CI uploads lcov to Codecov, cobertura to GitHub nativ...
Files:
packages/studio-app/src/components/RunTraining.tsxpackages/studio-app/src/lib/api.test.tspackages/studio-app/src/lib/api.tspackages/studio-app/src/components/RunTraining.hmr.test.tsxpackages/arkor/src/studio/server.tspackages/arkor/src/cli/commands/dev.tspackages/arkor/src/cli/commands/dev.test.tspackages/arkor/src/studio/hmr.ts
**/*.test.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (CONTRIBUTING.md)
Include vitest cases for SDK/CLI/scaffolder logic, jsdom-based Testing Library cases for Studio components, or visual UI documentation when submitting PRs
Files:
packages/studio-app/src/lib/api.test.tspackages/studio-app/src/components/RunTraining.hmr.test.tsxpackages/arkor/src/cli/commands/dev.test.ts
**/*.test.{ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
Test files must be placed under packages//src/**/.test.ts with vitest convention; e2e scenarios in e2e/cli/src and e2e/studio/**/*.test.ts
Files:
packages/studio-app/src/lib/api.test.tspackages/studio-app/src/components/RunTraining.hmr.test.tsxpackages/arkor/src/cli/commands/dev.test.ts
packages/{arkor,create-arkor,cli-internal}/**/*.{ts,tsx,js,mts}
📄 CodeRabbit inference engine (CONTRIBUTING.ja.md)
CLI runtime messages, generated file templates, and test names must also follow the no em-dash rule
Files:
packages/arkor/src/studio/server.tspackages/arkor/src/cli/commands/dev.tspackages/arkor/src/cli/commands/dev.test.tspackages/arkor/src/studio/hmr.ts
**/studio/server.ts
📄 CodeRabbit inference engine (AGENTS.md)
**/studio/server.ts: Validate every /api/* request via X-Arkor-Studio-Token header (or ?studioToken= query for EventSource) using timingSafeEqual; enforce host-header allow-list (127.0.0.1/localhost) for DNS-rebinding defence; do NOT configure CORS
Restrict EventSource ?studioToken= query param allow-list to GET stream-only routes via eventStreamPathPattern regex in studio/server.ts; adding entries is CSRF-sensitive and must never include mutation endpoints
Files:
packages/arkor/src/studio/server.ts
**/cli/commands/dev.ts
📄 CodeRabbit inference engine (AGENTS.md)
Generate a 32-byte base64url CSRF token per arkor dev launch in dev.ts; persist to ~/.arkor/studio-token (mode 0600) and inject into Studio SPA via Vite plugin
Files:
packages/arkor/src/cli/commands/dev.ts
**/studio/hmr.ts
📄 CodeRabbit inference engine (AGENTS.md)
Keep Rolldown watcher over src/arkor/ in packages/arkor/src/studio/hmr.ts; on rebuild, pull TrainerInspection snapshot and compute configHash for hot-swap decision logic
Files:
packages/arkor/src/studio/hmr.ts
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-12T10:25:25.551Z
Learning: Use pnpm + Turbo monorepo structure with workspaces declared in pnpm-workspace.yaml (packages/*, e2e/*, docs)
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-12T10:25:25.551Z
Learning: For all new features, SDK/CLI/Studio behaviour changes, or schema additions, include docs in both English and Japanese (README.md ↔ README.ja.md, docs/ ↔ docs/ja/) AND vitest test cases in the same PR; do not defer docs or tests to follow-ups
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-12T10:25:25.551Z
Learning: Set ARKOR_INTERNAL_SCAFFOLD_ARKOR_SPEC=file:.../packages/arkor in build/coverage CI lanes so fixtures resolve to in-workspace arkor; install-matrix does NOT set this and packs fresh arkor-*.tgz instead
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-12T10:25:25.551Z
Learning: When bumping oxlint or oxfmt in catalog, run pnpm install --config.minimum-release-age=0 once so platform bindings land in pnpm-lock.yaml; fixes pnpm 10.x minimumReleaseAge interaction
🪛 OpenGrep (1.22.0)
packages/studio-app/src/lib/api.ts
[ERROR] 473-473: Dynamic command passed to child_process.exec/execSync. Use child_process.execFile or spawn with an argument array instead.
(coderabbit.command-injection.exec-js)
| // Foreign token: restore it, UNLESS the other process already | ||
| // re-wrote `path` after our rename claimed the old copy. Rename | ||
| // REPLACES an existing destination rather than failing, so a | ||
| // bare rename-back would clobber that fresher token with the | ||
| // older claimed one. The existence probe shrinks the clobber | ||
| // window from "read → unlink" (previous design, milliseconds | ||
| // spanning a token comparison) to the few instructions between | ||
| // existsSync and renameSync, and the losing outcome in that | ||
| // residual window is restore-the-older-token (the other dev | ||
| // server 403s until its own next rewrite), not delete-the-token | ||
| // outright. | ||
| try { | ||
| if (existsSync(path)) { | ||
| // Newest writer wins: discard the claimed older copy. | ||
| unlinkSync(reapPath); | ||
| } else { | ||
| renameSync(reapPath, path); | ||
| } |
There was a problem hiding this comment.
Restore the foreign token without a replace-on-exist race.
This fallback still does existsSync(path) and then renameSync(reapPath, path). If another arkor dev recreates path after the existence check but before the rename-back, this branch overwrites that newer token with the older claimed one, so the sibling Studio can still fall back to 403s. Restore with an atomic "create only if absent" step instead, then discard reapPath on EEXIST.
Possible fix
-import { existsSync, readFileSync, renameSync, unlinkSync } from "node:fs";
+import {
+ constants as fsConstants,
+ copyFileSync,
+ readFileSync,
+ renameSync,
+ unlinkSync,
+} from "node:fs";
...
try {
- if (existsSync(path)) {
- // Newest writer wins: discard the claimed older copy.
- unlinkSync(reapPath);
- } else {
- renameSync(reapPath, path);
- }
- } catch {
+ copyFileSync(reapPath, path, fsConstants.COPYFILE_EXCL);
+ unlinkSync(reapPath);
+ } catch (err) {
+ const code =
+ typeof err === "object" && err !== null && "code" in err
+ ? String(err.code)
+ : "";
+ if (code === "EEXIST") {
+ try {
+ unlinkSync(reapPath);
+ } catch {
+ // best-effort
+ }
+ return;
+ }
try {
unlinkSync(reapPath);
} catch {
// best-effort
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/arkor/src/cli/commands/dev.ts` around lines 261 - 278, Replace the
non-atomic existsSync+renameSync path restore with an atomic "create only if
absent" operation: in the try block where you currently call
renameSync(reapPath, path), use fs.linkSync(reapPath, path) (or
fs.copyFileSync(reapPath, path, fs.constants.COPYFILE_EXCL)) so the creation
will fail with EEXIST if path was recreated, and in the catch for that EEXIST
simply unlinkSync(reapPath); keep the unlinkSync(reapPath) branch when path
already existed before attempting the atomic create; reference existsSync,
unlinkSync, renameSync (replace), linkSync or copyFileSync with COPYFILE_EXCL,
reapPath and path to locate the change.
| // The watcher writes to a STAGING file beside the real artefact, | ||
| // which BUNDLE_END atomically renames onto `outFile` (Codex P2, | ||
| // round 81). Writing `outFile` directly would let a concurrent | ||
| // `/api/train` spawn race the watcher's non-atomic write: | ||
| // `runStart` skips its own rebuild whenever the artefact exists, | ||
| // so a Run click landing inside the write window would | ||
| // dynamic-import a partially-written bundle and crash the run. | ||
| // `renameSync` within the same directory is atomic on POSIX and | ||
| // uses MoveFileEx(MOVEFILE_REPLACE_EXISTING) on Windows, so | ||
| // readers only ever observe the previous complete artefact or | ||
| // the new complete artefact, never a torn write. Bonus: a build | ||
| // that errors before BUNDLE_END leaves the last-good artefact | ||
| // in place instead of clobbering it with partial output. | ||
| const stagingFile = `${resolved.outFile}.hmr-staging`; | ||
| watcher = watch({ |
There was a problem hiding this comment.
Identical-byte rebuilds no longer reuse the cached inspection module.
Replacing resolved.outFile with a freshly written staging file changes mtime/ctime on every BUNDLE_END, even when the bytes are unchanged. inspectBundle() still cache-busts through moduleCacheBustUrl(outFile), so duplicate or no-op rebuilds now import a brand-new URL, rerun top-level module code, and retain another permanent ESM-cache record. That breaks the "same bytes collapse to the same inspection record" invariant this file relies on, especially for the duplicate BUNDLE_END case already documented in hmr.test.ts. Please either skip the rename when the staged content hash matches the current artifact, or switch the inspection cache key to a content hash.
Also applies to: 467-472
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/arkor/src/studio/hmr.ts` around lines 431 - 445, The staging rename
currently changes mtime/ctime and breaks inspectBundle() caching because
moduleCacheBustUrl(outFile) is used; fix by comparing content hashes and
skipping the atomic rename when the staging content equals the existing
artifact: compute a checksum (e.g., SHA256) of stagingFile and resolved.outFile
at the BUNDLE_END/rename path (the code around stagingFile and watcher), and
only call renameSync when hashes differ; alternatively, update inspectBundle()
and moduleCacheBustUrl() to use a content-hash-based cache key (read file bytes
and derive the cache key) so identical-byte rebuilds map to the same inspection
record.
| // win32 (Codex P1, round 81): Node's `subprocess.kill("SIGTERM")` | ||
| // on Windows terminates the child "forcefully and abruptly" | ||
| // (per Node's child_process docs), so the runner's | ||
| // `installShutdownHandlers` never runs and the child can't | ||
| // issue its own graceful `cancel()` POST. Snapshot each active | ||
| // child's parsed jobId + spawn scope BEFORE dispatch (after | ||
| // dispatch the kill may already be reaping entries via close | ||
| // handlers) so the server can fire the cloud cancel on the | ||
| // child's behalf for every restart target. | ||
| const win32CancelSnapshots = | ||
| process.platform === "win32" | ||
| ? activeTrains.list().map((entry) => ({ | ||
| pid: entry.child.pid, | ||
| jobId: entry.jobId, | ||
| scope: entry.scope, | ||
| })) | ||
| : []; | ||
| const { hotSwapTargets, restartTargets } = activeTrains.dispatchRebuild( | ||
| nextHash, | ||
| nextArtifactContentHash, | ||
| ); | ||
| if (process.platform === "win32" && restartTargets.length > 0) { | ||
| for (const target of restartTargets) { | ||
| const snap = win32CancelSnapshots.find( | ||
| (s) => s.pid === target.pid, | ||
| ); | ||
| if (!snap?.jobId || !snap.scope) continue; | ||
| const { jobId: snapJobId, scope: snapScope } = snap; | ||
| // Fire-and-forget per child, mirroring the manual-Stop | ||
| // cancel path: SIGTERM-as-forceful-kill on win32 means | ||
| // nobody else will release the cloud job. Best-effort; | ||
| // cloud reaper / TTL is the safety net on failure. | ||
| void (async () => { | ||
| try { | ||
| const { baseUrl: rpcBaseUrl, token: rpcToken } = | ||
| await resolveCredentialsAndBaseUrl(); | ||
| const rpc = createRpc(rpcBaseUrl, rpcToken); | ||
| await rpc.v1.jobs[":id"].cancel.$post({ | ||
| param: { id: snapJobId }, | ||
| query: { | ||
| orgSlug: snapScope.orgSlug, | ||
| projectSlug: snapScope.projectSlug, | ||
| }, | ||
| }); | ||
| } catch { | ||
| // best-effort | ||
| } | ||
| })(); |
There was a problem hiding this comment.
Windows restart cancel dropped the robust spawn-time cancel context.
This branch snapshots only pid/jobId/scope and then resolves credentials again at rebuild time. That reopens the same races the manual Stop path above already accounted for: jobId can still be null when the rebuild lands, first-run anonymous runs still have scope === null, and a login or control-plane switch mid-run makes the cancel POST target the wrong account/host. On win32 the child is already being terminated, so any of those cases leaves the original cloud job orphaned. Please reuse the same spawn-time RPC snapshot plus late jobId/state fallback here, ideally by storing the cancel context on the registry entry instead of re-implementing a weaker cancel flow in this subscriber.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/arkor/src/studio/server.ts` around lines 1074 - 1121, The win32
restart path snapshots only pid/jobId/scope and re-resolves credentials later,
reintroducing races (null jobId, null scope, auth/host switches) — instead
attach a spawn-time RPC cancel context (baseUrl, token and any resolved RPC
client info) to each registry entry when the child is spawned (augment whatever
structure activeTrains.list()/activeTrains registry returns), then in the
restart loop (the restartTargets handling after activeTrains.dispatchRebuild)
look up that stored RPC snapshot (instead of calling
resolveCredentialsAndBaseUrl/createRpc again) and call
rpc.v1.jobs[":id"].cancel.$post with the stored snapshot; still keep the
existing late fallback to jobId/state when the spawn-time snapshot is missing,
but prefer the stored spawn-time RPC context to avoid races.
| // Codex P1 (PR #101 round 80): a NONZERO exit suppresses auto- | ||
| // restart. The runner exits 143 when an HMR-triggered early-stop | ||
| // failed to cancel the cloud job (cancel POST rejected); spawning | ||
| // a fresh run on top of that would overlap a new cloud job with | ||
| // the one whose cancel never landed (double GPU spend, no | ||
| // operator signal). Any other nonzero exit is a crash, and | ||
| // restarting a crashing run in a loop would churn cloud jobs | ||
| // with the same broken artefact. Exit 0 (graceful early-stop | ||
| // succeeded) and `null` (no marker parsed: spawn failure path | ||
| // already surfaced its own error) keep the existing restart | ||
| // behaviour. | ||
| if (exitCode !== null && exitCode !== 0) { | ||
| restartPendingRef.current = false; | ||
| currentPidRef.current = null; | ||
| setHmrStatus("idle"); | ||
| if (restartGraceTimerRef.current !== null) { | ||
| clearTimeout(restartGraceTimerRef.current); | ||
| restartGraceTimerRef.current = null; | ||
| } | ||
| setLog((prev) => | ||
| appendCapped( | ||
| prev, | ||
| `\n[error] ${err instanceof Error ? err.message : String(err)}\n`, | ||
| `\n[hmr] run exited with code ${exitCode}; auto-restart suppressed. Fix the issue and press Run to start a new job.\n`, | ||
| ), | ||
| ); | ||
| } finally { | ||
| if (trainingAbortRef.current === ac) trainingAbortRef.current = null; | ||
| // Always release the running flag, including the user-initiated | ||
| // abort path. setState on an already-unmounted component is a | ||
| // no-op in React 18+, so the unmount-cleanup case handles itself. | ||
| setRunning(false); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Treat a missing exit= trailer as restart-unsafe.
streamTraining() returns null when the train stream ends without a parsed exit marker. This branch only suppresses restarts for explicit nonzero codes, so an HMR-triggered early-stop that loses its trailer still falls through to the restart path when restartPendingRef.current is set. That recreates the overlapping-job risk this guard is meant to stop. Only auto-restart after an explicit exitCode === 0; treat null as suppression too when a restart is pending.
Possible fix
- if (exitCode !== null && exitCode !== 0) {
+ const restartUnsafe = restartPendingRef.current
+ ? exitCode !== 0
+ : exitCode !== null && exitCode !== 0;
+ if (restartUnsafe) {
restartPendingRef.current = false;
currentPidRef.current = null;
setHmrStatus("idle");
if (restartGraceTimerRef.current !== null) {
clearTimeout(restartGraceTimerRef.current);
restartGraceTimerRef.current = null;
}
setLog((prev) =>
appendCapped(
prev,
- `\n[hmr] run exited with code ${exitCode}; auto-restart suppressed. Fix the issue and press Run to start a new job.\n`,
+ `\n[hmr] run exited with ${
+ exitCode === null ? "an unknown status" : `code ${exitCode}`
+ }; auto-restart suppressed. Fix the issue and press Run to start a new job.\n`,
),
);
return;
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/studio-app/src/components/RunTraining.tsx` around lines 463 - 489,
The current guard only suppresses auto-restart for explicit nonzero exit codes;
change it so a missing trailer (exitCode === null) also suppresses restart when
a restart was pending. Update the condition around the block that manipulates
restartPendingRef.current, currentPidRef.current, restartGraceTimerRef.current,
setHmrStatus and setLog (the existing if that checks exitCode) to: if ((exitCode
!== null && exitCode !== 0) || (exitCode === null && restartPendingRef.current))
{ ... } so the same suppression/cleanup and log (via appendCapped) runs for null
when a restart was pending.
| while (true) { | ||
| const { value, done } = await reader.read(); | ||
| if (done) break; | ||
| onChunk(decoder.decode(value, { stream: true })); | ||
| const text = decoder.decode(value, { stream: true }); | ||
| trackTail(text); | ||
| onChunk(text); | ||
| } | ||
| // Flush any bytes the streaming decoder buffered for a multi-byte | ||
| // UTF-8 sequence that landed split across the final two chunks. | ||
| // Without this, the last character of the trainer's output gets | ||
| // silently dropped when it happens to be non-ASCII (Japanese log | ||
| // lines, emoji progress bars, etc.). | ||
| const tail = decoder.decode(); | ||
| if (tail) onChunk(tail); | ||
| if (tail) { | ||
| trackTail(tail); | ||
| onChunk(tail); | ||
| } | ||
| } finally { | ||
| signal?.removeEventListener("abort", onAbort); | ||
| // Release the reader lock so a subsequent caller can re-acquire | ||
| // the body if needed. Mirrors `iterateSseFrames`'s finally clause. | ||
| reader.releaseLock(); | ||
| } | ||
| // `null` when no marker was found: aborted streams, `error=` paths, | ||
| // truncated bodies. Callers treat null as "unknown" and apply their | ||
| // own conservative default. | ||
| const m = EXIT_MARKER_PATTERN.exec(tailBuf); | ||
| return m?.[1] === undefined ? null : Number.parseInt(m[1], 10); |
There was a problem hiding this comment.
Don't forward the synthetic exit= trailer into onChunk.
streamTraining parses \n---\nexit=<code>\n only after the read loop, but every decoded chunk has already been passed to onChunk. That means RunTraining will log the protocol footer verbatim on normal exits. Keep a small holdback tail, strip the trailer once the stream closes, and only emit the non-marker bytes.
🧰 Tools
🪛 OpenGrep (1.22.0)
[ERROR] 473-473: Dynamic command passed to child_process.exec/execSync. Use child_process.execFile or spawn with an argument array instead.
(coderabbit.command-injection.exec-js)
This pull request migrates the Arkor build and development workflow from esbuild to Rolldown, introduces a robust hot module replacement (HMR) system for the Studio dev server, and refines the cleanup and shutdown logic for long-lived resources. Documentation has been updated to explain the new HMR behavior in both English and Japanese, and dependency management has been adjusted accordingly. These changes streamline the developer experience, enable live code updates during training runs, and improve resource handling during shutdown.
Build system migration and HMR integration:
arkor build, ensuring bundles target the current Node version and keeping external dependencies unbundled. (packages/arkor/package.json,packages/arkor/src/cli/commands/build.ts,packages/arkor/src/core/rolldownConfig) [1] [2] [3]src/arkor/in dev mode, enabling HMR for trainer code. The watcher notifies the SPA via SSE (/api/dev/events), and the dev server can hot-swap callbacks or gracefully early-stop training runs based on config changes. (packages/arkor/src/cli/commands/dev.ts,packages/arkor/src/studio/hmr.ts,packages/arkor/src/studio/trainRegistry.ts) [1] [2] [3] [4] [5] [6]Cleanup and shutdown improvements:
registerCleanupHookutility to handle resource cleanup on process exit and signals, ensuring proper teardown order (e.g., HMR watcher before token file removal). (packages/arkor/src/cli/cleanupHooks.ts,packages/arkor/src/cli/commands/dev.ts) [1] [2]Documentation updates:
docs/concepts/studio.mdx,docs/ja/concepts/studio.mdx,AGENTS.md) [1] [2] [3]AGENTS.md)Dependency and test adjustments:
packages/arkor/package.json,packages/arkor/src/cli/commands/start.test.ts) [1] [2]Security and internal API notes:
Symbol.forbrands, not exposed on the public SDK surface. (AGENTS.md) [1] [2]Summary by CodeRabbit