Skip to content

Enhance Trainer API with HMR, early-stop, and cleanup improvements#101

Open
k-taro56 wants to merge 77 commits into
mainfrom
eng-615
Open

Enhance Trainer API with HMR, early-stop, and cleanup improvements#101
k-taro56 wants to merge 77 commits into
mainfrom
eng-615

Conversation

@k-taro56

@k-taro56 k-taro56 commented May 2, 2026

Copy link
Copy Markdown
Contributor

This pull request migrates the Arkor build and development workflow from esbuild to Rolldown, introduces a robust hot module replacement (HMR) system for the Studio dev server, and refines the cleanup and shutdown logic for long-lived resources. Documentation has been updated to explain the new HMR behavior in both English and Japanese, and dependency management has been adjusted accordingly. These changes streamline the developer experience, enable live code updates during training runs, and improve resource handling during shutdown.

Build system migration and HMR integration:

  • Migrated the build system from esbuild to Rolldown in arkor build, ensuring bundles target the current Node version and keeping external dependencies unbundled. (packages/arkor/package.json, packages/arkor/src/cli/commands/build.ts, packages/arkor/src/core/rolldownConfig) [1] [2] [3]
  • Added a persistent Rolldown watcher for src/arkor/ in dev mode, enabling HMR for trainer code. The watcher notifies the SPA via SSE (/api/dev/events), and the dev server can hot-swap callbacks or gracefully early-stop training runs based on config changes. (packages/arkor/src/cli/commands/dev.ts, packages/arkor/src/studio/hmr.ts, packages/arkor/src/studio/trainRegistry.ts) [1] [2] [3] [4] [5] [6]

Cleanup and shutdown improvements:

  • Introduced a reusable registerCleanupHook utility to handle resource cleanup on process exit and signals, ensuring proper teardown order (e.g., HMR watcher before token file removal). (packages/arkor/src/cli/cleanupHooks.ts, packages/arkor/src/cli/commands/dev.ts) [1] [2]

Documentation updates:

  • Updated English and Japanese docs to describe the new HMR workflow, explaining how live code changes are handled during training and how the system differentiates between callback-only and config changes. (docs/concepts/studio.mdx, docs/ja/concepts/studio.mdx, AGENTS.md) [1] [2] [3]
  • Clarified the use of Rolldown in the build process and updated package policy documentation accordingly. (AGENTS.md)

Dependency and test adjustments:

  • Removed esbuild from dependencies and added Rolldown. Updated test comments to reference Rolldown instead of esbuild. (packages/arkor/package.json, packages/arkor/src/cli/commands/start.test.ts) [1] [2]

Security and internal API notes:

  • Documented the security-sensitive allow-list for SSE token query parameters and emphasized that HMR/early-stop internals remain behind Symbol.for brands, not exposed on the public SDK surface. (AGENTS.md) [1] [2]

Summary by CodeRabbit

  • New Features
    • HMR for Studio dev: SSE rebuild events, callback hot-swap, and automatic restart/hot-swap of running trainings; build tooling migrated to Rolldown.
  • Bug Fixes / Reliability
    • Safer shutdown/cleanup and deterministic restart/early-stop behavior; improved token/CSRF handling and manifest fast-paths.
  • Documentation
    • Expanded Studio dev-loop and security docs (including Japanese translation updates).
  • Tests
    • Added extensive unit and E2E coverage for HMR, signal handling, trainer inspection, and CLI/server flows.

k-taro56 added 7 commits May 2, 2026 21:27
- Integrated Rolldown for hot module replacement (HMR) in `arkor dev`, allowing real-time updates to the training interface without page refresh.
- Implemented `requestEarlyStop` and `replaceCallbacks` methods in the Trainer API to facilitate graceful stopping of training jobs and dynamic callback updates during execution.
- Updated documentation to reflect new features and usage patterns for improved developer guidance.
- Adjusted cleanup logic for better resource management during development sessions.
…l shutdown

- Introduced `replaceCallbacks` method in the Trainer API to allow dynamic updates of lifecycle callbacks during training runs.
- Enhanced signal handling for graceful early stopping, ensuring in-flight checkpoints are preserved during HMR rebuilds.
- Added `registerCleanupHook` for streamlined resource management on process exit, improving cleanup logic across development commands.
- Updated documentation to reflect new features and usage patterns for better developer guidance.
…internal callback swapping

- Removed the public `replaceCallbacks` method from the Trainer interface to prevent exposure of the hot-swapping functionality.
- Introduced an internal mechanism for callback swapping using a `Symbol.for`-keyed brand, allowing for dynamic updates during training runs without affecting the public API.
- Updated signal handling to ensure seamless integration with the new callback swapping logic, enhancing the hot module replacement (HMR) experience.
- Revised documentation to reflect changes in the Trainer API and clarify the internal callback management process.
…lement internal early-stop handling

- Removed the public `requestEarlyStop` method from the Trainer interface to prevent exposure of the early-stop functionality.
- Introduced an internal mechanism for early stopping using a `Symbol.for`-keyed brand, allowing for graceful stopping after the next checkpoint without affecting the public API.
- Updated signal handling to ensure seamless integration with the new early-stop logic, enhancing the hot module replacement (HMR) experience.
- Revised documentation to reflect changes in the Trainer API and clarify the internal early-stop management process.
…llback replacement

- Updated the Trainer API to remove public exposure of `requestEarlyStop` and `replaceCallbacks` methods, enhancing encapsulation.
- Implemented internal mechanisms for early stopping and callback swapping using `Symbol.for`-keyed brands, ensuring seamless integration during training runs.
- Revised signal handling to improve the hot module replacement (HMR) experience and maintain clean resource management.
- Updated documentation to reflect these changes and clarify the internal management processes for developers.
- Replaced esbuild with Rolldown for building the project, ensuring external dependencies are resolved correctly.
- Implemented a cleanup hook system to manage resource disposal on process termination signals.
- Enhanced the development experience with hot module replacement (HMR) capabilities, allowing for seamless updates during training runs.
- Updated documentation to reflect changes in the development loop and HMR behavior.
- Added tests for new signal handling and configuration hashing functionality.
Copilot AI review requested due to automatic review settings May 2, 2026 17:02
@mintlify

mintlify Bot commented May 2, 2026

Copy link
Copy Markdown

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
arkor-92aeef0e 🟢 Ready View Preview May 2, 2026, 5:03 PM

💡 Tip: Enable Workflows to automatically generate PRs for you.

Comment thread packages/arkor/src/studio/server.ts Fixed

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8abc594369

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/arkor/src/studio/server.ts Outdated
Comment on lines +421 to +423
const hotSwapTargets = activeTrains.notifyCallbackReload(nextHash);
const restartTargets =
activeTrains.requestEarlyStopOnMismatch(nextHash);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Dispatch train signals only once per rebuild

notifyCallbackReload/requestEarlyStopOnMismatch run inside each /api/dev/events subscriber callback, so one rebuild sends signals once per connected SSE client (and again on reconnect when createHmrCoordinator.subscribe replays lastEvent). In the common case of two open Studio tabs, the same child can receive two SIGTERMs for one config-changing rebuild; the second signal triggers the runner’s forced-exit path (exit(143)) instead of checkpoint-preserving early stop. Move signal dispatch out of per-subscriber delivery (or dedupe per rebuild hash) so each child is signaled at most once per rebuild event.

Useful? React with 👍 / 👎.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR upgrades the local Studio dev experience by migrating the trainer bundling pipeline from esbuild to Rolldown, adding an HMR/SSE channel to propagate rebuild events to the Studio SPA, and introducing signal-based “hot-swap callbacks vs early-stop + restart” behavior for in-flight training subprocesses.

Changes:

  • Replace esbuild-based trainer bundling with Rolldown (one-shot build + watch/HMR).
  • Add /api/dev/events SSE stream + SPA client wiring to refresh manifest and coordinate restarts/hot-swaps.
  • Introduce internal trainer inspection/callback replacement/early-stop hooks (via Symbol.for brands), plus cleanup utilities and updated docs/tests.

Reviewed changes

Copilot reviewed 27 out of 28 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
pnpm-lock.yaml Removes esbuild and adds rolldown to the lockfile graph.
packages/arkor/package.json Swaps dependency from esbuild to rolldown.
packages/arkor/src/core/rolldownConfig.ts Centralizes Rolldown input options + entry/outDir resolution and Node target derivation.
packages/arkor/src/cli/commands/build.ts Implements arkor build via Rolldown bundling instead of esbuild.
packages/arkor/src/studio/hmr.ts Adds lazy Rolldown watcher coordinator that emits ready/rebuild/error events.
packages/arkor/src/studio/hmr.test.ts Tests HMR coordinator behavior (ready/rebuild/error/replay/dispose).
packages/arkor/src/studio/manifest.ts Adds configHash to manifest summary and splits “summarise built manifest” vs “build + summarise”.
packages/arkor/src/studio/trainRegistry.ts Adds per-child registry + policy for SIGUSR2 hot-swap vs SIGTERM early-stop.
packages/arkor/src/studio/trainRegistry.test.ts Tests TrainRegistry signaling decisions and error tolerance.
packages/arkor/src/studio/server.ts Adds /api/dev/events SSE route and integrates TrainRegistry into rebuild handling.
packages/arkor/src/studio/server.test.ts Adds tests for /api/dev/events token rules, loopback guard, and SSE framing.
packages/arkor/src/core/configHash.ts Adds stable hashing of JobConfig for “callbacks-only vs restart” decisions.
packages/arkor/src/core/configHash.test.ts Tests determinism and sensitivity of hashJobConfig.
packages/arkor/src/core/trainerInspection.ts Introduces internal inspection + callback replacement + early-stop brands via Symbol.for.
packages/arkor/src/core/trainer.ts Implements callback hot-swap support + early-stop latch; attaches internal brands.
packages/arkor/src/core/trainer.test.ts Adds coverage for early-stop and callback hot-swap behavior via internal brands.
packages/arkor/src/core/runnerSignals.ts Adds signal handlers: SIGTERM graceful early-stop + SIGUSR2 callback reload.
packages/arkor/src/core/runnerSignals.test.ts Tests SIGTERM/SIGUSR2 handler behaviors with branded trainers.
packages/arkor/src/core/runner.ts Installs/removes new signal handlers around trainer.start()/trainer.wait().
packages/arkor/src/core/runner.test.ts Adds a test for SIGTERM early-stop behavior in the runner.
packages/arkor/src/cli/cleanupHooks.ts Adds reusable cleanup-hook registration for exit/signals.
packages/arkor/src/cli/commands/dev.ts Wires HMR coordinator into Studio server and refactors shutdown cleanup via cleanup hooks.
packages/arkor/src/cli/commands/start.test.ts Updates test comment to refer to Rolldown instead of esbuild.
packages/studio-app/src/lib/api.ts Adds typed DevEvent + openDevEvents() for SSE notifications.
packages/studio-app/src/components/RunTraining.tsx SPA listens to dev SSE events, refreshes manifest, and coordinates restart/hot-swap UI state.
docs/concepts/studio.mdx Documents HMR workflow, including hash-based hot-swap vs restart behavior (EN).
docs/ja/concepts/studio.mdx Documents HMR workflow (JA).
AGENTS.md Updates repository policy/docs for new SSE allow-list, HMR internals, and Rolldown build pipeline.
Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

const [log, setLog] = useState("");
const [manifest, setManifest] = useState<ManifestResult | null>(null);
const [hmrStatus, setHmrStatus] = useState<
"idle" | "rebuilding" | "early-stopping" | "restarting" | "hot-swapped"
Comment thread packages/arkor/src/studio/server.ts Outdated
import { ensureProjectState } from "../core/projectState";
import { readState } from "../core/state";
import { readManifestSummary } from "./manifest";
import { readManifestSummary, summariseBuiltManifest } from "./manifest";
Comment on lines +190 to 196
function scheduleHmrCleanup(hmr: { dispose: () => Promise<void> }): void {
// Registered before the studio-token cleanup so it runs first on
// shutdown — Node fires signal handlers in registration order, and we
// want the watcher to release file handles before the outermost
// process.exit.
registerCleanupHook({ cleanup: () => hmr.dispose() });
}
Comment on lines +97 to +113
requestEarlyStopOnMismatch(
nextConfigHash: string | null,
): RestartTarget[] {
const targets: RestartTarget[] = [];
for (const [pid, entry] of this.entries) {
if (
nextConfigHash === null ||
entry.configHash === null ||
entry.configHash !== nextConfigHash
) {
try {
entry.child.kill("SIGTERM");
} catch {
// child already exited; close handler will clean up.
}
targets.push({ pid, trainFile: entry.trainFile });
}
Comment thread packages/arkor/src/core/trainer.ts Outdated
Comment on lines +275 to +284
await callbacks.onCheckpoint?.(ctx);
// Early-stop latch: a checkpoint just landed, so the in-flight work
// is durable. Cancel the cloud job and end `wait()` cleanly.
if (earlyStopRequested && earlyStopDeferred) {
await trainer.cancel();
if (earlyStopDeferred.timer) clearTimeout(earlyStopDeferred.timer);
earlyStopDeferred.resolve();
earlyStopDeferred = null;
return { terminal: true, artifacts: terminalResult?.artifacts ?? [] };
}
Comment on lines +129 to +137
function startWatcher(): void {
if (watcher || disposed) return;
if (!existsSync(resolved.entry)) {
broadcast({
type: "error",
message: `Build entry not found: ${resolved.entry}. Create ${BUILD_DEFAULTS.entry} or pass an explicit entry argument.`,
});
return;
}
@k-taro56 k-taro56 self-assigned this May 3, 2026
…ous BUNDLE_END on macOS

- Modified the test to assert that the late subscriber receives the same event as the prior subscriber, addressing issues with spurious BUNDLE_END events in rolldown@1.0.0-rc.17 on macOS.
- Added comments to clarify the reasoning behind the change and noted a future task to revisit the assertion after rolldown stabilizes.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 31f510869c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +71 to +72
restartPendingRef.current = true;
setHmrStatus(runningRef.current ? "early-stopping" : "idle");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Ignore restart hints when no local run is active

restartPendingRef is set for every payload.restart event even when this tab is not running a training stream. Since /api/dev/events broadcasts restart events to all connected tabs, a passive tab can cache restartPendingRef=true from another tab’s run; the next time the user runs training in this tab, finally will auto-trigger an extra run, creating an unintended duplicate job and extra cloud/GPU spend. Only mark restart pending when runningRef.current is true (or otherwise correlate the restart target to the local run).

Useful? React with 👍 / 👎.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 28 changed files in this pull request and generated 4 comments.

Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +109 to +112
} catch {
// child already exited; close handler will clean up.
}
targets.push({ pid, trainFile: entry.trainFile });
Comment thread packages/arkor/src/studio/server.ts Outdated
import { ensureProjectState } from "../core/projectState";
import { readState } from "../core/state";
import { readManifestSummary } from "./manifest";
import { readManifestSummary, summariseBuiltManifest } from "./manifest";
const [log, setLog] = useState("");
const [manifest, setManifest] = useState<ManifestResult | null>(null);
const [hmrStatus, setHmrStatus] = useState<
"idle" | "rebuilding" | "early-stopping" | "restarting" | "hot-swapped"
Comment on lines +276 to 285
// Early-stop latch: a checkpoint just landed, so the in-flight work
// is durable. Cancel the cloud job and end `wait()` cleanly.
if (earlyStopRequested && earlyStopDeferred) {
await trainer.cancel();
if (earlyStopDeferred.timer) clearTimeout(earlyStopDeferred.timer);
earlyStopDeferred.resolve();
earlyStopDeferred = null;
return { terminal: true, artifacts: terminalResult?.artifacts ?? [] };
}
return { terminal: false, artifacts: terminalResult?.artifacts ?? [] };
…ce failures

- Added a test to verify that the SIGINT exit handler is armed even if the `persistStudioToken` function fails, preventing the dev server from idling indefinitely.
- Updated the `runDev` function to register the studio-token cleanup hook unconditionally, ensuring it executes on process termination signals regardless of token persistence success.
- Enhanced cleanup logic to maintain proper signal handling during development sessions, improving overall stability and user experience.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 55b4a2e8f6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +66 to +73
if (payload.restart) {
// `/api/dev/events` is a broadcast — every open Studio tab gets
// this event. Only flip the auto-restart latch when *this* tab
// is actually running a stream right now; otherwise a passive
// tab would silently auto-spawn an extra job the next time the
// user clicks Run training, doubling cloud spend.
if (runningRef.current) {
// Training run is early-stopping; the active stream will

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Scope restart latch to runs actually signaled for restart

The handler sets restartPendingRef for any payload.restart while this tab is running, but rebuild events are broadcast across all tabs and can contain mixed outcomes (restartTargets and hotSwapTargets) for different subprocesses. In a multi-run setup, a tab whose run was only hot-swapped can still latch restartPendingRef=true, then auto-spawn an extra training job when its current run eventually finishes, increasing unintended cloud/GPU usage.

Useful? React with 👍 / 👎.

Comment on lines +135 to +136
entry.child.kill("SIGTERM");
entry.earlyStopRequested = true;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Check signal delivery before marking restart targets

ChildProcess.kill() returns a boolean and does not throw when a process is already gone, but this path treats any non-throw as success. If kill("SIGTERM") returns false during an exit race, the entry is still marked earlyStopRequested and added to targets, so the UI may wait for/restart a run that was never actually signaled.

Useful? React with 👍 / 👎.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 29 changed files in this pull request and generated 4 comments.

Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +44 to +46
useEffect(() => {
const es = openDevEvents();
const onMessage = (raw: MessageEvent) => {
Comment on lines +86 to +90
setHmrStatus("hot-swapped");
window.setTimeout(() => {
setHmrStatus((s) => (s === "hot-swapped" ? "idle" : s));
}, 1500);
} else {
Comment thread packages/arkor/src/studio/server.ts Outdated
},
cancel() {
activeTrains.unregister(child.pid);
child.kill();
Comment thread packages/arkor/src/cli/cleanupHooks.ts Outdated
Comment on lines +34 to +59
// Synchronous wrapper so signal handlers preserve "cleanup landed
// before this function returns" — important for sync cleanups (e.g.
// `unlinkSync`) and for tests that assert the side effect right after
// invoking the handler. Async cleanups are fire-and-forget with a
// catch so a hung dispose doesn't block exit.
const run = (): void => {
if (done) return;
done = true;
try {
const result = options.cleanup();
if (result && typeof (result as Promise<void>).catch === "function") {
(result as Promise<void>).catch(() => {
// best-effort: shutdown is racing other cleanup paths
});
}
} catch {
// best-effort
}
};

process.on("exit", run);

for (const sig of TERMINATING_SIGNALS) {
process.on(sig, () => {
run();
if (options.exitOnSignal) process.exit(0);

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@github-code-quality

github-code-quality Bot commented May 28, 2026

Copy link
Copy Markdown

Code Coverage Overview

Languages: TypeScript

TypeScript / code-coverage/arkor

The overall coverage in the branch is 96%. The coverage in the branch is 99%.

Show a code coverage summary of the most impacted files.
File b061d17 c8c110d +/-
src/studio/server.ts 97% 92% -5%
src/cli/commands/dev.ts 100% 95% -5%
src/studio/hmr.ts 0% 84% +84%
src/studio/trainRegistry.ts 0% 94% +94%
src/core/runnerSignals.ts 0% 96% +96%
src/cli/cleanupHooks.ts 0% 98% +98%
src/core/trainerInspection.ts 0% 98% +98%
src/core/configHash.ts 0% 100% +100%
src/core/rolldownConfig.ts 0% 100% +100%
src/core/moduleCacheBust.ts 0% 100% +100%

TypeScript / code-coverage/create-arkor

The overall coverage remains at 56%, unchanged from the branch.

TypeScript / code-coverage/cli-internal

The overall coverage remains at 97%, unchanged from the branch.

TypeScript / code-coverage/studio-app

The overall coverage in the branch is 57%. The coverage in the branch is 53%.

Show a code coverage summary of the most impacted files.
File b061d17 c8c110d +/-
src/components/RunTraining.tsx 95% 87% -8%
src/lib/api.ts 99% 97% -2%

Updated June 12, 2026 10:26 UTC
Code Coverage is in Public Preview. Learn more and provide us with your feedback.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aa6a826f27

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +496 to +498
const configHash: string | null = options.hmr
? options.hmr.getCurrentConfigHash()
: null;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Route explicit train files with their own HMR hash

When /api/train is called with body.file, the spawned arkor start <file> rebuilds that explicit entry into .arkor/build/index.mjs, but this snapshot still records the HMR coordinator's current hash for the default src/arkor/index.ts. In HMR-enabled Studio, a run launched from an alternate file can therefore be classified for SIGUSR2/restart based on a different artifact, and the default watcher may overwrite the same index.mjs that the child reloads from; e.g. editing only the default entry can hot-swap callbacks into an alternate-entry run from the wrong bundle. Treat explicit files as unhashable/restart-only or build/watch a separate artifact for that entry.

Useful? React with 👍 / 👎.

Comment on lines +189 to +193
try {
await trainer.cancel();
} catch {
// intentionally ignored; see comment above.
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Propagate fallback cancel failures

For hand-rolled trainers that the runner accepts but that lack the internal early-stop brand, a SIGTERM/HMR stop falls back to trainer.cancel() here. If that cancel rejects (for example, a transient cloud/network failure), this catch resolves requestTrainerEarlyStop() successfully, so installShutdownHandlers exits with code 0 even though the remote job may still be running; this reintroduces the orphaned-job false-success path for the unbranded trainer shape.

Useful? React with 👍 / 👎.

Comment on lines +382 to +384
const manifest = await readManifestSummary(trainCwd, {
prebuiltOutFile: hmrOutFile,
});

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid serving stale artifacts before HMR is ready

On HMR-enabled startup, this always passes the prebuilt path to readManifestSummary, which imports it whenever .arkor/build/index.mjs already exists. If that file is left over from a previous build and the watcher has not emitted its first ready yet, the first /api/manifest poll can show an old trainer/config; because arkor start also skips rebuilding when the artifact exists, a quick Run click can launch the stale artifact until the watcher catches up. Gate this fast path on the coordinator having a successful current build, not just file existence.

Useful? React with 👍 / 👎.

Comment on lines +376 to +379
if (process.platform !== "win32") {
const r = safeKill(entry.child, "SIGUSR2");
if (r === "ok") {
hotSwapTargets.push(target);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Delay hot-swap signals until the child is ready

A same-hash rebuild can hit this branch immediately after /api/train registers a newly spawned child, before that child has imported runner.ts and installed its SIGUSR2 callback-reload handler. In that startup window on Unix, child.kill("SIGUSR2") can terminate the Node process with the default signal action while the server reports it as a hot-swap target, so the SPA does not schedule a restart and the user sees the run die instead of continue. Add a child-ready marker before allowing SIGUSR2, or route pre-ready children through restart.

Useful? React with 👍 / 👎.

Comment thread packages/arkor/package.json Outdated
Comment on lines +66 to +70
"@types/node": "^24",
"@vitest/coverage-v8": "^4.1.5",
"tsdown": "^0.22.0",
"typescript": "^5",
"vitest": "^4.1.5"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep published devDependencies catalog-bound

The repository's AGENTS.md dependency-catalog rule says the published packages' runtime dependencies are the only literal-semver carve-out and that their devDependencies stay catalog:-bound. Replacing these with literals here (and the same pattern in packages/create-arkor/package.json) breaks the single-source-of-truth catalog policy, so future bumps of TypeScript/Vitest/tsdown can drift between the published packages and the rest of the workspace instead of being one-line catalog updates.

Useful? React with 👍 / 👎.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

- Introduced a NO_OP_DISPOSE function for consistent signal registration failure handling.
- Enhanced shutdown signal handling to ensure accurate exit codes based on received signals.
- Updated requestTrainerEarlyStop to handle errors more gracefully and ensure proper exit codes.
- Improved callback reload handler to avoid crashing on unsupported platforms.
- Refactored trainer test cases to utilize Response.json for consistency and clarity.
- Updated various promises to use void for better readability and to avoid unhandled promise rejections.
- Enhanced type safety and clarity in several areas, including the use of readonly arrays and improved error handling.
- Cleaned up unnecessary comments and improved code readability across multiple files.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@coderabbitai

coderabbitai Bot commented May 31, 2026

Copy link
Copy Markdown

Review Change Stack

Walkthrough

Adds Rolldown-based Studio HMR: build/watch plumbing, deterministic config hashing, trainer inspection and Symbol.for-branded callbacks/early-stop, signal-driven hot-swap vs graceful restart, server SSE/dev routes and train registry, CLI cleanup integration, frontend auto-restart, and comprehensive tests/docs.

Changes

Studio HMR End-to-End Flow

Layer / File(s) Summary
Build migration and shared runtime contracts
packages/arkor/package.json, pnpm-workspace.yaml, packages/arkor/src/core/rolldownConfig.ts, packages/arkor/src/core/moduleCacheBust.ts, packages/arkor/src/core/configHash.ts, packages/arkor/src/core/signalExit.ts, packages/arkor/src/cli/commands/build.ts
Replaces esbuild with rolldown (^1.0.0); introduces shared build config helpers (resolveBuildEntry, rolldownInputOptions, resolveNodeTarget), ESM cache-busting via file stat (mtimeMs-ctimeMs-size), deterministic JSON-based config hashing (SHA-256 truncated), and POSIX signal-exit code mapping (128 + signo). Config hash and cache-bust key logic tested for JSON parity, cycles, and sparse arrays.
Trainer lifecycle: cleanup hooks, signals, inspection, and callback replacement
packages/arkor/src/cli/cleanupHooks.ts, packages/arkor/src/cli/cleanupHooks.test.ts, packages/arkor/src/core/runnerSignals.ts, packages/arkor/src/core/runnerSignals.test.ts, packages/arkor/src/core/trainerInspection.ts, packages/arkor/src/core/trainerInspection.test.ts, packages/arkor/src/core/trainer.ts, packages/arkor/src/core/trainer.test.ts, packages/arkor/src/core/runner.ts, packages/arkor/src/core/runner.test.ts, eslint.config.ts
Adds module-scoped cleanup-hook coordinator for signals with idempotent callbacks and async settling before exit; two-stage shutdown semantics (early-stop then POSIX-exit); Symbol.for-based trainer inspection and callback-replacer/early-stopper attachments; trainer internals updated to support callback rotation, early-stop latch, and start-in-flight coordination. Extensive tests added.
HMR coordinator and train registry: rebuild dispatch policy
packages/arkor/src/studio/hmr.ts, packages/arkor/src/studio/hmr.test.ts, packages/arkor/src/studio/trainRegistry.ts, packages/arkor/src/studio/trainRegistry.test.ts, packages/arkor/src/studio/manifest.ts
Implements HMR event model (ready/rebuild/error) with artifact fingerprints, content hash, and stable configHash caching. Coordinator watches rolldown builds, inspects bundles via cache-busted import, and uses a monotonic buildSeq to drop stale inspections. TrainRegistry tracks active children and decides per-child hot-swap (SIGUSR2) vs graceful restart (SIGTERM), including pre-ready backfill and platform-aware fallbacks. Manifest supports HMR fast-path and rejects when coordinator is in error state.
Studio server HMR routes, train orchestration, and comprehensive server tests
packages/arkor/src/studio/server.ts, packages/arkor/src/studio/server.test.ts
Adds optional hmr to server options; injects studio token and HMR-enabled meta into HTML; tightens CSRF to allow ?studioToken= only on specific GET SSE paths; rewrites /api/train to register children with TrainRegistry, pin spawn-time hashes and scope, pass per-spawn nonce, parse nonce-anchored Started job stdout, stream output robustly, implement manual cancel (cloud cancel + SIGKILL), and expose x-arkor-train-pid. Adds /api/dev/events SSE that augments rebuild events with restart/hot-swap targets and replays latest event. Extensive server tests added.
CLI dev: HMR coordinator wiring, token cleanup, and test updates
packages/arkor/src/cli/commands/dev.ts, packages/arkor/src/cli/commands/dev.test.ts, packages/arkor/src/cli/commands/start.test.ts
Integrates lazy createHmrCoordinator into runDev, registers disposal via registerCleanupHook, reworks token persistence cleanup to use timing-safe comparison and on-disk re-read gating, and registers cleanup with exitOnSignal: true. Adds deterministic cleanup tests and concurrency assertions.
Frontend HMR support: detection, dev-event streaming, spawn PID, and auto-restart
packages/studio-app/src/lib/api.ts, packages/studio-app/src/lib/api.test.ts, packages/studio-app/src/components/RunTraining.tsx, packages/studio-app/vite.config.ts
Adds isHmrEnabled(), openDevEvents() (EventSource to /api/dev/events), and DevEvent payload typing. streamTraining now parses x-arkor-train-pid and returns parsed exit codes; includes capped error-body reading for non-2xx responses. RunTraining implements HMR state/ref bookkeeping: pre-spawn event buffering, per-PID restart latch, grace-window restart (250ms), manifest-seq fencing, and UI status for early-stopping/restarting/hot-swapped. Vite dev plugin injects HMR meta tag.
Documentation, templates, and end-to-end validation
AGENTS.md, docs/concepts/studio.mdx, docs/ja/concepts/studio.mdx, e2e/cli/vitest.config.ts, e2e/studio/src/specs/hmr.spec.ts
Updates docs to describe Rolldown watcher, per-launch studio token propagation, SSE vs header token semantics, CSRF event-stream allow-list, HMR rebuild routing (callback hot-swap vs graceful early-stop), and Rolldown bundling. Adds Playwright E2E HMR spec and updates Vitest hook-timeout comments.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • soleil-colza
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch eng-615
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch eng-615

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d4030b9c43

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// Clearing here makes the user's broken-state edit the
// source of truth: no auto-restart fires until the next
// successful rebuild re-arms the latch.
restartPendingRef.current = false;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve restart after a fixed HMR error

When an HMR error arrives after a prior rebuild already put this child into the registry's earlyStopRequested state, clearing restartPendingRef here loses the only restart signal for that child. The server's rebuild dispatch skips entries with earlyStopRequested, so if the user fixes the compile error before the early-stop child exits, the successful rebuild will not include this pid in restartTargets and the SPA cannot re-arm the latch; the child then exits and stays idle instead of restarting with the fixed bundle.

Useful? React with 👍 / 👎.

Comment on lines +221 to +222
void fetchManifest()
.then(setManifest)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Ignore stale manifest fetches after HMR errors

Fresh evidence beyond the server-side /api/manifest guard: a ready/rebuild event can start this async fetchManifest(), then a newer HMR error event can set the manifest to { error } before this request resolves. Because the promise is not versioned or cancelled, its later setManifest can overwrite the error with the last-good manifest, re-enabling the Run button for stale code until a later poll restores the error state.

Useful? React with 👍 / 👎.

Comment thread packages/arkor/src/studio/hmr.ts Outdated
Comment on lines +218 to +231
async function inspectBundle(outFile: string): Promise<InspectionResult> {
try {
const mod = (await import(moduleCacheBustUrl(outFile))) as Record<
string,
unknown
>;
const inspection = findInspectableTrainer(mod);
if (!inspection) return null;
return {
configHash: hashJobConfig(inspection.config),
trainerName: inspection.name,
};
} catch {
return null;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Propagate inspection failures to HMR error state

If a rebuilt module throws during evaluation (for example top-level user code throws), Rolldown still emits BUNDLE_END, but this catch converts the failed import into configHash: null and a successful ready/rebuild event. That means /api/manifest will not stay in the HMR error state and the registry may SIGTERM-restart active runs against an artifact that arkor start cannot import, leaving the SPA to churn into a failed restart instead of surfacing the build/import failure as an error.

Useful? React with 👍 / 👎.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
packages/arkor/src/core/trainerInspection.test.ts (1)

91-121: ⚡ Quick win

Add the missing default: trainer inspection case.

This block calls out shape #4 in the regression note, but it never asserts findInspectableTrainer({ default: trainer }). A direct export default createTrainer(...) path can regress without failing this suite.

Proposed test addition
   const trainerC = brandedTrainer("default-arkor");
   const inspectionC = findInspectableTrainer({
     default: createArkor({ trainer: trainerC }),
   });
   expect(inspectionC?.name).toBe("default-arkor");

-  const trainerD = brandedTrainer("default-nested");
+  const trainerD = brandedTrainer("default-direct");
+  const inspectionD = findInspectableTrainer({ default: trainerD });
+  expect(inspectionD?.name).toBe("default-direct");
+
+  const trainerE = brandedTrainer("default-nested");
+  const inspectionE = findInspectableTrainer({
+    default: { trainer: trainerE },
+  });
+  expect(inspectionE?.name).toBe("default-nested");
-  const inspectionD = findInspectableTrainer({
-    default: { trainer: trainerD },
-  });
-  expect(inspectionD?.name).toBe("default-nested");
As per coding guidelines, "Add vitest test cases under `packages/*/src/**/*.test.ts` for SDK/CLI/scaffold logic changes."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/arkor/src/core/trainerInspection.test.ts` around lines 91 - 121, Add
a new assertion to cover the missing "default: trainer" shape by calling
findInspectableTrainer with an object whose default property is the bare
trainer; specifically create a trainer via brandedTrainer (e.g.,
brandedTrainer("default-bare")) and assert that findInspectableTrainer({
default: trainer }) returns the expected inspection (check .name), similar to
the other cases in the test that use findInspectableTrainer, brandedTrainer, and
createArkor.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/arkor/src/core/configHash.ts`:
- Around line 64-85: stableStringifyRec currently calls a value's toJSON before
adding the object to the seen set and uses Array.prototype.map which skips
sparse array holes; change it so that when maybeToJSON is a function you add the
current value to seen before invoking (so circular returns from toJSON are
detected like JSON.stringify), then call (maybeToJSON as (key: string) =>
unknown).call(value, key) and recurse on that result; for arrays, stop using
value.map and instead iterate numeric indices 0..value.length-1, treat missing
slots (i in value === false) and slots that serialize to undefined as "null"
(i.e. use stableStringifyRec per index and fall back to "null") to match
JSON.stringify behavior and ensure circulars are detected via the seen set.

In `@packages/arkor/src/core/trainer.test.ts`:
- Around line 1817-1849: The test's fetcher mock expects events for job id
"j-falsy" but the create-job stub returns minimalJobRow.id (currently "j-stop"),
causing wait() to hit the wrong stream URL; update the create-job response to
return a job with id "j-falsy" (i.e. ensure minimalJobRow.id or the object
returned in the POST branch matches "j-falsy") so the GET to
/v1/jobs/j-falsy/events/stream in the fetcher mock is exercised by the
Trainer.wait() flow.

---

Nitpick comments:
In `@packages/arkor/src/core/trainerInspection.test.ts`:
- Around line 91-121: Add a new assertion to cover the missing "default:
trainer" shape by calling findInspectableTrainer with an object whose default
property is the bare trainer; specifically create a trainer via brandedTrainer
(e.g., brandedTrainer("default-bare")) and assert that findInspectableTrainer({
default: trainer }) returns the expected inspection (check .name), similar to
the other cases in the test that use findInspectableTrainer, brandedTrainer, and
createArkor.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 429d904f-75a0-497d-bfbb-9ea6225bd3e4

📥 Commits

Reviewing files that changed from the base of the PR and between 0b0eee1 and d4030b9.

⛔ Files ignored due to path filters (1)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (43)
  • AGENTS.md
  • docs/concepts/studio.mdx
  • docs/ja/concepts/studio.mdx
  • docs/ja/studio/jobs.mdx
  • e2e/cli/vitest.config.ts
  • e2e/studio/src/specs/hmr.spec.ts
  • eslint.config.ts
  • packages/arkor/package.json
  • packages/arkor/src/cli/cleanupHooks.test.ts
  • packages/arkor/src/cli/cleanupHooks.ts
  • packages/arkor/src/cli/commands/build.ts
  • packages/arkor/src/cli/commands/dev.test.ts
  • packages/arkor/src/cli/commands/dev.ts
  • packages/arkor/src/cli/commands/start.test.ts
  • packages/arkor/src/core/configHash.test.ts
  • packages/arkor/src/core/configHash.ts
  • packages/arkor/src/core/moduleCacheBust.test.ts
  • packages/arkor/src/core/moduleCacheBust.ts
  • packages/arkor/src/core/projectState.test.ts
  • packages/arkor/src/core/rolldownConfig.ts
  • packages/arkor/src/core/runner.test.ts
  • packages/arkor/src/core/runner.ts
  • packages/arkor/src/core/runnerSignals.test.ts
  • packages/arkor/src/core/runnerSignals.ts
  • packages/arkor/src/core/schemas.test.ts
  • packages/arkor/src/core/signalExit.ts
  • packages/arkor/src/core/trainer.test.ts
  • packages/arkor/src/core/trainer.ts
  • packages/arkor/src/core/trainerInspection.test.ts
  • packages/arkor/src/core/trainerInspection.ts
  • packages/arkor/src/studio/hmr.test.ts
  • packages/arkor/src/studio/hmr.ts
  • packages/arkor/src/studio/manifest.ts
  • packages/arkor/src/studio/server.test.ts
  • packages/arkor/src/studio/server.ts
  • packages/arkor/src/studio/trainRegistry.test.ts
  • packages/arkor/src/studio/trainRegistry.ts
  • packages/cli-internal/src/templates.ts
  • packages/studio-app/src/components/RunTraining.tsx
  • packages/studio-app/src/lib/api.test.ts
  • packages/studio-app/src/lib/api.ts
  • packages/studio-app/vite.config.ts
  • pnpm-workspace.yaml

Comment thread packages/arkor/src/core/configHash.ts Outdated
Comment thread packages/arkor/src/core/trainer.test.ts

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

…andling

fix: improve inspectBundle error handling and manifest fetching logic in HMR

test: add assertions for default-exported trainer shapes in trainerInspection tests

feat: implement monotonic counter for HMR SSE handler to prevent stale state

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e726548bca

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +96 to +98
// restored the error. The polling effect's own fetches don't need
// this because they observe `cancelled` synchronously inside the
// `tick()` closure; this counter only protects the cross-handler

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Version manifest polling against HMR errors

When HMR is enabled, a regular 5s manifest poll that started before a later error SSE can still resolve after the error handler has set { error }, because this code explicitly leaves polling fetches outside the manifestFetchSeqRef guard. In projects whose manifest import/build is slow (for example top-level await or filesystem contention), that stale poll can overwrite the current build-error state and re-enable the Run button for the last-good artifact until the next poll restores the error.

Useful? React with 👍 / 👎.

Comment on lines +452 to +456
if (restartPendingRef.current) {
// Fast path: SSE event already landed before exit. Fire the
// restart synchronously without waiting for the grace
// window so the common case has no perceptible delay.
restartPendingRef.current = false;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Suppress auto-restart after failed early-stop exits

When an HMR-triggered early-stop fails to cancel the cloud job, the runner now exits nonzero (exit=143), but streamTraining() still resolves normally for any /api/train stream that closed after a 200 response. If restartPendingRef is set, this branch auto-spawns a new run without checking that exit marker, so the fixed nonzero exit path is still treated as restartable and can overlap a new cloud job with the one whose cancel failed.

Useful? React with 👍 / 👎.

Comment thread packages/arkor/src/studio/server.ts Outdated
Comment on lines +911 to +912
const { baseUrl: rpcBaseUrl, token: rpcToken } =
await resolveCredentialsAndBaseUrl();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Capture cancel credentials at spawn time

This server-side Stop path re-reads credentials/base URL at cancel time, while the child created the cloud job using whatever credentials it loaded at trainer.start(). If the user logs in/out or otherwise changes ~/.arkor credentials while a run is in flight, Stop can send the cancel POST to the wrong account/control plane, swallow the failure, and then SIGKILL the child that still held the original credentials, leaving the original cloud job running.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e41acc7f28

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

}

// Hash mismatch (or one side is null): graceful restart.
const r = safeKill(entry.child, "SIGTERM");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid SIGTERM restarts on Windows

When Studio HMR runs on Windows and a rebuild requires a restart, this SIGTERM does not reach the runner's installShutdownHandlers; Node's subprocess.kill() docs state that on Windows these signals cause the process to be killed "forcefully and abruptly" (https://nodejs.org/api/child_process.html#subprocesskillsignal). In that environment the child cannot early-stop or send cancel(), so a config-changing rebuild can orphan the cloud job and lose the checkpoint-preservation behavior this path is meant to provide.

Useful? React with 👍 / 👎.


async function emitBuildSucceeded(): Promise<void> {
if (disposed) return;
const seq = ++buildSeq;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Invalidate stale inspections when rebuild starts

This sequence guard only advances after a later build reaches BUNDLE_END (or ERROR), so if the previous bundle's inspectBundle() is slow (for example user top-level await during dynamic import) and a new edit has already started compiling but has not ended yet, the older inspection still passes seq === buildSeq and broadcasts/signals active children for stale source. Rolldown's watch cycle includes BUNDLE_START before BUNDLE_END, so the in-flight inspection should be invalidated when the newer rebuild starts, not only when it finishes.

Useful? React with 👍 / 👎.

Comment thread packages/arkor/src/studio/hmr.ts Outdated
}
watcher = watch({
...rolldownInputOptions(resolved),
output: { file: resolved.outFile, format: "esm" },

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid importing while the watcher writes

With HMR enabled, the watcher writes .arkor/build/index.mjs at this same path while /api/train can concurrently spawn arkor start, whose runStart skips rebuilding whenever that artifact already exists. If the user clicks Run during the watcher's non-atomic write window, the child can import a partially-written bundle and fail the run instead of starting from a coherent artifact. Gate training while a rebuild is in progress or publish the watched bundle via temp-file-and-rename before making it visible to arkor start.

Useful? React with 👍 / 👎.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
packages/arkor/src/studio/server.ts (1)

1620-1624: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Constrain static asset reads to assetsDir.

readAsset() joins the request path directly onto assetsDir, and app.get("*") feeds it c.req.path verbatim. Any traversal sequence that survives routing normalization can therefore escape the bundled assets tree and read arbitrary local files. Because these GETs are only host-guarded, not Studio-token gated, this is a local file disclosure on the loopback server.

🔒 Suggested hardening
 async function readAsset(relPath: string): Promise<Response | null> {
   const cleaned = relPath.replace(/^\/+/, "");
   try {
-      const file = await readFile(join(assetsDir, cleaned));
+      const filePath = resolve(assetsDir, cleaned);
+      if (
+        filePath !== assetsDir &&
+        !filePath.startsWith(`${assetsDir}${sep}`)
+      ) {
+        return null;
+      }
+      const file = await readFile(filePath);
       const ext = cleaned.slice(cleaned.lastIndexOf(".") + 1);
       if (ext === "html") {

Also applies to: 1647-1655

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/arkor/src/studio/server.ts` around lines 1620 - 1624, readAsset
currently concatenates the request path onto assetsDir and can be abused for
path traversal; to fix, decode and normalize the incoming relPath, then compute
the absolute target with path.resolve(assetsDir, cleaned) (use the same resolved
base: const base = path.resolve(assetsDir)), and verify the resulting target is
inside base (e.g. use path.relative(base, target) and ensure it does not start
with ".." and does not contain path.sep at the front); if the check fails return
null (or an appropriate error) and do not call readFile. Also reject suspicious
inputs (null bytes, absolute paths) before resolving. Apply this same guard to
the other occurrence around lines 1647-1655.
packages/arkor/src/cli/commands/dev.test.ts (1)

663-682: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

These persist-failure tests do not actually fail on Windows.

Line 671 and Line 738 rely on chmodSync(..., 0o555) to block writes to ~/.arkor, but NTFS does not enforce directory writes that way. On Windows both tests can go green without ever hitting the persistStudioToken catch path, so this regression coverage disappears on part of the CI matrix. Use a platform-neutral failure trigger here, such as making studioTokenPath() a directory or mocking the write boundary.

Also applies to: 726-747

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/arkor/src/cli/commands/dev.test.ts` around lines 663 - 682, The
tests relying on chmodSync to make ~/.arkor read-only are flaky on Windows;
replace the platform-specific permission trick with a platform-neutral
write-failure trigger: locate the token path via studioTokenPath(...) (or the
exported persistStudioToken/writeCredentials function) and either (a) create a
directory at that exact path before calling runDev({ port: 4203 }) so the token
write will fail with ENOTDIR, or (b) mock/spyon the module method that writes
the token (persistStudioToken or writeCredentials) to throw an EACCES-like error
for this test; ensure you restore the original behavior in the finally block and
keep assertions (expect(runDev...).resolves.toBeUndefined and
expect(serve).toHaveBeenCalledTimes(1)) unchanged.
packages/studio-app/src/components/RunTraining.tsx (2)

141-167: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The poller can still overwrite an HMR error with stale manifest data.

Line 151 is only gated by cancelled, which flips on unmount, not on newer HMR events. If a /api/manifest poll started before an SSE error frame resolves afterward, this effect will restore the last-good manifest and re-enable Run against broken code until the next refresh. Reuse the same seq/generation gate for poll fetches, or invalidate in-flight polls when HMR events land.

Possible fix
   async function tick() {
+    const seq = ++manifestFetchSeqRef.current;
     try {
       const m = await fetchManifest();
-      if (!cancelled) setManifest(m);
+      if (!cancelled && seq === manifestFetchSeqRef.current) {
+        setManifest(m);
+      }
     } catch (err: unknown) {
-      if (!cancelled) {
+      if (!cancelled && seq === manifestFetchSeqRef.current) {
         setManifest({
           error: err instanceof Error ? err.message : String(err),
         });
       }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/studio-app/src/components/RunTraining.tsx` around lines 141 - 167,
The poller can overwrite a newer HMR error because it only checks cancelled;
modify the effect to use a generation/sequence gate: create a ref (e.g.,
manifestSeqRef) that you increment whenever an SSE/HMR error frame or any HMR
event invalidates the manifest, capture the current seq at the start of tick
(before calling fetchManifest), then before calling setManifest or scheduling
the next timer ensure the captured seq still equals manifestSeqRef.current; if
it differs, discard the poll result (and do not schedule another timeout). Keep
using the existing symbols (useEffect, cancelled, timer, tick, fetchManifest,
setManifest) but add and check the manifestSeqRef around the async fetch and
before state updates so in-flight polls are invalidated when HMR events arrive.

334-510: 🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

Add a jsdom test suite for this restart state machine.

This component now owns the risky HMR logic - buffered pre-spawn events, late-SSE grace windows, and "Stop beats auto-restart" arbitration - but the provided coverage only hits API helpers and a caption-refresh E2E. A small Testing Library suite here would catch regressions before they become flaky cross-process Playwright failures. As per coding guidelines: Include jsdom-based Testing Library cases for Studio components. Based on learnings: Do not split implementation across multiple PRs; land docs, tests, and code in the same PR unless the user explicitly says otherwise.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/studio-app/src/components/RunTraining.tsx` around lines 334 - 510,
Add a jsdom Testing Library suite for the RunTraining component exercising the
run() state machine: write tests to simulate streamTraining callbacks and
SSE-like pre-spawn events to validate that pendingPreSpawnEventsRef is drained,
restartPendingRef wins over hot-swap, user Abort
(trainingAbortRef)/ac.signal.aborted prevents logging and cancels restarts, and
the 250ms restart grace window behavior (currentPidRef, restartGraceTimerRef)
toggles hmrStatus correctly; stub/mock streamTraining and isHmrEnabled, drive
onSpawn via the pid callback, and assert UI/state transitions (setHmrStatus
outcomes "idle"/"early-stopping"/"restarting"/"hot-swapped") and that run(file)
re-invokes when appropriate. Ensure tests run under jsdom (Testing Library +
jest) and are added alongside the component.

Sources: Coding guidelines, Learnings

packages/arkor/src/cli/commands/dev.ts (1)

199-230: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Read-then-unlink still lets a concurrent arkor dev instance lose its token.

Line 224 and Line 230 are a TOCTOU pair: another process can rewrite ~/.arkor/studio-token after the read succeeds but before this process unlinks it. In that interleaving, we compare our bytes and then delete the other process's fresh token anyway, so the shared-path 403 regression is still reachable. This needs an atomic ownership/lease scheme, not a best-effort content check.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/arkor/src/cli/commands/dev.ts` around lines 199 - 230, The current
read-then-unlink in registerCleanupHook (readFileSync → tokensEqual →
unlinkSync) is TOCTOU and can delete a freshly written token from another
process; replace the best-effort content check with an explicit ownership/lease:
when creating the token (persistStudioToken or wherever the token is written)
also atomically acquire a per-process lease by creating a companion
lock/ownership file using fs.openSync(lockPath, 'wx') (or writeFileSync with
flag 'wx') containing a unique owner id (pid/uuid); on cleanup, instead of just
reading the token file, read the lock file and compare the owner id to our
stored id and only then unlink the token and remove the lock; update
registerCleanupHook, persistStudioToken (or the token-writing site), and use
tokensEqual only for debug — use lockPath, persistStudioToken,
registerCleanupHook, and the stored owner id to locate the changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@packages/arkor/src/cli/commands/dev.test.ts`:
- Around line 663-682: The tests relying on chmodSync to make ~/.arkor read-only
are flaky on Windows; replace the platform-specific permission trick with a
platform-neutral write-failure trigger: locate the token path via
studioTokenPath(...) (or the exported persistStudioToken/writeCredentials
function) and either (a) create a directory at that exact path before calling
runDev({ port: 4203 }) so the token write will fail with ENOTDIR, or (b)
mock/spyon the module method that writes the token (persistStudioToken or
writeCredentials) to throw an EACCES-like error for this test; ensure you
restore the original behavior in the finally block and keep assertions
(expect(runDev...).resolves.toBeUndefined and
expect(serve).toHaveBeenCalledTimes(1)) unchanged.

In `@packages/arkor/src/cli/commands/dev.ts`:
- Around line 199-230: The current read-then-unlink in registerCleanupHook
(readFileSync → tokensEqual → unlinkSync) is TOCTOU and can delete a freshly
written token from another process; replace the best-effort content check with
an explicit ownership/lease: when creating the token (persistStudioToken or
wherever the token is written) also atomically acquire a per-process lease by
creating a companion lock/ownership file using fs.openSync(lockPath, 'wx') (or
writeFileSync with flag 'wx') containing a unique owner id (pid/uuid); on
cleanup, instead of just reading the token file, read the lock file and compare
the owner id to our stored id and only then unlink the token and remove the
lock; update registerCleanupHook, persistStudioToken (or the token-writing
site), and use tokensEqual only for debug — use lockPath, persistStudioToken,
registerCleanupHook, and the stored owner id to locate the changes.

In `@packages/arkor/src/studio/server.ts`:
- Around line 1620-1624: readAsset currently concatenates the request path onto
assetsDir and can be abused for path traversal; to fix, decode and normalize the
incoming relPath, then compute the absolute target with path.resolve(assetsDir,
cleaned) (use the same resolved base: const base = path.resolve(assetsDir)), and
verify the resulting target is inside base (e.g. use path.relative(base, target)
and ensure it does not start with ".." and does not contain path.sep at the
front); if the check fails return null (or an appropriate error) and do not call
readFile. Also reject suspicious inputs (null bytes, absolute paths) before
resolving. Apply this same guard to the other occurrence around lines 1647-1655.

In `@packages/studio-app/src/components/RunTraining.tsx`:
- Around line 141-167: The poller can overwrite a newer HMR error because it
only checks cancelled; modify the effect to use a generation/sequence gate:
create a ref (e.g., manifestSeqRef) that you increment whenever an SSE/HMR error
frame or any HMR event invalidates the manifest, capture the current seq at the
start of tick (before calling fetchManifest), then before calling setManifest or
scheduling the next timer ensure the captured seq still equals
manifestSeqRef.current; if it differs, discard the poll result (and do not
schedule another timeout). Keep using the existing symbols (useEffect,
cancelled, timer, tick, fetchManifest, setManifest) but add and check the
manifestSeqRef around the async fetch and before state updates so in-flight
polls are invalidated when HMR events arrive.
- Around line 334-510: Add a jsdom Testing Library suite for the RunTraining
component exercising the run() state machine: write tests to simulate
streamTraining callbacks and SSE-like pre-spawn events to validate that
pendingPreSpawnEventsRef is drained, restartPendingRef wins over hot-swap, user
Abort (trainingAbortRef)/ac.signal.aborted prevents logging and cancels
restarts, and the 250ms restart grace window behavior (currentPidRef,
restartGraceTimerRef) toggles hmrStatus correctly; stub/mock streamTraining and
isHmrEnabled, drive onSpawn via the pid callback, and assert UI/state
transitions (setHmrStatus outcomes
"idle"/"early-stopping"/"restarting"/"hot-swapped") and that run(file)
re-invokes when appropriate. Ensure tests run under jsdom (Testing Library +
jest) and are added alongside the component.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: be0f3b24-0b07-46b0-87c3-4a8c45682ee7

📥 Commits

Reviewing files that changed from the base of the PR and between e726548 and e41acc7.

📒 Files selected for processing (18)
  • AGENTS.md
  • docs/concepts/studio.mdx
  • e2e/cli/vitest.config.ts
  • e2e/studio/src/specs/hmr.spec.ts
  • eslint.config.ts
  • packages/arkor/src/cli/commands/dev.test.ts
  • packages/arkor/src/cli/commands/dev.ts
  • packages/arkor/src/cli/commands/start.test.ts
  • packages/arkor/src/core/runner.test.ts
  • packages/arkor/src/core/trainer.test.ts
  • packages/arkor/src/core/trainer.ts
  • packages/arkor/src/studio/hmr.ts
  • packages/arkor/src/studio/server.test.ts
  • packages/arkor/src/studio/server.ts
  • packages/studio-app/src/components/RunTraining.tsx
  • packages/studio-app/src/lib/api.test.ts
  • packages/studio-app/src/lib/api.ts
  • packages/studio-app/vite.config.ts
📜 Review details
🧰 Additional context used
📓 Path-based instructions (19)
docs/**/*.mdx

📄 CodeRabbit inference engine (AGENTS.md)

Verify Mintlify heading slugs with curl before adding cross-page anchor links. Mintlify preserves /, =, and full-width JP parens in ids, strips ASCII parens ( ) and backticks `, and converts spaces to -. Confirm actual rendered ids before committing links.

Files:

  • docs/concepts/studio.mdx
**

⚙️ CodeRabbit configuration file

**: # Arkor Development Guide

Note: Claude Code automatically loads this file.

Repository shape

pnpm + Turbo monorepo. Workspaces are declared in pnpm-workspace.yaml (packages/*, e2e/*, docs).

Path Role
packages/arkor Published arkor SDK + CLI + bundled local Studio server (Hono). bin/arkordist/bin.mjs. Library entry → dist/index.mjs.
packages/create-arkor Published create-arkor scaffolder (pnpm create arkor).
packages/cli-internal Private workspace package. Source is bundled into arkor and create-arkor via tsdown's deps.alwaysBundle. Never appears as a runtime dependency on npm.
packages/studio-app Private Vite + React 19 SPA. pnpm --filter @arkor/studio-app bundle builds it; packages/arkor/scripts/copy-studio-assets.mjs copies dist/ into packages/arkor/dist/assets/.
e2e/cli Private vitest suite that spawns the built dist/bin.mjs of both CLIs in temp dirs.
e2e/studio Private Playwright suite that spawns arkor dev against an in-process fake cloud-api and drives the Studio SPA in Chromium.
docs Mintlify sources for docs.arkor.ai.

Common commands

Root scripts fan out via Turbo (which respects ^build deps in turbo.json):

pnpm install
pnpm build          # turbo run build across all packages
pnpm typecheck      # tsc --noEmit across all packages
pnpm lint           # oxlint --deny-warnings, then strict ESLint 10 (single root config each)
pnpm format         # oxfmt --write across the repo (config in oxfmt.config.ts)
pnpm format:check   # oxfmt --check; CI gates on this (no writes)
pnpm test           # vitest run across all packages (incl. e2e)
pnpm test:coverage  # writes lcov + cobertura + junit per package; CI uploads lcov to Codecov, cobertura to GitHub nativ...

Files:

  • docs/concepts/studio.mdx
  • packages/arkor/src/cli/commands/start.test.ts
  • packages/studio-app/vite.config.ts
  • e2e/cli/vitest.config.ts
  • packages/studio-app/src/lib/api.ts
  • packages/arkor/src/studio/server.ts
  • AGENTS.md
  • eslint.config.ts
  • packages/arkor/src/studio/server.test.ts
  • packages/arkor/src/core/runner.test.ts
  • e2e/studio/src/specs/hmr.spec.ts
  • packages/arkor/src/core/trainer.test.ts
  • packages/arkor/src/core/trainer.ts
  • packages/arkor/src/cli/commands/dev.ts
  • packages/studio-app/src/components/RunTraining.tsx
  • packages/studio-app/src/lib/api.test.ts
  • packages/arkor/src/studio/hmr.ts
  • packages/arkor/src/cli/commands/dev.test.ts
**/*.{ts,tsx,js,jsx,mts,mjs,cts,cjs}

📄 CodeRabbit inference engine (CONTRIBUTING.md)

**/*.{ts,tsx,js,jsx,mts,mjs,cts,cjs}: Avoid the em dash (U+2014) in code comments, string literals, and template literals. Use colons, periods, commas, parentheses, spaced hyphens (" - "), or reworded sentences instead
Run pnpm lint (oxlint --deny-warnings followed by strict ESLint 10 with type-aware rules) before submitting code

Files:

  • packages/arkor/src/cli/commands/start.test.ts
  • packages/studio-app/vite.config.ts
  • e2e/cli/vitest.config.ts
  • packages/studio-app/src/lib/api.ts
  • packages/arkor/src/studio/server.ts
  • eslint.config.ts
  • packages/arkor/src/studio/server.test.ts
  • packages/arkor/src/core/runner.test.ts
  • e2e/studio/src/specs/hmr.spec.ts
  • packages/arkor/src/core/trainer.test.ts
  • packages/arkor/src/core/trainer.ts
  • packages/arkor/src/cli/commands/dev.ts
  • packages/studio-app/src/components/RunTraining.tsx
  • packages/studio-app/src/lib/api.test.ts
  • packages/arkor/src/studio/hmr.ts
  • packages/arkor/src/cli/commands/dev.test.ts
**/*.{ts,tsx,js,jsx,mts,mjs,cts,cjs,json,css,html}

📄 CodeRabbit inference engine (CONTRIBUTING.md)

Run pnpm format using oxfmt to format all JS/TS/JSON/CSS/HTML files before submitting code

Files:

  • packages/arkor/src/cli/commands/start.test.ts
  • packages/studio-app/vite.config.ts
  • e2e/cli/vitest.config.ts
  • packages/studio-app/src/lib/api.ts
  • packages/arkor/src/studio/server.ts
  • eslint.config.ts
  • packages/arkor/src/studio/server.test.ts
  • packages/arkor/src/core/runner.test.ts
  • e2e/studio/src/specs/hmr.spec.ts
  • packages/arkor/src/core/trainer.test.ts
  • packages/arkor/src/core/trainer.ts
  • packages/arkor/src/cli/commands/dev.ts
  • packages/studio-app/src/components/RunTraining.tsx
  • packages/studio-app/src/lib/api.test.ts
  • packages/arkor/src/studio/hmr.ts
  • packages/arkor/src/cli/commands/dev.test.ts
**/*.test.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (CONTRIBUTING.md)

Include vitest cases for SDK/CLI/scaffolder logic, jsdom-based Testing Library cases for Studio components, or visual UI documentation when submitting PRs

Files:

  • packages/arkor/src/cli/commands/start.test.ts
  • packages/arkor/src/studio/server.test.ts
  • packages/arkor/src/core/runner.test.ts
  • packages/arkor/src/core/trainer.test.ts
  • packages/studio-app/src/lib/api.test.ts
  • packages/arkor/src/cli/commands/dev.test.ts
**/*.{ts,tsx,js,jsx,mts,mjs}

📄 CodeRabbit inference engine (CONTRIBUTING.ja.md)

Do not use em-dashes (U+2014) anywhere in code comments, strings, and template literals. Instead use colons, periods, commas, parentheses, space-separated hyphens (-), or rephrase sentences.

Files:

  • packages/arkor/src/cli/commands/start.test.ts
  • packages/studio-app/vite.config.ts
  • e2e/cli/vitest.config.ts
  • packages/studio-app/src/lib/api.ts
  • packages/arkor/src/studio/server.ts
  • eslint.config.ts
  • packages/arkor/src/studio/server.test.ts
  • packages/arkor/src/core/runner.test.ts
  • e2e/studio/src/specs/hmr.spec.ts
  • packages/arkor/src/core/trainer.test.ts
  • packages/arkor/src/core/trainer.ts
  • packages/arkor/src/cli/commands/dev.ts
  • packages/studio-app/src/components/RunTraining.tsx
  • packages/studio-app/src/lib/api.test.ts
  • packages/arkor/src/studio/hmr.ts
  • packages/arkor/src/cli/commands/dev.test.ts
packages/{arkor,create-arkor,cli-internal}/**/*.{ts,tsx,js,mts}

📄 CodeRabbit inference engine (CONTRIBUTING.ja.md)

CLI runtime messages, generated file templates, and test names must also follow the no em-dash rule

Files:

  • packages/arkor/src/cli/commands/start.test.ts
  • packages/arkor/src/studio/server.ts
  • packages/arkor/src/studio/server.test.ts
  • packages/arkor/src/core/runner.test.ts
  • packages/arkor/src/core/trainer.test.ts
  • packages/arkor/src/core/trainer.ts
  • packages/arkor/src/cli/commands/dev.ts
  • packages/arkor/src/studio/hmr.ts
  • packages/arkor/src/cli/commands/dev.test.ts
**/*.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

**/*.{ts,tsx}: Prefer interface for defining object shapes in TypeScript (TypeScript-specific type definition pattern)
Use camelCase for variable names in TypeScript/JavaScript code
Always use async/await for promises in TypeScript/JavaScript code
Use timingSafeEqual for security-critical token comparisons (e.g., Studio CSRF token validation)

Files:

  • packages/arkor/src/cli/commands/start.test.ts
  • packages/studio-app/vite.config.ts
  • e2e/cli/vitest.config.ts
  • packages/studio-app/src/lib/api.ts
  • packages/arkor/src/studio/server.ts
  • eslint.config.ts
  • packages/arkor/src/studio/server.test.ts
  • packages/arkor/src/core/runner.test.ts
  • e2e/studio/src/specs/hmr.spec.ts
  • packages/arkor/src/core/trainer.test.ts
  • packages/arkor/src/core/trainer.ts
  • packages/arkor/src/cli/commands/dev.ts
  • packages/studio-app/src/components/RunTraining.tsx
  • packages/studio-app/src/lib/api.test.ts
  • packages/arkor/src/studio/hmr.ts
  • packages/arkor/src/cli/commands/dev.test.ts
packages/arkor/src/**/*.test.ts

📄 CodeRabbit inference engine (AGENTS.md)

Add vitest test cases for SDK/CLI/scaffold logic changes

Files:

  • packages/arkor/src/cli/commands/start.test.ts
  • packages/arkor/src/studio/server.test.ts
  • packages/arkor/src/core/runner.test.ts
  • packages/arkor/src/core/trainer.test.ts
  • packages/arkor/src/cli/commands/dev.test.ts
**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

When bumping oxlint or oxfmt in the catalog, run pnpm install --config.minimum-release-age=0 once so platform bindings land in pnpm-lock.yaml. Regular pnpm install thereafter will resolve them normally.

Files:

  • packages/arkor/src/cli/commands/start.test.ts
  • packages/studio-app/vite.config.ts
  • e2e/cli/vitest.config.ts
  • packages/studio-app/src/lib/api.ts
  • packages/arkor/src/studio/server.ts
  • eslint.config.ts
  • packages/arkor/src/studio/server.test.ts
  • packages/arkor/src/core/runner.test.ts
  • e2e/studio/src/specs/hmr.spec.ts
  • packages/arkor/src/core/trainer.test.ts
  • packages/arkor/src/core/trainer.ts
  • packages/arkor/src/cli/commands/dev.ts
  • packages/studio-app/src/lib/api.test.ts
  • packages/arkor/src/studio/hmr.ts
  • packages/arkor/src/cli/commands/dev.test.ts
packages/studio-app/vite.config.ts

📄 CodeRabbit inference engine (AGENTS.md)

Configure the arkor-studio-token Vite plugin with apply: "serve" (dev-only) to inject <meta name="arkor-studio-token"> into index.html on each request. Never apply this plugin during vite build to prevent baking a stale token into production.

Files:

  • packages/studio-app/vite.config.ts
e2e/cli/**/*.{ts,tsx,js,mts}

📄 CodeRabbit inference engine (CONTRIBUTING.ja.md)

E2E CLI tests must consume the built dist/bin.mjs from arkor and create-arkor packages, ensuring pnpm build is run first when executing tests standalone

Files:

  • e2e/cli/vitest.config.ts
packages/arkor/src/studio/server.ts

📄 CodeRabbit inference engine (AGENTS.md)

Validate every /api/* request via X-Arkor-Studio-Token header (or ?studioToken= query for EventSource) using timingSafeEqual comparison. Only GET stream-only routes can be added to the eventStreamPathPattern allow-list regex; never add mutation endpoints.

Files:

  • packages/arkor/src/studio/server.ts
AGENTS.md

📄 CodeRabbit inference engine (CLAUDE.md)

Maintain AGENTS.md as a living document with current agent status and architectural decisions

Files:

  • AGENTS.md
**/*.md

📄 CodeRabbit inference engine (AGENTS.md)

Maintain English/Japanese documentation pairs: README.mdREADME.ja.md, CONTRIBUTING.mdCONTRIBUTING.ja.md, and docs/docs/ja/. Update both sides in the same PR.

Files:

  • AGENTS.md
eslint.config.ts

📄 CodeRabbit inference engine (AGENTS.md)

ESLint 10 reads eslint.config.ts flat config at repo root (type-aware via projectService). Configure strict rules: typescript-eslint strictTypeChecked + stylisticTypeChecked, unicorn, import-x, promise, n, react-hooks, jsx-a11y, vitest. Add per-file overrides for SPA (React), tests, Playwright fixtures, and scripts. Keep auto-fix unsafe rules (unicorn/no-typeof-undefined, no-unnecessary-type-assertion) set to off with comments explaining the failure modes.

Files:

  • eslint.config.ts
e2e/studio/**/*.{ts,tsx,js,mts}

📄 CodeRabbit inference engine (CONTRIBUTING.ja.md)

E2E Studio tests must consume the built dist/bin.mjs from arkor and the bundled Studio assets, requiring playwright chromium installation and pnpm build as prerequisites

Files:

  • e2e/studio/src/specs/hmr.spec.ts
packages/arkor/src/cli/commands/dev.ts

📄 CodeRabbit inference engine (AGENTS.md)

Generate a 32-byte base64url CSRF token per dev server launch and pass it to buildStudioApp({ studioToken, hmr })

Files:

  • packages/arkor/src/cli/commands/dev.ts
packages/arkor/src/studio/hmr.ts

📄 CodeRabbit inference engine (AGENTS.md)

Keep a Rolldown watcher over src/arkor/ to track rebuilds and push events over /api/dev/events (SSE). Dynamic-import artifacts on success, extract TrainerInspection snapshots via Symbol.for("arkor.trainer.inspect") brand, and compute stable configHash from cloud-side JobConfig.

Files:

  • packages/arkor/src/studio/hmr.ts
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: arkorlab/arkor

Timestamp: 2026-06-11T14:01:54.065Z
Learning: Use the Repository pattern for data access
Learnt from: CR
Repo: arkorlab/arkor

Timestamp: 2026-06-11T14:01:54.065Z
Learning: Treat `createArkor` as a function that returns a frozen, opaque manifest tagged with `_kind: "arkor"`; pass it to tooling, not as a programmable client for user code
Learnt from: CR
Repo: arkorlab/arkor

Timestamp: 2026-06-11T14:01:54.065Z
Learning: Both e2e/cli and e2e/studio declare `arkor` (and `create-arkor`) as `workspace:*` devDependencies so Turbo's `^build` produces `dist/bin.mjs` exactly once before test runs (no pretest hooks, no concurrent races). Standalone runs need a prior `pnpm build`.
Learnt from: CR
Repo: arkorlab/arkor

Timestamp: 2026-06-11T14:01:54.065Z
Learning: E2E coverage uses `c8` wrapping vitest (not vitest's native coverage) so child CLI V8 coverage is captured and remapped through tsdown sourcemaps back to `src/`. Set `CREATE_ARKOR_BUILD_SOURCEMAP=1` for E2E coverage builds; CI declares it in turbo.json so it propagates and busts the no-sourcemap cache variant. Release builds do NOT set this and ship without `.map` files.
Learnt from: CR
Repo: arkorlab/arkor

Timestamp: 2026-06-11T14:01:54.065Z
Learning: Release SBOMs are generated with `pnpm sbom --prod --sbom-spec-version 1.7` (CycloneDX 1.7 and SPDX 2.3 formats), attested under the trusted builder's OIDC identity, and uploaded to both Codecov and GitHub native code coverage. Bundled runtime inputs of `arkor/studio-app` (e.g. `react`, `fontsource-variable/geist*`) must be `dependencies`, not `devDependencies`, so they appear in the production SBOM.
Learnt from: CR
Repo: arkorlab/arkor

Timestamp: 2026-06-11T14:01:54.065Z
Learning: Do NOT split implementation (features, schema, behavior changes) across multiple PRs. Land docs (English + Japanese), tests, and code in the same PR unless the user explicitly says otherwise.
Learnt from: CR
Repo: arkorlab/arkor

Timestamp: 2026-06-11T14:01:54.065Z
Learning: Node version: published packages declare `engines.node >=22.22.0` (raised from 22.6 to dodge the Jan 2026 async-hooks CVE). Use Node 24 (latest preferred) for development.
Learnt from: CR
Repo: arkorlab/arkor

Timestamp: 2026-06-11T14:01:54.065Z
Learning: Preserve the SIGTERM-and-let-the-child-handle-it pattern in HMR subprocess lifecycle. Do not replace it with SIGKILL escalation (orphans Cloud jobs) or always-hot-swap (risks stale JobConfig). Do not surface `requestEarlyStop()` or `replaceCallbacks()` as public Trainer interface methods; keep them behind `Symbol.for` brands (dev-only HMR primitives).
Learnt from: CR
Repo: arkorlab/arkor

Timestamp: 2026-06-11T14:01:54.065Z
Learning: HMR Symbols (`Symbol.for("arkor.trainer.inspect")`, `Symbol.for("arkor.trainer.replaceCallbacks")`, `Symbol.for("arkor.trainer.requestEarlyStop")`) are dev-only HMR primitives and must never be exposed in the published Trainer public interface. User code that wants similar semantics should compose `abortSignal` + `cancel()` per the cookbook.
Learnt from: CR
Repo: arkorlab/arkor

Timestamp: 2026-06-11T14:01:54.065Z
Learning: When touching the Studio server or SPA fetch layer, preserve: token via header for `fetch`, query param for `EventSource`, host-header guard, no CORS, timing-safe compare. The Vite plugin is dev-only (`apply: "serve"`); running it during `vite build` would bake a stale per-launch token into production.
Learnt from: CR
Repo: arkorlab/arkor

Timestamp: 2026-06-11T14:01:54.065Z
Learning: CORS is intentionally NOT configured on `/api/*`. The SPA is same-origin so reflecting `*` would let "simple" cross-origin POSTs reach handlers. The token check rejects those; cross-origin tabs cannot read the SPA's `<meta>`. This prevents another browser tab on the same machine from POSTing `/api/train` (RCE-grade exposure).
Learnt from: CR
Repo: arkorlab/arkor

Timestamp: 2026-06-11T14:01:54.065Z
Learning: Generated files copied into package directories are gitignored: `packages/*/CONTRIBUTING.md` (from root), `packages/arkor/docs/` (from root `docs/`). Always edit the source under repo root, not the copies.
Learnt from: CR
Repo: arkorlab/arkor

Timestamp: 2026-06-11T14:01:54.065Z
Learning: Do not call a HuggingFace model name "non-existent" based on training data alone. Verify (e.g. WebFetch) before flagging in issues or PR comments. If unverifiable, hedge ("could not confirm") rather than asserting absence.
🪛 LanguageTool
AGENTS.md

[style] ~100-~100: Consider using the typographical ellipsis character here instead.
Context: ...emoved. The build / coverage lanes set ARKOR_INTERNAL_SCAFFOLD_ARKOR_SPEC=file:.../packages/arkor so scaffolded fixtures ...

(ELLIPSIS)


[uncategorized] ~132-~132: The official name of this software platform is spelled with a capital “H”.
Context: ...TOCOL`. - Release SBOMs (build.yaml): each release sh...

(GITHUB)


[uncategorized] ~132-~132: The official name of this software platform is spelled with a capital “H”.
Context: ...ifact, and postflight in release.yaml attaches them t...

(GITHUB)


[uncategorized] ~132-~132: The official name of this software platform is spelled with a capital “H”.
Context: ...npm emits 2.3), so release-dry-run.yaml regener...

(GITHUB)

🔇 Additional comments (11)
packages/arkor/src/core/trainer.ts (1)

32-32: LGTM!

Also applies to: 157-157, 248-248

eslint.config.ts (1)

4-4: LGTM!

Also applies to: 31-38, 40-98, 154-167, 176-176, 334-334, 357-357

packages/arkor/src/core/runner.test.ts (1)

171-171: LGTM!

packages/arkor/src/core/trainer.test.ts (1)

279-279: LGTM!

Also applies to: 674-675, 856-858, 1841-1841

packages/arkor/src/studio/hmr.ts (1)

108-108: LGTM!

Also applies to: 111-111, 347-347

packages/studio-app/src/components/RunTraining.tsx (1)

241-251: Em-dash check: no U+2014 characters in RunTraining.tsx
packages/studio-app/src/components/RunTraining.tsx contains no U+2014 em-dash characters, so the em-dash guideline issue is not present here.

			> Likely an incorrect or invalid review comment.
packages/studio-app/src/lib/api.ts (1)

174-183: LGTM!

Also applies to: 262-268, 414-424, 447-453, 476-485, 510-515

packages/studio-app/vite.config.ts (1)

61-63: LGTM!

AGENTS.md (1)

100-100: LGTM!

Also applies to: 132-132

docs/concepts/studio.mdx (1)

15-15: LGTM!

e2e/cli/vitest.config.ts (1)

22-25: LGTM!

Also applies to: 45-46

@socket-security

Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedjsdom@​29.1.18110010093100
Addedglobals@​17.6.01001008691100
Addedhono@​4.12.23991009796100

View full report

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c8c110d181

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

const snap = win32CancelSnapshots.find(
(s) => s.pid === target.pid,
);
if (!snap?.jobId || !snap.scope) continue;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fall back to late scope before Windows HMR cancel

On Windows, child.kill("SIGTERM") forcefully terminates the runner (Node documents this as a forceful/abrupt kill), so this server-side cancel is the only chance to release the cloud job for HMR restarts. In first-run anonymous projects the registry scope is commonly null at spawn because .arkor/state.json is created inside the child during trainer.start(), but the job id may be known by the time a rebuild forces restart; this !snap.scope guard then skips the cancel POST and the abrupt kill can leave the remote job running. Fresh evidence versus the earlier Windows SIGTERM finding: the new mitigation still only uses the spawn-time snapshot and does not do the late state-file fallback that the manual Stop path added.

Useful? React with 👍 / 👎.

Comment on lines +1108 to +1109
const { baseUrl: rpcBaseUrl, token: rpcToken } =
await resolveCredentialsAndBaseUrl();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use the spawn credentials for Windows HMR cancels

When a Windows HMR restart hits this fallback, the child has just been killed abruptly, so this POST is the only process that can cancel the cloud job. Resolving credentials here uses whatever account/control plane is on disk at rebuild time; if the user logged in/out or switched hosts after the run was spawned, the cancel goes to the wrong place, the error is swallowed, and the original job keeps running. Fresh evidence versus the earlier credential-capture report: the manual Stop path now uses its spawn-time spawnRpc, but this new Windows HMR fallback still bypasses that snapshot.

Useful? React with 👍 / 👎.

Comment on lines +757 to +758
void trainer
.cancel()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Disarm the timeout latch before fallback cancel

When the early-stop timeout fires, the latch stays armed until this fallback cancel() settles. If a checkpoint.saved event arrives during that in-flight cancel POST, the checkpoint branch still sees earlyStopRequested && earlyStopDeferred and issues a second trainer.cancel(); on APIs that reject duplicate/already-cancelled requests, that second POST can mark the run failed and make the SIGTERM handler exit nonzero even though the timeout fallback cancel may have succeeded. Clear or mark the latch before starting the fallback cancel so late checkpoints cannot re-enter the cancel path.

Useful? React with 👍 / 👎.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/arkor/src/studio/server.ts (1)

354-362: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

The manifest fast path drifts from custom HMR entries.

options.hmr can be watching an explicit entry/output, but this code always re-derives hmrOutFile from cwd alone. In an arkor dev session started against a non-default entry, /api/manifest will keep reading the default .arkor/build/index.mjs while the watcher is publishing a different artifact, so the SPA polls stale or missing manifest data even though HMR is healthy. Please plumb the coordinator's resolved outFile into buildStudioApp instead of recomputing it here.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/arkor/src/studio/server.ts` around lines 354 - 362, The current
hmrOutFile computation recomputes the build entry from trainCwd (using
resolveBuildEntry) and ignores a coordinator-resolved output; update the logic
so buildStudioApp receives and uses the coordinator's resolved outFile instead
of recomputing it here: stop deriving hmrOutFile from resolveBuildEntry({ cwd:
trainCwd }) when options.hmr is true and instead accept a
resolvedOutFile/outFile parameter (the coordinator's value) and assign
hmrOutFile = coordinatorOutFile (or the passed-in resolvedOutFile) so
readManifestSummary and the /api/manifest fast path use the coordinator-provided
artifact; keep fallback behavior for undefined/out-of-date files as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/arkor/src/cli/commands/dev.ts`:
- Around line 261-278: Replace the non-atomic existsSync+renameSync path restore
with an atomic "create only if absent" operation: in the try block where you
currently call renameSync(reapPath, path), use fs.linkSync(reapPath, path) (or
fs.copyFileSync(reapPath, path, fs.constants.COPYFILE_EXCL)) so the creation
will fail with EEXIST if path was recreated, and in the catch for that EEXIST
simply unlinkSync(reapPath); keep the unlinkSync(reapPath) branch when path
already existed before attempting the atomic create; reference existsSync,
unlinkSync, renameSync (replace), linkSync or copyFileSync with COPYFILE_EXCL,
reapPath and path to locate the change.

In `@packages/arkor/src/studio/hmr.ts`:
- Around line 431-445: The staging rename currently changes mtime/ctime and
breaks inspectBundle() caching because moduleCacheBustUrl(outFile) is used; fix
by comparing content hashes and skipping the atomic rename when the staging
content equals the existing artifact: compute a checksum (e.g., SHA256) of
stagingFile and resolved.outFile at the BUNDLE_END/rename path (the code around
stagingFile and watcher), and only call renameSync when hashes differ;
alternatively, update inspectBundle() and moduleCacheBustUrl() to use a
content-hash-based cache key (read file bytes and derive the cache key) so
identical-byte rebuilds map to the same inspection record.

In `@packages/arkor/src/studio/server.ts`:
- Around line 1074-1121: The win32 restart path snapshots only pid/jobId/scope
and re-resolves credentials later, reintroducing races (null jobId, null scope,
auth/host switches) — instead attach a spawn-time RPC cancel context (baseUrl,
token and any resolved RPC client info) to each registry entry when the child is
spawned (augment whatever structure activeTrains.list()/activeTrains registry
returns), then in the restart loop (the restartTargets handling after
activeTrains.dispatchRebuild) look up that stored RPC snapshot (instead of
calling resolveCredentialsAndBaseUrl/createRpc again) and call
rpc.v1.jobs[":id"].cancel.$post with the stored snapshot; still keep the
existing late fallback to jobId/state when the spawn-time snapshot is missing,
but prefer the stored spawn-time RPC context to avoid races.

In `@packages/studio-app/src/components/RunTraining.tsx`:
- Around line 463-489: The current guard only suppresses auto-restart for
explicit nonzero exit codes; change it so a missing trailer (exitCode === null)
also suppresses restart when a restart was pending. Update the condition around
the block that manipulates restartPendingRef.current, currentPidRef.current,
restartGraceTimerRef.current, setHmrStatus and setLog (the existing if that
checks exitCode) to: if ((exitCode !== null && exitCode !== 0) || (exitCode ===
null && restartPendingRef.current)) { ... } so the same suppression/cleanup and
log (via appendCapped) runs for null when a restart was pending.

---

Outside diff comments:
In `@packages/arkor/src/studio/server.ts`:
- Around line 354-362: The current hmrOutFile computation recomputes the build
entry from trainCwd (using resolveBuildEntry) and ignores a coordinator-resolved
output; update the logic so buildStudioApp receives and uses the coordinator's
resolved outFile instead of recomputing it here: stop deriving hmrOutFile from
resolveBuildEntry({ cwd: trainCwd }) when options.hmr is true and instead accept
a resolvedOutFile/outFile parameter (the coordinator's value) and assign
hmrOutFile = coordinatorOutFile (or the passed-in resolvedOutFile) so
readManifestSummary and the /api/manifest fast path use the coordinator-provided
artifact; keep fallback behavior for undefined/out-of-date files as before.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 3eac6c93-4022-4549-b988-df0e2aa0795a

📥 Commits

Reviewing files that changed from the base of the PR and between e41acc7 and c8c110d.

📒 Files selected for processing (8)
  • packages/arkor/src/cli/commands/dev.test.ts
  • packages/arkor/src/cli/commands/dev.ts
  • packages/arkor/src/studio/hmr.ts
  • packages/arkor/src/studio/server.ts
  • packages/studio-app/src/components/RunTraining.hmr.test.tsx
  • packages/studio-app/src/components/RunTraining.tsx
  • packages/studio-app/src/lib/api.test.ts
  • packages/studio-app/src/lib/api.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 900000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (52)
  • GitHub Check: install · yarn-berry · macos-latest · node 26.0.0
  • GitHub Check: install · npm · macos-latest · node >=26.0.0 <27
  • GitHub Check: install · bun · macos-latest · node >=26.0.0 <27
  • GitHub Check: install · yarn-berry · macos-latest · node >=26.0.0 <27
  • GitHub Check: install · pnpm-9 · macos-latest · node 26.0.0
  • GitHub Check: install · yarn · macos-latest · node >=26.0.0 <27
  • GitHub Check: install · pnpm-11 · macos-latest · node >=26.0.0 <27
  • GitHub Check: install · pnpm-10 · macos-latest · node >=26.0.0 <27
  • GitHub Check: install · pnpm-10 · macos-latest · node 26.0.0
  • GitHub Check: install · pnpm-9 · macos-latest · node >=26.0.0 <27
  • GitHub Check: install · npm · macos-latest · node >=24.12.0 <25
  • GitHub Check: install · yarn-berry · macos-latest · node >=24.12.0 <25
  • GitHub Check: install · bun · macos-latest · node >=24.12.0 <25
  • GitHub Check: install · pnpm-10 · windows-latest · node 24.0.0
  • GitHub Check: install · pnpm-11 · macos-latest · node 24.12.0
  • GitHub Check: install · pnpm-11 · windows-latest · node >=26.0.0 <27
  • GitHub Check: install · yarn-berry · windows-latest · node 24.12.0
  • GitHub Check: install · bun · windows-latest · node 24.12.0
  • GitHub Check: install · pnpm-11 · macos-latest · node >=22.22.0 <23
  • GitHub Check: install · yarn · windows-latest · node >=26.0.0 <27
  • GitHub Check: install · pnpm-10 · macos-latest · node 24.0.0
  • GitHub Check: install · bun · windows-latest · node >=24.12.0 <25
  • GitHub Check: install · pnpm-11 · macos-latest · node 22.22.0
  • GitHub Check: install · bun · windows-latest · node 26.0.0
  • GitHub Check: install · bun · windows-latest · node >=26.0.0 <27
  • GitHub Check: install · pnpm-10 · windows-latest · node 26.0.0
  • GitHub Check: install · npm · windows-latest · node >=26.0.0 <27
  • GitHub Check: install · pnpm-11 · windows-latest · node >=24.12.0 <25
  • GitHub Check: install · yarn · windows-latest · node 26.0.0
  • GitHub Check: install · pnpm-10 · windows-latest · node >=26.0.0 <27
  • GitHub Check: install · yarn-berry · windows-latest · node >=24.12.0 <25
  • GitHub Check: install · yarn · windows-latest · node 24.12.0
  • GitHub Check: install · pnpm-9 · windows-latest · node >=26.0.0 <27
  • GitHub Check: install · npm · windows-latest · node 24.12.0
  • GitHub Check: install · yarn-berry · windows-latest · node 26.0.0
  • GitHub Check: install · yarn · windows-latest · node >=24.12.0 <25
  • GitHub Check: install · pnpm-10 · windows-latest · node >=24.12.0 <25
  • GitHub Check: install · pnpm-11 · windows-latest · node 24.12.0
  • GitHub Check: install · yarn · windows-latest · node 24.0.0
  • GitHub Check: install · pnpm-9 · windows-latest · node >=24.12.0 <25
  • GitHub Check: install · pnpm-9 · windows-latest · node 24.12.0
  • GitHub Check: install · npm · windows-latest · node 26.0.0
  • GitHub Check: install · pnpm-11 · windows-latest · node 24.0.0
  • GitHub Check: install · npm · windows-latest · node >=24.12.0 <25
  • GitHub Check: install · pnpm-10 · windows-latest · node 24.12.0
  • GitHub Check: install · npm · windows-latest · node 24.0.0
  • GitHub Check: install · npm · windows-latest · node >=22.22.0 <23
  • GitHub Check: install · pnpm-9 · windows-latest · node 22.22.0
  • GitHub Check: install · npm · windows-latest · node 22.22.0
  • GitHub Check: install · yarn · windows-latest · node >=22.22.0 <23
  • GitHub Check: install · npm · windows-latest · node 26.0.0
  • GitHub Check: typecheck · lint · test · build · windows-latest · node 22.22.0
🧰 Additional context used
📓 Path-based instructions (11)
**/*.{ts,tsx,js,jsx,mts,mjs,cts,cjs}

📄 CodeRabbit inference engine (CONTRIBUTING.md)

**/*.{ts,tsx,js,jsx,mts,mjs,cts,cjs}: Avoid the em dash (U+2014) in code comments, string literals, and template literals. Use colons, periods, commas, parentheses, spaced hyphens (" - "), or reworded sentences instead
Run pnpm lint (oxlint --deny-warnings followed by strict ESLint 10 with type-aware rules) before submitting code

Files:

  • packages/studio-app/src/components/RunTraining.tsx
  • packages/studio-app/src/lib/api.test.ts
  • packages/studio-app/src/lib/api.ts
  • packages/studio-app/src/components/RunTraining.hmr.test.tsx
  • packages/arkor/src/studio/server.ts
  • packages/arkor/src/cli/commands/dev.ts
  • packages/arkor/src/cli/commands/dev.test.ts
  • packages/arkor/src/studio/hmr.ts
**/*.{ts,tsx,js,jsx,mts,mjs,cts,cjs,json,css,html}

📄 CodeRabbit inference engine (CONTRIBUTING.md)

Run pnpm format using oxfmt to format all JS/TS/JSON/CSS/HTML files before submitting code

Files:

  • packages/studio-app/src/components/RunTraining.tsx
  • packages/studio-app/src/lib/api.test.ts
  • packages/studio-app/src/lib/api.ts
  • packages/studio-app/src/components/RunTraining.hmr.test.tsx
  • packages/arkor/src/studio/server.ts
  • packages/arkor/src/cli/commands/dev.ts
  • packages/arkor/src/cli/commands/dev.test.ts
  • packages/arkor/src/studio/hmr.ts
**/*.{ts,tsx,js,jsx,mts,mjs}

📄 CodeRabbit inference engine (CONTRIBUTING.ja.md)

Do not use em-dashes (U+2014) anywhere in code comments, strings, and template literals. Instead use colons, periods, commas, parentheses, space-separated hyphens (-), or rephrase sentences.

Files:

  • packages/studio-app/src/components/RunTraining.tsx
  • packages/studio-app/src/lib/api.test.ts
  • packages/studio-app/src/lib/api.ts
  • packages/studio-app/src/components/RunTraining.hmr.test.tsx
  • packages/arkor/src/studio/server.ts
  • packages/arkor/src/cli/commands/dev.ts
  • packages/arkor/src/cli/commands/dev.test.ts
  • packages/arkor/src/studio/hmr.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (AGENTS.md)

TypeScript: prefer interface for defining object shapes; use camelCase for variable names; always use async/await for promises

Files:

  • packages/studio-app/src/components/RunTraining.tsx
  • packages/studio-app/src/lib/api.test.ts
  • packages/studio-app/src/lib/api.ts
  • packages/studio-app/src/components/RunTraining.hmr.test.tsx
  • packages/arkor/src/studio/server.ts
  • packages/arkor/src/cli/commands/dev.ts
  • packages/arkor/src/cli/commands/dev.test.ts
  • packages/arkor/src/studio/hmr.ts
**

⚙️ CodeRabbit configuration file

**: # Arkor Development Guide

Note: Claude Code automatically loads this file.

Repository shape

pnpm + Turbo monorepo. Workspaces are declared in pnpm-workspace.yaml (packages/*, e2e/*, docs).

Path Role
packages/arkor Published arkor SDK + CLI + bundled local Studio server (Hono). bin/arkordist/bin.mjs. Library entry → dist/index.mjs.
packages/create-arkor Published create-arkor scaffolder (pnpm create arkor).
packages/cli-internal Private workspace package. Source is bundled into arkor and create-arkor via tsdown's deps.alwaysBundle. Never appears as a runtime dependency on npm.
packages/studio-app Private Vite + React 19 SPA. pnpm --filter @arkor/studio-app bundle builds it; packages/arkor/scripts/copy-studio-assets.mjs copies dist/ into packages/arkor/dist/assets/.
e2e/cli Private vitest suite that spawns the built dist/bin.mjs of both CLIs in temp dirs.
e2e/studio Private Playwright suite that spawns arkor dev against an in-process fake cloud-api and drives the Studio SPA in Chromium.
docs Mintlify sources for docs.arkor.ai.

Common commands

Root scripts fan out via Turbo (which respects ^build deps in turbo.json):

pnpm install
pnpm build          # turbo run build across all packages
pnpm typecheck      # tsc --noEmit across all packages
pnpm lint           # oxlint --deny-warnings, then strict ESLint 10 (single root config each)
pnpm format         # oxfmt --write across the repo (config in oxfmt.config.ts)
pnpm format:check   # oxfmt --check; CI gates on this (no writes)
pnpm test           # vitest run across all packages (incl. e2e)
pnpm test:coverage  # writes lcov + cobertura + junit per package; CI uploads lcov to Codecov, cobertura to GitHub nativ...

Files:

  • packages/studio-app/src/components/RunTraining.tsx
  • packages/studio-app/src/lib/api.test.ts
  • packages/studio-app/src/lib/api.ts
  • packages/studio-app/src/components/RunTraining.hmr.test.tsx
  • packages/arkor/src/studio/server.ts
  • packages/arkor/src/cli/commands/dev.ts
  • packages/arkor/src/cli/commands/dev.test.ts
  • packages/arkor/src/studio/hmr.ts
**/*.test.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (CONTRIBUTING.md)

Include vitest cases for SDK/CLI/scaffolder logic, jsdom-based Testing Library cases for Studio components, or visual UI documentation when submitting PRs

Files:

  • packages/studio-app/src/lib/api.test.ts
  • packages/studio-app/src/components/RunTraining.hmr.test.tsx
  • packages/arkor/src/cli/commands/dev.test.ts
**/*.test.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

Test files must be placed under packages//src/**/.test.ts with vitest convention; e2e scenarios in e2e/cli/src and e2e/studio/**/*.test.ts

Files:

  • packages/studio-app/src/lib/api.test.ts
  • packages/studio-app/src/components/RunTraining.hmr.test.tsx
  • packages/arkor/src/cli/commands/dev.test.ts
packages/{arkor,create-arkor,cli-internal}/**/*.{ts,tsx,js,mts}

📄 CodeRabbit inference engine (CONTRIBUTING.ja.md)

CLI runtime messages, generated file templates, and test names must also follow the no em-dash rule

Files:

  • packages/arkor/src/studio/server.ts
  • packages/arkor/src/cli/commands/dev.ts
  • packages/arkor/src/cli/commands/dev.test.ts
  • packages/arkor/src/studio/hmr.ts
**/studio/server.ts

📄 CodeRabbit inference engine (AGENTS.md)

**/studio/server.ts: Validate every /api/* request via X-Arkor-Studio-Token header (or ?studioToken= query for EventSource) using timingSafeEqual; enforce host-header allow-list (127.0.0.1/localhost) for DNS-rebinding defence; do NOT configure CORS
Restrict EventSource ?studioToken= query param allow-list to GET stream-only routes via eventStreamPathPattern regex in studio/server.ts; adding entries is CSRF-sensitive and must never include mutation endpoints

Files:

  • packages/arkor/src/studio/server.ts
**/cli/commands/dev.ts

📄 CodeRabbit inference engine (AGENTS.md)

Generate a 32-byte base64url CSRF token per arkor dev launch in dev.ts; persist to ~/.arkor/studio-token (mode 0600) and inject into Studio SPA via Vite plugin

Files:

  • packages/arkor/src/cli/commands/dev.ts
**/studio/hmr.ts

📄 CodeRabbit inference engine (AGENTS.md)

Keep Rolldown watcher over src/arkor/ in packages/arkor/src/studio/hmr.ts; on rebuild, pull TrainerInspection snapshot and compute configHash for hot-swap decision logic

Files:

  • packages/arkor/src/studio/hmr.ts
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: arkorlab/arkor

Timestamp: 2026-06-12T10:25:25.551Z
Learning: Use pnpm + Turbo monorepo structure with workspaces declared in pnpm-workspace.yaml (packages/*, e2e/*, docs)
Learnt from: CR
Repo: arkorlab/arkor

Timestamp: 2026-06-12T10:25:25.551Z
Learning: For all new features, SDK/CLI/Studio behaviour changes, or schema additions, include docs in both English and Japanese (README.md ↔ README.ja.md, docs/ ↔ docs/ja/) AND vitest test cases in the same PR; do not defer docs or tests to follow-ups
Learnt from: CR
Repo: arkorlab/arkor

Timestamp: 2026-06-12T10:25:25.551Z
Learning: Set ARKOR_INTERNAL_SCAFFOLD_ARKOR_SPEC=file:.../packages/arkor in build/coverage CI lanes so fixtures resolve to in-workspace arkor; install-matrix does NOT set this and packs fresh arkor-*.tgz instead
Learnt from: CR
Repo: arkorlab/arkor

Timestamp: 2026-06-12T10:25:25.551Z
Learning: When bumping oxlint or oxfmt in catalog, run pnpm install --config.minimum-release-age=0 once so platform bindings land in pnpm-lock.yaml; fixes pnpm 10.x minimumReleaseAge interaction
🪛 OpenGrep (1.22.0)
packages/studio-app/src/lib/api.ts

[ERROR] 473-473: Dynamic command passed to child_process.exec/execSync. Use child_process.execFile or spawn with an argument array instead.

(coderabbit.command-injection.exec-js)

Comment on lines +261 to +278
// Foreign token: restore it, UNLESS the other process already
// re-wrote `path` after our rename claimed the old copy. Rename
// REPLACES an existing destination rather than failing, so a
// bare rename-back would clobber that fresher token with the
// older claimed one. The existence probe shrinks the clobber
// window from "read → unlink" (previous design, milliseconds
// spanning a token comparison) to the few instructions between
// existsSync and renameSync, and the losing outcome in that
// residual window is restore-the-older-token (the other dev
// server 403s until its own next rewrite), not delete-the-token
// outright.
try {
if (existsSync(path)) {
// Newest writer wins: discard the claimed older copy.
unlinkSync(reapPath);
} else {
renameSync(reapPath, path);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restore the foreign token without a replace-on-exist race.

This fallback still does existsSync(path) and then renameSync(reapPath, path). If another arkor dev recreates path after the existence check but before the rename-back, this branch overwrites that newer token with the older claimed one, so the sibling Studio can still fall back to 403s. Restore with an atomic "create only if absent" step instead, then discard reapPath on EEXIST.

Possible fix
-import { existsSync, readFileSync, renameSync, unlinkSync } from "node:fs";
+import {
+  constants as fsConstants,
+  copyFileSync,
+  readFileSync,
+  renameSync,
+  unlinkSync,
+} from "node:fs";
...
       try {
-        if (existsSync(path)) {
-          // Newest writer wins: discard the claimed older copy.
-          unlinkSync(reapPath);
-        } else {
-          renameSync(reapPath, path);
-        }
-      } catch {
+        copyFileSync(reapPath, path, fsConstants.COPYFILE_EXCL);
+        unlinkSync(reapPath);
+      } catch (err) {
+        const code =
+          typeof err === "object" && err !== null && "code" in err
+            ? String(err.code)
+            : "";
+        if (code === "EEXIST") {
+          try {
+            unlinkSync(reapPath);
+          } catch {
+            // best-effort
+          }
+          return;
+        }
         try {
           unlinkSync(reapPath);
         } catch {
           // best-effort
         }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/arkor/src/cli/commands/dev.ts` around lines 261 - 278, Replace the
non-atomic existsSync+renameSync path restore with an atomic "create only if
absent" operation: in the try block where you currently call
renameSync(reapPath, path), use fs.linkSync(reapPath, path) (or
fs.copyFileSync(reapPath, path, fs.constants.COPYFILE_EXCL)) so the creation
will fail with EEXIST if path was recreated, and in the catch for that EEXIST
simply unlinkSync(reapPath); keep the unlinkSync(reapPath) branch when path
already existed before attempting the atomic create; reference existsSync,
unlinkSync, renameSync (replace), linkSync or copyFileSync with COPYFILE_EXCL,
reapPath and path to locate the change.

Comment on lines +431 to +445
// The watcher writes to a STAGING file beside the real artefact,
// which BUNDLE_END atomically renames onto `outFile` (Codex P2,
// round 81). Writing `outFile` directly would let a concurrent
// `/api/train` spawn race the watcher's non-atomic write:
// `runStart` skips its own rebuild whenever the artefact exists,
// so a Run click landing inside the write window would
// dynamic-import a partially-written bundle and crash the run.
// `renameSync` within the same directory is atomic on POSIX and
// uses MoveFileEx(MOVEFILE_REPLACE_EXISTING) on Windows, so
// readers only ever observe the previous complete artefact or
// the new complete artefact, never a torn write. Bonus: a build
// that errors before BUNDLE_END leaves the last-good artefact
// in place instead of clobbering it with partial output.
const stagingFile = `${resolved.outFile}.hmr-staging`;
watcher = watch({

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Identical-byte rebuilds no longer reuse the cached inspection module.

Replacing resolved.outFile with a freshly written staging file changes mtime/ctime on every BUNDLE_END, even when the bytes are unchanged. inspectBundle() still cache-busts through moduleCacheBustUrl(outFile), so duplicate or no-op rebuilds now import a brand-new URL, rerun top-level module code, and retain another permanent ESM-cache record. That breaks the "same bytes collapse to the same inspection record" invariant this file relies on, especially for the duplicate BUNDLE_END case already documented in hmr.test.ts. Please either skip the rename when the staged content hash matches the current artifact, or switch the inspection cache key to a content hash.

Also applies to: 467-472

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/arkor/src/studio/hmr.ts` around lines 431 - 445, The staging rename
currently changes mtime/ctime and breaks inspectBundle() caching because
moduleCacheBustUrl(outFile) is used; fix by comparing content hashes and
skipping the atomic rename when the staging content equals the existing
artifact: compute a checksum (e.g., SHA256) of stagingFile and resolved.outFile
at the BUNDLE_END/rename path (the code around stagingFile and watcher), and
only call renameSync when hashes differ; alternatively, update inspectBundle()
and moduleCacheBustUrl() to use a content-hash-based cache key (read file bytes
and derive the cache key) so identical-byte rebuilds map to the same inspection
record.

Comment on lines +1074 to +1121
// win32 (Codex P1, round 81): Node's `subprocess.kill("SIGTERM")`
// on Windows terminates the child "forcefully and abruptly"
// (per Node's child_process docs), so the runner's
// `installShutdownHandlers` never runs and the child can't
// issue its own graceful `cancel()` POST. Snapshot each active
// child's parsed jobId + spawn scope BEFORE dispatch (after
// dispatch the kill may already be reaping entries via close
// handlers) so the server can fire the cloud cancel on the
// child's behalf for every restart target.
const win32CancelSnapshots =
process.platform === "win32"
? activeTrains.list().map((entry) => ({
pid: entry.child.pid,
jobId: entry.jobId,
scope: entry.scope,
}))
: [];
const { hotSwapTargets, restartTargets } = activeTrains.dispatchRebuild(
nextHash,
nextArtifactContentHash,
);
if (process.platform === "win32" && restartTargets.length > 0) {
for (const target of restartTargets) {
const snap = win32CancelSnapshots.find(
(s) => s.pid === target.pid,
);
if (!snap?.jobId || !snap.scope) continue;
const { jobId: snapJobId, scope: snapScope } = snap;
// Fire-and-forget per child, mirroring the manual-Stop
// cancel path: SIGTERM-as-forceful-kill on win32 means
// nobody else will release the cloud job. Best-effort;
// cloud reaper / TTL is the safety net on failure.
void (async () => {
try {
const { baseUrl: rpcBaseUrl, token: rpcToken } =
await resolveCredentialsAndBaseUrl();
const rpc = createRpc(rpcBaseUrl, rpcToken);
await rpc.v1.jobs[":id"].cancel.$post({
param: { id: snapJobId },
query: {
orgSlug: snapScope.orgSlug,
projectSlug: snapScope.projectSlug,
},
});
} catch {
// best-effort
}
})();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Windows restart cancel dropped the robust spawn-time cancel context.

This branch snapshots only pid/jobId/scope and then resolves credentials again at rebuild time. That reopens the same races the manual Stop path above already accounted for: jobId can still be null when the rebuild lands, first-run anonymous runs still have scope === null, and a login or control-plane switch mid-run makes the cancel POST target the wrong account/host. On win32 the child is already being terminated, so any of those cases leaves the original cloud job orphaned. Please reuse the same spawn-time RPC snapshot plus late jobId/state fallback here, ideally by storing the cancel context on the registry entry instead of re-implementing a weaker cancel flow in this subscriber.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/arkor/src/studio/server.ts` around lines 1074 - 1121, The win32
restart path snapshots only pid/jobId/scope and re-resolves credentials later,
reintroducing races (null jobId, null scope, auth/host switches) — instead
attach a spawn-time RPC cancel context (baseUrl, token and any resolved RPC
client info) to each registry entry when the child is spawned (augment whatever
structure activeTrains.list()/activeTrains registry returns), then in the
restart loop (the restartTargets handling after activeTrains.dispatchRebuild)
look up that stored RPC snapshot (instead of calling
resolveCredentialsAndBaseUrl/createRpc again) and call
rpc.v1.jobs[":id"].cancel.$post with the stored snapshot; still keep the
existing late fallback to jobId/state when the spawn-time snapshot is missing,
but prefer the stored spawn-time RPC context to avoid races.

Comment on lines +463 to 489
// Codex P1 (PR #101 round 80): a NONZERO exit suppresses auto-
// restart. The runner exits 143 when an HMR-triggered early-stop
// failed to cancel the cloud job (cancel POST rejected); spawning
// a fresh run on top of that would overlap a new cloud job with
// the one whose cancel never landed (double GPU spend, no
// operator signal). Any other nonzero exit is a crash, and
// restarting a crashing run in a loop would churn cloud jobs
// with the same broken artefact. Exit 0 (graceful early-stop
// succeeded) and `null` (no marker parsed: spawn failure path
// already surfaced its own error) keep the existing restart
// behaviour.
if (exitCode !== null && exitCode !== 0) {
restartPendingRef.current = false;
currentPidRef.current = null;
setHmrStatus("idle");
if (restartGraceTimerRef.current !== null) {
clearTimeout(restartGraceTimerRef.current);
restartGraceTimerRef.current = null;
}
setLog((prev) =>
appendCapped(
prev,
`\n[error] ${err instanceof Error ? err.message : String(err)}\n`,
`\n[hmr] run exited with code ${exitCode}; auto-restart suppressed. Fix the issue and press Run to start a new job.\n`,
),
);
} finally {
if (trainingAbortRef.current === ac) trainingAbortRef.current = null;
// Always release the running flag, including the user-initiated
// abort path. setState on an already-unmounted component is a
// no-op in React 18+, so the unmount-cleanup case handles itself.
setRunning(false);
return;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Treat a missing exit= trailer as restart-unsafe.

streamTraining() returns null when the train stream ends without a parsed exit marker. This branch only suppresses restarts for explicit nonzero codes, so an HMR-triggered early-stop that loses its trailer still falls through to the restart path when restartPendingRef.current is set. That recreates the overlapping-job risk this guard is meant to stop. Only auto-restart after an explicit exitCode === 0; treat null as suppression too when a restart is pending.

Possible fix
-    if (exitCode !== null && exitCode !== 0) {
+    const restartUnsafe = restartPendingRef.current
+      ? exitCode !== 0
+      : exitCode !== null && exitCode !== 0;
+    if (restartUnsafe) {
       restartPendingRef.current = false;
       currentPidRef.current = null;
       setHmrStatus("idle");
       if (restartGraceTimerRef.current !== null) {
         clearTimeout(restartGraceTimerRef.current);
         restartGraceTimerRef.current = null;
       }
       setLog((prev) =>
         appendCapped(
           prev,
-          `\n[hmr] run exited with code ${exitCode}; auto-restart suppressed. Fix the issue and press Run to start a new job.\n`,
+          `\n[hmr] run exited with ${
+            exitCode === null ? "an unknown status" : `code ${exitCode}`
+          }; auto-restart suppressed. Fix the issue and press Run to start a new job.\n`,
         ),
       );
       return;
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/studio-app/src/components/RunTraining.tsx` around lines 463 - 489,
The current guard only suppresses auto-restart for explicit nonzero exit codes;
change it so a missing trailer (exitCode === null) also suppresses restart when
a restart was pending. Update the condition around the block that manipulates
restartPendingRef.current, currentPidRef.current, restartGraceTimerRef.current,
setHmrStatus and setLog (the existing if that checks exitCode) to: if ((exitCode
!== null && exitCode !== 0) || (exitCode === null && restartPendingRef.current))
{ ... } so the same suppression/cleanup and log (via appendCapped) runs for null
when a restart was pending.

Comment on lines 447 to +474
while (true) {
const { value, done } = await reader.read();
if (done) break;
onChunk(decoder.decode(value, { stream: true }));
const text = decoder.decode(value, { stream: true });
trackTail(text);
onChunk(text);
}
// Flush any bytes the streaming decoder buffered for a multi-byte
// UTF-8 sequence that landed split across the final two chunks.
// Without this, the last character of the trainer's output gets
// silently dropped when it happens to be non-ASCII (Japanese log
// lines, emoji progress bars, etc.).
const tail = decoder.decode();
if (tail) onChunk(tail);
if (tail) {
trackTail(tail);
onChunk(tail);
}
} finally {
signal?.removeEventListener("abort", onAbort);
// Release the reader lock so a subsequent caller can re-acquire
// the body if needed. Mirrors `iterateSseFrames`'s finally clause.
reader.releaseLock();
}
// `null` when no marker was found: aborted streams, `error=` paths,
// truncated bodies. Callers treat null as "unknown" and apply their
// own conservative default.
const m = EXIT_MARKER_PATTERN.exec(tailBuf);
return m?.[1] === undefined ? null : Number.parseInt(m[1], 10);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don't forward the synthetic exit= trailer into onChunk.

streamTraining parses \n---\nexit=<code>\n only after the read loop, but every decoded chunk has already been passed to onChunk. That means RunTraining will log the protocol footer verbatim on normal exits. Keep a small holdback tail, strip the trailer once the stream closes, and only emit the non-marker bytes.

🧰 Tools
🪛 OpenGrep (1.22.0)

[ERROR] 473-473: Dynamic command passed to child_process.exec/execSync. Use child_process.execFile or spawn with an argument array instead.

(coderabbit.command-injection.exec-js)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants