Skip to content

feat(supervisor): compute workload manager#3114

Open
nicktrn wants to merge 70 commits intomainfrom
feat/compute-workload-manager
Open

feat(supervisor): compute workload manager#3114
nicktrn wants to merge 70 commits intomainfrom
feat/compute-workload-manager

Conversation

@nicktrn
Copy link
Copy Markdown
Collaborator

@nicktrn nicktrn commented Feb 23, 2026

Adds the ComputeWorkloadManager for routing task execution through the compute gateway, including full checkpoint/restore support, OTel trace integration, and template pre-warming.

Changes

Compute workload manager (apps/supervisor/src/workloadManager/compute.ts)

  • Routes instance create, snapshot, delete, and restore through the compute gateway API
  • Wide event logging on create with full timing and context
  • Configurable gateway timeout, auth token, image digest stripping

Compute snapshot service (apps/supervisor/src/services/computeSnapshotService.ts)

  • Timer wheel for delayed snapshot dispatch (avoids wasted work on short-lived waitpoints)
  • Configurable dispatch concurrency limit (COMPUTE_SNAPSHOT_DISPATCH_LIMIT)
  • Snapshot-complete callback handler with suspend completion reporting
  • Trace context management and OTel span emission for snapshot operations

OTel trace service (apps/supervisor/src/services/otlpTraceService.ts)

  • Fire-and-forget OTLP span emission for compute operations (provision, restore, snapshot)
  • BigInt nanosecond conversion preserving sub-ms precision for span ordering

Template creation (apps/webapp/app/v3/services/computeTemplateCreation.server.ts)

  • Three-mode rollout: required (MICROVM projects), shadow (feature flag / percentage), skip
  • Integrated into deploy finalize flow

Shared compute package (internal-packages/compute/)

  • Gateway client with namespace-based API (instances, templates, snapshots)
  • Zod schemas for all gateway request/response types

Database

  • COMPUTE variant added to TaskRunCheckpointType enum
  • WorkloadType enum and column on WorkerInstanceGroup
  • hasComputeAccess feature flag

Env / config

  • Compute gateway URL, auth token, timeout
  • Snapshot enable flag, delay, dispatch limit
  • Dedicated OTLP endpoint for compute spans (COMPUTE_TRACE_OTLP_ENDPOINT)

Add a third WorkloadManager implementation that creates sandboxes via
the compute gateway HTTP API (POST /api/sandboxes). Uses native fetch
with no new dependencies. Enabled by setting COMPUTE_GATEWAY_URL, which
takes priority over Kubernetes and Docker providers.
The fetch() call had no timeout, causing infinite hangs when the gateway
accepted requests but never returned responses. Adds AbortSignal.timeout
(30s) and consolidates all logging into a single structured event per
create() call with timing, status, and error context.
Emit a single canonical log line in a finally block instead of scattered
log calls at each early return. Adds business context (envId, envType,
orgId, projectId, deploymentVersion, machine) and instanceName to the
event. Always emits at info level with ok=true/false for queryability.
Pass business context (runId, envId, orgId, projectId, machine, etc.)
as metadata on CreateSandboxRequest instead of relying on env vars.
This enables wide event logging in the compute stack without parsing
env or leaking secrets.
Passes machine preset cpu and memory as top-level fields on the
CreateSandboxRequest so the compute stack can use them for admission
control and resource allocation.
Thread timing context from queue consumer through to the compute
workload manager's wide event:

- dequeueResponseMs: platform dequeue HTTP round-trip
- pollingIntervalMs: which polling interval was active (idle vs active)
- warmStartCheckMs: warm start check duration

All fields are optional to avoid breaking existing consumers.
- Fix instance creation URL from /api/sandboxes to /api/instances
- Pass name: runnerId when creating compute instances
- Add snapshot(), deleteInstance(), and restore() methods to ComputeWorkloadManager
- Add /api/v1/compute/snapshot-complete callback endpoint to WorkloadServer
- Handle suspend requests in compute mode via fire-and-forget snapshot with callback
- Handle restore in compute mode by calling gateway restore API directly
- Wire computeManager into WorkloadServer for compute mode suspend/restore
…re request

Restore calls now send a request body with the runner name, env override metadata,
cpu, and memory so the agent can inject them before the VM resumes. The runner
fetches these overrides from TRIGGER_METADATA_URL at restore time.

runnerId is derived per restore cycle as runner-{runIdShort}-{checkpointSuffix},
matching iceman's pattern.
Gates snapshot/restore behaviour independently of compute mode.
When disabled, VMs won't receive the metadata URL and suspend/restore
are no-ops. Defaults to off so compute mode can be used without snapshots.
@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Feb 23, 2026

🦋 Changeset detected

Latest commit: 5b188a5

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 29 packages
Name Type
trigger.dev Patch
d3-chat Patch
references-d3-openai-agents Patch
references-nextjs-realtime Patch
references-realtime-hooks-test Patch
references-realtime-streams Patch
references-telemetry Patch
@trigger.dev/build Patch
@trigger.dev/core Patch
@trigger.dev/python Patch
@trigger.dev/react-hooks Patch
@trigger.dev/redis-worker Patch
@trigger.dev/rsc Patch
@trigger.dev/schema-to-json Patch
@trigger.dev/sdk Patch
@trigger.dev/database Patch
@trigger.dev/otlp-importer Patch
@internal/cache Patch
@internal/clickhouse Patch
@internal/llm-model-catalog Patch
@internal/redis Patch
@internal/replication Patch
@internal/run-engine Patch
@internal/schedule-engine Patch
@internal/testcontainers Patch
@internal/tracing Patch
@internal/tsql Patch
@internal/zod-worker Patch
@internal/sdk-compat-tests Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 23, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds end-to-end compute support: a new internal package @internal/compute (client, types, imageRef), supervisor compute workload manager and wiring (create/snapshot/restore), OTLP trace payload/dispatch, timer-wheel-based delayed snapshot orchestration and HTTP callback route, environment schema extensions, webapp compute template creation service with feature-flag and rollout logic, a DB migration adding WorkloadType and WorkerInstanceGroup.workloadType, propagation of dequeue/polling timing through the run queue, a CLI local-build --load behavior fix, and new tests and logging verbosity adjustments.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(supervisor): compute workload manager' accurately and clearly describes the main change: adding a ComputeWorkloadManager to the supervisor component.
Description check ✅ Passed The PR description provides comprehensive technical details about ComputeWorkloadManager, checkpoint/restore support, OTel integration, and template pre-warming, with clear sections outlining major changes across supervisor, webapp, compute package, database, and env/config.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/compute-workload-manager

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as resolved.

…nabled

Remove the silent `localhost` fallback for the snapshot callback URL,
which would be unreachable from external compute gateways. Add env
validation and a runtime guard matching the existing metadata URL pattern.
coderabbitai[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

@nicktrn
Copy link
Copy Markdown
Collaborator Author

nicktrn commented Mar 29, 2026

ready

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants