[core] [world] Lazy run creation on start#1537
Conversation
🦋 Changeset detectedLatest commit: 90b9a7f The changes in this PR will be included in the next version bump. This PR includes changesets to release 20 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express | Nitro workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) | Nitro Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro | Express Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 10 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 25 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 50 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 10 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 25 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express | Nitro workflow with 50 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express | Nitro Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro | Express stream pipeline with 5 transform steps (1MB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) 10 parallel streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) fan-out fan-in 10 streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
|
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests🌍 Community Worlds (63 failed)mongodb (3 failed):
redis (3 failed):
turso (57 failed):
Details by Category✅ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
❌ 🌍 Community Worlds
✅ 📋 Other
|
| const encodedInput = | ||
| workflowArguments instanceof Uint8Array | ||
| ? Array.from(workflowArguments) | ||
| : workflowArguments; |
There was a problem hiding this comment.
Concrete suggestion for the queue transport issue: @vercel/queue offers a BufferTransport out of the box. The fix here is:
- CBOR-encode the
WorkflowInvokePayloadbefore dispatch - Use
BufferTransportinstead of the defaultJsonTransport - CBOR-decode on the receive side
This eliminates the entire encode/decode problem — Uint8Array values survive CBOR natively. No base64, no Array.from, no discriminant fields. The input field just carries the raw Uint8Array through transparently.
This would also be a net improvement for all queue messages (not just resilient start), since CBOR is more compact than JSON for binary-heavy payloads.
There was a problem hiding this comment.
Implemented. We went with a custom CborTransport (rather than BufferTransport) so the transport handles both encode and decode internally — call sites pass plain objects, the handler receives decoded objects. See world-vercel/queue.ts for the implementation. world-local and world-postgres use TypedJsonTransport with tagged envelopes since they don't have VQS's BufferTransport available.
Signed-off-by: Peter Wielander <mittgfu@gmail.com>
…ts.js' causes TS2300 build failure.
This commit fixes the issue reported at packages/core/src/runtime.ts:18
**Bug:** Lines 18-22 of `packages/core/src/runtime.ts` contain two separate import statements from `'./runtime/constants.js'`:
1. Line 18: `import { MAX_QUEUE_DELIVERIES } from './runtime/constants.js';`
2. Lines 19-22: `import { MAX_QUEUE_DELIVERIES, REPLAY_TIMEOUT_MS } from './runtime/constants.js';`
Both import `MAX_QUEUE_DELIVERIES`, creating a duplicate identifier. This is a merge conflict artifact - the first import was from the PR's reordering of imports, and the second was added by a merge from main that introduced `REPLAY_TIMEOUT_MS`. The TypeScript compiler rejects this with error TS2300 on both lines 18 and 20, causing the `@workflow/core` package build to fail, which in turn fails the entire Vercel deployment.
The build log confirms:
```
src/runtime.ts(18,10): error TS2300: Duplicate identifier 'MAX_QUEUE_DELIVERIES'.
src/runtime.ts(20,3): error TS2300: Duplicate identifier 'MAX_QUEUE_DELIVERIES'.
```
**Fix:** Consolidated the two import statements into a single import:
```typescript
import { MAX_QUEUE_DELIVERIES, REPLAY_TIMEOUT_MS } from './runtime/constants.js';
```
Both symbols are used in the file - `MAX_QUEUE_DELIVERIES` for queue delivery limits and `REPLAY_TIMEOUT_MS` on lines 174, 185, and 198 for replay timeout configuration.
Co-authored-by: Vercel <vercel[bot]@users.noreply.github.com>
Co-authored-by: VaguelySerious <mittgfu@gmail.com>
Replace stock JsonTransport with a custom transport that encodes
Uint8Array values as { __type: 'Uint8Array', data: '<base64>' }
during JSON serialization. Without this, runInput.input (a Uint8Array)
gets corrupted to a plain object when sent through the postgres queue,
causing 'Invalid input' errors in the resilient start e2e test.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Peter Wielander <mittgfu@gmail.com>
Signed-off-by: Peter Wielander <mittgfu@gmail.com>
Signed-off-by: Peter Wielander <mittgfu@gmail.com> # Conflicts: # packages/core/src/runtime/start.test.ts # packages/world-local/src/storage.test.ts
Signed-off-by: Peter Wielander <mittgfu@gmail.com>
# Conflicts: # packages/core/e2e/e2e.test.ts
…t duplicate events The normal run_created path used writeJSON (fs.access + temp+rename) which has a TOCTOU race with the resilient start path's writeExclusive. On the local world, both events.create(run_created) and events.create(run_started) run concurrently in the same event loop. Both could pass the existence check simultaneously, resulting in two run_created events — causing "Unconsumed event in event log" errors during replay. Switch the normal run_created entity write to writeExclusive (O_CREAT|O_EXCL) so exactly one writer wins atomically. Fixes consistent Windows CI failures in world-testing embedded tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…okups
The Vite-builder workbenches (astro, sveltekit) have committed step.js
bundles that register builtins with bare names (e.g. "__builtin_response_text").
The workflow VM looks them up with full IDs from builtinStepId(). The previous
suffix match (endsWith("//{name}")) missed bare-name registrations since they
don't contain "//".
Fix by also checking for exact bare-name matches, and extracting the function
name from fully-qualified step IDs before matching. Also expand the builtin
allowlist to cover start() and Run.* methods.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ll-ID lookups" This reverts commit 64797f1.
CborTransport was a pass-through wrapper — serialize() was an identity function and deserialize() returned raw Buffers. The actual CBOR encode/decode happened at call sites (queue() pre-encoded, handler post-decoded). This violated the transport abstraction and required callers to remember to handle encoding. Move encode()/decode() into CborTransport.serialize()/deserialize() so the transport is self-contained, matching TypedJsonTransport (world-local) and the inline transport (world-postgres). Call sites now pass plain objects; the handler receives decoded objects. Also update follow-up items in resilient-start.mdx: - Mark Local Prod flakiness as resolved - Close events optimization for re-enqueue (won't-do: unsafe with at-least-once delivery) - Mark CborTransport refactor as done Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TooTallNate
left a comment
There was a problem hiding this comment.
Re-reviewing the latest state: most of the previous blockers are resolved (changeset is patch, binary-safe queue transports are in place, eventData ||= eventDataJson fallback exists, and EntityConflictError handling is restored).
I found one remaining correctness issue that should be fixed before merge:
- Blocking: resilient start can lose the caller's requested
specVersion.
start() includes runInput.specVersion in the queue payload, but workflowEntrypoint still always emits run_started with specVersion: SPEC_VERSION_CURRENT. If run_created fails and the run is created through the resilient path, world-local/world-postgres use the event's specVersion (current), not the original one requested by start().
This breaks the contract for callers explicitly using legacy spec versions.
Suggested direction:
- In
workflowEntrypoint, whenrunInputis present, setrun_started.specVersionfromrunInput.specVersion(validated), otherwise default to current. - Keep runInput schema validation strict (
specVersionrequired) and rely on existing world storage logic that useseffectiveSpecVersionfrom the event.
| @@ -258,6 +250,20 @@ export function workflowEntrypoint( | |||
| { | |||
| eventType: 'run_started', | |||
| specVersion: SPEC_VERSION_CURRENT, | |||
There was a problem hiding this comment.
Blocking: This always uses SPEC_VERSION_CURRENT, even when runInput is present and includes its own specVersion.
In resilient start, run_created may fail and the run gets created from run_started in the world layer. Both world-local and world-postgres resilient creation paths use effectiveSpecVersion from the event (data.specVersion), so this line effectively forces the run to current spec version.
That means a caller doing start(..., { specVersion: 1 }) can be silently upgraded to current spec when resilient start kicks in.
Please use runInput.specVersion when runInput is present, and only default to SPEC_VERSION_CURRENT when it's absent.
workflowEntrypoint hardcoded specVersion: SPEC_VERSION_CURRENT on run_started events. When the resilient start path creates the run from run_started (because run_created failed), the run was always created with the current spec version, ignoring the version originally requested by start(). This breaks callers using legacy spec versions. Use runInput.specVersion (carried through the queue from start()) when available, falling back to SPEC_VERSION_CURRENT for re-enqueue cycles where runInput is absent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TooTallNate
left a comment
There was a problem hiding this comment.
Re-reviewed after 90b9a7f5 — the remaining blocker is resolved.
workflowEntrypoint now uses runInput?.specVersion ?? SPEC_VERSION_CURRENT when creating run_started, so resilient-start run creation preserves the caller-requested spec version instead of always forcing current. That fixes the legacy specVersion drift issue I flagged.
Given that, LGTM.
Minor follow-up (non-blocking): adding a regression test for the resilient path with start(..., { specVersion: 1 }) would be great, so this behavior stays locked in.
This PR makes
startresilient: ensures that as long as the queue is up, the world storage layer being down will not affect run creations, only defer them.This works by sending the run creation payload into the queue, and allowing the runtime to re-try creating the run it was invoked for.
See the live docs