fix: ExponentialBackoff execute() bypasses maxElapsed timeout by deepshekhardas · Pull Request #3750 · triggerdotdev/trigger.dev

deepshekhardas · 2026-05-26T07:05:01Z

Fixes issue #3726.

The execute() method tracked elapsedMs but never checked it against maxElapsed, allowing the retry loop to continue indefinitely regardless of the configured time limit. Added check after each retry attempt.

…entry (triggerdotdev#2920)

…t build server failures (triggerdotdev#2913)

…erve log chain (triggerdotdev#2900)

…erdotdev#2911)

)

- Include reproduction scripts for Sentry (triggerdotdev#2900) and engine strictness (triggerdotdev#2913) - Include PR body drafts for consolidated tracking

When the underlying logical-replication client errored (e.g. after a Postgres failover), the runs and sessions replication services logged the error and left the stream stopped. The host process kept running, the WAL backed up, and ClickHouse silently fell behind. Both services now run a configurable recovery strategy on stream errors, defaulting to in-process reconnect with exponential backoff so a fresh self-hosted setup heals on its own: - "reconnect" (default) re-subscribes via the existing subscribe(lastLsn) path with exponential backoff (1s -> 60s cap, unlimited attempts), which re-validates the publication, re-acquires the leader lock, and resumes from the last acknowledged LSN. - "exit" calls process.exit after a short flush window so a host's supervisor (Docker restart=always, systemd, k8s, etc.) can replace the process. - "log" preserves the historical behaviour. Per-service strategy + exit knobs are env-driven via RUN_REPLICATION_ERROR_STRATEGY / SESSION_REPLICATION_ERROR_STRATEGY plus matching *_EXIT_DELAY_MS / *_EXIT_CODE. Reconnect tuning is shared across both services via REPLICATION_RECONNECT_INITIAL_DELAY_MS / _MAX_DELAY_MS / _MAX_ATTEMPTS (0 = unlimited).

Addresses PR review feedback: - LogicalReplicationClient.subscribe() can throw before its internal "error" listener is wired up (notably when pg client.connect() fails mid-failover). The reconnect strategy's catch block only logged, so recovery silently stopped. Now also calls scheduleReconnect(err) — the pendingReconnect guard makes it idempotent if an error event was also emitted. - Reject negative values for the new replication-recovery env vars and cap exit codes at 255. - Convert the new ReplicationErrorRecovery{Deps,} interfaces to type aliases to match the repo's TypeScript style. - Tighten the reconnect dep comment to drop a stale "lastAcknowledgedLsn" reference (the wrapper-tracked resume LSN is what callers actually pass). - Restore process.exit after service.shutdown() in the exit-strategy test so a delayed exit timer can't terminate the test worker.

LogicalReplicationClient.subscribe() can resolve without throwing or emitting an "error" event when leader-lock acquisition fails — it just calls this.stop() and returns. The reconnect callback now checks isStopped after subscribe() and throws so the recovery handler can schedule the next attempt instead of silently giving up.

…rough handle() The previous post-subscribe() isStopped check was always true on the happy path: subscribe() calls stop() up front (setting _isStopped=true) and only resets the flag inside the replicationStart event, which fires asynchronously after subscribe() returns. So the check threw on every successful reconnect, the catch rescheduled, the next attempt tore down the just-built client, and the cycle continued — replication briefly worked between teardowns, which is why the integration test passed. Replace it with the correct nudge: subscribe to leaderElection and call the recovery handler on isLeader=false. That's the only subscribe() exit path that doesn't either throw or emit an "error" event (the other silent-return paths emit "error" first via createPublication/createSlot failures).

The previous commit routed leaderElection(false) through handle(), which under the exit strategy schedules process.exit. In a multi-instance deployment that turns lost leader election — a normal operational state — into a restart loop: exit, supervisor restarts, election fails again, exit, and so on. Add a dedicated notifyLeaderElectionLost() on ReplicationErrorRecovery that the reconnect strategy treats as another retry trigger, while exit and log strategies no-op. Wire the wrapper services through the new method.

fix(webapp): auto-recover replication services after stream errors

The execute() method tracked elapsedMs across retries but never checked it against maxElapsed, allowing the retry loop to continue indefinitely regardless of the configured time limit.

changeset-bot · 2026-05-26T07:05:10Z

🦋 Changeset detected

Latest commit: bf858f4

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

github-actions · 2026-05-26T07:05:17Z

Hi @deepshekhardas, thanks for your interest in contributing!

This project requires that pull request authors are vouched, and you are not in the list of vouched users.

This PR will be closed automatically. See https://github.com/triggerdotdev/trigger.dev/blob/main/CONTRIBUTING.md for more details.

coderabbitai · 2026-05-26T07:07:19Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 66a52d98-10fb-46c1-b60e-b63bb5500aa4

📥 Commits

Reviewing files that changed from the base of the PR and between 37eeaa3 and bf858f4.

📒 Files selected for processing (29)

.changeset/fix-console-interceptor-2900.md
.changeset/fix-docker-hub-rate-limit-2911.md
.changeset/fix-github-install-node-version-2913.md
.changeset/fix-orphaned-workers-2909.md
.changeset/fix-sentry-oom-2920.md
.server-changes/replication-error-recovery.md
apps/webapp/app/env.server.ts
apps/webapp/app/services/replicationErrorRecovery.server.ts
apps/webapp/app/services/runsReplicationInstance.server.ts
apps/webapp/app/services/runsReplicationService.server.ts
apps/webapp/app/services/sessionsReplicationInstance.server.ts
apps/webapp/app/services/sessionsReplicationService.server.ts
apps/webapp/test/runsReplicationService.errorRecovery.test.ts
consolidated_pr_body.md
packages/cli-v3/src/cli/common.ts
packages/cli-v3/src/commands/deploy.ts
packages/cli-v3/src/commands/dev.ts
packages/cli-v3/src/commands/login.ts
packages/cli-v3/src/commands/update.test.ts
packages/cli-v3/src/commands/update.ts
packages/cli-v3/src/deploy/buildImage.ts
packages/cli-v3/src/entryPoints/dev-index-worker.ts
packages/cli-v3/src/entryPoints/dev-run-worker.ts
packages/cli-v3/src/entryPoints/managed-index-worker.ts
packages/cli-v3/src/entryPoints/managed-run-worker.ts
packages/cli-v3/src/utilities/sourceMaps.test.ts
packages/cli-v3/src/utilities/sourceMaps.ts
packages/core/src/v3/apps/backoff.ts
packages/core/src/v3/consoleInterceptor.ts

Walkthrough

This PR consolidates eight bug fixes and replication improvements. The primary addition is a configurable error-recovery system for replication stream failures in both runs and sessions services, supporting three strategies: reconnect with exponential backoff (default), exit for supervisor restart, or log-only. The PR also adds a CLI ignoreEngines option to bypass Node version checks during deployment, implements Docker Hub authentication for native builds, extracts source-map configuration into a reusable utility supporting environment-driven modes, refactors the dev command to use SIGINT/SIGTERM handlers for cleanup, fixes ConsoleInterceptor to delegate to original console methods, and enforces maxElapsed time limits in exponential backoff retry loops.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

⚔️ Resolve merge conflicts

Resolve merge conflict in branch fix/3726-backoff-maxelapsed

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

devin-ai-integration

Devin Review found 4 potential issues.

View 6 additional findings in Devin Review.

devin-ai-integration · 2026-05-26T07:11:42Z

@@ -260,8 +261,7 @@ export async function updateTriggerPackages(
    await installDependencies({ cwd: projectPath, silent: true });


🔴 ignoreEngines option added but never consumed — engine-strict flags never passed to installDependencies

The ignoreEngines option was added to CommonCommandOptions and UpdateCommandOptions, and deploy.ts:262 passes { ...options, ignoreEngines: true } to updateTriggerPackages. However, the installDependencies call at packages/cli-v3/src/commands/update.ts:261 is still await installDependencies({ cwd: projectPath, silent: true }) — it never reads options.ignoreEngines, never computes the package-manager-specific args (--no-engine-strict for npm, --config.engine-strict=false for pnpm, --ignore-engines for yarn), and never passes an args property. The stated fix for issue #2913 ("Ignore engine checks during deployment install phase") is completely non-functional. The tests in update.test.ts assert on args that the implementation never provides, so they would fail as well.

Prompt for agents

In packages/cli-v3/src/commands/update.ts, the `installDependencies` call at line 261 needs to compute and pass engine-strictness flags based on `options.ignoreEngines` and the detected package manager. The `packageManager` is already detected at line 254 via `detectPackageManager(projectPath)`. When `options.ignoreEngines` is true, the code should compute an `args` array based on `packageManager.name`: - npm: ["--no-engine-strict"] - pnpm: ["--config.engine-strict=false"] - yarn: ["--ignore-engines"] - default/unknown: [] When `options.ignoreEngines` is false or undefined, `args` should be `[]`. Then pass it to installDependencies: `await installDependencies({ cwd: projectPath, silent: true, args })`. The tests in update.test.ts already assert the expected behavior — the implementation just needs to match.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-05-26T07:11:44Z

    await devInstance.waitUntilExit();
  } finally {
-    await watcher?.stop();
+    process.off("SIGINT", signalHandler);
+    process.off("SIGTERM", signalHandler);
+    await cleanup();
  }


🔴 devCommand cleanup in finally block immediately kills the dev session because waitUntilExit is a no-op

In the new devCommand(), the finally block calls cleanup() which calls devInstance.stop() — this stops the inner DevSessionInstance, the config watcher, and removes the lock file. Since waitUntilExit is defined as async () => { } at packages/cli-v3/src/commands/dev.ts:312 (a no-op that resolves immediately), the finally block executes immediately after startDev completes. This stops the entire dev session right after starting it.

In the old code (git show 6c9f1f19:packages/cli-v3/src/commands/dev.ts), the finally block only called watcher?.stop(), which stopped just the config file watcher — the dev session (IPC connections, child processes, file watchers) stayed alive and kept the Node.js event loop running. The new code calls devInstance.stop() which stops everything, causing trigger dev to start and immediately exit. The SIGINT/SIGTERM signal handlers registered at lines 209-210 are also removed in the finally block before they can ever fire.

Prompt for agents

The core problem is that `waitUntilExit` (defined in startDev at line 312) is a no-op, so the finally block at lines 214-218 runs immediately after the dev session starts, calling cleanup() which stops the dev session. The old code's finally block only called `watcher?.stop()`, leaving the dev session alive. The new code calls `devInstance.stop()` which kills everything. There are two possible approaches: 1. Make `waitUntilExit` an actual blocking promise that only resolves when a signal is received. For example, return a promise that resolves from the signal handler. This way the finally block only runs after the signal handler has done cleanup. 2. Don't call `cleanup()` in the finally block. Only call it from the signal handlers. The finally block should only remove the signal handlers (to avoid double-cleanup). The old code's pattern of only stopping the watcher in finally worked because the dev session stayed alive. Approach 1 is cleaner — the signal handler should resolve the waitUntilExit promise (after cleanup), and the finally block removes the handlers as a safety measure.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-05-26T07:11:45Z

+        switch (severityNumber) {
+          case SeverityNumber.INFO:
+            this.originalConsole.log(...args);
+            break;
+          case SeverityNumber.WARN:
+            this.originalConsole.warn(...args);
+            break;
+          case SeverityNumber.ERROR:
+            this.originalConsole.error(...args);
+            break;
+          case SeverityNumber.DEBUG:
+            this.originalConsole.debug(...args);
+            break;
+          default:
+            this.originalConsole.log(...args);
+            break;
+        }


🟡 console.info() calls are routed to original console.log() instead of original console.info()

Both log() and info() methods use SeverityNumber.INFO as their severity number. In the new switch statement at line 97, case SeverityNumber.INFO delegates to this.originalConsole.log(...args). This means when user code calls console.info(...), it flows through info() → #handleLog(SeverityNumber.INFO, ...) → this.originalConsole.log(...) instead of this.originalConsole.info(...). Interceptors like Sentry that separately patch console.info will not see these calls, partially defeating the fix's stated purpose of "preserving log chain when other interceptors are present". The fix needs to distinguish the originating method name rather than relying solely on severity number.

Prompt for agents

The issue is that both log() and info() use SeverityNumber.INFO, so the switch statement can't distinguish them. The fix needs to pass the originating method name (e.g., as an additional parameter to #handleLog) and use it to pick the correct originalConsole method. One approach: add a parameter like `methodName: 'log' | 'info' | 'warn' | 'error' | 'debug'` to #handleLog, then use it to dispatch: - log() calls #handleLog with methodName='log' - info() calls #handleLog with methodName='info' Then in the sendToStdIO block, use methodName to pick the right original method: this.originalConsole[methodName](...args) This avoids the severity-number ambiguity entirely.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-05-26T07:11:47Z

  });

  return service;


🚩 Sessions replication instance missing start()/shutdown signal handlers unlike runs instance

The sessionsReplicationInstance.server.ts creates the service and returns it without calling service.start() or registering SIGTERM/SIGINT handlers for shutdown. In contrast, runsReplicationInstance.server.ts:84-97 calls service.start() when RUN_REPLICATION_ENABLED === '1' and registers signal handlers. This asymmetry predates this PR (the sessions instance was already like this before). Presumably the sessions service is started/stopped elsewhere, but this is worth confirming since the new error recovery features won't activate if the service is never started.

(Refers to lines 78-81)

Was this helpful? React with 👍 or 👎 to provide feedback.

Deploy Bot and others added 20 commits February 2, 2026 16:16

fix(cli-v3): allow disabling source-map-support to prevent OOM with S…

ed41f0a

…entry (triggerdotdev#2920)

fix(cli-v3): ignore engine checks during deployment install to preven…

023c3fd

…t build server failures (triggerdotdev#2913)

fix(core): delegate to original console in ConsoleInterceptor to pres…

93aa053

…erve log chain (triggerdotdev#2900)

fix(cli-v3): authenticate to Docker Hub to prevent rate limits (trigg…

8b684e1

…erdotdev#2911)

fix(cli-v3): ensure worker cleanup on SIGINT/SIGTERM (triggerdotdev#2909

737ad56

)

verify: add reproduction scripts and PR details for all major fixes

c97cbcc

- Include reproduction scripts for Sentry (triggerdotdev#2900) and engine strictness (triggerdotdev#2913) - Include PR body drafts for consolidated tracking

verify: add reproduction scripts and PR details for all major fixes

aa90db9

- Include reproduction scripts for Sentry (triggerdotdev#2900) and engine strictness (triggerdotdev#2913) - Include PR body drafts for consolidated tracking

docs: add consolidated PR body description

f5ce2bc

chore: remove reproduction scripts and temporary files

8c986db

Merge remote-tracking branch 'remotes/origin/fix/sentry-oom-2920'

82f198f

Merge branch 'fix/issue-2909-orphaned-workers'

9a3e8d0

chore: remove reproduction scripts after verification

e101f8e

fix: resolve typecheck errors after merge

d01d438

Merge pull request #10 from deepshekhardas/pr/3613-replication-fix

d35bf04

fix(webapp): auto-recover replication services after stream errors

Fix ExponentialBackoff maxElapsed timeout bypass in execute()

bf858f4

The execute() method tracked elapsedMs across retries but never checked it against maxElapsed, allowing the retry loop to continue indefinitely regardless of the configured time limit.

github-actions Bot closed this May 26, 2026

devin-ai-integration Bot reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: ExponentialBackoff execute() bypasses maxElapsed timeout#3750

fix: ExponentialBackoff execute() bypasses maxElapsed timeout#3750
deepshekhardas wants to merge 20 commits into
triggerdotdev:mainfrom
deepshekhardas:fix/3726-backoff-maxelapsed

deepshekhardas commented May 26, 2026

Uh oh!

changeset-bot Bot commented May 26, 2026

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

coderabbitai Bot commented May 26, 2026 •

edited

Loading

Review failed

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot May 26, 2026

Uh oh!

devin-ai-integration Bot May 26, 2026

Uh oh!

devin-ai-integration Bot May 26, 2026

Uh oh!

devin-ai-integration Bot May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -260,8 +261,7 @@ export async function updateTriggerPackages(
		await installDependencies({ cwd: projectPath, silent: true });

Uh oh!

Conversation

deepshekhardas commented May 26, 2026

Uh oh!

changeset-bot Bot commented May 26, 2026

🦋 Changeset detected

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

coderabbitai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Estimated code review effort

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 26, 2026 •

edited

Loading