Skip to content

Commit 98b16b3

Browse files
d-csclaude
andcommitted
fix(webapp): explicit opt-in for mollifier stale-sweep + SYSTEM_FAILURE fallback on cancel-bifurcation
Two fixes from CodeRabbit on PR #3754: 1. `TRIGGER_MOLLIFIER_STALE_SWEEP_ENABLED` was inheriting `TRIGGER_MOLLIFIER_ENABLED`, so any deployment already running the mollifier would auto-start the sweep worker on upgrade — defeating the point of having a separate kill switch. Hard-default to "0" so the sweep is an explicit ops decision. 2. In the cancel-bifurcation path, a non-conflict + non-retryable error from `createCancelledRun` rethrew out of the handler. The drainer's `onTerminalFailure` handler gates on `cause === "max-attempts-exhausted"` and skips "non-retryable", so `buffer.fail()` deleted the entry without ever writing a PG row — the cancelled run disappeared silently. Mirror the SYSTEM_FAILURE fallback the non-cancel trigger path already uses: write a terminal row via the shared helper, falling back to the original throw only when the write also fails non-retryably (and re-throw retryable write errors so the drainer requeues). UX trade: the customer initiated a cancel but sees SYSTEM_FAILURE if the cancel write itself was structurally rejected. Terminal-but- mislabelled beats terminal silence — the alternative was the run disappearing from the dashboard entirely. The path is narrow (only non-conflict, non-retryable createCancelledRun failures). Two new unit tests cover the cancel branch: non-retryable creates the SYSTEM_FAILURE row + ACKs the entry; retryable rethrows so the drainer requeues. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 8cab7c8 commit 98b16b3

3 files changed

Lines changed: 125 additions & 7 deletions

File tree

apps/webapp/app/env.server.ts

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1104,13 +1104,19 @@ const EnvironmentSchema = z
11041104
// Periodic sweep that scans buffer queue LISTs for entries whose
11051105
// dwell exceeds the stale threshold. Independent of the drainer —
11061106
// its job is exactly to make a stuck/offline drainer visible to
1107-
// ops. Defaults: enabled when the mollifier is enabled, run every
1108-
// 5 minutes, alert on anything that's been dwelling for 5+ minutes
1109-
// (matches the sweep interval — "anything still here when we
1110-
// check" is the simplest threshold that converges).
1111-
TRIGGER_MOLLIFIER_STALE_SWEEP_ENABLED: z
1112-
.string()
1113-
.default(process.env.TRIGGER_MOLLIFIER_ENABLED ?? "0"),
1107+
// ops. Defaults: explicitly opt-in (a separate kill switch from
1108+
// the mollifier itself), run every 5 minutes, alert on anything
1109+
// that's been dwelling for 5+ minutes (matches the sweep interval
1110+
// — "anything still here when we check" is the simplest threshold
1111+
// that converges).
1112+
//
1113+
// The sweep was previously defaulting to inherit
1114+
// `TRIGGER_MOLLIFIER_ENABLED`, which meant any deployment already
1115+
// running with the mollifier on would auto-start the sweep worker
1116+
// on upgrade — turning on new background load with no explicit
1117+
// rollout step. Hard-defaulting to "0" preserves the intent of
1118+
// exposing the sweep as a separate switch.
1119+
TRIGGER_MOLLIFIER_STALE_SWEEP_ENABLED: z.string().default("0"),
11141120
TRIGGER_MOLLIFIER_STALE_SWEEP_INTERVAL_MS: z.coerce
11151121
.number()
11161122
.int()

apps/webapp/app/v3/mollifier/mollifierDrainerHandler.server.ts

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,39 @@ export function createDrainerHandler(deps: {
103103
const isConflict =
104104
err instanceof Error && err.message.startsWith("createCancelledRun conflict");
105105
if (!isConflict) {
106+
// Mirror the SYSTEM_FAILURE fallback the non-cancelled
107+
// trigger path uses below. Without this branch, a
108+
// non-retryable createCancelledRun failure rethrows, the
109+
// drainer's onTerminalFailure handler skips because it
110+
// gates on `cause === "max-attempts-exhausted"` (and the
111+
// outer drainer classifies non-retryable failures with
112+
// `cause: "non-retryable"`), and buffer.fail() deletes
113+
// the entry — leaving NO PG row. The cancellation
114+
// disappears silently from the customer's dashboard.
115+
// Writing a SYSTEM_FAILURE row gives the run a terminal,
116+
// visible state.
117+
if (isRetryablePgError(err)) {
118+
throw err;
119+
}
120+
span.setAttribute("mollifier.cancel_terminal_failure_reason",
121+
err instanceof Error ? err.message : String(err));
122+
try {
123+
const wrote = await writeMollifierTerminalFailureRow(deps, {
124+
friendlyId: input.runId,
125+
snapshot: input.payload as Record<string, unknown>,
126+
reason: err instanceof Error ? err.message : String(err),
127+
});
128+
if (wrote) return;
129+
} catch (writeErr) {
130+
if (isRetryablePgError(writeErr)) {
131+
span.setAttribute("mollifier.cancel_terminal_write_retryable", true);
132+
throw writeErr;
133+
}
134+
span.setAttribute(
135+
"mollifier.cancel_terminal_write_error",
136+
writeErr instanceof Error ? writeErr.message : String(writeErr)
137+
);
138+
}
106139
throw err;
107140
}
108141
span.setAttribute("mollifier.cancel_conflict", true);

apps/webapp/test/mollifierDrainerHandler.test.ts

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -348,6 +348,85 @@ describe("createDrainerHandler", () => {
348348
).rejects.toThrow("Can't reach database server");
349349
});
350350

351+
it("writes a SYSTEM_FAILURE row when createCancelledRun fails non-retryably (cancel bifurcation)", async () => {
352+
// Without this guard a non-conflict, non-retryable failure from
353+
// createCancelledRun rethrows out of the handler. The drainer's
354+
// onTerminalFailure gates on cause==="max-attempts-exhausted" and
355+
// skips "non-retryable", so buffer.fail() deletes the entry with
356+
// no PG row written — the cancellation disappears silently.
357+
// Mirror the non-cancel path's SYSTEM_FAILURE fallback so the
358+
// customer always sees a terminal row.
359+
const friendlyId = RunId.generate().friendlyId;
360+
const cancelErr = new Error("validation failed: bad cancel snapshot");
361+
const createCancelledRun = vi.fn(async () => {
362+
throw cancelErr;
363+
});
364+
const createFailedTaskRun = vi.fn(async () => ({ id: "internal_x" }));
365+
const handler = createDrainerHandler({
366+
engine: { createCancelledRun, createFailedTaskRun } as any,
367+
prisma: {} as any,
368+
});
369+
370+
await handler({
371+
runId: friendlyId,
372+
envId: "env_a",
373+
orgId: "org_1",
374+
payload: {
375+
friendlyId,
376+
taskIdentifier: "t",
377+
environment: envFixture,
378+
cancelledAt: new Date().toISOString(),
379+
cancelReason: "Canceled by user",
380+
},
381+
attempts: 0,
382+
createdAt: new Date(),
383+
} as any);
384+
385+
// SYSTEM_FAILURE row was written via the shared helper. Handler
386+
// returns cleanly so the drainer ACKs the entry instead of
387+
// buffer.fail()ing it.
388+
expect(createFailedTaskRun).toHaveBeenCalledOnce();
389+
expect(createFailedTaskRun.mock.calls[0][0].friendlyId).toBe(friendlyId);
390+
expect(createFailedTaskRun.mock.calls[0][0].error.raw).toContain(
391+
"validation failed: bad cancel snapshot"
392+
);
393+
});
394+
395+
it("requeues when createCancelledRun fails with a retryable PG error (cancel bifurcation)", async () => {
396+
// Retryable PG failures must rethrow so the drainer requeues the
397+
// entry — writing a SYSTEM_FAILURE row when PG is transiently
398+
// unreachable would still fail. The drainer's existing retry loop
399+
// handles the requeue.
400+
const friendlyId = RunId.generate().friendlyId;
401+
const cancelErr = new Error("Can't reach database server");
402+
const createCancelledRun = vi.fn(async () => {
403+
throw cancelErr;
404+
});
405+
const createFailedTaskRun = vi.fn();
406+
const handler = createDrainerHandler({
407+
engine: { createCancelledRun, createFailedTaskRun } as any,
408+
prisma: {} as any,
409+
});
410+
411+
await expect(
412+
handler({
413+
runId: friendlyId,
414+
envId: "env_a",
415+
orgId: "org_1",
416+
payload: {
417+
friendlyId,
418+
taskIdentifier: "t",
419+
environment: envFixture,
420+
cancelledAt: new Date().toISOString(),
421+
cancelReason: "Canceled by user",
422+
},
423+
attempts: 0,
424+
createdAt: new Date(),
425+
} as any)
426+
).rejects.toThrow("Can't reach database server");
427+
expect(createFailedTaskRun).not.toHaveBeenCalled();
428+
});
429+
351430
it("rethrows the original error when the snapshot lacks an environment block", async () => {
352431
const triggerErr = new Error("engine rejected the snapshot");
353432
const trigger = vi.fn(async () => {

0 commit comments

Comments
 (0)