fix(worker): run stuck-job recovery on a fixed cadence (DEV-35) by RndmCodeGuy20 · Pull Request #7 · RndmCodeGuy20/mpiper

RndmCodeGuy20 · 2026-06-16T17:21:28Z

Closes DEV-35.

Recovery (_recover_stuck_pending) was only called on the idle path of consume() (when xreadgroup returned nothing). Under sustained load the worker never idles, so the sole mechanism for re-queueing jobs left in_progress by a crashed worker never fired — precisely when stuck jobs are most likely.

Fix

Add _maybe_recover() — a time-gated wrapper (_recovery_interval = 120s, matching the 2-min staleness threshold in the recovery query)
Call it at the top of every consume(), decoupled from message availability; remove the idle-path-only call
_last_recovery starts at 0 so leftovers from a prior crash are swept on the first poll
Tests: recovery fires on the load path when the interval has elapsed, and is gated when it hasn't

🤖 Generated with Claude Code

- Time-gate recovery via _maybe_recover and call it every consume() - Recover regardless of load instead of only on the idle path - Stop in_progress jobs from a crashed worker getting stuck under load - Sweep leftovers at startup (_last_recovery starts at 0) - Add tests for the load-path firing and the interval gate

RndmCodeGuy20 merged commit ca32b39 into staging Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(worker): run stuck-job recovery on a fixed cadence (DEV-35)#7

fix(worker): run stuck-job recovery on a fixed cadence (DEV-35)#7
RndmCodeGuy20 merged 1 commit into
stagingfrom
feat/dev-35-periodic-stuck-job-recovery

RndmCodeGuy20 commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RndmCodeGuy20 commented Jun 16, 2026

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant