Skip to content

fix(worker): run stuck-job recovery on a fixed cadence (DEV-35)#7

Merged
RndmCodeGuy20 merged 1 commit into
stagingfrom
feat/dev-35-periodic-stuck-job-recovery
Jun 16, 2026
Merged

fix(worker): run stuck-job recovery on a fixed cadence (DEV-35)#7
RndmCodeGuy20 merged 1 commit into
stagingfrom
feat/dev-35-periodic-stuck-job-recovery

Conversation

@RndmCodeGuy20

Copy link
Copy Markdown
Owner

Closes DEV-35.

Recovery (_recover_stuck_pending) was only called on the idle path of consume() (when xreadgroup returned nothing). Under sustained load the worker never idles, so the sole mechanism for re-queueing jobs left in_progress by a crashed worker never fired — precisely when stuck jobs are most likely.

Fix

  • Add _maybe_recover() — a time-gated wrapper (_recovery_interval = 120s, matching the 2-min staleness threshold in the recovery query)
  • Call it at the top of every consume(), decoupled from message availability; remove the idle-path-only call
  • _last_recovery starts at 0 so leftovers from a prior crash are swept on the first poll
  • Tests: recovery fires on the load path when the interval has elapsed, and is gated when it hasn't

🤖 Generated with Claude Code

- Time-gate recovery via _maybe_recover and call it every consume()
- Recover regardless of load instead of only on the idle path
- Stop in_progress jobs from a crashed worker getting stuck under load
- Sweep leftovers at startup (_last_recovery starts at 0)
- Add tests for the load-path firing and the interval gate
@RndmCodeGuy20 RndmCodeGuy20 merged commit ca32b39 into staging Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant