fix: lock dead_letter rows in dlq_replay#287
Draft
NikolayS wants to merge 1 commit into
Draft
Conversation
Two concurrent pgque.dlq_replay() calls for the same dl_id could both pass the unlocked existence select, both call insert_event(), and re-enqueue the dead-lettered event twice; the second delete then silently removed 0 rows. pgque.dlq_replay_all() had the same unlocked-select shape. dlq_replay() now locks the dead_letter row with 'for update of dl': the second caller blocks, re-evaluates after the first commits, finds no row, and raises the existing 'dead letter entry not found' error. dlq_replay_all() uses 'for update of dl skip locked' so a bulk replay skips rows already being replayed by a concurrent session instead of blocking or double-replaying them. Adds tests/two_session_dlq_replay_race.sh, a deterministic two-session harness that fails on the unfixed code (event enqueued twice) and passes with the row lock (one event, clean error for the loser). Verification: bash build/transform.sh psql -d <db> -v ON_ERROR_STOP=1 -f sql/pgque.sql PGQUE_TEST_DSN=postgresql:///<db> tests/two_session_dlq_replay_race.sh psql -d <db> -v ON_ERROR_STOP=1 -f tests/run_all.sql Addresses finding A3 of #283. https://claude.ai/code/session_01KAaEGkQZmey1D1xCsVGmqv
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bug
pgque.dlq_replay(i_dead_letter_id)selected thepgque.dead_letterrow (joined topgque.queue) withoutfor update, then calledpgque.insert_event(), then deleted the row. Two concurrentdlq_replay()calls for the samedl_id(the function is granted topgque_writer, so any two writers) could both pass the unlocked select and both callinsert_event()— the dead-lettered event was re-enqueued twice. Both deletes then "succeeded" (the loser's delete removed 0 rows, silently).pgque.dlq_replay_all()had the same unlocked-select shape in its loop query.Fix
dlq_replay(): the initial select now ends withfor update of dl, locking only thedead_letterrow (not the joinedpgque.queuerow). The second concurrent caller blocks on the row lock; after the first transaction commits its delete, the second's select re-evaluates under read committed, finds no row, and the existingif not foundbranch raises the existing error:dead letter entry not found: <id>. No behavior change for non-concurrent callers.dlq_replay_all(): the loop query now usesfor update of dl skip locked.Locking-choice rationale for
dlq_replay_all:skip lockedfits the "replay everything" semantics better than a blockingfor update. A row locked by a concurrentdlq_replay()/dlq_replay_all()is already being handled by that session; blocking would only make this call wait so it could either replay the row a second time (the exact race being fixed) or count a guaranteednot found-style failure. Skipping leaves the row to its owner; if that owner rolls back, the row is still in the DLQ for the next replay pass.Error messages, grants,
security definer+set search_path = pgque, pg_catalog, and SQL style are unchanged. Generated filessql/pgque.sqlandsql/pgque-tle.sqlare regenerated viabash build/transform.shand committed together with the source change (only the dlq chunk differs).TDD / verification
New deterministic two-session harness, following the pattern of
tests/two_session_receive_lock.sh:tests/two_session_dlq_replay_race.sh. Session 1 runsbegin; select pgque.dlq_replay(<dl_id>); pg_sleep(4); commit;; once session 1 is observed insidepg_sleepviapg_stat_activity, session 2 callspgque.dlq_replay(<dl_id>)for the same id. The harness then asserts session 2 fails withdead letter entry not found, exactly 1 event of the replayed type is received, and the DLQ is empty.Red (unfixed code, origin/main install):
Both sessions re-enqueued the same dead letter (two new event ids).
Green (fixed build, fresh install):
Full regression suite (fresh DB, fixed build):
Generated-file sync:
bash build/transform.shafter the source edit;git statusshowed onlysql/pgque-additions/dlq.sql,sql/pgque.sql,sql/pgque-tle.sql, and the new test changed; the embedded chunks match the source edit.Manual verification command for reviewers:
Addresses finding A3 of #283
https://claude.ai/code/session_01KAaEGkQZmey1D1xCsVGmqv
Generated by Claude Code