Skip to content

fix: lock dead_letter rows in dlq_replay#287

Draft
NikolayS wants to merge 1 commit into
mainfrom
claude/fix-dlq-replay-race-oqxpbr
Draft

fix: lock dead_letter rows in dlq_replay#287
NikolayS wants to merge 1 commit into
mainfrom
claude/fix-dlq-replay-race-oqxpbr

Conversation

@NikolayS

Copy link
Copy Markdown
Owner

Bug

pgque.dlq_replay(i_dead_letter_id) selected the pgque.dead_letter row (joined to pgque.queue) without for update, then called pgque.insert_event(), then deleted the row. Two concurrent dlq_replay() calls for the same dl_id (the function is granted to pgque_writer, so any two writers) could both pass the unlocked select and both call insert_event() — the dead-lettered event was re-enqueued twice. Both deletes then "succeeded" (the loser's delete removed 0 rows, silently). pgque.dlq_replay_all() had the same unlocked-select shape in its loop query.

Fix

  • dlq_replay(): the initial select now ends with for update of dl, locking only the dead_letter row (not the joined pgque.queue row). The second concurrent caller blocks on the row lock; after the first transaction commits its delete, the second's select re-evaluates under read committed, finds no row, and the existing if not found branch raises the existing error: dead letter entry not found: <id>. No behavior change for non-concurrent callers.
  • dlq_replay_all(): the loop query now uses for update of dl skip locked.

Locking-choice rationale for dlq_replay_all: skip locked fits the "replay everything" semantics better than a blocking for update. A row locked by a concurrent dlq_replay()/dlq_replay_all() is already being handled by that session; blocking would only make this call wait so it could either replay the row a second time (the exact race being fixed) or count a guaranteed not found-style failure. Skipping leaves the row to its owner; if that owner rolls back, the row is still in the DLQ for the next replay pass.

Error messages, grants, security definer + set search_path = pgque, pg_catalog, and SQL style are unchanged. Generated files sql/pgque.sql and sql/pgque-tle.sql are regenerated via bash build/transform.sh and committed together with the source change (only the dlq chunk differs).

TDD / verification

New deterministic two-session harness, following the pattern of tests/two_session_receive_lock.sh: tests/two_session_dlq_replay_race.sh. Session 1 runs begin; select pgque.dlq_replay(<dl_id>); pg_sleep(4); commit;; once session 1 is observed inside pg_sleep via pg_stat_activity, session 2 calls pgque.dlq_replay(<dl_id>) for the same id. The harness then asserts session 2 fails with dead letter entry not found, exactly 1 event of the replayed type is received, and the DLQ is empty.

Red (unfixed code, origin/main install):

$ PGQUE_TEST_DSN=postgresql:///pgque_dlq_oqxpbr tests/two_session_dlq_replay_race.sh
FAIL: session2 replay succeeded; expected 'dead letter entry not found' after waiting on session1
--- session1.out ---
 s1_new_eid=2003
--- session2.out ---
 s2_new_eid=2004

Both sessions re-enqueued the same dead letter (two new event ids).

Green (fixed build, fresh install):

$ psql -d pgque_dlq_oqxpbr -v ON_ERROR_STOP=1 -f sql/pgque.sql   # exit 0
$ PGQUE_TEST_DSN=postgresql:///pgque_dlq_oqxpbr tests/two_session_dlq_replay_race.sh
PASS: concurrent dlq_replay serialized; second caller got 'dead letter entry not found' and the event was re-enqueued exactly once

Full regression suite (fresh DB, fixed build):

$ psql -d pgque_dlq_oqxpbr -v ON_ERROR_STOP=1 -f tests/run_all.sql   # exit 0
153 PASS notices, including test_api_dlq and test_dlq_edge_cases

Generated-file sync: bash build/transform.sh after the source edit; git status showed only sql/pgque-additions/dlq.sql, sql/pgque.sql, sql/pgque-tle.sql, and the new test changed; the embedded chunks match the source edit.

Manual verification command for reviewers:

createdb pgque_dlq_check
psql -d pgque_dlq_check -v ON_ERROR_STOP=1 -f sql/pgque.sql
PGQUE_TEST_DSN=postgresql:///pgque_dlq_check tests/two_session_dlq_replay_race.sh
psql -d pgque_dlq_check -v ON_ERROR_STOP=1 -f tests/run_all.sql

Addresses finding A3 of #283

https://claude.ai/code/session_01KAaEGkQZmey1D1xCsVGmqv


Generated by Claude Code

Two concurrent pgque.dlq_replay() calls for the same dl_id could
both pass the unlocked existence select, both call insert_event(),
and re-enqueue the dead-lettered event twice; the second delete then
silently removed 0 rows. pgque.dlq_replay_all() had the same
unlocked-select shape.

dlq_replay() now locks the dead_letter row with 'for update of dl':
the second caller blocks, re-evaluates after the first commits, finds
no row, and raises the existing 'dead letter entry not found' error.
dlq_replay_all() uses 'for update of dl skip locked' so a bulk replay
skips rows already being replayed by a concurrent session instead of
blocking or double-replaying them.

Adds tests/two_session_dlq_replay_race.sh, a deterministic two-session
harness that fails on the unfixed code (event enqueued twice) and
passes with the row lock (one event, clean error for the loser).

Verification:
  bash build/transform.sh
  psql -d <db> -v ON_ERROR_STOP=1 -f sql/pgque.sql
  PGQUE_TEST_DSN=postgresql:///<db> tests/two_session_dlq_replay_race.sh
  psql -d <db> -v ON_ERROR_STOP=1 -f tests/run_all.sql

Addresses finding A3 of #283.

https://claude.ai/code/session_01KAaEGkQZmey1D1xCsVGmqv
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants