Skip to content

fix(python): consumer loop drain, error survival, dead Event.extra#291

Draft
NikolayS wants to merge 3 commits into
mainfrom
claude/fix-python-consumer-loop-oqxpbr
Draft

fix(python): consumer loop drain, error survival, dead Event.extra#291
NikolayS wants to merge 3 commits into
mainfrom
claude/fix-python-consumer-loop-oqxpbr

Conversation

@NikolayS

Copy link
Copy Markdown
Owner

Bugs

B2 (medium) — backlog drained at one batch per poll_interval.
Consumer.start() unconditionally ran _poll_once(conn) then _wait_for_notify_or_stop(conn). After a non-empty batch the consumer still waited up to poll_interval (default 30s) for a NOTIFY — but notifies fire only on new ticks, and the notifies for accumulated batches were emitted while the consumer was not listening. A consumer that was down with N batches accumulated drained them at ~1 batch per poll_interval and could never catch up.

B3 (medium) — a transient SQL error killed the consumer.
There was no try/except around the poll iteration, so a transient error from receive or ack (failover, restart, network blip) propagated out of start() and the consumer died, contradicting the documented "blocks until SIGTERM/SIGINT" contract. Nack failures were already handled, making this an oversight.

B5 (low) — Event.extra was silently dropped.
pgque/types.py declared Event.extra: dict[str, str], but client.send() unpacks only type and payload; extra was referenced nowhere. Data placed there was silently lost.

Fixes

  • B2: _poll_once now returns whether it processed a batch; the loop re-polls immediately after a non-empty batch and only waits for NOTIFY/timeout when the queue came back empty (matches the Go consumer). A batch left unacked due to a nack failure reports False, so redelivery waits a poll cycle instead of hot-looping.
  • B3: start() now wraps each connection session: on psycopg.Error / PgqueError it logs, sleeps poll_interval in short slices (so stop() stays prompt), reconnects, re-LISTENs, and resumes — paralleling the Go consumer's log-and-retry on receive errors. KeyboardInterrupt (BaseException) and stop() semantics are unchanged; handler exceptions and nack failures remain contained inside _poll_once.
  • B5: I checked the SQL surface first: the pgque.send overloads are (queue, payload) and (queue, type, payload) only — no ev_extra1..4 parameters (those exist only on the PgQ primitive pgque.insert_event(7)), and a dict[str, str] has no natural mapping to four positional extra columns anyway. So there is no natural wiring through pgque.send; the dead field is removed. A never-working field that silently loses data is worse than a removed field — construction with extra= now fails loudly with TypeError.

All three fixes were done red/green: each new test was verified failing against the unfixed code, then turned green by the fix (separate commits per finding).

New tests (clients/python/tests/)

  • test_consumer_drains_backlog_within_one_poll_interval — 3 pre-accumulated batches, poll_interval=30; all must be consumed within 10s. Red on old code: 1/3 messages in 10.0s.
  • test_consumer_survives_transient_receive_error — first receive raises, consumer must recover and process the message.
  • test_consumer_survives_killed_backendpg_terminate_backend on the consumer's session (what a DB restart/failover looks like to the client); consumer must reconnect, re-LISTEN, and consume an event produced after the kill.
  • test_consumer_stop_is_prompt_during_error_retry_wait — persistent receive failure with poll_interval=30; thread must stay alive and stop() must return in <2s.
  • test_event_rejects_extra_kwargEvent(..., extra={...}) raises TypeError.

Verification

Against a fresh scratch database (PostgreSQL 16, sql/pgque.sql installed via psql -f):

$ createdb pgque_pyconsumer
$ psql -d pgque_pyconsumer -v ON_ERROR_STOP=1 -q -f sql/pgque.sql
$ pip install -e 'clients/python[dev]'
$ cd clients/python
$ PGQUE_TEST_DSN="postgresql:///pgque_pyconsumer" python -m pytest tests/ -q
76 passed in 49.68s

Red-state evidence before each fix:

  • B2: FAILED tests/test_consumer_resilience.py::test_consumer_drains_backlog_within_one_poll_interval — AssertionError: backlog not drained: 1/3 messages in 10.0s
  • B3: 3 failed, 1 passed (test_consumer_survives_transient_receive_error, test_consumer_survives_killed_backend, test_consumer_stop_is_prompt_during_error_retry_wait)
  • B5: FAILED tests/test_send.py::test_event_rejects_extra_kwarg - Failed: DID NOT RAISE

Addresses findings B2, B3, B5 of #283

https://claude.ai/code/session_01KAaEGkQZmey1D1xCsVGmqv


Generated by Claude Code

claude added 3 commits June 10, 2026 13:33
After a non-empty batch, the consumer waited up to poll_interval
(default 30s) for a NOTIFY before polling again. Notifies fire only
on new ticks, so a consumer that was down with N batches accumulated
drained them at one batch per poll_interval and could never catch up.

_poll_once now reports whether it processed a batch; the loop
re-polls immediately after a non-empty batch and only waits for
NOTIFY/timeout when the queue came back empty. Matches the Go
consumer's behavior. Nack-failure batches still wait before
redelivery to avoid a hot loop.

Addresses finding B2 of #283.

https://claude.ai/code/session_01KAaEGkQZmey1D1xCsVGmqv
A transient SQL error from receive or ack (failover, restart,
network blip) propagated out of start() and killed the consumer,
contradicting the documented 'blocks until SIGTERM/SIGINT' contract.
Nack failures were already handled, making this an oversight.

start() now wraps each connection session: on psycopg.Error or
PgqueError it logs, waits poll_interval (in short slices so stop()
stays prompt), reconnects, re-LISTENs, and resumes -- matching the
Go consumer's log-and-retry behavior. KeyboardInterrupt and stop()
handling are unchanged (BaseException is not caught).

Tested with a flaky receive, a pg_terminate_backend'ed session, and
a persistent failure during which stop() must return promptly.

Addresses finding B3 of #283.

https://claude.ai/code/session_01KAaEGkQZmey1D1xCsVGmqv
Event.extra was declared but never transmitted: send() unpacks only
type and payload, and the SQL pgque.send overloads carry no
ev_extra1..4 parameters. A dict field also has no natural mapping to
the four positional extra columns. Data placed there was silently
lost, which is worse than not offering the field -- so remove it and
fail loudly (TypeError) on construction.

Addresses finding B5 of #283.

https://claude.ai/code/session_01KAaEGkQZmey1D1xCsVGmqv
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants