Skip to content

fix(ts): back off on nack/ack failure in consumer#289

Draft
NikolayS wants to merge 2 commits into
mainfrom
claude/fix-ts-consumer-backoff-oqxpbr
Draft

fix(ts): back off on nack/ack failure in consumer#289
NikolayS wants to merge 2 commits into
mainfrom
claude/fix-ts-consumer-backoff-oqxpbr

Conversation

@NikolayS

Copy link
Copy Markdown
Owner

Bug

Consumer.start() in clients/typescript/src/consumer.ts slept pollIntervalMs only on receive error or empty batch. When a batch was received but finalization failed — a nack failed so ack was intentionally skipped, or ack() threw — the while loop re-polled immediately. Because the batch was never finished, pgque.next_batch returns the same batch instantly, so:

  • a persistent nack/ack failure (e.g. partial grants where receive works but nack/ack don't) becomes a tight loop at full speed: re-receive → re-run ALL handlers (duplicate side effects) → fail → repeat, hammering both the application and Postgres;
  • even a single transient ack failure re-executes the whole batch's handlers with zero delay.

The comment on the skip-ack path even claimed redelivery happens "on the next poll" — a poll interval that did not exist on that path.

Fix

await sleep(this.pollIntervalMs, signal) before re-polling on both the anyNackFailed path and the ack-throw path. The existing sleep helper is abort-aware, so shutdown latency is unchanged. ack returning 0 without throwing stays warning-only with no sleep (the batch is finished in that case; the loop should keep draining).

Red/green TDD: two new mock-based tests (pollInterval: 60_000, receive always returns the same batch, nack/ack fail persistently) assert receive is called exactly once within a 300 ms observation window. Unfixed, the hot loop is so tight that the vitest worker dies of OOM recording mock calls — that was the red run. Green after the fix.

Second commit (informational note from the same review): src/producer_bench.ts spliced the table name from pgque.current_event_table() into select count(*) from ${table} via a template literal — not exploitable (self-generated value, dev-only script) but the only non-parameterized query in the repo. Now quoted part by part with pg's escapeIdentifier. Also fixed the result generic: the pgque pool parses int8 (OID 20) to BigInt, so count(*) arrives as bigint, not string; the old <{ count: string }> + Number.parseInt only worked by coercion.

Verification

All run in clients/typescript/ (deps via bun install):

bun run check
# tsc --noEmit -p tsconfig.json && tsc --noEmit -p tsconfig.test.json — clean

npx vitest run    # without PGQUE_TEST_DSN
# Test Files  7 passed | 1 skipped (8)
#      Tests  35 passed | 50 skipped (85)

# Integration: fresh scratch DB on local PG 16, pgque installed via \i sql/pgque.sql
PGQUE_TEST_DSN="postgres://root:***@localhost:5432/pgque_ts_b1_oqxpbr" npx vitest run
# Test Files  8 passed (8)
#      Tests  85 passed (85)

# Bench script live run (exercises the quoted-identifier path; verifyCount passes):
PGQUE_TEST_DSN=... PGQUE_BENCH_REPEATS=1 bun src/producer_bench.ts
# table + CSV output produced, no verification errors

Red run (before the fix, same test file): the consumer.test.ts worker crashed with FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory — direct evidence of the unbounded hot loop.

Addresses finding B1 (TypeScript) and the producer_bench informational note of #283

https://claude.ai/code/session_01KAaEGkQZmey1D1xCsVGmqv


Generated by Claude Code

claude added 2 commits June 10, 2026 13:34
The Consumer.start() loop slept pollInterval only on receive error or
empty batch. When a batch was received but finalization failed (a nack
failed so ack was skipped, or ack threw), the loop re-polled instantly;
since the batch was never finished, pgque.next_batch returned the same
batch immediately. A persistent nack/ack failure (e.g. partial grants
where receive works but nack/ack do not) therefore hot-looped at full
speed, re-running every handler with duplicate side effects, and even a
single transient ack failure re-executed the batch with zero delay.

Sleep pollIntervalMs (abort-aware) before re-polling on both paths.
Ack returning 0 without throwing stays warning-only with no sleep.

Red/green: the two new mock-based tests hot-loop so hard unfixed that
the vitest worker dies of OOM recording mock calls; green after fix.

Addresses finding B1 (TypeScript) of #283.

https://claude.ai/code/session_01KAaEGkQZmey1D1xCsVGmqv
producer_bench.ts spliced the table name returned by
pgque.current_event_table() into 'select count(*) from ...' via a raw
template literal -- the only non-parameterized query in the repo. The
value is self-generated and the script is dev-only, so this is not
exploitable, but quote it properly anyway with pg's escapeIdentifier,
part by part for the schema-qualified name.

Also fix the result generic: the pgque pool parses int8 (OID 20) to
BigInt, so count(*) arrives as bigint, not string; the old code only
worked because Number.parseInt coerces its argument.

Verified: bun run check, plus a live run against local PG 16
(PGQUE_BENCH_REPEATS=1 bun src/producer_bench.ts) where verifyCount
passes for all batch sizes.

Addresses the producer_bench informational note of #283.

https://claude.ai/code/session_01KAaEGkQZmey1D1xCsVGmqv
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants